Official repository for the published EMNLP 2025 Main Conference Oral 🏆 paper:
Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation
📄 Paper
Fin-Force is a benchmark for evaluating the ability of large language models (LLMs) to generate forward counterfactual scenarios grounded in financial news. Given a real market news headline, a model must generate two plausible alternative market developments:
- Risk Counterfactual: a materially adverse market development relative to the original headline.
- Opportunity Counterfactual: a materially favorable market development relative to the original headline.
These counterfactuals must be forward-compatible — they represent plausible developments that could occur after the original news event, rather than negating it; and also directional - they represent clear and meaningful market shift to either an opportunity or risk in market conditions.
Note
Fin-Force is designed not only as a benchmark for forward counterfactual generation in financial markets, but it can also function as a testbed for studying future-oriented reasoning, narrative understanding, and scenario generation in large language models. Below we outline several research ideas that can be explored, be it with Fin-Force functioning as a primary benchmark, auxiliary evaluation dataset, or source of examples for methodological studies:
-
Forward Counterfactual Reasoning
Investigating how well language models can generate plausible future developments conditioned on real-world events, rather than negating past outcomes. -
Narrative Understanding in Financial News
Studying how models interpret and extrapolate the implicit market narratives embedded in financial headlines. -
Strategic Scenario Generation
Evaluating the ability of models to generate risk and opportunity scenarios that could inform strategic planning or decision-making. -
Controlled and Directional Text Generation
Developing techniques that explicitly steer models toward directional outputs (e.g., adverse vs favorable market developments). -
Benchmarking Counterfactual Generation Methods
Comparing approaches such as prompting, distillation, preference optimization, and model editing for structured counterfactual generation. -
Text-Based Financial Forecasting
Exploring how language models can use narrative signals from news to generate forward-looking financial scenarios. -
Reasoning Benchmark for Large Language Models
Using Fin-Force as a benchmark to evaluate the reasoning capabilities of frontier LLMs on structured future-oriented reasoning tasks. -
Alignment and Preference Learning for Scenario Generation
Studying how techniques such as RLHF, DPO, or self-rewarding training influence the quality and directionality of generated scenarios. -
Human–AI Collaborative Decision Support
Investigating how AI-generated forward counterfactuals can assist analysts in risk assessment, scenario planning, or strategic analysis. -
Cross-Domain Scenario Planning Benchmarks
Extending the Fin-Force framework to other domains requiring scenario planning, such as climate risk analysis, disaster response, policy forecasting, or supply-chain resilience.
fin_force/
├── data/
│ ├── fin_force.json # Main benchmark dataset (test split)
│ └── supplementary_data.json # Supplementary SFT training data
├── evaluation/
│ ├── llm_evaluate.py # LLM-based evaluation (directionality & forward compatibility)
│ ├── nonllm_evaluate.py # Non-LLM evaluation (perplexity)
│ ├── run_eval.yaml # Evaluation config template (with parameter comments)
│ └── rubrics/
│ ├── directionality_risk.txt
│ ├── directionality_opportunity.txt
│ └── forward_compatibility.txt
├── models/
│ ├── baseline_prompting/ # Zero-shot / few-shot prompting baselines
│ ├── cf_distillation/ # Counterfactual distillation via masked headlines
│ ├── self_rewarding_llm/ # Self-Rewarding LLM fine-tuning pipeline
│ └── lm-counterfactuals/ # Gumbel-based counterfactual generation
└── README.md
The primary test benchmark. Each entry contains a financial news headline along with its associated risk and opportunity counterfactuals.
Key fields:
| Field | Description |
|---|---|
headline |
The original financial news headline |
classification |
Event type (e.g., "market_event") |
category |
Market category (e.g., "Financial markets & asset performance") |
output.original_headline |
Headline used as the evaluation prompt |
output.risk_counterfactual |
The risk counterfactual |
output.opportunity_counterfactual |
The opportunity counterfactual |
Supplementary data for supervised fine-tuning. Shares the same schema as fin_force.json.
Fin-Force uses four evaluation criteria across two scripts.
| Criterion | Type | Description |
|---|---|---|
directionality_risk |
LLM-based | Does the risk counterfactual represent a materially adverse market development? |
directionality_opportunity |
LLM-based | Does the opportunity counterfactual represent a materially favorable market development? |
forward_compatibility |
LLM-based | Is the counterfactual a plausible future development that does not negate the original headline? |
perplexity |
Non-LLM | Sentence-level perplexity (fluency proxy) via GPT-2. |
Uses the OpenAI API with structured outputs to score counterfactuals against the rubrics. Configure via evaluation/run_eval.yaml.
from pydantic import BaseModel
from evaluation.llm_evaluate import LLMEvaluator
import os
class DirectionalityOutput(BaseModel):
directionality: dict # e.g. {"value": true}
evaluator = LLMEvaluator(
api_key=os.environ["OPENAI_API_KEY"],
input_file="data/fin_force.json",
rubric_filepath="evaluation/rubrics/directionality_risk.txt",
opportunity_rubric_filepath="evaluation/rubrics/directionality_opportunity.txt",
model="gpt-4o",
response_model=DirectionalityOutput,
load_prompting_directly=True, # True for fin_force.json format
raw_output_file="results_raw.jsonl" # Optional: stream results to file
)
results = evaluator.evaluate_responses()Input format (load_prompting_directly=True): JSON array where each item has an "output" key containing original_headline, risk_counterfactual, and opportunity_counterfactual.
Input format (load_prompting_directly=False): JSONL where each line has "prompt", "risk counterfactual", and "opportunity counterfactual".
Computes perplexity using GPT-2 (or any HuggingFace LM).
from evaluation.nonllm_evaluate import NonLLMEval
evaluator = NonLLMEval("data/fin_force.json", load_prompting_directly=True)
scores = evaluator.compute_perplexity_scores(model_id="gpt2")
print(scores["risk_perplexity"]) # list of floats, one per example
print(scores["opp_perplexity"])
print(scores["org_perplexity"]) # perplexity of original headlinesAll evaluation parameters are documented with inline comments — API key placeholder, input/output paths, model, rubric paths, and which criteria to run.
| File | Criterion | Output key |
|---|---|---|
directionality_risk.txt |
Risk directionality | directionality.value (bool) |
directionality_opportunity.txt |
Opportunity directionality | directionality.value (bool) |
forward_compatibility.txt |
Forward compatibility | forward_compatibility.value (bool) |
All baseline methods are in models/.
Zero-shot and few-shot counterfactual generation using frontier LLMs via the OpenAI or OpenRouter APIs.
Files:
_inference.py— Config-driven inference script with retry logic (supports OpenAI and OpenRouter)response_config.py— Pydantic response schemas:Counterfactuals,Counterfactuals_COT,Counterfactual_maskedconfigs/— One YAML per model/variant, all with commented parameters and API key placeholdersprompts/— System prompt and few-shot example files
Usage:
export OPENAI_API_KEY=sk-...
export OPENROUTER_API_KEY=sk-or-...
python models/baseline_prompting/_inference.py --config models/baseline_prompting/configs/gpt4o.yamlAvailable configs:
| Config | Model | Variant |
|---|---|---|
gpt4o.yaml |
GPT-4o | Zero-shot |
gpt4o_fs_sample1.yaml |
GPT-4o | Few-shot (sample 1) |
claude3.5_haiku.yaml |
Claude 3.5 Haiku | Zero-shot |
gemini_2.0_flash.yaml |
Gemini 2.0 Flash | Zero-shot |
llama4_mav.yaml |
Llama 4 Maverick | Zero-shot |
qwen2.5_72B.yaml |
Qwen 2.5 72B | Zero-shot |
gpt4o_distil.yaml |
GPT-4o | CF Distillation (masked input) |
A two-step pipeline: (1) extract the topic word per headline and mask noun phrases to de-identify it, then (2) generate counterfactuals from the masked input using baseline_prompting with gpt4o_distil.yaml.
Files:
cf_distillation.py— Step 1: topic extraction + noun-phrase masking via OpenAIcf_distillation_config.yaml— Commented config for step 1 (API key placeholder, paths)process_distill_results.py— Post-processing utility for step 1 output
Usage:
export OPENAI_API_KEY=sk-...
# Step 1: produce masked headlines
python models/cf_distillation/cf_distillation.py
# Step 2: generate counterfactuals from masked input
python models/baseline_prompting/_inference.py --config models/baseline_prompting/configs/gpt4o_distil.yamlIterative self-improvement pipeline: SFT → generate → LLM-score → DPO → repeat. Adapted from the Self-Rewarding Language Models framework.
Pipeline:
| Script | Stage |
|---|---|
scripts/00_sft.py |
Supervised fine-tuning on supplementary_data.json |
scripts/01_gen_prompts.py |
Generate evaluation prompts |
scripts/02_gen_responses.py |
Generate candidate responses |
scripts/03_gen_scores.py |
Score responses with LLM-as-a-judge |
scripts/04_gen_preferences.py |
Build preference pairs |
scripts/05_dpo.py |
Direct Preference Optimization |
Other files:
scripts/configs/— YAML configs for each stagesrc/srlm/— Core model, trainer, and inference modulesprocess/— Data preprocessing utilitiesmy_llm_judge_prompts/— LLM judge prompt templatesself-reward_init_model.sh/self-reward_iter_train.sh— Full pipeline shell scriptsmulti_self-reward_*.sh— Multi-model variants
Usage:
bash models/self_rewarding_llm/self-reward_init_model.sh # Initial SFT + iteration 1
bash models/self_rewarding_llm/self-reward_iter_train.sh # Subsequent iterationsCounterfactual generation using Gumbel-max sampling and model editing (MEMIT/ROME), adapted from the lm-counterfactuals framework.
Files:
run.py— Main generation scriptrun_mimic.py/mimic.py— MIMIC counterfactual methodsampling.py— Gumbel sampling utilitiesutils.py— Shared utilitiesanalyze.py/analyze_edit.py— Analysis and edit-distance toolsgumbel_config.yaml— Generation configurationexample.ipynb— Usage notebook
Usage:
python models/lm-counterfactuals/run.pypip install openai pydantic tqdm evaluatepip install openai pydantic tqdm instructor spacy
python -m spacy download en_core_web_smpip install -r models/self_rewarding_llm/requirements.txtpip install -r models/lm-counterfactuals/requirements.txtAPI Keys: Never hardcode keys. Use environment variables for running the different models:
export OPENAI_API_KEY=sk-... export OPENROUTER_API_KEY=sk-or-...
Keane Ong Google Scholar | LinkedIn | Email | Website
If you use Fin-Force in your research, please cite:
@inproceedings{ong-etal-2025-deriving,
title = "Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation",
author = "Ong, Keane and
Mao, Rui and
Varshney, Deeksha and
Liang, Paul Pu and
Cambria, Erik and
Mengaldo, Gianmarco",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.575/",
doi = "10.18653/v1/2025.emnlp-main.575",
pages = "11411--11434",
ISBN = "979-8-89176-332-6",
abstract = "Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form{---}forward counterfactual reasoning{---}focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities for stakeholders, guiding their decision-making. However, performing this at scale is challenging due to the cognitive demands involved, underscoring the need for automated solutions. Large Language Models (LLMs) offer promise, but remain unexplored for this application. To address this gap, we introduce a novel benchmark, Fin-Force{---}**FIN**ancial **FOR**ward **C**ounterfactual **E**valuation. By curating financial news headlines and providing structured evaluation, Fin-Force supports LLM based forward counterfactual generation. This paves the way for scalable and automated solutions for exploring and anticipating future market developments, thereby providing structured insights for decision-making. Through experiments on Fin-Force, we evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing their limitations and proposing insights for future research."
}