Fin-Force

Official repository for the published EMNLP 2025 Main Conference Oral 🏆 paper:

Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation
📄 Paper

Overview

Fin-Force is a benchmark for evaluating the ability of large language models (LLMs) to generate forward counterfactual scenarios grounded in financial news. Given a real market news headline, a model must generate two plausible alternative market developments:

Risk Counterfactual: a materially adverse market development relative to the original headline.
Opportunity Counterfactual: a materially favorable market development relative to the original headline.

These counterfactuals must be forward-compatible — they represent plausible developments that could occur after the original news event, rather than negating it; and also directional - they represent clear and meaningful market shift to either an opportunity or risk in market conditions.

Potential Research Ideas Using Fin-Force

Note

Fin-Force is designed not only as a benchmark for forward counterfactual generation in financial markets, but it can also function as a testbed for studying future-oriented reasoning, narrative understanding, and scenario generation in large language models. Below we outline several research ideas that can be explored, be it with Fin-Force functioning as a primary benchmark, auxiliary evaluation dataset, or source of examples for methodological studies:

Forward Counterfactual Reasoning
Investigating how well language models can generate plausible future developments conditioned on real-world events, rather than negating past outcomes.
Narrative Understanding in Financial News
Studying how models interpret and extrapolate the implicit market narratives embedded in financial headlines.
Strategic Scenario Generation
Evaluating the ability of models to generate risk and opportunity scenarios that could inform strategic planning or decision-making.
Controlled and Directional Text Generation
Developing techniques that explicitly steer models toward directional outputs (e.g., adverse vs favorable market developments).
Benchmarking Counterfactual Generation Methods
Comparing approaches such as prompting, distillation, preference optimization, and model editing for structured counterfactual generation.
Text-Based Financial Forecasting
Exploring how language models can use narrative signals from news to generate forward-looking financial scenarios.
Reasoning Benchmark for Large Language Models
Using Fin-Force as a benchmark to evaluate the reasoning capabilities of frontier LLMs on structured future-oriented reasoning tasks.
Alignment and Preference Learning for Scenario Generation
Studying how techniques such as RLHF, DPO, or self-rewarding training influence the quality and directionality of generated scenarios.
Human–AI Collaborative Decision Support
Investigating how AI-generated forward counterfactuals can assist analysts in risk assessment, scenario planning, or strategic analysis.
Cross-Domain Scenario Planning Benchmarks
Extending the Fin-Force framework to other domains requiring scenario planning, such as climate risk analysis, disaster response, policy forecasting, or supply-chain resilience.

Repository Structure

fin_force/
├── data/
│   ├── fin_force.json               # Main benchmark dataset (test split)
│   └── supplementary_data.json      # Supplementary SFT training data
├── evaluation/
│   ├── llm_evaluate.py              # LLM-based evaluation (directionality & forward compatibility)
│   ├── nonllm_evaluate.py           # Non-LLM evaluation (perplexity)
│   ├── run_eval.yaml                # Evaluation config template (with parameter comments)
│   └── rubrics/
│       ├── directionality_risk.txt
│       ├── directionality_opportunity.txt
│       └── forward_compatibility.txt
├── models/
│   ├── baseline_prompting/          # Zero-shot / few-shot prompting baselines
│   ├── cf_distillation/             # Counterfactual distillation via masked headlines
│   ├── self_rewarding_llm/          # Self-Rewarding LLM fine-tuning pipeline
│   └── lm-counterfactuals/          # Gumbel-based counterfactual generation
└── README.md

Dataset

`data/fin_force.json` — Main Benchmark

The primary test benchmark. Each entry contains a financial news headline along with its associated risk and opportunity counterfactuals.

Key fields:

Field	Description
`headline`	The original financial news headline
`classification`	Event type (e.g., `"market_event"`)
`category`	Market category (e.g., `"Financial markets & asset performance"`)
`output.original_headline`	Headline used as the evaluation prompt
`output.risk_counterfactual`	The risk counterfactual
`output.opportunity_counterfactual`	The opportunity counterfactual

`data/supplementary_data.json` — SFT Training Data for Self-Rewarding Language Models

Supplementary data for supervised fine-tuning. Shares the same schema as fin_force.json.

Evaluation

Fin-Force uses four evaluation criteria across two scripts.

Criteria

Criterion	Type	Description
`directionality_risk`	LLM-based	Does the risk counterfactual represent a materially adverse market development?
`directionality_opportunity`	LLM-based	Does the opportunity counterfactual represent a materially favorable market development?
`forward_compatibility`	LLM-based	Is the counterfactual a plausible future development that does not negate the original headline?
`perplexity`	Non-LLM	Sentence-level perplexity (fluency proxy) via GPT-2.

LLM Evaluation (`evaluation/llm_evaluate.py`)

Uses the OpenAI API with structured outputs to score counterfactuals against the rubrics. Configure via evaluation/run_eval.yaml.

from pydantic import BaseModel
from evaluation.llm_evaluate import LLMEvaluator
import os

class DirectionalityOutput(BaseModel):
    directionality: dict  # e.g. {"value": true}

evaluator = LLMEvaluator(
    api_key=os.environ["OPENAI_API_KEY"],
    input_file="data/fin_force.json",
    rubric_filepath="evaluation/rubrics/directionality_risk.txt",
    opportunity_rubric_filepath="evaluation/rubrics/directionality_opportunity.txt",
    model="gpt-4o",
    response_model=DirectionalityOutput,
    load_prompting_directly=True,       # True for fin_force.json format
    raw_output_file="results_raw.jsonl" # Optional: stream results to file
)

results = evaluator.evaluate_responses()

Input format (load_prompting_directly=True): JSON array where each item has an "output" key containing original_headline, risk_counterfactual, and opportunity_counterfactual.

Input format (load_prompting_directly=False): JSONL where each line has "prompt", "risk counterfactual", and "opportunity counterfactual".

Non-LLM Evaluation (`evaluation/nonllm_evaluate.py`)

Computes perplexity using GPT-2 (or any HuggingFace LM).

from evaluation.nonllm_evaluate import NonLLMEval

evaluator = NonLLMEval("data/fin_force.json", load_prompting_directly=True)
scores = evaluator.compute_perplexity_scores(model_id="gpt2")
print(scores["risk_perplexity"])   # list of floats, one per example
print(scores["opp_perplexity"])
print(scores["org_perplexity"])    # perplexity of original headlines

Configuration (`evaluation/run_eval.yaml`)

All evaluation parameters are documented with inline comments — API key placeholder, input/output paths, model, rubric paths, and which criteria to run.

Rubrics

File	Criterion	Output key
`directionality_risk.txt`	Risk directionality	`directionality.value` (bool)
`directionality_opportunity.txt`	Opportunity directionality	`directionality.value` (bool)
`forward_compatibility.txt`	Forward compatibility	`forward_compatibility.value` (bool)

Models

All baseline methods are in models/.

`models/baseline_prompting/` — Prompting Baselines

Zero-shot and few-shot counterfactual generation using frontier LLMs via the OpenAI or OpenRouter APIs.

Files:

_inference.py — Config-driven inference script with retry logic (supports OpenAI and OpenRouter)
response_config.py — Pydantic response schemas: Counterfactuals, Counterfactuals_COT, Counterfactual_masked
configs/ — One YAML per model/variant, all with commented parameters and API key placeholders
prompts/ — System prompt and few-shot example files

Usage:

export OPENAI_API_KEY=sk-...
export OPENROUTER_API_KEY=sk-or-...
python models/baseline_prompting/_inference.py --config models/baseline_prompting/configs/gpt4o.yaml

Available configs:

Config	Model	Variant
`gpt4o.yaml`	GPT-4o	Zero-shot
`gpt4o_fs_sample1.yaml`	GPT-4o	Few-shot (sample 1)
`claude3.5_haiku.yaml`	Claude 3.5 Haiku	Zero-shot
`gemini_2.0_flash.yaml`	Gemini 2.0 Flash	Zero-shot
`llama4_mav.yaml`	Llama 4 Maverick	Zero-shot
`qwen2.5_72B.yaml`	Qwen 2.5 72B	Zero-shot
`gpt4o_distil.yaml`	GPT-4o	CF Distillation (masked input)

`models/cf_distillation/` — Counterfactual Distillation

A two-step pipeline: (1) extract the topic word per headline and mask noun phrases to de-identify it, then (2) generate counterfactuals from the masked input using baseline_prompting with gpt4o_distil.yaml.

Files:

cf_distillation.py — Step 1: topic extraction + noun-phrase masking via OpenAI
cf_distillation_config.yaml — Commented config for step 1 (API key placeholder, paths)
process_distill_results.py — Post-processing utility for step 1 output

Usage:

export OPENAI_API_KEY=sk-...
# Step 1: produce masked headlines
python models/cf_distillation/cf_distillation.py
# Step 2: generate counterfactuals from masked input
python models/baseline_prompting/_inference.py --config models/baseline_prompting/configs/gpt4o_distil.yaml

`models/self_rewarding_llm/` — Self-Rewarding LLM

Iterative self-improvement pipeline: SFT → generate → LLM-score → DPO → repeat. Adapted from the Self-Rewarding Language Models framework.

Pipeline:

Script	Stage
`scripts/00_sft.py`	Supervised fine-tuning on `supplementary_data.json`
`scripts/01_gen_prompts.py`	Generate evaluation prompts
`scripts/02_gen_responses.py`	Generate candidate responses
`scripts/03_gen_scores.py`	Score responses with LLM-as-a-judge
`scripts/04_gen_preferences.py`	Build preference pairs
`scripts/05_dpo.py`	Direct Preference Optimization

Other files:

scripts/configs/ — YAML configs for each stage
src/srlm/ — Core model, trainer, and inference modules
process/ — Data preprocessing utilities
my_llm_judge_prompts/ — LLM judge prompt templates
self-reward_init_model.sh / self-reward_iter_train.sh — Full pipeline shell scripts
multi_self-reward_*.sh — Multi-model variants

Usage:

bash models/self_rewarding_llm/self-reward_init_model.sh   # Initial SFT + iteration 1
bash models/self_rewarding_llm/self-reward_iter_train.sh   # Subsequent iterations

`models/lm-counterfactuals/` — Gumbel-Based Counterfactuals

Counterfactual generation using Gumbel-max sampling and model editing (MEMIT/ROME), adapted from the lm-counterfactuals framework.

Files:

run.py — Main generation script
run_mimic.py / mimic.py — MIMIC counterfactual method
sampling.py — Gumbel sampling utilities
utils.py — Shared utilities
analyze.py / analyze_edit.py — Analysis and edit-distance tools
gumbel_config.yaml — Generation configuration
example.ipynb — Usage notebook

Usage:

python models/lm-counterfactuals/run.py

Requirements

Evaluation

pip install openai pydantic tqdm evaluate

Baseline Prompting & CF Distillation

pip install openai pydantic tqdm instructor spacy
python -m spacy download en_core_web_sm

Self-Rewarding LLM

pip install -r models/self_rewarding_llm/requirements.txt

LM Counterfactuals

pip install -r models/lm-counterfactuals/requirements.txt

API Keys: Never hardcode keys. Use environment variables for running the different models:
export OPENAI_API_KEY=sk-...
export OPENROUTER_API_KEY=sk-or-...

Maintained By

Keane Ong Google Scholar | LinkedIn | Email | Website

Citation

If you use Fin-Force in your research, please cite:

@inproceedings{ong-etal-2025-deriving,
    title = "Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation",
    author = "Ong, Keane  and
      Mao, Rui  and
      Varshney, Deeksha  and
      Liang, Paul Pu  and
      Cambria, Erik  and
      Mengaldo, Gianmarco",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.575/",
    doi = "10.18653/v1/2025.emnlp-main.575",
    pages = "11411--11434",
    ISBN = "979-8-89176-332-6",
    abstract = "Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form{---}forward counterfactual reasoning{---}focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities for stakeholders, guiding their decision-making. However, performing this at scale is challenging due to the cognitive demands involved, underscoring the need for automated solutions. Large Language Models (LLMs) offer promise, but remain unexplored for this application. To address this gap, we introduce a novel benchmark, Fin-Force{---}**FIN**ancial **FOR**ward **C**ounterfactual **E**valuation. By curating financial news headlines and providing structured evaluation, Fin-Force supports LLM based forward counterfactual generation. This paves the way for scalable and automated solutions for exploring and anticipating future market developments, thereby providing structured insights for decision-making. Through experiments on Fin-Force, we evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing their limitations and proposing insights for future research."
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fin-Force

Overview

Potential Research Ideas Using Fin-Force

Repository Structure

Dataset

`data/fin_force.json` — Main Benchmark

`data/supplementary_data.json` — SFT Training Data for Self-Rewarding Language Models

Evaluation

Criteria

LLM Evaluation (`evaluation/llm_evaluate.py`)

Non-LLM Evaluation (`evaluation/nonllm_evaluate.py`)

Configuration (`evaluation/run_eval.yaml`)

Rubrics

Models

`models/baseline_prompting/` — Prompting Baselines

`models/cf_distillation/` — Counterfactual Distillation

`models/self_rewarding_llm/` — Self-Rewarding LLM

`models/lm-counterfactuals/` — Gumbel-Based Counterfactuals

Requirements

Evaluation

Baseline Prompting & CF Distillation

Self-Rewarding LLM

LM Counterfactuals

Maintained By

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
evaluation		evaluation
models		models
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Fin-Force

Overview

Potential Research Ideas Using Fin-Force

Repository Structure

Dataset

data/fin_force.json — Main Benchmark

data/supplementary_data.json — SFT Training Data for Self-Rewarding Language Models

Evaluation

Criteria

LLM Evaluation (evaluation/llm_evaluate.py)

Non-LLM Evaluation (evaluation/nonllm_evaluate.py)

Configuration (evaluation/run_eval.yaml)

Rubrics

Models

models/baseline_prompting/ — Prompting Baselines

models/cf_distillation/ — Counterfactual Distillation

models/self_rewarding_llm/ — Self-Rewarding LLM

models/lm-counterfactuals/ — Gumbel-Based Counterfactuals

Requirements

Evaluation

Baseline Prompting & CF Distillation

Self-Rewarding LLM

LM Counterfactuals

Maintained By

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`data/fin_force.json` — Main Benchmark

`data/supplementary_data.json` — SFT Training Data for Self-Rewarding Language Models

LLM Evaluation (`evaluation/llm_evaluate.py`)

Non-LLM Evaluation (`evaluation/nonllm_evaluate.py`)

Configuration (`evaluation/run_eval.yaml`)

`models/baseline_prompting/` — Prompting Baselines

`models/cf_distillation/` — Counterfactual Distillation

`models/self_rewarding_llm/` — Self-Rewarding LLM

`models/lm-counterfactuals/` — Gumbel-Based Counterfactuals

Packages