ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures

This is the GitHub repository for the paper "ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures". Here, you will find the datasets created in the paper. The paper itself delivers a detailed description of the dataset creation. We upload two datasets and code for the experiments.

Updates

10/03/2025: I have to thank Aida Usmanova and Rana Abdullah for their great help. The two pointed out some errors in the base dataset, and I corrected it. Both datasets are now updated. Please don't hesitate to reach out if you find any trouble with the new data.

Expert-Annotated Relevant Sources Dataset

The first product of the paper is the expert-annotated dataset. In this dataset, three expert labelers have walked through reports (in the "Reports" folder) and searched for information corresponding to climate change questions.

As a result, they created the "ClimRetrieve_base" dataset with the following form:

Document: Report under investigation.
Question: Question under investigation.
Relevant: Full-sentence form question-relevant information.
Context: Context of the question relevant information (extending the relevant information by a sentence before and afterward).
Page: Page of the relevant information.
Source From: Answers whether the relevant information is from text, table, or graph.
Source Relevance Score: Classifies from 1-3 how relevant the information is for answering the question (see Appendix E for details on the relevance classification).
Unsure Flag: Flag whether it is unclear if this source is question-relevant.
Addressed directly: Flag whether the relevant information addresses the question directly or indirectly.
Answer: Answer to the question based on all retrieved relevant information.
Core 16 Question: Indicator variable that signals whether the question is among the core 16 questions. These are the main questions under investigation. Other questions have also been answered and might be of use to researchers (hence included here).

Furthermore, every of the core 16 questions under investigation was defined by the lablers.

As a result, we created the "Core_Questions_with_Explanations" dataset with the following columns:

question: Question under investigation.
Definition: Initial labler definition.
Refined Definition: Refined definition after labeling.
Raw Information Retrieval Background: Concepts that the question contained (see paper Appendix A).
Information Retrieval Background: Concepts that the question contained (see paper Appendix A). Concepts are classified to be Core (Co) or Latent (La) concepts. Furthermore, they are classified to be very broad (C1), concrete (C2), or very concrete (C3) concepts.
Answer Definition Help: Help to answer the question.

As mentioned in the paper, these concepts and definitions are personal help for the labelers rather than general truths about the question. Thus, they have an explanatory power of the mental model of the labler but do not entirely capture it. Rather, this represents a first approach towards making labler thoughts transparent.

Report-Level Dataset

The "ClimRetrieve_base" data contains relevant sources extracted from reports. The report-level dataset is a result of parsing the entire report and searching for the content (see Appendix G in the paper).

The "ClimRetrieve_ReportLevel" dataset has the following form:

paragraph: Paragraph of a given report.
report: Report under investigation.
question: Question that was investigated for the report.
relevant_text: Potentially relevant text for the question.
relevance: Relevance label from 0-3. Classifies how relevant the information is for answering the question (see Appendix E for details on the relevance classification).
relevance_text_sim: Similarity of the relevant_text to the paragraph. Values range from 0 (not at all similar) to 1 (identical).
sim_text_relevance: Relevance label of the most similar relevant_text. This only has explanatory power if the relevant_text and the paragraph are very similar (i.e., they are the same text). In the paper, we set the similarity threshold to 0.9 to be identical. Then relevance is set to sim_text_relevance, otherwise relevance is set to zero.

You will find the code for creating this dataset in the "Report-Level Dataset" folder. A demo run has already been implemented. When using the code, make yourself familiar with the details of implementation (e.g., selected thresholds).

Installation

To set up the environment, install the required dependencies:

pip install -r requirements.txt

Note: After installing spacy, you'll need to download the English language model:

python -m spacy download en_core_web_sm

Experiments

We follow three steps to execute the experiments:

Create the new information retrieval background for the embedding search (see "create_embedding_search_queries.py" and folder "Embedding_Search_Queries").
Search for the top-k paragraphs with different information retrieval strategies (see "create_embedding_search_results.py" and folder "Embedding_Search_Results").
Aggregate model results and create a visualization (see "analyse_results.py").

Data Coverage Analysis

To analyze the coverage and matching between the base dataset and search results, use the analyze_data_coverage.py tool:

# Analyze a single search result file
python analyze_data_coverage.py --search-file ./Embedding_Search_Results/text-embedding-3-small__5.csv

# Analyze all search result files in a directory
python analyze_data_coverage.py --search-dir ./Embedding_Search_Results --output coverage_summary.csv

This tool provides:

Match rates between base dataset and search results
Analysis of unmatched rows and why they can't be matched
Breakdown by report and question
Identification of reports/paragraphs not in the base dataset

Running Tests

The repository includes comprehensive unit tests for the evaluation and analysis scripts. To run the tests:

# Install test dependencies (if not already installed)
pip install pytest>=7.0.0

# Run all tests
cd Experiments
pytest tests/ -v

# Run specific test file
pytest tests/test_analyse_results.py -v
pytest tests/test_analyze_data_coverage.py -v

# Run with coverage report
pytest tests/ --cov=. --cov-report=html

The test suite includes:

22 tests covering core functionality
Unit tests for individual functions (findlen, allMetrics_allThreholds, normalize_report_path, etc.)
Integration tests for data processing workflows
Edge case handling (empty files, unmatched rows, etc.)

All tests use small, real data files to ensure realistic testing scenarios.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.idea		.idea
Experiments		Experiments
Expert-Annotated Relevant Sources Dataset		Expert-Annotated Relevant Sources Dataset
Report-Level Dataset		Report-Level Dataset
Reports		Reports
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures

Updates

Expert-Annotated Relevant Sources Dataset

Report-Level Dataset

Installation

Experiments

Data Coverage Analysis

Running Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

tobischimanski/ClimRetrieve

Folders and files

Latest commit

History

Repository files navigation

ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures

Updates

Expert-Annotated Relevant Sources Dataset

Report-Level Dataset

Installation

Experiments

Data Coverage Analysis

Running Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages