This is the GitHub repository for the paper "ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures". Here, you will find the datasets created in the paper. The paper itself delivers a detailed description of the dataset creation. We upload two datasets and code for the experiments.
- 10/03/2025: I have to thank Aida Usmanova and Rana Abdullah for their great help. The two pointed out some errors in the base dataset, and I corrected it. Both datasets are now updated. Please don't hesitate to reach out if you find any trouble with the new data.
The first product of the paper is the expert-annotated dataset. In this dataset, three expert labelers have walked through reports (in the "Reports" folder) and searched for information corresponding to climate change questions.
As a result, they created the "ClimRetrieve_base" dataset with the following form:
- Document: Report under investigation.
- Question: Question under investigation.
- Relevant: Full-sentence form question-relevant information.
- Context: Context of the question relevant information (extending the relevant information by a sentence before and afterward).
- Page: Page of the relevant information.
- Source From: Answers whether the relevant information is from text, table, or graph.
- Source Relevance Score: Classifies from 1-3 how relevant the information is for answering the question (see Appendix E for details on the relevance classification).
- Unsure Flag: Flag whether it is unclear if this source is question-relevant.
- Addressed directly: Flag whether the relevant information addresses the question directly or indirectly.
- Answer: Answer to the question based on all retrieved relevant information.
- Core 16 Question: Indicator variable that signals whether the question is among the core 16 questions. These are the main questions under investigation. Other questions have also been answered and might be of use to researchers (hence included here).
Furthermore, every of the core 16 questions under investigation was defined by the lablers.
As a result, we created the "Core_Questions_with_Explanations" dataset with the following columns:
- question: Question under investigation.
- Definition: Initial labler definition.
- Refined Definition: Refined definition after labeling.
- Raw Information Retrieval Background: Concepts that the question contained (see paper Appendix A).
- Information Retrieval Background: Concepts that the question contained (see paper Appendix A). Concepts are classified to be Core (Co) or Latent (La) concepts. Furthermore, they are classified to be very broad (C1), concrete (C2), or very concrete (C3) concepts.
- Answer Definition Help: Help to answer the question.
As mentioned in the paper, these concepts and definitions are personal help for the labelers rather than general truths about the question. Thus, they have an explanatory power of the mental model of the labler but do not entirely capture it. Rather, this represents a first approach towards making labler thoughts transparent.
The "ClimRetrieve_base" data contains relevant sources extracted from reports. The report-level dataset is a result of parsing the entire report and searching for the content (see Appendix G in the paper).
The "ClimRetrieve_ReportLevel" dataset has the following form:
- paragraph: Paragraph of a given report.
- report: Report under investigation.
- question: Question that was investigated for the report.
- relevant_text: Potentially relevant text for the question.
- relevance: Relevance label from 0-3. Classifies how relevant the information is for answering the question (see Appendix E for details on the relevance classification).
- relevance_text_sim: Similarity of the relevant_text to the paragraph. Values range from 0 (not at all similar) to 1 (identical).
- sim_text_relevance: Relevance label of the most similar relevant_text. This only has explanatory power if the relevant_text and the paragraph are very similar (i.e., they are the same text). In the paper, we set the similarity threshold to 0.9 to be identical. Then relevance is set to sim_text_relevance, otherwise relevance is set to zero.
You will find the code for creating this dataset in the "Report-Level Dataset" folder. A demo run has already been implemented. When using the code, make yourself familiar with the details of implementation (e.g., selected thresholds).
To set up the environment, install the required dependencies:
pip install -r requirements.txtNote: After installing spacy, you'll need to download the English language model:
python -m spacy download en_core_web_smWe follow three steps to execute the experiments:
- Create the new information retrieval background for the embedding search (see "create_embedding_search_queries.py" and folder "Embedding_Search_Queries").
- Search for the top-k paragraphs with different information retrieval strategies (see "create_embedding_search_results.py" and folder "Embedding_Search_Results").
- Aggregate model results and create a visualization (see "analyse_results.py").
To analyze the coverage and matching between the base dataset and search results, use the analyze_data_coverage.py tool:
# Analyze a single search result file
python analyze_data_coverage.py --search-file ./Embedding_Search_Results/text-embedding-3-small__5.csv
# Analyze all search result files in a directory
python analyze_data_coverage.py --search-dir ./Embedding_Search_Results --output coverage_summary.csvThis tool provides:
- Match rates between base dataset and search results
- Analysis of unmatched rows and why they can't be matched
- Breakdown by report and question
- Identification of reports/paragraphs not in the base dataset
The repository includes comprehensive unit tests for the evaluation and analysis scripts. To run the tests:
# Install test dependencies (if not already installed)
pip install pytest>=7.0.0
# Run all tests
cd Experiments
pytest tests/ -v
# Run specific test file
pytest tests/test_analyse_results.py -v
pytest tests/test_analyze_data_coverage.py -v
# Run with coverage report
pytest tests/ --cov=. --cov-report=htmlThe test suite includes:
- 22 tests covering core functionality
- Unit tests for individual functions (
findlen,allMetrics_allThreholds,normalize_report_path, etc.) - Integration tests for data processing workflows
- Edge case handling (empty files, unmatched rows, etc.)
All tests use small, real data files to ensure realistic testing scenarios.