Replication package for the paper:
"Theory Building from Data Strategy Studies: Aggregating Evidence on Model Quantization in Deep Learning Systems" submitted to the Empirical Software Engineering Journal.
This replication package consists of the following components:
-
Data:
- Raw, external, interim, and processed data are stored in the data directory.
-
Source Code:
- Located in the src directory, it includes scripts for data processing, analysis, and evidence extraction.
- Key modules:
- data/papers/entities.py & data/papers/knowledge_extraction.py: Define the structure and data extraction logic for the papers analyzed.
- data/download.py: Downloads the list of papers from arXiv and merges them with the Scopus list.
- data/selection/llm.py: Implements logic for selecting studies using Gemini 3.0 Flash.
-
Jupyter Notebooks:
- Located in the notebooks directory, these notebooks contain the analysis and visualization of the data.
- Notebooks include:
- 1.0-llm-promt-refinement.ipynb: Refines the prompt for LLMs and the selection of LLM.
- 2.0-model-quantization-paper-selection.ipynb: Filters the raw list of papers using the selected GEMINI 3.0.
- 3.0-final-selection-analysis.ipynb: Analyzes the final selection of papers.
- 4.0-paper-metadata-analysis.ipynb: Analyzes metadata from selected papers.
- 5.0-evidence-analysis.ipynb: Analyzes evidence extracted from the papers and generates the forest plot.
-
Documentation:
- data/processed/evidence-diagrams-mapping.md: Links to evidence diagrams generated during the study.
data/processed/{paperkey}/metadata.json: Contains metadata for the specific paper.data/processed/{paperkey}/systematic-studies-quality-evaluation.md: Contains the filled quality evaluation form for the specific paper.
The project is organized as follows:
├── data/
│ ├── raw/ <- Contains the original list of papers retrieved from Scopus
│ ├── external/ <- Contains the raw data obtained from the selected papers
│ ├── interim/ <- Contains the interim data used in the analysis
│ └── processed/ <- Contains the processed data used in the analysis
│ └── evidence-diagrams-mapping.md <- Contains links to the evidence diagrams
├── notebooks/
│ ├── 1.0-llm-promt-refinement.ipynb
│ ├── 2.0-model-quantization-paper-selection.ipynb
│ ├── 3.0-second-selection-analysis.ipynb
│ ├── 4.0-paper-metadata-analysis.ipynb
│ └── 5.0-evidence-analysis.ipynb
├── reports/
│ └── figures/
├── src/
│ ├── data/
│ │ ├── papers/ <- Contains the logic for extracting and analyzing data from papers
│ │ │ ├── entities.py
│ │ │ └── knowledge_extraction.py
│ │ ├── download.py
│ │ └── selection/ <- Utility functions for selecting studies using LLMs,
│ │ └── llm.py including the prompt
│ ├── forestplot/ <- Utility functions for generating the forest plot
│ ├── effect_intensity.py <- Definition of the effect intensity thresholds
│ ├── run_evidence_extraction.py
│ └── config.py
├── .pre-commit-config.yaml
├── dot-env-template <- Template for environment variables
├── requirements.txt <- List of Python dependencies
├── uv.lock <- Environment lock file
├── LICENSE
├── pyproject.toml <- Project configuration file
└── README.md
-
Setup:
-
Clone the repository:
git clone <repository-url> cd green-tactics-synthesis
-
Install dependencies:
The project is managed with uv. To install the dependencies, run:uv sync
Alternatively, you can use pip to install the dependencies listed in
requirements.txt:pip install -r requirements.txt
-
Using Docker (recommended for reproducibility):
A pre-built Docker image is available on Docker Hub:docker pull santidr/model-quantization-aggregation
Run the container with Jupyter Lab:
docker run -it -p 8888:8888 santidr/model-quantization-aggregation
To use LLM features (paper selection), pass your API key:
docker run -it -p 8888:8888 \ -e GEMINI_API_KEY=your_key \ santidr/model-quantization-aggregation
To persist data changes, mount local directories:
docker run -it -p 8888:8888 \ -v $(pwd)/data:/app/data \ -v $(pwd)/reports:/app/reports \ santidr/model-quantization-aggregation
-
-
Getting the Data:
-
Run the download script to fetch the list of papers from arXiv and merge it with the Scopus list:
python src/data/downlad.py
-
We do not provide the raw data from the selected papers to prevent potential copyright issues. However, we provide instructions on how to obtain the data in each paper's README file. Located in the data/external directory.
-
-
Extracting the evidence:
- Use the run_evidence_extraction.py module to extract the evidence from the selected papers.
-
Explore the data with Jupyter Notebooks:
- Open the Jupyter notebooks in the notebooks directory to explore the data and analysis.
- Ensure all required data is placed in the appropriate directories.
- For any issues or questions, please contact the authors of the paper.
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.