Welcome! This repository contains the source code and experimental results for our paper, " A Causal Perspective on Measuring, Explaining and Mitigating Smells in LLM-Generated Code"
Our work investigates the prevalence of code smells in code generated by large language models (LLMs), provides insights into their causes, and explores strategies for mitigation. Here you will find all materials necessary to reproduce our analyses and findings.
To use GPU for predictions, ensure you have PyTorch installed and a compatible GPU available.
Check your GPU setup with:
!nvidia-smiExample output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 12.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:61:00.0 Off | 0 |
| N/A 33C P0 34W / 250W | 4MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
- Create a virtual environment (using conda, mamba, or virtualenv) and activate it:
mamba create -n code-smell-env conda activate code-smell-env
- Navigate to the project base path:
cd CodeSmells - Install dependencies:
pip install . - Compile and install the code_smell_lib using nbdev:
cd code_smell_lib nbdev_export pip install .
Note: Some dependencies may not be included in
requirements.txt. Install them manually if prompted.
- The dataset folder contains the datasets (
CodeSmellData) used for our experiments. - The notebooks folder contains Jupyter notebooks for each research question.
| Research Question | Subfolder | Notebook File | Function/Explanation |
|---|---|---|---|
| RQ1 (Measure) | baseline | 01_information_gain.ipynb | Computes information gain for baseline model comparisons. |
| RQ1 (Measure) | robustness_generation | 05_1_data_engineering-CodeLlama.ipynb | Processes logits for robustness generation experiments (CodeLlama model). |
| RQ1 (Measure) | robustness_generation | 06_alignment_and_aggregation.ipynb | Aggregates and aligns logits for robustness generation experiments. |
| RQ1 (Measure) | robustness_transformations | 04_1_data_engineering-CodeLlama.ipynb | Processes logits for robustness transformation experiments (CodeLlama model). |
| RQ1 (Measure) | robustness_transformations | 05_alignment_and_aggregation.ipynb | Aggregates and aligns logits for robustness transformation experiments. |
| RQ2 (Explain) | causal_analysis | 01_dataset_preprocessing.ipynb | Preprocesses datasets for causal analysis experiments. |
| RQ2 (Explain) | causal_analysis | 02_causal_analysis.ipynb | Performs causal analysis to identify relationships between code smells and model outputs. |
| RQ2 (Explain) | causal_analysis | 03_result_analysis.ipynb | Analyzes and visualizes results from causal analysis. |
| RQ2 (Explain) | prompting | 04_alignment_and_aggregation.ipynb | Aggregates and aligns logits for PSC computation in prompting experiments. |
| RQ3 (Mitigation) | mitigation | 01_dataset_preparation.ipynb | Prepares datasets for mitigation experiments. |
| RQ3 (Mitigation) | mitigation | 02_extractor_CausalLM.ipynb | Extracts logits from CausalLM models for mitigation analysis. |
| RQ3 (Mitigation) | mitigation | 03_2_data_engineering-CausalLM.ipynb | Processes extracted logits for mitigation experiments. |
| RQ3 (Mitigation) | mitigation | 04_alignment_and_aggregation.ipynb | Aggregates and aligns logits to compute Propensity Smelly Score (PSC) for mitigation. |
| RQ3 (Mitigation) | mitigation | 05_analysis.ipynb | Performs statistical analysis and visualization for mitigation results. |
| RQ4 (Pipeline) | pipeline | 01_extractor_CausalLM.ipynb | Extracts logits from CausalLM models for pipeline experiments. |
| RQ4 (Pipeline) | pipeline | 02_2_data_engineering-CausalLM.ipynb | Processes logits for pipeline experiments. |
| RQ4 (Pipeline) | pipeline | 03_alignment_and_aggregation.ipynb | Aggregates and aligns logits for PSC computation in pipeline experiments. |
| RQ4 (Survey) | survey | result_analysis.ipynb | Analyzes survey results related to code smell perceptions. |
| All RQs | extension | models.md | Documents the models used in all experiments. |
- baseline: Notebooks for baseline model comparisons and metrics.
- causal_analysis: Notebooks for dataset preprocessing, causal analysis, and result visualization.
- mitigation: Notebooks for preparing data, extracting logits, engineering features, aggregating results, and analyzing mitigation strategies.
- pipeline: Notebooks for extracting, processing, and aggregating logits in pipeline experiments.
- prompting: Notebooks for aggregation and analysis in prompting experiments.
- robustness_generation: Notebooks for robustness generation experiments, including data engineering and aggregation.
- robustness_transformations: Notebooks for robustness transformation experiments, including data engineering and aggregation.
- survey: Notebooks for analyzing survey data on code smell perceptions.
This folder contains the nbdev source notebooks that implement and document the main components of the code_smells_lib library. Each notebook focuses on a specific aspect of code smell detection and analysis:
- 00_ast_utils.ipynb: Utility functions for parsing and analyzing Python Abstract Syntax Trees (ASTs), essential for identifying structural code smells.
- 01_pos_tagging.ipynb: Implements part-of-speech tagging for code tokens, supporting advanced code analysis and smell detection.
- 02_smell_detectors.ipynb: Core logic for detecting various code smells, including heuristics and rule-based approaches.
- 03_metrics.ipynb: Defines metrics for quantifying code smells and evaluating code quality.
- 04_data_processing.ipynb: Functions for loading, cleaning, and transforming code datasets used in experiments.
- 05_visualization.ipynb: Tools for visualizing code smell distributions and analysis results.
Each notebook is designed for literate programming: code cells implement functionality, while markdown cells explain usage and design decisions. The code is exported to Python modules using nbdev_export, ensuring that documentation and implementation remain synchronized.
The scripts folder contains shell-executable versions of the main analysis pipelines found in the notebooks. Each subfolder corresponds to a research question or experimental setting (e.g., causal_analysis, mitigation, pipeline, prompting, robustness_generation, robustness_transformations) and includes scripts and configuration files tailored for batch execution.
Key points:
- Purpose: These scripts automate the execution of data preprocessing, model inference, code smell detection, and result aggregation steps, making it easy to reproduce experiments from the command line.
- Structure: The folder structure mirrors the organization of the main notebooks, with each subfolder containing the relevant Python scripts and shell scripts (
run_script.sh) for its experimental stage. - Usage: To run an experiment, navigate to the desired subfolder and execute the provided shell script. Logs and outputs are saved in dedicated directories for easy inspection.
This setup enables reproducible, large-scale experiments and is ideal for running analyses on remote servers or clusters.
The survey folder contains materials and analysis related to the user study conducted for our research. This study investigates human perceptions of code smells in generated code.
- control.pdf and treatment.pdf: These files contain the survey forms shown to participants. The control version presents code snippets without explicit code smell annotations, while the treatment version includes additional information or highlighting related to code smells.
- result_analysis.ipynb: This notebook performs statistical analysis of the survey responses. It processes Likert-scale ratings from participants, compares control and treatment groups, and applies significance tests (e.g., Mann-Whitney U) to assess the impact of code smell annotations on user judgments.
Use this folder to reproduce the survey analysis and explore how code smell explanations affect developer perceptions.
The causation plots folder contains additional visualizations that were excluded from the main paper but provide valuable insights into the robustness of Propensity Smelly Score (PSC) for each type of code smell. These plots illustrate how PSC and related metrics behave under various robustness experiments, helping to further understand the stability and reliability of code smell detection across different scenarios.
- The plots are organized by robustness experiment and include both mean and median statistics for actual and maximum probabilities associated with code smells.
- You will find visualizations such as
code_smell_actual_prob_mean,code_smell_actual_prob_median,code_smell_max_prob_mean, andcode_smell_max_prob_medianin PDF and PNG formats. - These results can be used to explore the sensitivity of PSC to code transformations and generation settings, offering deeper context for interpreting the main findings.
Researchers interested in the detailed behavior of PSC under robustness conditions can refer to these plots for supplementary analysis.