A system for grading commit messages and generating actionable feedback to improve adherence to best practices in commit writing. This repository includes tools for automated grading and feedback generation using GPT.
- Introduction
- Features
- Getting Started
- Usage
- Functions
- Datasets
- Grading Guidelines
- Contributing
- License
- Acknowledgements
Developed as part of the Machine Learning (CS-433) course project, this work automates the evaluation of commit messages and provides constructive feedback. In collaboration with the Dependable Systems Lab at EPFL, led by Professor George Candea, we designed this system for the SWENT (CS-311) course, where students write commit messages as part of their software development projects.
The system evaluates the quality of commit messages based on established best practices, such as conventional commit guidelines, and generates actionable feedback to help students improve their writing. By leveraging BERT-based models for grading and GPT-based models for feedback generation, the system promotes better version control practices and enhances the learning experience for students.
- Grading of commit messages based on clarity, structure, and adherence to best practices.
- Dynamic feedback generation using GPT-4o with structured prompts.
- Hyperparameter tuning for optimizing model performance.
- Support for SWENT-style and conventional commit standards.
- Comprehensive evaluation metrics (see
Metrics.ipynbfor more details).
- Python 3.8 or above
- Required Python libraries listed in
requirements.txt
-
Clone the repository:
git clone https://github.com/your-repo-name.git cd your-repo-name -
Install dependencies:
pip install -r requirements.txt
Warning
To download the SpaCy model used for parsing and checking imperative verbs, please run:
python -m spacy download en_core_web_sm- Obtain the OpenAI API key: Since the code requires an API key to generate feedback using GPT, please contact me at albert.fares@epfl.ch to request my personal API key.
Warning
- Usage Guidelines:
- Please use the key responsibly, as it is tied to my personal account and has limited funds.
- Avoid excessive testing to conserve the available funds.
- If any issues arise or additional funds are needed, feel free to reach out, and I will assist you accordingly.
You can grade a single commit message and generate feedback using the process_single_commit function.
from helpers import process_single_commit
model_name_or_path = "albertfares/CommitGrader"
commit_message = "feat: add user authentication"
result = process_single_commit(commit_message, model_name_or_path)To process multiple commit messages grouped by SCIPER, generate feedback, and calculate average grades, use the process_commits_and_generate_feedback_with_sciper function.
from helpers import process_commits_and_generate_feedback_with_sciper
input_json_path = "testing_data.json"
output_json_path = "output_feedback.json"
model_name_or_path = "albertfares/CommitGrader"
# Set your api key here
openai.api_key = ""
# Process a subset of SCIPERs
results = process_commits_and_generate_feedback_with_sciper(input_json_path, output_json_path, model_name_or_path, start = 0, end = 2)
# Process all SCIPERs
results = process_commits_and_generate_feedback_with_sciper(input_json_path, output_json_path, model_name_or_path)The input JSON file should be structured as follows, where each SCIPER ID maps to a list of commit objects containing a hash and a commit message:
{
"123456": [
{
"hash": "675f7201cce20ae7f757324909b9a87ab474873c",
"commit_message": "test: Add ui tests for the map screen"
},
{
"hash": "117406f8b77ef98ce6b8e1fe89bed3c8a816eafe",
"commit_message": "test: Add unit tests for ListToDosViewModel and ToDosRepositoryFirestore"
}
],
}This structure is essential for the process_commits_and_generate_feedback_with_sciper function to process the data correctly.
"123456": {
"commits": [
{
"hash": "675f7201cce20ae7f757324909b9a87ab474873c",
"commit_message": "test: Add ui tests for the map screen",
"grade": 4.5,
"feedback": "The commit message should start the description with a lowercase letter to align with conventional commit message guidelines."
},
{
"hash": "117406f8b77ef98ce6b8e1fe89bed3c8a816eafe",
"commit_message": "test: Add unit tests for ListToDosViewModel and ToDosRepositoryFirestore",
"grade": 4.0,
"feedback": "The identified issue is that the commit message description is too long and should be concise, ideally under 50 characters, to improve readability. Additionally, the description starts with an uppercase letter, whereas conventional commit messages typically start with a lowercase letter."
}
],
"average_grade": 4.17
},train_bert_for_grading(json_path, output_dir="./bert_grade_model", num_epochs=3, batch_size=16, learning_rate=2e-5, weight_decay=0.0)
- Description: Trains a BERT model for commit message grading using a labeled dataset.
- Inputs:
json_path(str): Path to the JSON file containingcommit_messageandgradefields.output_dir(str): Path to save the trained model and tokenizer.num_epochs(int): Number of training epochs.batch_size(int): Batch size for training.learning_rate(float): Learning rate for the optimizer.weight_decay(float): Weight decay for regularization.
- Outputs:
- Saves the trained BERT model and tokenizer in the specified
output_dir.
- Saves the trained BERT model and tokenizer in the specified
- Description: Extracts the prefix, description, and body from a commit message.
- Inputs:
commit_message(str): The full commit message.
- Outputs: A dictionary containing:
prefix(str): Extracted prefix if valid.description(str): Message description after the prefix.body(str): The body of the commit message.
- Description: Evaluates whether a prefix is valid according to the Conventional Commit standard and returns a grade.
- Inputs:
prefix(str): The prefix to evaluate.
- Outputs: A dictionary containing:
grade(float): Grade of the prefix (0, 0.5, or 1).error_code(int): Error type:0: Perfect prefix.1: Uppercase error.2: Typo error.3: Both uppercase and typo errors.4: Invalid prefix.
- Description: Grades the description of a commit message using a pre-trained BERT model.
- Inputs:
commit_message(str): The full commit message to grade.tokenizer: The tokenizer loaded with the pre-trained BERT model.model: The trained BERT model used for grading.
- Outputs:
- An integer representing the predicted grade (0, 1, 2, or 3).
- Description: Evaluates whether the body of a commit message adds meaningful information to its description. The evaluation can be done using a simplified rule-based approach (if
no_openaiisTrue) or by leveraging GPT-4o (ifno_openaiisFalse). - Inputs:
description(str): The description (title) of the commit message.body(str): The body of the commit message.no_openai(bool, optional): IfTrue, the function assumes the body is meaningful if it exists. Defaults toTrue.
- Outputs:
Trueif the body is meaningful,Falseotherwise.
- Description: Checks if the commit message description exceeds 50 characters (excluding spaces).
- Inputs:
content(str): The commit message description.
- Outputs:
Trueif the description is too long,Falseotherwise.
- Description: Checks if any line in the commit message body exceeds 72 characters.
- Inputs:
body(str): The body of the commit message.
- Outputs:
Trueif any line exceeds 72 characters,Falseotherwise.
- Description: Checks if the first letter of the commit message description is uppercase.
- Inputs:
content(str): The commit message description.
- Outputs:
Trueif the first letter is uppercase,Falseotherwise.
- Description: Determines if the first word of a commit message description is a verb in imperative mood using a SpaCy model.
- Inputs:
sentence(str): The commit message description to analyze.nlp: A loaded SpaCy language model used for natural language processing.
- Outputs:
Trueif the first word is a verb in imperative mood,Falseotherwise.
-
Description: Grades a commit message by evaluating its prefix, description, and body, while checking for errors and adherence to best practices.
-
Inputs:
commit_message(str): The full commit message to grade.nlp: A loaded SpaCy language model used for natural language processing.tokenizer: A preloaded tokenizer for the BERT model.model: A preloaded BERT model used for evaluating the description grade.no_openai(bool): IfTrue, body evaluation defaults toTruewithout using GPT for evaluation.
-
Outputs: A tuple containing:
description_grade(int): Initial description grade (0–3).final_grade(float): Adjusted final grade after applying all checks and adjustments.- Various boolean flags indicating detected issues:
is_desc_too_long(bool):Trueif the description exceeds 50 characters (excluding spaces).is_uppercase(bool):Trueif the first letter of the description is uppercase (only for conventional commit messages).is_not_imp_verb(bool):Trueif the description does not start with an imperative verb (checked using SpaCy).is_perfect_prefix(bool):Trueif the prefix is valid and correctly formatted.is_uppercase_prefix(bool):Trueif the prefix has an uppercase error.is_typo_prefix(bool):Trueif the prefix contains a typo.is_uppercase_and_typo_prefix(bool):Trueif the prefix has both uppercase and typo errors.is_invalid_prefix(bool):Trueif the prefix is invalid or non-standard.is_body_meaningful(bool):Trueif the body adds meaningful context to the description.is_body_too_long(bool):Trueif any line in the body exceeds 72 characters (excluding spaces).is_body_evaluated(bool):Trueif the body has been evaluated for meaningfulness.
- Description: Extracts commit messages from a JSON file.
- Inputs:
json_path(str): Path to the JSON file.
- Outputs: A list of commit messages.
- Description: Extracts grades from a JSON file.
- Inputs:
json_path(str): Path to the JSON file.
- Outputs: A list of grades.
actual_vs_pred(commit_messages, validation_grades, num_messages, nlp, tokenizer, model, no_openai=True)
-
Description: Validates commit message grades by grading a subset of messages and comparing them with validation grades. Calculates accuracy, custom F1-like metrics, and flags mismatched grades for detailed analysis.
-
Inputs:
commit_messages(list): List of commit messages to validate.validation_grades(list): List of ground-truth validation grades.num_messages(int): Number of commit messages to validate (processed sequentially from the beginning).nlp: A loaded SpaCy language model for natural language processing.tokenizer: A preloaded tokenizer for the BERT model.model: A preloaded BERT model for evaluating the description grades.no_openai(bool): IfTrue, body evaluation defaults to meaningful if present without using GPT.
-
Outputs: A dictionary containing:
flagged_messages(list): A list of messages with mismatched grades, along with detailed results:- Commit message.
- Predicted grade.
- Validation grade.
- Detected errors and their flags.
custom_accuracy(float): Fraction of predictions within the error margin (default is 1).custom_f1(float): A custom F1-like metric adapted for continuous grade values.
- Description: Generates targeted feedback for a commit message using GPT-4o based on the detected issues.
- Inputs:
commit_message(str): The full commit message.final_grade(float): Final grade of the commit.description_grade(int): Intermediate description grade.errors(dict): Dictionary of error flags.
- Outputs:
- A string containing the feedback.
- Description: Processes commit messages from a JSON file, grades them, and generates feedback.
- Inputs:
input_json_path(str): Path to the input JSON file.output_json_path(str): Path to save the graded commit messages with feedback.start(int, optional): Start index for processing commits.end(int, optional): End index for processing commits.
- Outputs: A list of processed commit messages with grades and feedback.
- Description: Extracts commit messages grouped by SCIPER ID from a JSON file.
- Inputs:
json_path(str): Path to the JSON file.
- Outputs: A dictionary where keys are SCIPER IDs and values are lists of commit messages.
process_commits_and_generate_feedback_with_sciper(input_json_path, output_json_path, model_path, spacy_nlp_name="en_core_web_sm", start=None, end=None)
- Description: Processes commit messages grouped by SCIPER ID, generates grades and feedback, and calculates average grades using a trained BERT model and a specified SpaCy language model.
- Inputs:
input_json_path(str): Path to the input JSON file with commit messages grouped by SCIPER.output_json_path(str): Path to save the graded results.model_path(str): Path to the trained BERT model used for grading.spacy_nlp_name(str, optional): The name of the SpaCy model to use. Defaults to"en_core_web_sm".start(int, optional): Starting index for SCIPER processing.end(int, optional): Ending index for SCIPER processing.
- Outputs:
- A JSON file containing commit messages, grades, feedback, and average grades per SCIPER.
- Metrics Printed:
- Total execution time.
- Average time per SCIPER.
- Average time per commit.
- Description: Processes and grades a single commit message using a specified BERT model and SpaCy language model, then generates feedback.
- Inputs:
commit_message(str): The commit message to evaluate.model_path(str): Path to the trained BERT model for grading.spacy_nlp_name(str, optional): Name of the SpaCy model to use. Defaults to"en_core_web_sm".
- Outputs:
- A dictionary containing:
commit_message(str): The original commit message.grade(float): The final grade assigned to the commit message.feedback(str): Detailed feedback based on detected issues and grading criteria.
- A dictionary containing:
- Notes:
- This function integrates both natural language processing (via SpaCy) and deep learning (via BERT) to provide a robust analysis of commit messages.
The project involves the use of three key datasets:
- Purpose: Used for metric computation and to compare with the new and improved training dataset.
- Description: This dataset contains the original commit messages and grades, which served as the foundation for preprocessing and creating the improved training dataset. The grades in here are integers ranging from 0-5.
- Purpose: Serves as the new training dataset to fine-tune the BERT-based classification model.
- Description: This dataset is the result of extensive preprocessing and improvements made to the old training data. It provides a clean and well-structured set of commit messages and corresponding grades for model training. The grades in here are integers ranging from 0-3.
- Purpose: Used to test and evaluate the grading system's performance.
- Description: This dataset contains commit messages from 154 students who took the CS-311 course during the winter semester of the 2024-2025 academic period. It is used exclusively for testing purposes and should remain private to ensure student privacy. The commit messages in this dataset are not graded.
Warning
This dataset will be removed from the GitHub repository after the Machine Learning project grading to uphold the privacy of the students.
- Purpose: Used to compute performance metrics in
Metrics.ipynb - Description: This dataset contains a different collection of graded commit messages compared to the training data (and the old training data). The grades are integers ranging from 0-5.
The metrics.ipynb file evaluates the grading system's performance using both the final and original datasets. It includes:
-
Model Comparison:
- Compares models trained on the final and original datasets.
- Outputs metrics such as accuracy, F1 scores, and confusion matrices.
-
Validation:
- Validates predictions using
validation_data.json. - Saves predictions for comparison (
y_final_pred.txt,y_old_pred.txt).
- Validates predictions using
-
Visualization:
- Displays confusion matrices and classification reports for both datasets.
- Ensure the required datasets (
validation_data.json) and models are available. - Run the notebook to:
- Compute predictions for commit messages.
- Evaluate model performance.
- Visualize results with confusion matrices.
Warning
The validation_data.json dataset contains sensitive information and will be removed post-grading to ensure privacy.
Below are the official grading guidelines followed by this project to evaluate commit messages:
-
Description Grade (Out of 3):
- 3/3: The description is clear and thoroughly describes the changes made.
- 2/3: The description is relatively clear but could benefit from improved wording or structure.
- 1/3: The description lacks sufficient detail and does not adequately describe the changes made.
- 0/3: The message is too vague, off-topic, or unrelated to code changes.
-
Mapped Description Grade (Out of 4):
- (0/3 -> 0/4)
- (1/3 -> 1/4)
- (2/3 -> 3/4)
- (3/3 -> 4/4)
-
Prefix Evaluation (For Messages with Prefixes): Only one scenario applies based on the state of the prefix:
- +1: The prefix is perfect and adheres to conventional commit standards.
- +0.5: The prefix has a minor typo but is still recognizable.
- +0.5: The first letter of the prefix is uppercase.
- +0: Both a typo and an uppercase letter are present in the prefix.
- +0: The prefix is not one of the conventional prefixes.
-
Body Evaluation (If a Body is Present): Only one scenario applies based on the description grade and the body:
- If the description grade is 1:
- +1: The body adds meaningful details to the description, and each line has fewer than 72 characters (excluding spaces).
- +0.5: The body adds meaningful details, but some lines exceed 72 characters (excluding spaces).
- +0: The body is not meaningful.
- If the description grade is 2:
- +0.5: The body adds meaningful details, and each line has fewer than 72 characters (excluding spaces).
- +0: Otherwise.
- If the description grade is 0 or 3, no bonus points are awarded for the body.
- If the description grade is 1:
-
Penalties (-0.5 Each):
- For messages with prefixes (conventional commit):
- The description starts with an uppercase letter.
- The description does not start with an imperative mood verb.
- The description exceeds 50 characters (excluding spaces).
- For messages without prefixes (SWENT-style):
- The description does not start with an imperative mood verb.
- The description exceeds 50 characters (excluding spaces).
- For messages with prefixes (conventional commit):
Note
For both prefix and body evaluation, only one scenario is applied per message. This ensures that the evaluation is precise and avoids applying multiple scores simultaneously.
The final workflow for the grading and feedback generation process is illustrated below.
The diagram provides an overview of the entire process, from dataset preparation to model evaluation and feedback generation.
We welcome contributions! To contribute:
- Fork this repository.
- Create a new branch:
git checkout -b feature/your-feature-name
- Commit your changes:
git commit -m "Add your commit message here" - Push to the branch:
git push origin feature/your-feature-name
- Create a pull request.
This project is licensed under the MIT License.
- BERT
- OpenAI GPT-4
- SpaCy
- Contributors: Albert Fares, Hugo Jeannin, Daniel Polka
