End-to-end notebooks to train, analyze, and ensemble models that classify source code as human-written or generated by one of 10 LLM families. The pipeline includes universal code canonicalization (Tree-sitter), hard negative mining, a 3‑phase training schedule, and a weighted ensemble.
hardnegative.ipynb: Mines hard negatives by scoring the full train set, selecting the top 20% highest-loss samples, and saving their indices to JSON for re-weighting/augmentation.1_unixcoder_latest.ipynb: Three-phase training withmicrosoft/unixcoder-base. Includes:- UniversalCanonicalizer (Tree-sitter) and data augmentation
- Phase 1 (Weighted), Phase 2 (Natural), Phase 3 (Full) training
- Saves both “Latest” and “Best” checkpoints’ predictions; probability files saved for P2 Best and P3 Final
2_graphcodebert_latest.ipynb: Same pipeline as above usingmicrosoft/graphcodebert-base.ensemble.ipynb: Weighted ensemble of probability files from UnixCoder and GraphCodeBERT to produce a final submission CSV.
- Hard Negative Mining: run
hardnegative.ipynbto create hard negative indices JSON. - Train UnixCoder: run
1_unixcoder_latest.ipynb(Phases P1→P2→P3). - Train GraphCodeBERT: run
2_graphcodebert_latest.ipynb(Phases P1→P2→P3). - Ensemble: run
ensemble.ipynbto create the final submission CSV.
Binary-origins expanded to 11 classes:
0: human, 1: deepseek, 2: qwen, 3: 01-ai, 4: bigcode, 5: gemma, 6: phi, 7: meta-llama, 8: ibm-granite, 9: mistral, 10: openai
Default (Colab) paths used by training notebooks:
BASE_DIR = /content/drive/MyDrive/SemEval_Models- Files required in
BASE_DIR:train.parquet(columns:code[str],label[int 0–10], optionallanguage)validation.parquet(columns:code,label)test.parquet(columns:code, optionalID)
- Hard negatives JSON (optional but recommended):
- Training notebooks expect
JSON_PATH = /content/hard_negative_indices.jsonwith the schema:
- Training notebooks expect
{ "hard_indices": [12, 57, 103, ...] }Note: hardnegative.ipynb writes to BASE_DIR/hard_negatives.json with keys hard_indices and distribution. You can either:
- Copy/rename it to
/content/hard_negative_indices.json, or - Change
JSON_PATHin the training notebooks to point toBASE_DIR/hard_negatives.json.
Prerequisites:
- Upload your parquet files to Google Drive at
/content/drive/MyDrive/SemEval_Models. - Use a GPU runtime (recommended for fp16).
Execution Order:
- Run
hardnegative.ipynb: generateshard_negatives.jsoninBASE_DIR(or setJSON_PATHaccordingly). - Run
1_unixcoder_latest.ipynb:- Installs
tree-sitterdeps - Mounts Drive
- Loads data, augments, tokenizes
- Trains P1→P2→P3 and writes predictions/probabilities to
BASE_DIR
- Installs
- Run
2_graphcodebert_latest.ipynbto repeat training for GraphCodeBERT. - Run
ensemble.ipynb:- Place the required
.npyprobability files in the Colab working directory (.) or adjust paths - Produces
submission_ENSEMBLE_UnixDominant-9.csv
- Place the required
- Phase 1 (Weighted): Uses
WeightedRandomSamplerover augmented data with sample weights that up-weight hard negatives and positive classes. - Phase 2 (Natural): Trains without custom sampler.
- Phase 3 (Full): Trains on the union of train + validation (no eval saving).
Common hyperparameters (edit in notebook):
- Learning rate:
3e-5 - Max length:
512 - Batch size:
128 - Mixed precision:
fp16=True(requires GPU)
UniversalCanonicalizer(Tree-sitter) normalizes identifiers and code layout in a language-aware way (Python, Java, C/C++, C#, JavaScript, PHP, Go).AugmentationPipelineduplicates examples (with probability), applying canonicalization; training samples receive weights based on hardness and class.
Saved in BASE_DIR by each training notebook:
- CSV predictions for test:
submission_<Model>_<Phase>_{Latest|Best}.csv - Probability files (for ensembling):
- UnixCoder:
submission_UnixCoder_P2_Best_probs.npy,submission_UnixCoder_P3_Final_probs.npy - GraphCodeBERT:
submission_GCB_P2_Best_probs.npy,submission_GCB_P3_Final_probs.npy
- UnixCoder:
Ensemble output (in current working dir by default):
submission_ENSEMBLE_UnixDominant-9.csv
Default ensemble weights (can be edited in ensemble.ipynb):
- UnixCoder P2 Best:
0.43 - UnixCoder P3 Final:
0.23 - GraphCodeBERT P3 Final:
0.34 - GraphCodeBERT P2 Best:
0.00
These notebooks are Colab-ready. To run locally:
- Install dependencies (example):
pip install torch transformers datasets scikit-learn pandas numpy pyarrow fastparquet \
tree-sitter==0.21.3 tree-sitter-languages==1.10.2- In each training notebook:
- Comment out
from google.colab import driveanddrive.mount(...) - Set
BASE_DIRandJSON_PATHto local paths - If no GPU or no AMP support, set
fp16=FalseinTrainingArguments
- Comment out
- If probability files are missing for ensembling, ensure you completed P2 Best and P3 Final in both training notebooks (those phases save
.npy). - If Tree-sitter parsing fails for a language, the original code is used as-is (canonicalization is best-effort).
- If you hit out-of-memory (OOM), reduce
BATCH_SIZEand/orMAX_LENGTH.
- Switch models by editing
MODEL_CHECKPOINT:- UnixCoder:
microsoft/unixcoder-base - GraphCodeBERT:
microsoft/graphcodebert-base
- UnixCoder:
- Edit
FOLDER_NAME,SUB_PREFIX, and training/eval/save step intervals to match your experiment plan.