Code Origin Classification with UnixCoder & GraphCodeBERT

End-to-end notebooks to train, analyze, and ensemble models that classify source code as human-written or generated by one of 10 LLM families. The pipeline includes universal code canonicalization (Tree-sitter), hard negative mining, a 3‑phase training schedule, and a weighted ensemble.

Repository Contents

hardnegative.ipynb: Mines hard negatives by scoring the full train set, selecting the top 20% highest-loss samples, and saving their indices to JSON for re-weighting/augmentation.
1_unixcoder_latest.ipynb: Three-phase training with microsoft/unixcoder-base. Includes:
- UniversalCanonicalizer (Tree-sitter) and data augmentation
- Phase 1 (Weighted), Phase 2 (Natural), Phase 3 (Full) training
- Saves both “Latest” and “Best” checkpoints’ predictions; probability files saved for P2 Best and P3 Final
2_graphcodebert_latest.ipynb: Same pipeline as above using microsoft/graphcodebert-base.
ensemble.ipynb: Weighted ensemble of probability files from UnixCoder and GraphCodeBERT to produce a final submission CSV.

Recommended Execution Order

Hard Negative Mining: run hardnegative.ipynb to create hard negative indices JSON.
Train UnixCoder: run 1_unixcoder_latest.ipynb (Phases P1→P2→P3).
Train GraphCodeBERT: run 2_graphcodebert_latest.ipynb (Phases P1→P2→P3).
Ensemble: run ensemble.ipynb to create the final submission CSV.

Task & Labels

Binary-origins expanded to 11 classes:

0: human, 1: deepseek, 2: qwen, 3: 01-ai, 4: bigcode, 5: gemma, 6: phi, 7: meta-llama, 8: ibm-granite, 9: mistral, 10: openai

Data Layout (expected)

Default (Colab) paths used by training notebooks:

BASE_DIR = /content/drive/MyDrive/SemEval_Models
Files required in BASE_DIR:
- train.parquet (columns: code [str], label [int 0–10], optional language)
- validation.parquet (columns: code, label)
- test.parquet (columns: code, optional ID)
Hard negatives JSON (optional but recommended):
- Training notebooks expect JSON_PATH = /content/hard_negative_indices.json with the schema:

{ "hard_indices": [12, 57, 103, ...] }

Note: hardnegative.ipynb writes to BASE_DIR/hard_negatives.json with keys hard_indices and distribution. You can either:

Copy/rename it to /content/hard_negative_indices.json, or
Change JSON_PATH in the training notebooks to point to BASE_DIR/hard_negatives.json.

Quick Start (Google Colab)

Prerequisites:

Upload your parquet files to Google Drive at /content/drive/MyDrive/SemEval_Models.
Use a GPU runtime (recommended for fp16).

Execution Order:

Run hardnegative.ipynb: generates hard_negatives.json in BASE_DIR (or set JSON_PATH accordingly).
Run 1_unixcoder_latest.ipynb:
- Installs tree-sitter deps
- Mounts Drive
- Loads data, augments, tokenizes
- Trains P1→P2→P3 and writes predictions/probabilities to BASE_DIR
Run 2_graphcodebert_latest.ipynb to repeat training for GraphCodeBERT.
Run ensemble.ipynb:
- Place the required .npy probability files in the Colab working directory (.) or adjust paths
- Produces submission_ENSEMBLE_UnixDominant-9.csv

Training Schedule (both models)

Phase 1 (Weighted): Uses WeightedRandomSampler over augmented data with sample weights that up-weight hard negatives and positive classes.
Phase 2 (Natural): Trains without custom sampler.
Phase 3 (Full): Trains on the union of train + validation (no eval saving).

Common hyperparameters (edit in notebook):

Learning rate: 3e-5
Max length: 512
Batch size: 128
Mixed precision: fp16=True (requires GPU)

Augmentation & Canonicalization

UniversalCanonicalizer (Tree-sitter) normalizes identifiers and code layout in a language-aware way (Python, Java, C/C++, C#, JavaScript, PHP, Go).
AugmentationPipeline duplicates examples (with probability), applying canonicalization; training samples receive weights based on hardness and class.

Outputs

Saved in BASE_DIR by each training notebook:

CSV predictions for test: submission_<Model>_<Phase>_{Latest|Best}.csv
Probability files (for ensembling):
- UnixCoder: submission_UnixCoder_P2_Best_probs.npy, submission_UnixCoder_P3_Final_probs.npy
- GraphCodeBERT: submission_GCB_P2_Best_probs.npy, submission_GCB_P3_Final_probs.npy

Ensemble output (in current working dir by default):

submission_ENSEMBLE_UnixDominant-9.csv

Default ensemble weights (can be edited in ensemble.ipynb):

UnixCoder P2 Best: 0.43
UnixCoder P3 Final: 0.23
GraphCodeBERT P3 Final: 0.34
GraphCodeBERT P2 Best: 0.00

Running Locally (optional)

These notebooks are Colab-ready. To run locally:

Install dependencies (example):

pip install torch transformers datasets scikit-learn pandas numpy pyarrow fastparquet \
    tree-sitter==0.21.3 tree-sitter-languages==1.10.2

In each training notebook:
- Comment out from google.colab import drive and drive.mount(...)
- Set BASE_DIR and JSON_PATH to local paths
- If no GPU or no AMP support, set fp16=False in TrainingArguments

Tips & Troubleshooting

If probability files are missing for ensembling, ensure you completed P2 Best and P3 Final in both training notebooks (those phases save .npy).
If Tree-sitter parsing fails for a language, the original code is used as-is (canonicalization is best-effort).
If you hit out-of-memory (OOM), reduce BATCH_SIZE and/or MAX_LENGTH.

Customization

Switch models by editing MODEL_CHECKPOINT:
- UnixCoder: microsoft/unixcoder-base
- GraphCodeBERT: microsoft/graphcodebert-base
Edit FOLDER_NAME, SUB_PREFIX, and training/eval/save step intervals to match your experiment plan.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Task B		Task B
Task C		Task C
data_preprocessing		data_preprocessing
taskB-old-version-0.38f1		taskB-old-version-0.38f1
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Origin Classification with UnixCoder & GraphCodeBERT

Repository Contents

Recommended Execution Order

Task & Labels

Data Layout (expected)

Quick Start (Google Colab)

Training Schedule (both models)

Augmentation & Canonicalization

Outputs

Running Locally (optional)

Tips & Troubleshooting

Customization

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Code Origin Classification with UnixCoder & GraphCodeBERT

Repository Contents

Recommended Execution Order

Task & Labels

Data Layout (expected)

Quick Start (Google Colab)

Training Schedule (both models)

Augmentation & Canonicalization

Outputs

Running Locally (optional)

Tips & Troubleshooting

Customization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages