Skip to content

AyhanMeherrem/SemEval-2026---Machine-Generated-Code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code Origin Classification with UnixCoder & GraphCodeBERT

End-to-end notebooks to train, analyze, and ensemble models that classify source code as human-written or generated by one of 10 LLM families. The pipeline includes universal code canonicalization (Tree-sitter), hard negative mining, a 3‑phase training schedule, and a weighted ensemble.

Repository Contents

  • hardnegative.ipynb: Mines hard negatives by scoring the full train set, selecting the top 20% highest-loss samples, and saving their indices to JSON for re-weighting/augmentation.
  • 1_unixcoder_latest.ipynb: Three-phase training with microsoft/unixcoder-base. Includes:
    • UniversalCanonicalizer (Tree-sitter) and data augmentation
    • Phase 1 (Weighted), Phase 2 (Natural), Phase 3 (Full) training
    • Saves both “Latest” and “Best” checkpoints’ predictions; probability files saved for P2 Best and P3 Final
  • 2_graphcodebert_latest.ipynb: Same pipeline as above using microsoft/graphcodebert-base.
  • ensemble.ipynb: Weighted ensemble of probability files from UnixCoder and GraphCodeBERT to produce a final submission CSV.

Recommended Execution Order

  1. Hard Negative Mining: run hardnegative.ipynb to create hard negative indices JSON.
  2. Train UnixCoder: run 1_unixcoder_latest.ipynb (Phases P1→P2→P3).
  3. Train GraphCodeBERT: run 2_graphcodebert_latest.ipynb (Phases P1→P2→P3).
  4. Ensemble: run ensemble.ipynb to create the final submission CSV.

Task & Labels

Binary-origins expanded to 11 classes:

0: human, 1: deepseek, 2: qwen, 3: 01-ai, 4: bigcode, 5: gemma, 6: phi, 7: meta-llama, 8: ibm-granite, 9: mistral, 10: openai

Data Layout (expected)

Default (Colab) paths used by training notebooks:

  • BASE_DIR = /content/drive/MyDrive/SemEval_Models
  • Files required in BASE_DIR:
    • train.parquet (columns: code [str], label [int 0–10], optional language)
    • validation.parquet (columns: code, label)
    • test.parquet (columns: code, optional ID)
  • Hard negatives JSON (optional but recommended):
    • Training notebooks expect JSON_PATH = /content/hard_negative_indices.json with the schema:
{ "hard_indices": [12, 57, 103, ...] }

Note: hardnegative.ipynb writes to BASE_DIR/hard_negatives.json with keys hard_indices and distribution. You can either:

  • Copy/rename it to /content/hard_negative_indices.json, or
  • Change JSON_PATH in the training notebooks to point to BASE_DIR/hard_negatives.json.

Quick Start (Google Colab)

Prerequisites:

  • Upload your parquet files to Google Drive at /content/drive/MyDrive/SemEval_Models.
  • Use a GPU runtime (recommended for fp16).

Execution Order:

  1. Run hardnegative.ipynb: generates hard_negatives.json in BASE_DIR (or set JSON_PATH accordingly).
  2. Run 1_unixcoder_latest.ipynb:
    • Installs tree-sitter deps
    • Mounts Drive
    • Loads data, augments, tokenizes
    • Trains P1→P2→P3 and writes predictions/probabilities to BASE_DIR
  3. Run 2_graphcodebert_latest.ipynb to repeat training for GraphCodeBERT.
  4. Run ensemble.ipynb:
    • Place the required .npy probability files in the Colab working directory (.) or adjust paths
    • Produces submission_ENSEMBLE_UnixDominant-9.csv

Training Schedule (both models)

  • Phase 1 (Weighted): Uses WeightedRandomSampler over augmented data with sample weights that up-weight hard negatives and positive classes.
  • Phase 2 (Natural): Trains without custom sampler.
  • Phase 3 (Full): Trains on the union of train + validation (no eval saving).

Common hyperparameters (edit in notebook):

  • Learning rate: 3e-5
  • Max length: 512
  • Batch size: 128
  • Mixed precision: fp16=True (requires GPU)

Augmentation & Canonicalization

  • UniversalCanonicalizer (Tree-sitter) normalizes identifiers and code layout in a language-aware way (Python, Java, C/C++, C#, JavaScript, PHP, Go).
  • AugmentationPipeline duplicates examples (with probability), applying canonicalization; training samples receive weights based on hardness and class.

Outputs

Saved in BASE_DIR by each training notebook:

  • CSV predictions for test: submission_<Model>_<Phase>_{Latest|Best}.csv
  • Probability files (for ensembling):
    • UnixCoder: submission_UnixCoder_P2_Best_probs.npy, submission_UnixCoder_P3_Final_probs.npy
    • GraphCodeBERT: submission_GCB_P2_Best_probs.npy, submission_GCB_P3_Final_probs.npy

Ensemble output (in current working dir by default):

  • submission_ENSEMBLE_UnixDominant-9.csv

Default ensemble weights (can be edited in ensemble.ipynb):

  • UnixCoder P2 Best: 0.43
  • UnixCoder P3 Final: 0.23
  • GraphCodeBERT P3 Final: 0.34
  • GraphCodeBERT P2 Best: 0.00

Running Locally (optional)

These notebooks are Colab-ready. To run locally:

  1. Install dependencies (example):
pip install torch transformers datasets scikit-learn pandas numpy pyarrow fastparquet \
    tree-sitter==0.21.3 tree-sitter-languages==1.10.2
  1. In each training notebook:
    • Comment out from google.colab import drive and drive.mount(...)
    • Set BASE_DIR and JSON_PATH to local paths
    • If no GPU or no AMP support, set fp16=False in TrainingArguments

Tips & Troubleshooting

  • If probability files are missing for ensembling, ensure you completed P2 Best and P3 Final in both training notebooks (those phases save .npy).
  • If Tree-sitter parsing fails for a language, the original code is used as-is (canonicalization is best-effort).
  • If you hit out-of-memory (OOM), reduce BATCH_SIZE and/or MAX_LENGTH.

Customization

  • Switch models by editing MODEL_CHECKPOINT:
    • UnixCoder: microsoft/unixcoder-base
    • GraphCodeBERT: microsoft/graphcodebert-base
  • Edit FOLDER_NAME, SUB_PREFIX, and training/eval/save step intervals to match your experiment plan.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors