Active Learning for Illicit Transaction Detection on the Elliptic Bitcoin Dataset

This repository contains an end-to-end framework for detecting illicit cryptocurrency transactions using a combination of Graph Neural Networks (GNNs) and Active Learning (AL). The project evaluates how graph structure, temporal dynamics, and selective label acquisition influence model performance under severe label scarcity, using the Elliptic Bitcoin transaction dataset.

📄 Project Report

🔗 Jump to Setup Instructions

Project Overview

The Elliptic dataset provides a large-scale Bitcoin transaction graph with 203k transactions,
234k directed edges, and 166 node features. Only 2% of nodes are labeled as illicit, creating a highly imbalanced and label-scarce environment.

We study:

Whether GNNs improve detection over feature-only baselines.
Whether Active Learning can reduce the labeling cost needed to reach high accuracy.
How both components interact under real-world AML constraints.

We evaluate:

Models: MLP, GCN, EvolveGCN-O, DySAT
AL methods: Random, Entropy, CMCS(certainty-oriented minority class sampling), Sequential (temporal-aware)
Graph variants: Directed vs. Undirected, Local vs. Full feature sets

Main Contributions

Build a full dynamic-graph learning pipeline over the Elliptic dataset.
Integrate multiple GNN architectures, including temporal ones (EvolveGCN, DySAT).
Implement a complete Active Learning framework tailored to temporal, imbalanced graphs.
Analyze convergence, label-efficiency, minority sampling behavior, and feature vs. structure importance.
Provide a fully reproducible code base with clean configuration files.

Models

1. MLP (Baseline)

A fully-connected neural network that treats each transaction independently.

Uses only the node features (no graph structure).
Strong baseline since Elliptic features are heavily engineered.
Helps isolate the contribution of graph connectivity.

2. GCN (Graph Convolutional Network)

A static GNN based on neighborhood aggregation.

Each node updates its embedding by aggregating information from its neighbors.
Captures local structural patterns such as suspicious clusters or dense fund-flow regions.

3. EvolveGCN-O

A temporal extension of GCN for dynamic graphs.

Uses a recurrent mechanism (GRU) to evolve the GCN weights over time.
Learns how transaction behavior changes across the 49 time steps.
Does not rely on fixed node embeddings, making it suitable for evolving financial networks.

4. DySAT (Dynamic Self-Attention Network)

A dynamic GNN using dual attention mechanisms.

Structural attention: learns which neighbors are most informative at each time step.
Temporal attention: learns which past snapshots are relevant for the current state.
Captures long-range dependencies and evolving laundering patterns.
More expressive than simple GCN aggregation, especially in temporal settings.

🔍 Active Learning Framework

Loop:

Train
Score
Acquire
Expand labeled set

Active Learning Strategies

1. Random Sampling (Baseline)

Selects unlabeled nodes uniformly at random.

Serves as a control baseline.
Useful for measuring whether more sophisticated strategies actually provide value.
Ensures unbiased but inefficient coverage of the data.

2. Entropy Sampling

Selects nodes with the highest predictive uncertainty.

Measures uncertainty as the entropy of the predicted probability distribution.
Focuses on samples near the decision boundary.
Often accelerates learning when the model’s confidence is meaningful.

3. CMCS (Certainty Minority-Oriented Sampling)

A class-aware strategy designed for highly imbalanced datasets like Elliptic.

Prioritizes nodes the model predicts as illicit (minority class).
Addresses the tendency of uncertainty-based methods to oversample majority (licit) nodes.
Aims to increase the proportion of illicit samples in the labeled pool, improving minority-class F1.

4. Sequential (Temporal-Aware) Sampling

Selects nodes based on chronological order in the Bitcoin transaction graph.

Mimics real-world AML workflows where future data is unavailable.
Ensures the model learns only from information available “up to this time”.
Useful for dynamic or streaming scenarios.

🧪 Experimental Setup

203,769 transactions
49 time steps
Labels: 21% licit, 2% illicit, 77% unknown
Chronological split: Train 1–34, Val 35–41, Test 42–49

Metrics:

F1-illicit
AUPRC
Performance vs labeling budget

📈 Key Findings

MLP is the strongest baseline, outperforming all GNNs on passive learning.
Graph signal in Elliptic is weak, causing GCN/EvolveGCN/DySAT to underperform relative to feature-only models.
Active Learning improves label efficiency- all models reach passive performance with far fewer labels.
CMCS & Sequential increase illicit coverage, but do not improve F1; Random correlates best with actual F1 gains.
Artificial minority balancing does not help and often destabilizes training.

🔮 Future Work

Richer AL experiments: more AL rounds, multiple random seeds, alternative temporal splits.
Test stronger temporal GNNs (e.g., DySAT) in AL method under full training budgets to check if GNNs can close the gap.
Explore minority-learning mechanisms: understand which models naturally detect rare illicit patterns without forced balancing.
Cross-dataset validation: determine if Elliptic’s weak graph signal is dataset-specific or general to AML graphs.
Investigate temporally-aware AL strategies designed for streaming or sequential transaction environments.

📂 Repository Structure

Crypto_Fraud
├── code/
│   ├── active_learning.py/
│   ├── data.py/
│   ├── models.py/
│   ├── run_experiments.py
│   ├── training.py/
│   │── visual.py
│
│
├── configs/
├── elliptic_bitcoin_dataset/
├── results/
├── vizualizations/
├── README.md
└── requirements.txt

Setup Instructions

Follow these steps to run the project from scratch.

1. Clone the repository

Open a terminal and run:

git clone https://github.com/danielbehargithub/Crypto-Fraud.git
cd Crypto-Fraud

2. Download and place the Elliptic dataset

Due to licensing restrictions, the Elliptic dataset is not included in this repository.

Download the dataset manually from Kaggle:
🔗 https://www.kaggle.com/datasets/ellipticco/elliptic-data-set
Extract the files into the following folder inside the project:

elliptic_bitcoin_dataset/

Final structure should look like:

Crypto_Fraud/
├── elliptic_bitcoin_dataset/
│   ├── elliptic_txs_classes.csv
│   ├── elliptic_txs_features.csv
│   ├── elliptic_txs_edgelist.csv

3. Install dependencies

From the project root, run:

pip install -r requirements.txt

4. Configure the experiment

The main configuration file is located at:

configs/config_run_experiments.yaml

In this file you can modify the following parameters:

# Model / Graph combinations
graph_modes: ["dag", "undirected"]   # graph construction: DAG (direct) or undirected
model_names: ["GCN", "MLP"]          # models to run: GCN, MLP, EVOLVEGCN, DYSAT
feature_sets: ["local", "all"]       # feature configuration: local-only or all features
split_types: ["temporal"]            # data split type: temporal or random

# Active Learning Methods
al_methods:
  - "entropy"
  - "random"
  - "cmcs"
  - "sequential"

5. Run the experiments

python code/run_experiments.py

6. Generate visualizations

python code/visual.py

All plots will be saved under:

visualizations/

Enjoy exploring illicit transaction detection using GNNs + Active Learning 🎯

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Active Learning for Illicit Transaction Detection on the Elliptic Bitcoin Dataset

🔗 Jump to Setup Instructions

Project Overview

Main Contributions

Models

1. MLP (Baseline)

2. GCN (Graph Convolutional Network)

3. EvolveGCN-O

4. DySAT (Dynamic Self-Attention Network)

🔍 Active Learning Framework

Active Learning Strategies

1. Random Sampling (Baseline)

2. Entropy Sampling

3. CMCS (Certainty Minority-Oriented Sampling)

4. Sequential (Temporal-Aware) Sampling

🧪 Experimental Setup

📈 Key Findings

🔮 Future Work

📂 Repository Structure

Setup Instructions

1. Clone the repository

2. Download and place the Elliptic dataset

3. Install dependencies

4. Configure the experiment

5. Run the experiments

6. Generate visualizations

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
code		code
configs		configs
results		results
visualizations		visualizations
README.md		README.md
project_report.pdf		project_report.pdf
requirements.txt		requirements.txt

danielbehargithub/Crypto-Fraud

Folders and files

Latest commit

History

Repository files navigation

Active Learning for Illicit Transaction Detection on the Elliptic Bitcoin Dataset

🔗 Jump to Setup Instructions

Project Overview

Main Contributions

Models

1. MLP (Baseline)

2. GCN (Graph Convolutional Network)

3. EvolveGCN-O

4. DySAT (Dynamic Self-Attention Network)

🔍 Active Learning Framework

Active Learning Strategies

1. Random Sampling (Baseline)

2. Entropy Sampling

3. CMCS (Certainty Minority-Oriented Sampling)

4. Sequential (Temporal-Aware) Sampling

🧪 Experimental Setup

📈 Key Findings

🔮 Future Work

📂 Repository Structure

Setup Instructions

1. Clone the repository

2. Download and place the Elliptic dataset

3. Install dependencies

4. Configure the experiment

5. Run the experiments

6. Generate visualizations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages