Skip to content

Active Learning and GNN-based models for illicit Bitcoin transaction detection using the Elliptic dataset.

Notifications You must be signed in to change notification settings

danielbehargithub/Crypto-Fraud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

65 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Active Learning for Illicit Transaction Detection on the Elliptic Bitcoin Dataset

This repository contains an end-to-end framework for detecting illicit cryptocurrency transactions using a combination of Graph Neural Networks (GNNs) and Active Learning (AL). The project evaluates how graph structure, temporal dynamics, and selective label acquisition influence model performance under severe label scarcity, using the Elliptic Bitcoin transaction dataset.

๐Ÿ“„ Project Report

Project Overview

The Elliptic dataset provides a large-scale Bitcoin transaction graph with 203k transactions,
234k directed edges, and 166 node features. Only 2% of nodes are labeled as illicit, creating a highly imbalanced and label-scarce environment.

We study:

  • Whether GNNs improve detection over feature-only baselines.
  • Whether Active Learning can reduce the labeling cost needed to reach high accuracy.
  • How both components interact under real-world AML constraints.

We evaluate:

  • Models: MLP, GCN, EvolveGCN-O, DySAT
  • AL methods: Random, Entropy, CMCS(certainty-oriented minority class sampling), Sequential (temporal-aware)
  • Graph variants: Directed vs. Undirected, Local vs. Full feature sets

Main Contributions

  • Build a full dynamic-graph learning pipeline over the Elliptic dataset.
  • Integrate multiple GNN architectures, including temporal ones (EvolveGCN, DySAT).
  • Implement a complete Active Learning framework tailored to temporal, imbalanced graphs.
  • Analyze convergence, label-efficiency, minority sampling behavior, and feature vs. structure importance.
  • Provide a fully reproducible code base with clean configuration files.

Models

1. MLP (Baseline)

A fully-connected neural network that treats each transaction independently.

  • Uses only the node features (no graph structure).
  • Strong baseline since Elliptic features are heavily engineered.
  • Helps isolate the contribution of graph connectivity.

2. GCN (Graph Convolutional Network)

A static GNN based on neighborhood aggregation.

  • Each node updates its embedding by aggregating information from its neighbors.
  • Captures local structural patterns such as suspicious clusters or dense fund-flow regions.

3. EvolveGCN-O

A temporal extension of GCN for dynamic graphs.

  • Uses a recurrent mechanism (GRU) to evolve the GCN weights over time.
  • Learns how transaction behavior changes across the 49 time steps.
  • Does not rely on fixed node embeddings, making it suitable for evolving financial networks.

4. DySAT (Dynamic Self-Attention Network)

A dynamic GNN using dual attention mechanisms.

  • Structural attention: learns which neighbors are most informative at each time step.
  • Temporal attention: learns which past snapshots are relevant for the current state.
  • Captures long-range dependencies and evolving laundering patterns.
  • More expressive than simple GCN aggregation, especially in temporal settings.

๐Ÿ” Active Learning Framework

Loop:

  1. Train
  2. Score
  3. Acquire
  4. Expand labeled set

Active Learning Strategies

1. Random Sampling (Baseline)

Selects unlabeled nodes uniformly at random.

  • Serves as a control baseline.
  • Useful for measuring whether more sophisticated strategies actually provide value.
  • Ensures unbiased but inefficient coverage of the data.

2. Entropy Sampling

Selects nodes with the highest predictive uncertainty.

  • Measures uncertainty as the entropy of the predicted probability distribution.
  • Focuses on samples near the decision boundary.
  • Often accelerates learning when the modelโ€™s confidence is meaningful.

3. CMCS (Certainty Minority-Oriented Sampling)

A class-aware strategy designed for highly imbalanced datasets like Elliptic.

  • Prioritizes nodes the model predicts as illicit (minority class).
  • Addresses the tendency of uncertainty-based methods to oversample majority (licit) nodes.
  • Aims to increase the proportion of illicit samples in the labeled pool, improving minority-class F1.

4. Sequential (Temporal-Aware) Sampling

Selects nodes based on chronological order in the Bitcoin transaction graph.

  • Mimics real-world AML workflows where future data is unavailable.
  • Ensures the model learns only from information available โ€œup to this timeโ€.
  • Useful for dynamic or streaming scenarios.

๐Ÿงช Experimental Setup

  • 203,769 transactions
  • 49 time steps
  • Labels: 21% licit, 2% illicit, 77% unknown
  • Chronological split: Train 1โ€“34, Val 35โ€“41, Test 42โ€“49

Metrics:

  • F1-illicit
  • AUPRC
  • Performance vs labeling budget

๐Ÿ“ˆ Key Findings

  • MLP is the strongest baseline, outperforming all GNNs on passive learning.

  • Graph signal in Elliptic is weak, causing GCN/EvolveGCN/DySAT to underperform relative to feature-only models.

  • Active Learning improves label efficiency- all models reach passive performance with far fewer labels.

  • CMCS & Sequential increase illicit coverage, but do not improve F1; Random correlates best with actual F1 gains.

  • Artificial minority balancing does not help and often destabilizes training.


๐Ÿ”ฎ Future Work

  • Richer AL experiments: more AL rounds, multiple random seeds, alternative temporal splits.

  • Test stronger temporal GNNs (e.g., DySAT) in AL method under full training budgets to check if GNNs can close the gap.

  • Explore minority-learning mechanisms: understand which models naturally detect rare illicit patterns without forced balancing.

  • Cross-dataset validation: determine if Ellipticโ€™s weak graph signal is dataset-specific or general to AML graphs.

  • Investigate temporally-aware AL strategies designed for streaming or sequential transaction environments.


๐Ÿ“‚ Repository Structure

Crypto_Fraud
โ”œโ”€โ”€ code/
โ”‚   โ”œโ”€โ”€ active_learning.py/
โ”‚   โ”œโ”€โ”€ data.py/
โ”‚   โ”œโ”€โ”€ models.py/
โ”‚   โ”œโ”€โ”€ run_experiments.py
โ”‚   โ”œโ”€โ”€ training.py/
โ”‚   โ”‚โ”€โ”€ visual.py
โ”‚
โ”‚
โ”œโ”€โ”€ configs/
โ”œโ”€โ”€ elliptic_bitcoin_dataset/
โ”œโ”€โ”€ results/
โ”œโ”€โ”€ vizualizations/
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ requirements.txt

Setup Instructions

Follow these steps to run the project from scratch.


1. Clone the repository

Open a terminal and run:

git clone https://github.com/danielbehargithub/Crypto-Fraud.git
cd Crypto-Fraud

2. Download and place the Elliptic dataset

Due to licensing restrictions, the Elliptic dataset is not included in this repository.

  1. Download the dataset manually from Kaggle:
    ๐Ÿ”— https://www.kaggle.com/datasets/ellipticco/elliptic-data-set

  2. Extract the files into the following folder inside the project:

elliptic_bitcoin_dataset/

Final structure should look like:

Crypto_Fraud/
โ”œโ”€โ”€ elliptic_bitcoin_dataset/
โ”‚   โ”œโ”€โ”€ elliptic_txs_classes.csv
โ”‚   โ”œโ”€โ”€ elliptic_txs_features.csv
โ”‚   โ”œโ”€โ”€ elliptic_txs_edgelist.csv

3. Install dependencies

From the project root, run:

pip install -r requirements.txt

4. Configure the experiment

The main configuration file is located at:

configs/config_run_experiments.yaml

In this file you can modify the following parameters:

# Model / Graph combinations
graph_modes: ["dag", "undirected"]   # graph construction: DAG (direct) or undirected
model_names: ["GCN", "MLP"]          # models to run: GCN, MLP, EVOLVEGCN, DYSAT
feature_sets: ["local", "all"]       # feature configuration: local-only or all features
split_types: ["temporal"]            # data split type: temporal or random

# Active Learning Methods
al_methods:
  - "entropy"
  - "random"
  - "cmcs"
  - "sequential"

5. Run the experiments

python code/run_experiments.py

6. Generate visualizations

python code/visual.py

All plots will be saved under:

visualizations/

Enjoy exploring illicit transaction detection using GNNs + Active Learning ๐ŸŽฏ

About

Active Learning and GNN-based models for illicit Bitcoin transaction detection using the Elliptic dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages