CGT Classification Project

This project involves training and evaluating various machine learning models for classification tasks using the CGT dataset. The models include a BERT-based model, a Feedforward Neural Network (FFNN), an LSTM-based classifier, and a pool of traditional classifiers.

Companion workshop paper presented at OVERLAY 2024: "A Comparison of Machine Learning Techniques for Ethereum Smart Contract Vulnerability Detection".

Installation

To get started with this project, install the required libraries using pip:

pip install numpy pandas torch scikit-learn tqdm transformers xgboost

Dataset

The dataset used in this project is the CGT dataset, which can be found here.

Download and Setup

Clone the dataset repository:

git clone https://github.com/gsalzer/cgt.git

Place the cloned repository in the appropriate directory structure:

project_directory/
├── dataset/                  # Cloned CGT dataset repository
└── your_project_files/   # Your project files

Configuration

Device Setup

The project uses a GPU if available:

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Random Seed

Random seeds are set for reproducibility:

RANDOM_SEED = 0

Dataset Path

Specify the path to the dataset:

PATH_TO_DATASET = os.path.join("..", "dataset", "cgt")

Training Configurations

Model Type: BERT (microsoft/codebert-base)
Max Features: 500
Batch Size: 1
Number of Folds: 10
Number of Epochs: 25
Number of Labels: 20
Learning Rate: 0.001
Test Size: 0.1

File Configurations

Handle different file types: source, runtime, and bytecode.

Log Directory

Logs are stored in a directory created if it doesn't already exist:

LOG_DIR = os.path.join("log", FILE_TYPE)
if not os.path.exists(LOG_DIR):
    os.makedirs(LOG_DIR)

Preprocessing

Preprocessing Functions

Functions are provided to preprocess hex data and Solidity code:

Hex Data Preprocessing: Converts hex data to a readable byte string.
Solidity Code Preprocessing: Removes comments and blank lines.

Data Initialization

Initialize inputs, labels, and groundtruth from the dataset:

inputs, labels, gt = init_inputs_and_gt(dataset)

Setting Labels

Set up labels based on groundtruth:

labels = set_labels(dataset, labels, gt)

Vectorization

TF-IDF vectorizer is used to convert text data into numerical features:

VECTORIZER = TfidfVectorizer(max_features=MAX_FEATURES)

Models

BERTModelTrainer

Handles training and evaluation of a BERT-based model. Uses the transformers library to load a BERT model for sequence classification.

FFNNClassifier

A simple feedforward neural network with three fully connected layers for classification tasks.

LSTMClassifier

An LSTM-based model for text classification, initialized with pretrained GloVe embeddings.

Load GloVe Embeddings

Download the GloVe embeddings from Kaggle and extract the file to the appropriate directory:

project_directory/
├── asset/
│   └── glove.6B.100d.txt  # GloVe embeddings file
└── your_project_files/    # Your project files

Load the GloVe embeddings:

glove_embeddings = load_glove_embeddings(os.path.join("..", "asset", "glove.6B.100d.txt"))

Training and Evaluation

Trainer Class

Handles the training and evaluation of a neural network model.

CrossValidator Class

Performs k-fold cross-validation of a model, training and evaluating it across multiple folds.

ClassifiersPoolEvaluator Class

Evaluates a pool of classifiers using TF-IDF features and k-fold cross-validation.

Initializing and Training Models

BERT Model:

model = RobertaForSequenceClassification.from_pretrained(BERT_MODEL_TYPE, num_labels=NUM_LABELS, ignore_mismatched_sizes=True)
model.config.problem_type = "multi_label_classification"
model.to(DEVICE)

tokenizer = RobertaTokenizer.from_pretrained(BERT_MODEL_TYPE, ignore_mismatched_sizes=True)

x, y = tokenizer(INPUTS, add_special_tokens=True, max_length=512, return_token_type_ids=False, padding="max_length", truncation=True, return_attention_mask=True, return_tensors='pt'), LABELS

x_train, x_test, y_train, y_test = train_test_split(x['input_ids'], y, test_size=TEST_SIZE)
train_masks, test_masks = train_test_split(x['attention_mask'], test_size=TEST_SIZE)

train_data = TensorDataset(x_train, train_masks, torch.tensor(y_train).float())
test_data = TensorDataset(x_test, test_masks, torch.tensor(y_test).float())

CrossValidator(BERTModelTrainer(model), train_data, test_data).k_fold_cv(log_id="bert")

FFNN Model:

model = FFNNClassifier()

x = torch.FloatTensor(VECTORIZER.fit_transform(INPUTS).toarray())
y = torch.FloatTensor(LABELS)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=RANDOM_SEED)
train_data = TensorDataset(x_train, y_train)
test_data = TensorDataset(x_test, y_test)

CrossValidator(Trainer(model), train_data, test_data).k_fold_cv(log_id="ffnn")

LSTM Model:

embeddings = load_glove_embeddings("path_to_glove_file")
vocab_size = len(embeddings)
embedding_dim = len(next(iter(embeddings.values())))
hidden_dim = 128

model = LSTMClassifier(vocab_size, embedding_dim, hidden_dim, np.array(list(embeddings.values())))

tokenizer = SomeTokenizer(vocab=embeddings.keys())  # Assume you have a tokenizer that converts text to sequences of indices

sequences = tokenizer.texts_to_sequences(INPUTS)
padded_sequences = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)  # Assume you pad sequences to a maximum length
x = torch.tensor(padded_sequences)
y = torch.FloatTensor(LABELS)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=RANDOM_SEED)
train_data = TensorDataset(x_train, y_train)
test_data = TensorDataset(x_test, y_test)

CrossValidator(Trainer(model), train_data, test_data).k_fold_cv(log_id="lstm")

Classifiers Pool Evaluation:

evaluator = ClassifiersPoolEvaluator()
evaluator.pool_evaluation()

Evaluation and Results

Metrics such as precision, recall, and F1 score are calculated and saved to a CSV file.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CGT Classification Project

Installation

Dataset

Download and Setup

Configuration

Device Setup

Random Seed

Dataset Path

Training Configurations

File Configurations

Log Directory

Preprocessing

Preprocessing Functions

Data Initialization

Setting Labels

Vectorization

Models

BERTModelTrainer

FFNNClassifier

LSTMClassifier

Load GloVe Embeddings

Training and Evaluation

Trainer Class

CrossValidator Class

ClassifiersPoolEvaluator Class

Initializing and Training Models

Evaluation and Results

About

Releases 1

Packages

Contributors 2

Languages

License

matteo-rizzo/smart-contracts-vulnerabilities-ml-detector

Folders and files

Latest commit

History

Repository files navigation

CGT Classification Project

Installation

Dataset

Download and Setup

Configuration

Device Setup

Random Seed

Dataset Path

Training Configurations

File Configurations

Log Directory

Preprocessing

Preprocessing Functions

Data Initialization

Setting Labels

Vectorization

Models

BERTModelTrainer

FFNNClassifier

LSTMClassifier

Load GloVe Embeddings

Training and Evaluation

Trainer Class

CrossValidator Class

ClassifiersPoolEvaluator Class

Initializing and Training Models

Evaluation and Results

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages