Skip to content

matteo-rizzo/smart-contracts-vulnerabilities-ml-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CGT Classification Project

This project involves training and evaluating various machine learning models for classification tasks using the CGT dataset. The models include a BERT-based model, a Feedforward Neural Network (FFNN), an LSTM-based classifier, and a pool of traditional classifiers.

Companion workshop paper presented at OVERLAY 2024: "A Comparison of Machine Learning Techniques for Ethereum Smart Contract Vulnerability Detection".

Installation

To get started with this project, install the required libraries using pip:

pip install numpy pandas torch scikit-learn tqdm transformers xgboost

Dataset

The dataset used in this project is the CGT dataset, which can be found here.

Download and Setup

  1. Clone the dataset repository:
git clone https://github.com/gsalzer/cgt.git
  1. Place the cloned repository in the appropriate directory structure:
project_directory/
├── dataset/                  # Cloned CGT dataset repository
└── your_project_files/   # Your project files

Configuration

Device Setup

The project uses a GPU if available:

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Random Seed

Random seeds are set for reproducibility:

RANDOM_SEED = 0

Dataset Path

Specify the path to the dataset:

PATH_TO_DATASET = os.path.join("..", "dataset", "cgt")

Training Configurations

  • Model Type: BERT (microsoft/codebert-base)
  • Max Features: 500
  • Batch Size: 1
  • Number of Folds: 10
  • Number of Epochs: 25
  • Number of Labels: 20
  • Learning Rate: 0.001
  • Test Size: 0.1

File Configurations

Handle different file types: source, runtime, and bytecode.

Log Directory

Logs are stored in a directory created if it doesn't already exist:

LOG_DIR = os.path.join("log", FILE_TYPE)
if not os.path.exists(LOG_DIR):
    os.makedirs(LOG_DIR)

Preprocessing

Preprocessing Functions

Functions are provided to preprocess hex data and Solidity code:

  • Hex Data Preprocessing: Converts hex data to a readable byte string.
  • Solidity Code Preprocessing: Removes comments and blank lines.

Data Initialization

Initialize inputs, labels, and groundtruth from the dataset:

inputs, labels, gt = init_inputs_and_gt(dataset)

Setting Labels

Set up labels based on groundtruth:

labels = set_labels(dataset, labels, gt)

Vectorization

TF-IDF vectorizer is used to convert text data into numerical features:

VECTORIZER = TfidfVectorizer(max_features=MAX_FEATURES)

Models

BERTModelTrainer

Handles training and evaluation of a BERT-based model. Uses the transformers library to load a BERT model for sequence classification.

FFNNClassifier

A simple feedforward neural network with three fully connected layers for classification tasks.

LSTMClassifier

An LSTM-based model for text classification, initialized with pretrained GloVe embeddings.

Load GloVe Embeddings

Download the GloVe embeddings from Kaggle and extract the file to the appropriate directory:

project_directory/
├── asset/
│   └── glove.6B.100d.txt  # GloVe embeddings file
└── your_project_files/    # Your project files

Load the GloVe embeddings:

glove_embeddings = load_glove_embeddings(os.path.join("..", "asset", "glove.6B.100d.txt"))

Training and Evaluation

Trainer Class

Handles the training and evaluation of a neural network model.

CrossValidator Class

Performs k-fold cross-validation of a model, training and evaluating it across multiple folds.

ClassifiersPoolEvaluator Class

Evaluates a pool of classifiers using TF-IDF features and k-fold cross-validation.

Initializing and Training Models

  1. BERT Model:

    model = RobertaForSequenceClassification.from_pretrained(BERT_MODEL_TYPE, num_labels=NUM_LABELS, ignore_mismatched_sizes=True)
    model.config.problem_type = "multi_label_classification"
    model.to(DEVICE)
    
    tokenizer = RobertaTokenizer.from_pretrained(BERT_MODEL_TYPE, ignore_mismatched_sizes=True)
    
    x, y = tokenizer(INPUTS, add_special_tokens=True, max_length=512, return_token_type_ids=False, padding="max_length", truncation=True, return_attention_mask=True, return_tensors='pt'), LABELS
    
    x_train, x_test, y_train, y_test = train_test_split(x['input_ids'], y, test_size=TEST_SIZE)
    train_masks, test_masks = train_test_split(x['attention_mask'], test_size=TEST_SIZE)
    
    train_data = TensorDataset(x_train, train_masks, torch.tensor(y_train).float())
    test_data = TensorDataset(x_test, test_masks, torch.tensor(y_test).float())
    
    CrossValidator(BERTModelTrainer(model), train_data, test_data).k_fold_cv(log_id="bert")
  2. FFNN Model:

    model = FFNNClassifier()
    
    x = torch.FloatTensor(VECTORIZER.fit_transform(INPUTS).toarray())
    y = torch.FloatTensor(LABELS)
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=RANDOM_SEED)
    train_data = TensorDataset(x_train, y_train)
    test_data = TensorDataset(x_test, y_test)
    
    CrossValidator(Trainer(model), train_data, test_data).k_fold_cv(log_id="ffnn")
  3. LSTM Model:

    embeddings = load_glove_embeddings("path_to_glove_file")
    vocab_size = len(embeddings)
    embedding_dim = len(next(iter(embeddings.values())))
    hidden_dim = 128
    
    model = LSTMClassifier(vocab_size, embedding_dim, hidden_dim, np.array(list(embeddings.values())))
    
    tokenizer = SomeTokenizer(vocab=embeddings.keys())  # Assume you have a tokenizer that converts text to sequences of indices
    
    sequences = tokenizer.texts_to_sequences(INPUTS)
    padded_sequences = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)  # Assume you pad sequences to a maximum length
    x = torch.tensor(padded_sequences)
    y = torch.FloatTensor(LABELS)
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=RANDOM_SEED)
    train_data = TensorDataset(x_train, y_train)
    test_data = TensorDataset(x_test, y_test)
    
    CrossValidator(Trainer(model), train_data, test_data).k_fold_cv(log_id="lstm")
  4. Classifiers Pool Evaluation:

    evaluator = ClassifiersPoolEvaluator()
    evaluator.pool_evaluation()

Evaluation and Results

Metrics such as precision, recall, and F1 score are calculated and saved to a CSV file.

About

A Machine Learning (ML) detector for vulnerabilities in smart contracts

Resources

License

Stars

Watchers

Forks

Packages

No packages published