This project involves training and evaluating various machine learning models for classification tasks using the CGT dataset. The models include a BERT-based model, a Feedforward Neural Network (FFNN), an LSTM-based classifier, and a pool of traditional classifiers.
Companion workshop paper presented at OVERLAY 2024: "A Comparison of Machine Learning Techniques for Ethereum Smart Contract Vulnerability Detection".
To get started with this project, install the required libraries using pip
:
pip install numpy pandas torch scikit-learn tqdm transformers xgboost
The dataset used in this project is the CGT dataset, which can be found here.
- Clone the dataset repository:
git clone https://github.com/gsalzer/cgt.git
- Place the cloned repository in the appropriate directory structure:
project_directory/
├── dataset/ # Cloned CGT dataset repository
└── your_project_files/ # Your project files
The project uses a GPU if available:
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
Random seeds are set for reproducibility:
RANDOM_SEED = 0
Specify the path to the dataset:
PATH_TO_DATASET = os.path.join("..", "dataset", "cgt")
- Model Type: BERT (
microsoft/codebert-base
) - Max Features: 500
- Batch Size: 1
- Number of Folds: 10
- Number of Epochs: 25
- Number of Labels: 20
- Learning Rate: 0.001
- Test Size: 0.1
Handle different file types: source
, runtime
, and bytecode
.
Logs are stored in a directory created if it doesn't already exist:
LOG_DIR = os.path.join("log", FILE_TYPE)
if not os.path.exists(LOG_DIR):
os.makedirs(LOG_DIR)
Functions are provided to preprocess hex data and Solidity code:
- Hex Data Preprocessing: Converts hex data to a readable byte string.
- Solidity Code Preprocessing: Removes comments and blank lines.
Initialize inputs, labels, and groundtruth from the dataset:
inputs, labels, gt = init_inputs_and_gt(dataset)
Set up labels based on groundtruth:
labels = set_labels(dataset, labels, gt)
TF-IDF vectorizer is used to convert text data into numerical features:
VECTORIZER = TfidfVectorizer(max_features=MAX_FEATURES)
Handles training and evaluation of a BERT-based model. Uses the transformers
library to load a BERT model for sequence
classification.
A simple feedforward neural network with three fully connected layers for classification tasks.
An LSTM-based model for text classification, initialized with pretrained GloVe embeddings.
Download the GloVe embeddings from Kaggle and extract the file to the appropriate directory:
project_directory/
├── asset/
│ └── glove.6B.100d.txt # GloVe embeddings file
└── your_project_files/ # Your project files
Load the GloVe embeddings:
glove_embeddings = load_glove_embeddings(os.path.join("..", "asset", "glove.6B.100d.txt"))
Handles the training and evaluation of a neural network model.
Performs k-fold cross-validation of a model, training and evaluating it across multiple folds.
Evaluates a pool of classifiers using TF-IDF features and k-fold cross-validation.
-
BERT Model:
model = RobertaForSequenceClassification.from_pretrained(BERT_MODEL_TYPE, num_labels=NUM_LABELS, ignore_mismatched_sizes=True) model.config.problem_type = "multi_label_classification" model.to(DEVICE) tokenizer = RobertaTokenizer.from_pretrained(BERT_MODEL_TYPE, ignore_mismatched_sizes=True) x, y = tokenizer(INPUTS, add_special_tokens=True, max_length=512, return_token_type_ids=False, padding="max_length", truncation=True, return_attention_mask=True, return_tensors='pt'), LABELS x_train, x_test, y_train, y_test = train_test_split(x['input_ids'], y, test_size=TEST_SIZE) train_masks, test_masks = train_test_split(x['attention_mask'], test_size=TEST_SIZE) train_data = TensorDataset(x_train, train_masks, torch.tensor(y_train).float()) test_data = TensorDataset(x_test, test_masks, torch.tensor(y_test).float()) CrossValidator(BERTModelTrainer(model), train_data, test_data).k_fold_cv(log_id="bert")
-
FFNN Model:
model = FFNNClassifier() x = torch.FloatTensor(VECTORIZER.fit_transform(INPUTS).toarray()) y = torch.FloatTensor(LABELS) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=RANDOM_SEED) train_data = TensorDataset(x_train, y_train) test_data = TensorDataset(x_test, y_test) CrossValidator(Trainer(model), train_data, test_data).k_fold_cv(log_id="ffnn")
-
LSTM Model:
embeddings = load_glove_embeddings("path_to_glove_file") vocab_size = len(embeddings) embedding_dim = len(next(iter(embeddings.values()))) hidden_dim = 128 model = LSTMClassifier(vocab_size, embedding_dim, hidden_dim, np.array(list(embeddings.values()))) tokenizer = SomeTokenizer(vocab=embeddings.keys()) # Assume you have a tokenizer that converts text to sequences of indices sequences = tokenizer.texts_to_sequences(INPUTS) padded_sequences = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) # Assume you pad sequences to a maximum length x = torch.tensor(padded_sequences) y = torch.FloatTensor(LABELS) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=RANDOM_SEED) train_data = TensorDataset(x_train, y_train) test_data = TensorDataset(x_test, y_test) CrossValidator(Trainer(model), train_data, test_data).k_fold_cv(log_id="lstm")
-
Classifiers Pool Evaluation:
evaluator = ClassifiersPoolEvaluator() evaluator.pool_evaluation()
Metrics such as precision, recall, and F1 score are calculated and saved to a CSV file.