Purpose

We want to gain useful generalizable knowledge about the inner workings of non-finetuned transformer language models. We achieve this by running a number of experiments to probe the hidden layer activations using different datasets, model sizes, pooling methods and classifiers, and visualize this interactively with a StreamLit app. This information will help us in future embedding experiments and eegi, among other things.

Usage

Prepare data by the running get_data.py script: python get_data.py <data_save_path>
Create classifiers Classifiers come in two flavours: sklearn and pytorch, these are taken care off by the base classes SingleStepOpt and MultiStepOpt resp. (See Modules section).

Sklearn-type classifiers need following the attributes:

name: identifier name for logging
classifier: sklearn classifier. Needs to have a fit and transform method. If transform is not available the forward methods needs to be overwritten.
discrete_targets: boolean to indicate whether the classifier requires binary (0 or 1) targets.

For example

class LDA(SingleStepOpt):
    def __init__(self, input_size):
        super().__init__()
        self.name = 'LDA'
        self.discrete_targets = True
        self.classifier = LinearDiscriminantAnalysis(n_components=1)

Pytorch-type classifiers need

name attribute: identifier name for logging
forward method

For example

class SingleLayer(MultiStepOpt):
    def __init__(self, input_size):
        super().__init__()
        self.name = 'SingleLayer'
        self.linear = nn.Linear(input_size, 1)
    
    def forward(self, X):
        return self.linear(X)

Call run_exp. For example

import glob
import torch
import torch.nn as nn
from pipeline import NLPDataset, LMModel, PoolToken, SingleStepOpt, MultiStepOpt, RunExperiments, evaluator
from classifiers import LDA, SingleLayer

device = 'cuda' if torch.cuda.is_available() else 'cpu'
n_folds = 10
dataset_paths = glob.glob('data/*')
poolers = [PoolToken(layer = 1), PoolToken(quantile = 0.4), PoolToken(layer = 'all', layer_method = 'mean')]
classifiers = [LDA, SingleLayer]
model_names = ['gpt2', 'gpt2-large']

exps = RunExperiments(n_folds, dataset_paths, poolers, classifiers, model_names, evaluator, device)
exps.run_all()

Visualise: run visualisation_experiments.py: streamlit run streamlit_app.py

Modules

The program consists of the following parts

NLPDataset

This loads the text into a class. Can be given either a path to a data-json file or dict.

__init__ args:

json_input: Path to dataset json file or loaded json file itself

Methods:

load_embeds_into_memory: Given a model and pooler, processes all embeddings and stores them
__len__
__getitem__

LMModel

(Hugging Face) Language Model Model Wrapper

init args:

model_name: Hugging face model name or dictionary that contains a model and tokenizer

Methods:

__call__: Takes a list of strings and returns Hugging face style output dictionary

PoolToken

Selects (combinations of) tokens from hugging face output dictionary

__init__ args:

layer: Which layer the tokens are to be selected from
- layer index
- list of layer indices
- -1: second to last layer
- 'all': all layers except the first and last one
quantile: Which quantile of tokens is selected (0. = first token, 1. = last)
- quantile
- -1: last token
layer_method: If multiple layers are selected, layer_method can be used to reduce them
- 'mean', 'max', 'min': pool over layer dimension
- 'extend': concatenates embeddings in embed dimension

Methods:

__call__: Takes hugging face output dictionary and returns (batch_size, hidden_size) sized tensor of embeddings

Classifiers

SingleStepOpt

Base class for sklearn style classifiers. Distinguished between regression (continuous targets) and classification (binary targets)

__init__ args: None

Methods:

fit: fits classifier and sets optimal thresholds if the task is classification
set_thresholds: Uses a validation set to find optimal thresholds for both accuracy and F1
predict: Returns predictions for targets, using the thresholds if necessary
forward: Maps inputs to targets

MultiStepOpt

Base class for pytorch style classifiers, uses gradient descent.

__init__ args:

batch_size
max_epoch

Methods:

init_optimizer: Initializes optimizer, scheduler and loss function
training_step: Training step used in fit method
test_step: Test step used in fit method for validation and in predict to get predictions
fit: Training loop, stores losses
predict: Given testset, returns its corresponding predictions from the classifier

RunExperiments

Runs all possible combinations of experiments given datasets, poolers, classifiers and language models, and logs them

__init__ args:

n_folds: Number of folds for cross validation (Not real folds cause sample will typically overlap)
dataset_paths
poolers
classifiers
model_names
evaluator
device
logs_folder: Folder in which the experiments logs are to be stored, will create folder if it doesn't already exist

Methods:

log_data: Logs experiment results to logs_folder
run_all: single_batch_limit is not yet implemented

TODO

make concat work properly
log val_size, test_size
add classification using attention
BERT type models
log number of params per model
larger datasets: generate embeddings on-the-fly

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Results Experiments		Results Experiments
data		data
Experiments1.py		Experiments1.py
README.md		README.md
get_data.py		get_data.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt
visualisation_experiments.py		visualisation_experiments.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Purpose

Usage

Modules

NLPDataset

LMModel

PoolToken

Classifiers

SingleStepOpt

MultiStepOpt

RunExperiments

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Purpose

Usage

Modules

NLPDataset

LMModel

PoolToken

Classifiers

SingleStepOpt

MultiStepOpt

RunExperiments

TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages