Conceptual Landscaping with BERT and Friends

Welcome to the Conceptual Landscaping with BERT and Friends repository! This project is designed to help you explore and visualize the conceptual landscape of a given text using advanced language models like BERT, RoBERTa, and DistilBERT. The goal is to provide a user-friendly interface for analyzing and understanding the relationships between different concepts in your text.

Installation

To run this code, we first recommend forking the repo so you have your own version to play with. Clone the repo

git clone https://github.com/acceleratescience/conceptual-cartography.git
cd conceptual-engineering

The conceptual-engineering repo comes with a setup.sh file that will handle most of the installation and environment management for you. This means that if you want to run this software on a remote cloud server, then it's as simple as spinning up a CPU or GPU instance with some base Linux such as Ubuntu, cloning the repo, and running the setup.

We first recommend installing python 3.12. Instructions for Linux are below, and other operating systems such as MacOS are easy to find.

sudo apt-get update
sudo apt-get install python3.12

Now when the setup file:

source ./setup.sh

After a bunch of install infomation, you should see something like the following:

Installing the current project: conceptual-cartography (0.1.0)
✓ Poetry environment created successfully

=== Setup Status ===

✓ Setup Complete with regular dependencies! 🎉
✓ To activate the virtual environment, run: poetry shell
✓ Or use: source .venv/bin/activate
✓ To run commands in the environment: poetry run <command>

Do what it says and activate your virtual environment!

You can also run the setup with development dependencies by adding the flag --dev

Installation of PyTorch

Due to the nature of PyTorch installations across different hardware, we have left the installation of PyTorch to the user. For installation instructions, please refer to the PyTorch installation page. The majority of the development for this repo was done using ROCm...not CUDA.

If using an AMD GPU, you can install PyTorch with ROCm support:

 pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4/

Then run the following:

export HIP_VISIBLE_DEVICES=0
export HSA_OVERRIDE_GFX_VERSION=11.0.0  # adjust version for your GPU

To make it permanent:

echo 'export HIP_VISIBLE_DEVICES=0' >> ~/.bashrc
echo 'export HSA_OVERRIDE_GFX_VERSION=11.0.0' >> ~/.bashrc  # adjust version for your GPU

then restart your terminal or run source ~/.bashrc. Don't forget to reactivate your virtual environment if you are using one.

CPU only

For cpu only PyTorch, you can run

pip install torch --index-url https://download.pytorch.org/whl/cpu

Data Format Requirements

Important: Your input text data must be formatted with one sentence per line. The system reads text files line-by-line, where each non-empty line is treated as a separate sentence for analysis.

Example format:

I need to go to the bank today to deposit this cheque.
She decided to open a new savings account at a different bank for a better interest rate.
The bank approved their mortgage application after reviewing their financial history.
We decided to have our picnic on the grassy bank of the river.
The children loved skipping stones from the river bank into the water.

❌ Incorrect format:

I need to go to the bank today. She decided to open a savings account. The bank approved their mortgage application.

Running an experiment.

The file directory should look something like:

.
├── configs
│   └── bank-test-metrics.yaml
├── data
│   ├── testing_data
│   │   └── bank_test.txt
├── scripts
├── src
...etc

The important thing here is that we have some testing data ./data/testing_data/bank_test.txt and a testing config file ./configs/bank-test-metrics.yaml

Configs

This config file should look something like this

model:
  model_name: 'sentence-transformers/all-MiniLM-L6-v2'
data:
  sentences_path: 'data/testing_data/bank_test.txt'
  output_path: 'output'
experiment:
  model_batch_size: 32
  context_window: None
  target_word: 'bank'
metrics:
  anisotropy_correction: False
  layers: 'all'
  metrics: ['similarity_matrix', 'mev', 'inter_similarity', 'intra_similarity', 'average_similarity', 'similarity_std']
landscapes:
  pca_min: 2
  pca_max: 5
  pca_step: 1
  cluster_min: 2
  cluster_max: 5
  cluster_step: 1
  generate_all: True
  save_optimization: True

Running experiments on a corpus

To run this file, simply run in the command line

poetry run experiment --config 'configs/bank-test-metrics.yaml'

When running for the first time, you will see the model being downloaded. The corpus examples will be run through the model, and the landscapes will be saved and generated.

You should now have a new directory:

├── output
│   └── sentence-transformers_all-MiniLM-L6-v2
│       └── window_None
│           └── bank
│               ├── contexts.txt
│               ├── final_embeddings.pt
│               ├── hidden_embeddings.pt
│               ├── indices.txt
│               ├── landscapes
│               │   ├── landscape_layer-0.pt
│               │   ├── landscape_layer-1.pt
│               │   ├── landscape_layer-2.pt
│               │   ├── landscape_layer-3.pt
│               │   ├── landscape_layer-4.pt
│               │   └── landscape_layer-5.pt
│               └── metrics
│                   ├── metrics_layer-0.pt
│                   ├── metrics_layer-1.pt
│                   ├── metrics_layer-2.pt
│                   ├── metrics_layer-3.pt
│                   ├── metrics_layer-4.pt
│                   └── metrics_layer-5.pt

Each landscape .pt file is a Landscape object containing the information required to visualize the conceptual landscapes:

class Landscape:
    X: np.ndarray
    Y: np.ndarray
    Z: np.ndarray
    X_pca: np.ndarray
    consensus_labels: np.ndarray
    ari_scores: list
    pca_components: int = None
    cluster_count: int = None
    covariance_type: str = None

Visualizing the experiment

Visualizing a completed experiment is straight forward:

poetry run visualize --config 'configs/bank-test-metrics.yaml'

This will then show something like

  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://172.17.0.6:8501
  External URL: http://213.173.105.105:8501

Head to the url to open the app and click through the layers to explore the different clusters!

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.devcontainer		.devcontainer
.vscode		.vscode
assets		assets
configs		configs
data/testing_data		data/testing_data
scripts		scripts
src		src
streamlit		streamlit
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Conceptual Landscaping with BERT and Friends

Installation

Installation of PyTorch

CPU only

Data Format Requirements

Example format:

❌ Incorrect format:

Running an experiment.

Configs

Running experiments on a corpus

Visualizing the experiment

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

acceleratescience/conceptual-cartography

Folders and files

Latest commit

History

Repository files navigation

Conceptual Landscaping with BERT and Friends

Installation

Installation of PyTorch

CPU only

Data Format Requirements

Example format:

❌ Incorrect format:

Running an experiment.

Configs

Running experiments on a corpus

Visualizing the experiment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages