LexSubLM-Lite

Fast, context-aware lexical substitution that really fits on a laptop.

1 · What is it?

LexSubLM-Lite is a Python toolkit for proposing single-word substitutes that preserve both the meaning and syntax of a target word within its sentence. It is ideal for NLP applications that require controlled synonym generation without heavy dependencies.

2 · Why use LexSubLM-Lite?

Test new models quickly: Swap in your own models—HF repos or local GGUF quantisations—to see how they perform on standard lexical substitution tasks.
Lightweight research: Run benchmarks on your laptop CPU or GPU without needing large server infrastructure.
Easy model extension: Add new generators by editing model_registry.yaml, no code changes needed.
Reproducible results: Dockerfile and helper scripts enable one-command setup and dataset downloads.
Research-grade metrics: Built-in evaluation scripts output P@1, Recall@k, GAP, and ProLex Pro-F1 for rigorous analysis.

3 · Key features

Stage	What we do	Why it matters
Prompted generation	Causal LLM (4-bit when CUDA, fp16/fp32 otherwise) returns k candidates.	Runs on laptop CPU or GPU.
Sanitisation	Strips punctuation / multi-word babble.	Keeps outputs clean.
POS + morph filter	spaCy + pymorphy3 — keeps tense, number, degree.	Prevents form errors (e.g., “cats → cat”).
Ranking	Log-prob or cosine with `e5-small-v2` (<40 MB).	Balance quality vs. footprint.
Evaluation	P@1, Recall@k, GAP, ProF1 (optional).	Research-grade metrics.
Model registry	`model_registry.yaml` maps alias → HF repo / GGUF path.	Add models without touching code.
Benchmark script	`python -m lexsublm_lite.bench.bench_models` prints a table via tabulate2.	One-shot comparison across aliases.

4 · Install

# 1. Create & activate a virtualenv or conda environment
python -m pip install --upgrade pip

# 2. Clone & install in development mode
git clone https://github.com/shamspias/lexsublm-lite
cd lexsublm-lite
pip install -e .

CUDA users: install bitsandbytes to enable true 4-bit quant (pip install bitsandbytes).

5 · Quick start (run)

Generate top-k substitutes:

lexsub run \
  --sentence "The bright student aced the exam." \
  --target bright \
  --top_k 5 \
  --model llama3-mini

Example JSON output

[
  "brilliant",
  "smart",
  "gifted",
  "clever",
  "talented"
]

Toggle model source

# By alias
lexsub run ... --model distilgpt2

# Direct HF repo
lexsub run ... --model EleutherAI/gpt-neo-125m

# Local GGUF (fastest on macOS)
lexsub run ... --model ./models/MyQuantizedModel.gguf

Run a mini-benchmark on 5 hand‑crafted cases across all registry aliases:

python -m lexsublm_lite.bench.bench_models --top_k 5

6 · Evaluation (eval)

Benchmark any model on standard datasets (SWORDS, ProLex, TSAR-2022) and output aggregate metrics:

lexsub eval \
  --dataset swords \
  --split dev \
  --model llama3-mini

--dataset: swords | prolex | tsar22
--split:
- swords/prolex: dev or test
- tsar22: test (alias for test_none), test_none, or test_gold
--model: alias, HF repo, or .gguf path (overrides default)

The command prints mean P@1, Recall@k, GAP (and ProF1 for ProLex).

7 · Datasets

Corpus	Download helper	Size	License
SWORDS (2021)	`python -m lexsub.datasets.swords.download`	4 848 targets / 57 k subs	CC-BY-4.0
ProLex (2024)	`python -m lexsub.datasets.prolex.download`	6 000 sentences + proficiency ranks	CC-BY-4.0
TSAR-2022	`python -m lexsub.datasets.tsar.download`	EN/ES/PT – 1 133 sents	CC-BY-4.0
SemEval-2007	(legacy)	2 000 sents	CC-BY-2.5

8 · Default model registry

llama3-mini: meta-llama/Llama-3.2-1B

distilgpt2: distilbert/distilgpt2

qwen500m: Qwen/Qwen2.5-0.5B

tinyllama: Maykeye/TinyLLama-v0

gpt-neo-125m: EleutherAI/gpt-neo-125m

opt-125m: facebook/opt-125m

Drop new entries in model_registry.yaml; aliases are auto-discovered.

9 · Performance vs. footprint (sample, SWORDS dev, M2 Pro CPU)

Model	RAM GB	P@1	R@5	Jaccard	Notes
tinyllama	0.8	0.20	0.04	0.04	Q4 GGUF, fast + stable
llama3-mini	1.2	0.00	0.16	0.13	Gated HF model
distilgpt2	1.1	0.10	0.05	0.08	Compact transformer

🔧 To improve performance, ensure safetensors is installed and configure offload_folder when using accelerate.

10 · Citing

@software{lexsublm_lite_2025,
  author  = {Shamsuddin Ahmed},
  title   = {LexSubLM-Lite: Lightweight Contextual Lexical Substitution Toolkit},
  year    = {2025},
  url     = {https://github.com/shamspias/lexsublm-lite},
  license = {MIT}
}

11 · Licenses

Code – MIT
Models – See individual model cards (Apache-2, commercial, etc.)
Datasets – CC-BY-4.0 unless noted

12 · Roadmap

🔜 LoRA fine-tuning on SWORDS (opt-in GPU)
🔜 Gradio playground demo
🔜 Multilingual eval on TSAR-2022 ES/PT

13 · 🐛 Known Issues

🔐 Hugging Face: Gated model access (e.g., LLaMA 3)

If you encounter:

401 Client Error: Unauthorized for url: https://huggingface.co/.../config.json

You need to authenticate for gated models:

Login via CLI:
```
huggingface-cli login
```
Obtain a token from https://huggingface.co/settings/tokens

Export it:

export HUGGINGFACE_TOKEN=your_token_here

(Optional) Request access: https://huggingface.co/meta-llama/Llama-3.2-1B

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
data		data
lexsublm_lite		lexsublm_lite
.gitignore		.gitignore
README.md		README.md
model_registry.yaml		model_registry.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LexSubLM-Lite

1 · What is it?

2 · Why use LexSubLM-Lite?

3 · Key features

4 · Install

5 · Quick start (run)

Toggle model source

6 · Evaluation (eval)

7 · Datasets

8 · Default model registry

9 · Performance vs. footprint (sample, SWORDS dev, M2 Pro CPU)

10 · Citing

11 · Licenses

12 · Roadmap

13 · 🐛 Known Issues

About

Uh oh!

Releases

Packages

Uh oh!

Languages

shamspias/lexsublm-lite

Folders and files

Latest commit

History

Repository files navigation

LexSubLM-Lite

1 · What is it?

2 · Why use LexSubLM-Lite?

3 · Key features

4 · Install

5 · Quick start (run)

Toggle model source

6 · Evaluation (eval)

7 · Datasets

8 · Default model registry

9 · Performance vs. footprint (sample, SWORDS dev, M2 Pro CPU)

10 · Citing

11 · Licenses

12 · Roadmap

13 · 🐛 Known Issues

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages