Notebook-first NLP lab for building small transformer systems end-to-end, including model training, task fine-tuning, routing, and retrieval pipelines.
High-level architecture of the Mini Transformer NLP Lab showing the encoder/decoder pipelines, router, and RAG workflow.
- train decoder and encoder models from scratch
- fine-tune for multiple NLP tasks
- train a TF-IDF query router
- build a mini RAG index
- compose everything into a routed mini app
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtjupyter notebookOpen notebooks from the notebooks/ folder and run cells top-to-bottom.
- Python: 3.10+ recommended
- GPU: strongly recommended for pretraining/fine-tuning
- Internet: required when notebooks download Hugging Face datasets/models
- Core deps (from
requirements.txt):torch,transformers,datasets,trl,tokenizers,evaluate,numpy,matplotlib
Decoder-only track (LlamaForCausalLM style).
0. prepare data.ipynb: stream and clean corpus intodata.txt1. train tokenizer.ipynb: train ByteLevel BPE tokenizer2. pretrain model from scratch.ipynb: pretrain base decoder LM intomodel/3. prepare tokenizer for chat_template.ipynb: add chat-template compatibility4. sft fine-tune assistant.ipynb: instruction SFT (yahma/alpaca-cleaned,databricks/databricks-dolly-15k)5. fine-tune -> Classification.ipynb: sentiment classification ontweet_evalintoclassifier/inference.ipynb: inference for base LM / assistant / classifierzz.ipynb: scratch experiments
Main artifacts: data.txt, model/, classifier/
Encoder track (BERT-style).
0. prepare data.ipynb: stream and clean corpus intodata.txt1. train tokenizer.ipynb: train WordPiece tokenizer2. pretrain model MLM from scratch.ipynb: MLM pretraining intomodel/3. fine-tune -> Classification.ipynb:tweet_evalclassification intoclassifier/4. fine-tune -> NER.ipynb: NER oneriktks/conll2003intoner/5. fine-tune -> QA.ipynb: extractive QA onsquadintoqa/6. fine-tune -> contrastive-similarity-embeddings like (all-MiniLM).ipynb: contrastive embeddings onsentence-transformers/all-nliintoembed_model/7. export embeddings for embeddings-projector.ipynb: exportembeddings.tsvandmetadata.tsvfor projectorinference.ipynb: fill-mask, classification, NER, QA, and embedding checks
Main artifacts: data.txt, model/, classifier/, ner/, qa/, embed_model/
Intent router training:
1. train.ipynb: TF-IDF + LogisticRegression router trainingdata.json: labeled routes (retrieve_generate,direct_qa,chat)- output:
router_tfidf.pkl
Minimal retrieval pipeline:
1. prepare-documents-and-chunking.ipynb: producedocuments.json,chunks.json2. build-embedding-index.ipynb: build and saverag_index.pkl3. retrieval-test.ipynb: retrieval sanity checks- embed_model_utils.py: embedding helper using
sentence-transformers/all-MiniLM-L6-v2for chunk encoding and retrieval.
The repository also contains a locally trained embedding model (Mini-Encoder-LLM/embed_model) that can replace this baseline if desired.
Composed app that routes a query to QA, Chat, or RAG:
app.ipynb: entry notebookpipelines.py: orchestration (handle_query)router_utils.py: loads../TF-IDF router/router_tfidf.pklqa_utils.py: baseline direct-QA pipeline (SmolLM2-135M-Instruct)chat_utils.py: baseline chat pipeline (SmolLM2-135M-Instruct)rag_utils.py: retrieval pipeline over../Y-Mini-RAG/rag_index.pkl, currently returning the best retrieved chunk as the answer baseline
Standalone reference notebook for SmolLM2 instruction generation.
The QA and Chat pipelines in notebooks/Z-Mini-App currently use the small instruction model:
HuggingFaceTB/SmolLM2-135M-Instruct
This model is used only as a lightweight baseline so the mini application can run without requiring heavy training.
The intended assistant model for this project is the one trained in:
notebooks/Mini-Decoder-LLM/4. sft fine-tune assistant.ipynb
This notebook produces an instruction-tuned assistant checkpoint based on the decoder model trained from scratch in this repository.
With sufficient training data and compute, the pipelines in notebooks/Z-Mini-App can be switched to use this locally trained assistant instead of the SmolLM2 baseline.
Mini-Decoder-LLM: run in numeric order (0to5)Mini-Encoder-LLM: run in numeric order (0to7)TF-IDF router/1. train.ipynbY-Mini-RAGnotebooks (1to3)Z-Mini-App/app.ipynb
For embedding projector:
- Run
Mini-Encoder-LLM/7. export embeddings for embeddings-projector.ipynb - Place
embeddings.tsvandmetadata.tsvintoembedding-projector-standalone/oss_data/ - Open
embedding-projector-standalone/index.html
- Many folders already contain trained weights/checkpoints (
model.safetensors, tokenizer/config files), so inference notebooks can be run without retraining. - If you retrain, artifacts are overwritten in track-local folders such as
model/,classifier/,ner/,qa/, andembed_model/. - Notebook outputs depend on random seeds, hardware, and dataset revisions; exact metrics can vary.
The project demonstrates a modular transformer-based NLP stack:
- Decoder track → generative language modeling and instruction tuning
- Encoder track → representation learning and classic NLP tasks
- Router → lightweight intent classification for query routing
- Retrieval → semantic document search via embeddings
- Mini App → unified pipeline combining routing, QA, chat, and retrieval
Each component can be studied independently or composed together to build a small but complete NLP system.
Query: What is tokenization?
Route: direct_qa
Query: Help me understand RAG simply
Route: chat
Query: Search the documents for information about RAG
Route: retrieve_generate
Query: Search the documents for information about RAG
Top retrieved chunk:
RAG (Retrieval Augmented Generation) is a technique that combines
information retrieval with text generation. Instead of relying only
on the model's internal knowledge, RAG retrieves relevant documents
and uses them as context for generation.
Input: What is tokenization?
Pipeline: direct_qa
Output: Tokenization is the process of splitting text into smaller units called tokens.
You can also run a simple CLI demo from notebooks/Z-Mini-App/:
cd notebooks/Z-Mini-App
python run_app.py- CUDA out-of-memory: reduce batch size, sequence length, or use CPU.
- Missing model/dataset download: verify internet access and Hugging Face availability.
- Notebook path issues: run notebooks from their own folders so relative paths resolve correctly.
