Colab AI Project Collection

These notebooks were developed in 2024 for learning and experimentation—exploring NLP and modern LLMs end-to-end rather than serving production systems.

A portfolio of hands-on NLP and transformer notebooks, originally built in Google Colab, covering classical text pipelines through modern LLM fine-tuning with Hugging Face Transformers, parameter-efficient adaptation (LoRA), and standard evaluation metrics.

Each notebook is self-contained and includes an Open in Colab badge pointing at this repository.

What’s inside (project-by-project)

Notebook	What you explore	Stack & datasets (high level)
`Text_Cleaning_Project.ipynb`	End-to-end text normalization for downstream NLP: regex cleanup, punctuation, chat-word handling, light spell correction, stop-word removal, and emoji stripping, on real review text.	pandas; IMDB sentiment corpus (CSV via URL in notebook)
`Word2Vec_Project.ipynb`	Train a Word2Vec model from scratch with Gensim, preprocess long-form fiction text, and inspect similarity and doesn’t-match behavior—classic static embeddings before transformers.	NLTK, Gensim; Game of Thrones books (Kaggle-style layout referenced in notebook)
`Sentence_Embedding_Project.ipynb`	BERT-based sentence embeddings: tokenization with `bert-base-uncased`, forward pass through `BertModel`, pooling/context vectors, and decoding back to readable tokens.	Hugging Face Transformers; PyTorch/CUDA where available
`NER_with_BERT_Project.ipynb`	Named Entity Recognition as token classification: fine-tune `bert-base-uncased` with `AutoModelForTokenClassification`, aligned token–label encoding, training with `Trainer`, and evaluation (seqeval / `evaluate`). Optional push to Hugging Face Hub.	`datasets` (CoNLL-2003), Transformers, Accelerate
`Q&A_with_LM.ipynb`	Extractive question answering on SQuAD-style data: predict an answer span in the passage (or “unanswerable” behavior as supported by the pipeline). Uses a BERT-style reader via SimpleTransformers for fast training/eval.	SimpleTransformers; `bert-base-cased`; Stanford QA data (paths in notebook)
`Text_Summarization_Project.ipynb`	Abstractive summarization with a seq2seq transformer: start from Pegasus (`google/pegasus-cnn_dailymail`), fine-tune on dialogue → summary data (`Samsung/samsum`), use `TrainingArguments` + `Trainer`, and measure quality with ROUGE via `evaluate`.	Transformers, Datasets, ROUGE
`Llama2_LoRA_Project.ipynb`	Instruction tuning for Llama 2 7B with LoRA (PEFT), quantization-friendly tooling (bitsandbytes where applicable), and TRL-style supervised fine-tuning on an instruction dataset (`mlabonne/guanaco-llama2-1k`). Illustrates efficient LLM adaptation vs full fine-tuning.	PEFT, TRL, Accelerate, Transformers

Skills and themes reflected in this repo

These notebooks map cleanly to common hiring and portfolio keywords:

Natural Language Processing: cleaning, tokenization, embeddings, span QA, NER, summarization
Transformers & LLMs: BERT (encoder), Pegasus (encoder–decoder), Llama 2 (decoder)
Fine-tuning: full fine-tuning (NER, summarization), LoRA / PEFT (Llama 2)
Hugging Face ecosystem: transformers, datasets, evaluate, Hub workflows
Model evaluation: span metrics for QA/NER, ROUGE for summarization
Efficient training: mixed precision / Trainer patterns, parameter-efficient LLM tuning

GPT-class APIs, LangChain, RAG, AWS Bedrock, and Azure AI are important parts of the modern LLM stack but are not implemented as standalone projects in this repository. If you use those in production or other repos, call them out on your profile or add a short “Elsewhere” section there—keep this README tightly aligned with what visitors can actually run from these notebooks.

Tech stack (from the notebooks)

Python · Jupyter / Google Colab · PyTorch · Hugging Face (transformers, datasets, accelerate, evaluate, peft, trl) · NLTK · Gensim · SimpleTransformers (QA) · ROUGE · seqeval (NER)

How to run

Recommended: open any notebook on Google Colab via the badge at the top of the file (points to jackychh7878/Colab_AI_Project on GitHub).
Locally: use Python 3.10+ (as in the Colab logs), install dependencies per notebook (pip install … cells), and ensure GPU availability for larger models (BERT fine-tunes, Pegasus, Llama 2 + LoRA).
Secrets: notebooks that use the Hugging Face Hub may expect a token (notebook_login / HF_TOKEN patterns)—add your own token where prompted.
GitHub shows “Invalid Notebook” (missing application/vnd.jupyter.widget-state+json): Colab/Hugging Face often save tqdm output as ipywidgets. GitHub’s viewer is strict about widget JSON. From the repo root run python fix_notebooks_for_github.py — it removes only the widget MIME bundle from each output and keeps text/plain (and images, etc.), so you do not need to re-run the notebook.

Repository layout

Colab_AI_Project/
├── Text_Cleaning_Project.ipynb
├── Word2Vec_Project.ipynb
├── Sentence_Embedding_Project.ipynb
├── NER_with_BERT_Project.ipynb
├── Q&A_with_LM.ipynb
├── Text_Summarization_Project.ipynb
├── Llama2_LoRA_Project.ipynb
└── README.md

Reference

Author: Jacky Chong — Colab AI project collection.
Medical chatbot (Retrieval-Augmented Generation) — write-up / notes: Google Doc
Google Drive backup of materials: folder

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Colab AI Project Collection

What’s inside (project-by-project)

Skills and themes reflected in this repo

Tech stack (from the notebooks)

How to run

Repository layout

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Llama2_LoRA_Project.ipynb		Llama2_LoRA_Project.ipynb
NER_with_BERT_Project.ipynb		NER_with_BERT_Project.ipynb
Q&A_with_LM.ipynb		Q&A_with_LM.ipynb
README.md		README.md
Sentence_Embedding_Project.ipynb		Sentence_Embedding_Project.ipynb
Text_Cleaning_Project.ipynb		Text_Cleaning_Project.ipynb
Text_Summarization_Project.ipynb		Text_Summarization_Project.ipynb
Word2Vec_Project.ipynb		Word2Vec_Project.ipynb

Folders and files

Latest commit

History

Repository files navigation

Colab AI Project Collection

What’s inside (project-by-project)

Skills and themes reflected in this repo

Tech stack (from the notebooks)

How to run

Repository layout

Reference

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages