Quick Summary

An end-to-end data pipeline that scrapes Canvas files and lecture transcripts, processes them, and enables semantic search using sentence embeddings.

Uses:

Scraping Tools: Selenium, Requests, Concurrency
Embeddings: Transformers, Pytorch, PostgreSQL
(hopefully) Web App: Nextjs, Tailwind, Transformers.js

Pipeline Order

Canvas Scraper
- Uses Auth cookies and requests to scrape all Canvas files, and then uses Selenium to load the lecture hosting site and access transcripts
- The lecture scraper runs slowly, takes me 30-40 minutes to run but a more powerful PC or more optimized code can definitely speed it up.
Course Downloading
- Turns the scraped download URLs into local files.
Text Extractor
- Wrangles resources into raw text files. A lot of work can be done here (Tesseract OCR, better chunking, etc.)
Embeddings Generation
- Generates embeddings for text files. So far, I've only experimented with lectures, I will experiment with PDFs soon.

TODOs:

Data Exploration
- Look into fun ways to cluster and visualize lectures
- Cluster similar classes, similar lectures, semantic variation within lectures. etc
UI and UX
- Turn into a functional web app, pushing embeddings to Postgres and allowing easy semantic search
- Think about useful interfaces
  - Find similar passages in lectures and suggest lecture slides afterwards?
  - Allow filtering for specific topics within classes (EG find where dynamic programming was introduced in an Algorithms class or a specific lecture)
- More!

Setting Up Postgres

Here's the structure of the tables I'm currently using. There's definitely room for expansion / modification. Off the top of my head: links to live resources, etc.

Note that I used MiniLM-L6-v2 to generate my text embeddings, if you use a different model, you will likely have to change the vector size to accomodate it.

postgres=# CREATE TABLE lectures (
    lecture_id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    class TEXT,  -- URL/filepath
    avg_embedding VECTOR(384),  -- Dimension matches MiniLM-L6-v2
    created_at TIMESTAMP DEFAULT NOW(),
    metadata JSONB  -- author, date, tags, etc.
);
postgres=# CREATE TABLE chunks (
    chunk_id SERIAL PRIMARY KEY,
    lecture_id INT NOT NULL REFERENCES lectures(lecture_id) ON DELETE CASCADE,
    content TEXT NOT NULL,
    embedding VECTOR(384),  -- Dimension matches MiniLM-L6-v2
    position INT,  -- Original order in lecture
    metadata JSONB,  -- page numbers, timestamps, etc.
    created_at TIMESTAMP DEFAULT NOW()
);

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
practice-notebooks		practice-notebooks
utils		utils
.gitignore		.gitignore
CanvasScraper.ipynb		CanvasScraper.ipynb
CourseDownloading.ipynb		CourseDownloading.ipynb
EmbeddingGeneration.ipynb		EmbeddingGeneration.ipynb
README.md		README.md
Text-Extractor.ipynb		Text-Extractor.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Summary

Pipeline Order

Setting Up Postgres

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick Summary

Pipeline Order

Setting Up Postgres

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages