Skip to content

diegotyner/CanvasResourceSemanticSearch

Repository files navigation


Quick Summary

An end-to-end data pipeline that scrapes Canvas files and lecture transcripts, processes them, and enables semantic search using sentence embeddings.

Uses:

  • Scraping Tools: Selenium, Requests, Concurrency
  • Embeddings: Transformers, Pytorch, PostgreSQL
  • (hopefully) Web App: Nextjs, Tailwind, Transformers.js

Pipeline Order

  1. Canvas Scraper
    • Uses Auth cookies and requests to scrape all Canvas files, and then uses Selenium to load the lecture hosting site and access transcripts
    • The lecture scraper runs slowly, takes me 30-40 minutes to run but a more powerful PC or more optimized code can definitely speed it up.
  2. Course Downloading
    • Turns the scraped download URLs into local files.
  3. Text Extractor
    • Wrangles resources into raw text files. A lot of work can be done here (Tesseract OCR, better chunking, etc.)
  4. Embeddings Generation
    • Generates embeddings for text files. So far, I've only experimented with lectures, I will experiment with PDFs soon.

TODOs:

  1. Data Exploration
    • Look into fun ways to cluster and visualize lectures
    • Cluster similar classes, similar lectures, semantic variation within lectures. etc
  2. UI and UX
    • Turn into a functional web app, pushing embeddings to Postgres and allowing easy semantic search
    • Think about useful interfaces
      • Find similar passages in lectures and suggest lecture slides afterwards?
      • Allow filtering for specific topics within classes (EG find where dynamic programming was introduced in an Algorithms class or a specific lecture)
    • More!

Setting Up Postgres

Here's the structure of the tables I'm currently using. There's definitely room for expansion / modification. Off the top of my head: links to live resources, etc.

Note that I used MiniLM-L6-v2 to generate my text embeddings, if you use a different model, you will likely have to change the vector size to accomodate it.

postgres=# CREATE TABLE lectures (
    lecture_id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    class TEXT,  -- URL/filepath
    avg_embedding VECTOR(384),  -- Dimension matches MiniLM-L6-v2
    created_at TIMESTAMP DEFAULT NOW(),
    metadata JSONB  -- author, date, tags, etc.
);
postgres=# CREATE TABLE chunks (
    chunk_id SERIAL PRIMARY KEY,
    lecture_id INT NOT NULL REFERENCES lectures(lecture_id) ON DELETE CASCADE,
    content TEXT NOT NULL,
    embedding VECTOR(384),  -- Dimension matches MiniLM-L6-v2
    position INT,  -- Original order in lecture
    metadata JSONB,  -- page numbers, timestamps, etc.
    created_at TIMESTAMP DEFAULT NOW()
);

About

An end-to-end data pipeline that scrapes Canvas files and lecture transcripts, processes them, and enables semantic search using sentence embeddings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors