A lightweight search engine built with Streamlit that indexes Wikipedia articles hosted on GitHub Pages using TF-IDF. The project demonstrates document preprocessing, vectorization, ranking, and adjustable query expansion using WordNet.
This project provides a simple text retrieval system designed for experimentation and instructional use. Users can load a set of locally downloaded Wikipedia articles, build a TF-IDF index, and perform ranked retrieval through a Streamlit interface.
The system includes:
- HTML text extraction
- Tokenization, stopword filtering, and lemmatization
- TF-IDF vectorization
- Cosine similarity ranking
- WordNet-based query expansion
No external database is required; all data is processed locally.
- TF-IDF document indexing with scikit-learn
- Automatic index caching to improve performance
- Adjustable query expansion (narrow, neutral, broad)
- Extraction of paragraph text from Wikipedia HTML pages
- Streamlit interface for querying and displaying ranked results
Note: While the app can run locally with downloaded HTML articles, the preferred method is to use the live Streamlit app, which accesses the articles via GitHub Pages.
git clone https://github.com/cwilburn-dev/INFO556Project.git
cd INFO556Project
The following libraries are required for the project:
- streamlit
- beautifulsoup4
- nltk
- scikit-learn
- numpy
- joblib
To install the dependencies, execute the command below:
pip install -r requirements.txt
We recommend viewing the project via the live Streamlit app:
https://info556project-wilburn.streamlit.app/
If you choose to run the app locally, note that some functionality or behavior may differ from the deployed version.
Run the Streamlit application:
streamlit run streamlit_app.py
Once running, the Streamlit app should open in your browser automatically.
If not, navigate to:
http://localhost:8501
The query expansion slider supports three modes:
- −1 (Narrow): removes very short terms to tighten the query
- 0 (Neutral): searches using only the original query terms
- +1 (Broad): expands the query with WordNet synonyms and hypernyms (limited to selected semantic domains)
The expansion process includes token normalization and lemmatization.