Query Expansion Demo

A lightweight search engine built with Streamlit that indexes Wikipedia articles hosted on GitHub Pages using TF-IDF. The project demonstrates document preprocessing, vectorization, ranking, and adjustable query expansion using WordNet.

About

This project provides a simple text retrieval system designed for experimentation and instructional use. Users can load a set of locally downloaded Wikipedia articles, build a TF-IDF index, and perform ranked retrieval through a Streamlit interface.

The system includes:

HTML text extraction
Tokenization, stopword filtering, and lemmatization
TF-IDF vectorization
Cosine similarity ranking
WordNet-based query expansion

No external database is required; all data is processed locally.

Features

TF-IDF document indexing with scikit-learn
Automatic index caching to improve performance
Adjustable query expansion (narrow, neutral, broad)
Extraction of paragraph text from Wikipedia HTML pages
Streamlit interface for querying and displaying ranked results

Note: While the app can run locally with downloaded HTML articles, the preferred method is to use the live Streamlit app, which accesses the articles via GitHub Pages.

Installation

Clone Repository

git clone https://github.com/cwilburn-dev/INFO556Project.git
cd INFO556Project

Install Dependencies

The following libraries are required for the project:

streamlit
beautifulsoup4
nltk
scikit-learn
numpy
joblib

To install the dependencies, execute the command below:
pip install -r requirements.txt

Usage

We recommend viewing the project via the live Streamlit app:
https://info556project-wilburn.streamlit.app/

If you choose to run the app locally, note that some functionality or behavior may differ from the deployed version.

Run the Streamlit application:
streamlit run streamlit_app.py

Once running, the Streamlit app should open in your browser automatically.

If not, navigate to:
http://localhost:8501

Query expansion

The query expansion slider supports three modes:

−1 (Narrow): removes very short terms to tighten the query
0 (Neutral): searches using only the original query terms
+1 (Broad): expands the query with WordNet synonyms and hypernyms (limited to selected semantic domains)

The expansion process includes token normalization and lemmatization.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
articles		articles
README.md		README.md
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Query Expansion Demo

Table of Contents

About

Features

Installation

Clone Repository

Install Dependencies

Usage

Query expansion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Query Expansion Demo

Table of Contents

About

Features

Installation

Clone Repository

Install Dependencies

Usage

Query expansion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages