Before using the browser/search engine, you need to prepare the necessary data to build the engine itself. Follow these steps:
-
Download the required number of pages using the
get_articles.pyscript.
The downloaded files will be saved in a directory named with an integer (e.g., "42"), which will be located undergettingArticles/wikipedia_repos/.
If you have two separate directories of downloaded pages, you can merge them into one usingutils/merge_wikipedia_repos.py. -
Create a title list with the
make_titles_files.pyscript, passing the path to the relevantwikipedia_repossubfolder as an argument. -
Process the downloaded dataset using the
parsing/parsing.pyscript, adjusting the script’s arguments as needed for your specific data. -
Compute SVD for selected values of "K" (number of singular values) with the
parsing/svd/calc_svds.pyscript.
This should be run on the chosen dataset located within theparsing/parseddirectory. -
Run the main program (
main.py).
This guide is for a Python application that implements a web page search engine. The engine uses Singular Value Decomposition (SVD) on the matrix of word frequencies across pages as its fuzzy similarity method for queries.