GitHub

Preparing Data for the Search Engine

Before using the browser/search engine, you need to prepare the necessary data to build the engine itself. Follow these steps:

Download the required number of pages using the get_articles.py script.
The downloaded files will be saved in a directory named with an integer (e.g., "42"), which will be located under gettingArticles/wikipedia_repos/.
If you have two separate directories of downloaded pages, you can merge them into one using utils/merge_wikipedia_repos.py.
Create a title list with the make_titles_files.py script, passing the path to the relevant wikipedia_repos subfolder as an argument.
Process the downloaded dataset using the parsing/parsing.py script, adjusting the script’s arguments as needed for your specific data.
Compute SVD for selected values of "K" (number of singular values) with the parsing/svd/calc_svds.py script.
This should be run on the chosen dataset located within the parsing/parsed directory.
Run the main program (main.py).

This guide is for a Python application that implements a web page search engine. The engine uses Singular Value Decomposition (SVD) on the matrix of word frequencies across pages as its fuzzy similarity method for queries.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
gettingArticles		gettingArticles
gui		gui
parsing		parsing
requester		requester
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Preparing Data for the Search Engine

About

Uh oh!

Releases

Packages

Languages

xsalonx/SemanticDocsSearcherSVD

Folders and files

Latest commit

History

Repository files navigation

Preparing Data for the Search Engine

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages