Skip to content

xsalonx/SemanticDocsSearcherSVD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Preparing Data for the Search Engine

Before using the browser/search engine, you need to prepare the necessary data to build the engine itself. Follow these steps:

  1. Download the required number of pages using the get_articles.py script.
    The downloaded files will be saved in a directory named with an integer (e.g., "42"), which will be located under gettingArticles/wikipedia_repos/.
    If you have two separate directories of downloaded pages, you can merge them into one using utils/merge_wikipedia_repos.py.

  2. Create a title list with the make_titles_files.py script, passing the path to the relevant wikipedia_repos subfolder as an argument.

  3. Process the downloaded dataset using the parsing/parsing.py script, adjusting the script’s arguments as needed for your specific data.

  4. Compute SVD for selected values of "K" (number of singular values) with the parsing/svd/calc_svds.py script.
    This should be run on the chosen dataset located within the parsing/parsed directory.

  5. Run the main program (main.py).

This guide is for a Python application that implements a web page search engine. The engine uses Singular Value Decomposition (SVD) on the matrix of word frequencies across pages as its fuzzy similarity method for queries.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages