This utility is designed to check url pages that can be achieved from given urls with limited depth for searching.
It takes an input file, that contain links, and max depth.
The program returns links, that can be achieved using no more than $depth clicks on links.
The utility is based on famous Map-Reduce technology for distributed computations. So, it can be runned on high-level server machines and work simultaneously, using all available resources.
First of all, we need to compile binaries. If you trust me, you can move compiled on my system binaries from Build folder. Otherwise, you can run bash script:
bash BuildScripts.sh
And now we can use this using simple command:
bash WebScraperMainScript input.txt output.txt depth
where
- depth is maximal depth of links.
- input.txt - file with links in format
link\t1\n. - output.txt - file for results.
The model can be simply described as BFS algorithms on links.
Script just run $depth times the map-reduce cycle.
-
Map splits input links on many files, and run a process for each that gets all accessable links from the given links. At the end, it merge all files in one.
-
Reduce sorts the file from previous step, then splits it into small, and make a final file with unique values.
For generating simple dataset for testing I used generate.cpp:
./generate number_of_links > file_name.txt
P.S. links are taken from "top-100 visited websites", so $number_of_links is no more than 100.