GitHub - kokhnovich/MapReduceWebScraper: Program to walk through internet links automatically.

MapReduceWebScraper

What is it?

This utility is designed to check url pages that can be achieved from given urls with limited depth for searching.

It takes an input file, that contain links, and max depth.

The program returns links, that can be achieved using no more than $depth clicks on links.

The utility is based on famous Map-Reduce technology for distributed computations. So, it can be runned on high-level server machines and work simultaneously, using all available resources.

So, how to use that?

First of all, we need to compile binaries. If you trust me, you can move compiled on my system binaries from Build folder. Otherwise, you can run bash script:

bash BuildScripts.sh

And now we can use this using simple command:

bash WebScraperMainScript input.txt output.txt depth

where

depth is maximal depth of links.
input.txt - file with links in format link\t1\n.
output.txt - file for results.

How does it work?

The model can be simply described as BFS algorithms on links.

Script just run $depth times the map-reduce cycle.

Map splits input links on many files, and run a process for each that gets all accessable links from the given links. At the end, it merge all files in one.
Reduce sorts the file from previous step, then splits it into small, and make a final file with unique values.

Utils

For generating simple dataset for testing I used generate.cpp:

./generate number_of_links > file_name.txt

P.S. links are taken from "top-100 visited websites", so $number_of_links is no more than 100.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
build		build
BuildScripts.sh		BuildScripts.sh
Constants.h		Constants.h
ExternalMergeSort.cpp		ExternalMergeSort.cpp
MapReduceMain.cpp		MapReduceMain.cpp
README.md		README.md
ReduceScript.cpp		ReduceScript.cpp
Scraper.py		Scraper.py
WebScraperMainScript.sh		WebScraperMainScript.sh
generate.cpp		generate.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MapReduceWebScraper

What is it?

So, how to use that?

How does it work?

Utils

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MapReduceWebScraper

What is it?

So, how to use that?

How does it work?

Utils

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages