GitHub - anidata/palantiri: Web crawler to collect data on ht

Web Crawler for the Human Trafficking Project

This is the core web crawler that will be used for the human trafficking project

Building

Get the code

Clone or Fork

  # clone
  git clone git@gitlab.com:atl-ads/palantiri.git     # ssh
  # or
  git clone https://gitlab.com/atl-ads/palantiri.git # http
  # build
  cd palantiri
  
  # Make sure you are using python3, then use pip to install dependencies
  # The anaconda package and version manager is easiest way to do this https://www.continuum.io/downloads
  pip install -e .

  # test
  python setup.py test

Running

Start a MongoMD or PostgreSQL Server

Install MongoDB or PostgreSQL and use the PostgreSQLDump or MongoDBDump class to store the collected data in a database.

Scrape

  python search.py -[cgb] <site> <optional arguments>"

-[cgb] defines the domain name. E.g. -b for .backpage.com
site takes a comma separated list which defines the subdirectories to search. E.g. BusinessServices,ComputerServices
optional arguments are defined with --<argument> value

A more detailed list may be obtained by running python search.py --help. example.py is an example of what we currently run. The run time for the program is around 30 minutes.

Dependencies

Contributing

Please see CONTRIBUTING.md for more information about contributing to this project

Questions

Please checkout our slack if you are already a part of the project or contact @danlrobertson if you have any questions.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
docs		docs
examples		examples
palantiri		palantiri
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler for the Human Trafficking Project

Building

Get the code

Clone or Fork

Running

Start a MongoMD or PostgreSQL Server

Scrape

More Documentation

Dependencies

Contributing

Questions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

anidata/palantiri

Folders and files

Latest commit

History

Repository files navigation

Web Crawler for the Human Trafficking Project

Building

Get the code

Clone or Fork

Running

Start a MongoMD or PostgreSQL Server

Scrape

More Documentation

Dependencies

Contributing

Questions

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages