Skip to content

anidata/palantiri

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler for the Human Trafficking Project

This is the core web crawler that will be used for the human trafficking project

Building

Get the code

Clone or Fork

  # clone
  git clone git@gitlab.com:atl-ads/palantiri.git     # ssh
  # or
  git clone https://gitlab.com/atl-ads/palantiri.git # http
  # build
  cd palantiri
  
  # Make sure you are using python3, then use pip to install dependencies
  # The anaconda package and version manager is easiest way to do this https://www.continuum.io/downloads
  pip install -e .

  # test
  python setup.py test

Running

Start a MongoMD or PostgreSQL Server

Install MongoDB or PostgreSQL and use the PostgreSQLDump or MongoDBDump class to store the collected data in a database.

Scrape

  python search.py -[cgb] <site> <optional arguments>"
  • -[cgb] defines the domain name. E.g. -b for .backpage.com
  • site takes a comma separated list which defines the subdirectories to search. E.g. BusinessServices,ComputerServices
  • optional arguments are defined with --<argument> value

A more detailed list may be obtained by running python search.py --help. example.py is an example of what we currently run. The run time for the program is around 30 minutes.

More Documentation

Dependencies

Contributing

Please see CONTRIBUTING.md for more information about contributing to this project

Questions

Please checkout our slack if you are already a part of the project or contact @danlrobertson if you have any questions.

About

Web crawler to collect data on ht

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages