This is the core web crawler that will be used for the human trafficking project
Clone or Fork
# clone
git clone git@gitlab.com:atl-ads/palantiri.git # ssh
# or
git clone https://gitlab.com/atl-ads/palantiri.git # http
# build
cd palantiri
# Make sure you are using python3, then use pip to install dependencies
# The anaconda package and version manager is easiest way to do this https://www.continuum.io/downloads
pip install -e .
# test
python setup.py testInstall MongoDB or PostgreSQL and use the
PostgreSQLDump or
MongoDBDump class
to store the collected data in a database.
python search.py -[cgb] <site> <optional arguments>"-[cgb]defines the domain name. E.g.-bfor .backpage.comsitetakes a comma separated list which defines the subdirectories to search. E.g. BusinessServices,ComputerServices- optional arguments are defined with
--<argument> value
A more detailed list may be obtained by running python search.py --help. example.py is an example of what
we currently run. The run time for the program is around 30 minutes.
Please see CONTRIBUTING.md for more information about contributing to this project
Please checkout our slack if you are already a part of the project or contact @danlrobertson if you have any questions.