This project consists in two modules, the webcrawler for crawling data from news websites, and noah for crawling data from Twitter.
Prerequisites: scala, sbt, AsterixDB, TextGeoLocator
Follow the official documentation to setup a fully functional cluster.
Webcrawler is an application that integrates with Webhose.io API to crawl data from news websites, geotag and ingest it on AsterixDB.
Parameters description:
-tkor--api-key: Webhose.io API key-kwor--keywords: Keywords to search for in the news-coor--country-code: Thread country code-dsor--days-ago: Crawl since the given number of days ago, default 1-tglurlor--textgeolocatorurl: Url of the TextGeoLocator API, default: "http://localhost:9000/location"-uor--url: Url of the feed adapter-por--port: Port of the feed socket-wor--wait: Waiting milliseconds per record, default 500-bor--batch: Batchsize per waiting periods, default 50-cor--count: Maximum number to feed, default unlimited-foor--file-only: Only store in a file, do not geotag nor ingest, default false
You can run the following example command in a separate command line window:
>cd crawler
>sbt "project webcrawler" "run-main Crawler \
>-tk "Your Webhose.io token" \
>-kw "dengue", "zika", "zikavirus", "microcefalia", "febreamarela", "chikungunya" \
>-co "BR" \
>-ds 1 \
>-tglurl "http://localhost:9000/location" \
>-u 127.0.0.1 \
>-p 10010 \
>-w 0 \
>-b 50"
Noah is a module that continuously crawls new tweets that mentions a specified keyword, geotag and ingests it on AsterixDB.
Parameters description:
-ckor--consumer-key: ConsumerKey for Twitter API OAuth-csor--consumer-secret: Consumer Secret for Twitter API OAuth-tkor--token: Token for Twitter API OAuth-tsor--token-secret: Token secret for Twitter API OAuth-tror--tracker: Tracked terms-uor--url: Url of the feed adapter-por--port: Port of the feed socket-wor--wait: Waiting milliseconds per record, default 500-bor--batch: Batchsize per waiting periods, default 50-cor--count: Maximum number to feed, default unlimited-foor--file-only: Only store in a file, do not geotag nor ingest, default false
You can run the following example command in a separate command line window:
> cd crawler
> sbt "project noah" "run-main edu.uci.ics.cloudberry.noah.feed.TwitterFeedStreamDriver \
> -ck Your consumer key \
> -cs Your consumer secret \
> -tk Your token \
> -ts Your token secret \
> -tr dengue zikavirus microcefalia febreamarela chikungunya\
> -u 127.0.0.1 \
> -p 10001 \
> -w 0 \
> -b 50"
- The Noah and Gnosis modules were adapted from TwitterMap.
- Currently, geotagging works only for Brazil.
- Users and developers are welcome to contact me through moniquemoledo@id.uff.br