-
Notifications
You must be signed in to change notification settings - Fork 0
Backend
ok we are ready to go at this point: start the twitter capture using the streaming API with the following command:
java -jar fom.jar --captureStream --filterGeoTagged
you never know what is going to happen with twitter, as they randomly push a lot of content that your stream may be unable to handle, this making the connection a zombie process on your server. to avoid this, I have set up a cron job to restart the capture, please refer to page Cron jobs
this is an example of what I usually run, the parameters should be pretty straight forward to understand:
java -jar fom.jar --clusterAnalysis --geoGran 40 --minFollCount 150 --nOfTopics 4 --nOfWords 3 --disableLangMaps --day 2011-07-22
in this example:
- execute HAC with geo-granularity set to 40km radius (it is enough to cover a european city and its neighborhood)
- use tweets created by users with at least 150 followers. this should remove a bit of noise
- execute LDA extracting 4 topics, each containing 3 keywords
- disable clustering of language when creating the LDA models
- use a daily time window, and extract topics just from 2011-01-01 (other possible windows are: range of days or hours)
java -jar fom.jar --clusterAnalysis
-
[--sourceName <source name>]for future use, select posts from database with specific source field (defaults to twitter) -
{--rangeStartDay YYYY-MM-DD --rangeEndDay YYYY-MM-DD, --day YYYY-MM-DD, --hour YYYY-MM-DD-HH}self ex, time windows -
{--geoGran {poi, neighborhood, city, <custom km radius>}}self ex, HAC parameters -
[--considerApproxGeolocations]this initially appeared a clever idea, matching the location name specified in the twitter profile with a geonames database (indexed with Apache Lucene and stored under /java/fom/data). Unfortunately most of the people on twitter have complete nonsense origin names. -
[--minRTcount <n>]the ideal world is for rich people: since we usually use location boxes to maximize output from a spritzer account, we will not have this information -
[--minFollCount <n>]instead of using RT, it is useful to filter on number of followers -
[--nOfTopics <n>] [--nOfWords <n>]self ex, LDA params -
[--nOfKeywords <n>]meta keywords and opengraph content of related links -
[--disableLangDetection] [--excludeRelLinksText] [--disableLangMaps]add/remove stop words depending on language, add/remove related links, enable/disable single buckets of languages when clustering [--consoleLog] [--csvLog] [--folderLog] [--rpcLog]
general help printed typing: java -jar fom.jar --help