Skip to content
grudelsud edited this page Sep 25, 2011 · 1 revision

Fetch data

ok we are ready to go at this point: start the twitter capture using the streaming API with the following command:

java -jar fom.jar --captureStream --filterGeoTagged

you never know what is going to happen with twitter, as they randomly push a lot of content that your stream may be unable to handle, this making the connection a zombie process on your server. to avoid this, I have set up a cron job to restart the capture, please refer to page Cron jobs

Clustering

this is an example of what I usually run, the parameters should be pretty straight forward to understand:

java -jar fom.jar --clusterAnalysis --geoGran 40 --minFollCount 150 --nOfTopics 4 --nOfWords 3 --disableLangMaps --day 2011-07-22

in this example:

  • execute HAC with geo-granularity set to 40km radius (it is enough to cover a european city and its neighborhood)
  • use tweets created by users with at least 150 followers. this should remove a bit of noise
  • execute LDA extracting 4 topics, each containing 3 keywords
  • disable clustering of language when creating the LDA models
  • use a daily time window, and extract topics just from 2011-01-01 (other possible windows are: range of days or hours)

complete list of parameters

java -jar fom.jar --clusterAnalysis

  • [--sourceName <source name>] for future use, select posts from database with specific source field (defaults to twitter)
  • {--rangeStartDay YYYY-MM-DD --rangeEndDay YYYY-MM-DD, --day YYYY-MM-DD, --hour YYYY-MM-DD-HH} self ex, time windows
  • {--geoGran {poi, neighborhood, city, <custom km radius>}} self ex, HAC parameters
  • [--considerApproxGeolocations] this initially appeared a clever idea, matching the location name specified in the twitter profile with a geonames database (indexed with Apache Lucene and stored under /java/fom/data). Unfortunately most of the people on twitter have complete nonsense origin names.
  • [--minRTcount <n>] the ideal world is for rich people: since we usually use location boxes to maximize output from a spritzer account, we will not have this information
  • [--minFollCount <n>] instead of using RT, it is useful to filter on number of followers
  • [--nOfTopics <n>] [--nOfWords <n>] self ex, LDA params
  • [--nOfKeywords <n>] meta keywords and opengraph content of related links
  • [--disableLangDetection] [--excludeRelLinksText] [--disableLangMaps] add/remove stop words depending on language, add/remove related links, enable/disable single buckets of languages when clustering
  • [--consoleLog] [--csvLog] [--folderLog] [--rpcLog]

Help

general help printed typing: java -jar fom.jar --help

Clone this wiki locally