Backend

Fetch data

ok we are ready to go at this point: start the twitter capture using the streaming API with the following command:

java -jar fom.jar --captureStream --filterGeoTagged

you never know what is going to happen with twitter, as they randomly push a lot of content that your stream may be unable to handle, this making the connection a zombie process on your server. to avoid this, I have set up a cron job to restart the capture, please refer to page Cron jobs

Clustering

this is an example of what I usually run, the parameters should be pretty straight forward to understand:

java -jar fom.jar --clusterAnalysis --geoGran 40 --minFollCount 150 --nOfTopics 4 --nOfWords 3 --disableLangMaps --day 2011-07-22

in this example:

execute HAC with geo-granularity set to 40km radius (it is enough to cover a european city and its neighborhood)
use tweets created by users with at least 150 followers. this should remove a bit of noise
execute LDA extracting 4 topics, each containing 3 keywords
disable clustering of language when creating the LDA models
use a daily time window, and extract topics just from 2011-01-01 (other possible windows are: range of days or hours)

complete list of parameters

java -jar fom.jar --clusterAnalysis

[--sourceName <source name>] for future use, select posts from database with specific source field (defaults to twitter)
{--rangeStartDay YYYY-MM-DD --rangeEndDay YYYY-MM-DD, --day YYYY-MM-DD, --hour YYYY-MM-DD-HH} self ex, time windows
{--geoGran {poi, neighborhood, city, <custom km radius>}} self ex, HAC parameters
[--considerApproxGeolocations] this initially appeared a clever idea, matching the location name specified in the twitter profile with a geonames database (indexed with Apache Lucene and stored under /java/fom/data). Unfortunately most of the people on twitter have complete nonsense origin names.
[--minRTcount <n>] the ideal world is for rich people: since we usually use location boxes to maximize output from a spritzer account, we will not have this information
[--minFollCount <n>] instead of using RT, it is useful to filter on number of followers
[--nOfTopics <n>] [--nOfWords <n>] self ex, LDA params
[--nOfKeywords <n>] meta keywords and opengraph content of related links
[--disableLangDetection] [--excludeRelLinksText] [--disableLangMaps] add/remove stop words depending on language, add/remove related links, enable/disable single buckets of languages when clustering
[--consoleLog] [--csvLog] [--folderLog] [--rpcLog]

Help

general help printed typing: java -jar fom.jar --help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend

Fetch data

Clustering

complete list of parameters

Help

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally