-
Notifications
You must be signed in to change notification settings - Fork 8
Getting started
The quickest way to get started is to run a few examples.
The simplest tag cloud can be created using the library provided by the 'tagcloud' project. Check out and build the project using the Maven build file provided. This project does not have any dependencies and should build on any machine with JDK 1.6 installed. Alternatively, using IDEs like Eclipse, one can simply import the project and the classes get compiled.
Run Simple Tag Cloud after editing the output path where the tag cloud image file is to be created.
Sift also supports creating tag clouds from input files containing free-form text with some structure. An example is an input file that contains say user reviews for products where each line has this structure :
<Product ID><tab><User review contents>
Example is as follows:
productid_123 this is a good product. it has some good features
productid_234 this is a lousy product.
productid_123 this is a decent product. have been mostly happy using it
productid_123 I bought this good product from xyz company. the good product suffers from the service of xyz company which sucks!Sift uses the Trooper Batch runtime profile to read and process this input file. Trooper builds are distributed via the Clojars community maintained repository for open source libraries.
Add the following repository to your Maven settings or POM build file to access releases builds:
<repository>
<id>clojars</id>
<name>Clojars repository</name>
<url>https://clojars.org/repo</url>
</repository>A Trooper batch job therefore reads the source file and uses the Sift 'runtime' and 'tagcloud' libraries to generate the tagcloud(s).A sample batch job demonstrating this is available in the Sift 'batch' project. The steps below explain running this example:
- Checkout/clone the Sift project. Sift provides a master build file that builds all of sift modules. This is available as : Sift Master Build.
- Build all Sift modules using Maven and the Sift master build pom file :
cd /workspace/Sift
mvn clean install -DskipTests- Path to the input file containing text to be processed and tagcloud output directory are configured in : Sample Job Configuration. Edit the following bean snippets in this file to point to relevant paths on the local machine :
>8 --snip--
<bean id="inputFileResource1" class="org.springframework.core.io.FileSystemResource">
<constructor-arg value="/Users/regunath.balasubramanian/Documents/workspace/experiments/Samples/scripts/nikon.txt"/>
</bean>
<bean id="imageFileWriter" class="org.sift.tagcloud.impl.service.ImageFilePersistenceService">
<property name="tagCloudsDirectory" value="/Users/regunath.balasubramanian/Documents/junk/tagclouds" />
</bean>
<bean id="marshallerFileWriter" class="org.sift.batch.tag.service.TagCloudMarshallerService">
<property name="tagCloudsDirectory" value="/Users/regunath.balasubramanian/Documents/junk/tagclouds" />
<property name="marshaller">
<bean class="org.trpr.platform.integration.impl.json.JSONTranscoderImpl" />
</property>
</bean>
>8 --snip--- Execute the batch job by running:
cd /workspace/Sift/batch
java -cp "./target/classes:./target/test-classes:./target/lib/*" org.trpr.platform.batch.client.StandAloneBatchClient ./src/test/resources/external/bootstrap.xml tagCloudJob This will create image and JSON files representing the tag clouds in the tagcloud output directory specified. Note that you might have to tune the 'nGram' and 'wordWeights' properties in this snippet to influence/better the term frequency i.e tags generated in the final output as shown in this snippet:
>8 --snip--
<bean class="org.sift.runtime.impl.WordSplitterProcessor">
<property name="nGram" value="3" />
<property name="stopWords">
<bean class="org.sift.winnow.StopWords" />
</property>
</bean>
<bean class="org.sift.batch.test.TagIdentifierProcessor">
<property name="wordWeights">
<map> <!-- the word weights have to be tweaked as per requirements -->
<entry key="1" value="1" />
<entry key="2" value="2" />
<entry key="3" value="3" />
</map>
</property>
<property name="sourceBoosts">
<map>
<entry key="nikon.txt" value="1" />
</map>
</property>
</bean>
>8 --snip--