Skip to content
dongzhaoan edited this page Jun 4, 2017 · 4 revisions

To execute the algorithm, you need to point to files that contain the data that should be processed. At a minimum, you need to pass the categoriesfile and the inputfile with the labels assigned by the workers to the different objects. See below for a more detailed explanation.

Table of Contents

categoriesfile

A text file that contains the list of categories used to annotate the objects, one category per line. The categoriesfile can optionally be used to define the prior values for the different categories, instead of letting the priors be defined by the data. In that case, it becomes a tab-separated file and each line has the form category<tab>prior where:

  • category is the name of the category
  • prior is a number from 0 to 1, indicating the prior probability of observing this category in the data (this is optional)
  • <tab> is the tab character
For example, if we have an adult website detection task, with two categories (porn and notporn), and we decide to put the priors to be equal to each other, the file will be:
 porn	0.5
 notporn	0.5

If we do not want to specify the priors but rather let them be estimated from the data (in a maximum likelihood manner) then the file will be:

 porn
 notporn

inputfile

A plain tab-separated text file containing the labels assigned by the workers. Each line has the form workerid<tab>objectid<tab>assigned_label where:

  • workerid is the id of the worker that assigned the label
  • objectid is the id of the object that has been labeled
  • assigned_label is the label assigned by workerid to objectid
  • <tab> is the tab character
For example, if we have five workers that label websites as porn or not, then the file with the labels assigned by the workers will look similar to this:
 worker1	http://sunnyfun.com	porn
 worker1	http://sex-mission.com	porn
 worker1	http://google.com	porn
 worker1	http://youporn.com	porn
 worker1	http://yahoo.com	porn
 worker2	http://sunnyfun.com	notporn
 worker2	http://sex-mission.com	porn
 worker2	http://google.com	notporn
 worker2	http://youporn.com	porn
 worker2	http://yahoo.com	porn
 worker3	http://sunnyfun.com	notporn
 worker3	http://sex-mission.com	porn
 worker3	http://google.com	notporn
 worker3	http://youporn.com	porn
 worker3	http://yahoo.com	notporn
 worker4	http://sunnyfun.com	notporn
 worker4	http://sex-mission.com	porn
 worker4	http://google.com	notporn
 worker4	http://youporn.com	porn
 worker4	http://yahoo.com	notporn
 worker5	http://sunnyfun.com	porn
 worker5	http://sex-mission.com	notporn
 worker5	http://google.com	porn
 worker5	http://youporn.com	notporn
 worker5	http://yahoo.com	porn

goldfile

A plain tab-separated text file containing the correct labels for some of the objects in the data. Having some pre-labeled data helps in better estimating the quality of the workers and, in turn, the correct labels for the objects for which we do not have the correct label. Each line has the form objectid<tab>gold_label where:

  • objectid is the id of the object that has been labeled
  • gold_label is the correct label for objectid
  • <tab> is the tab character
For example, if we checked the first two URLs from the site above and we have an authoritative judgement about their category, we can have a file such as this:
 http://sex-mission.com	porn
 http://sunnyfun.com	notporn

Note that it is not necessary to have any data in this file.

costfile

A plain tab-separated text file the classification costs of classifying an object that belongs to a category A into a category B. Each line has the form correct_class<tab>classified_class<tab>classification_cost where:

  • correct_class is the correct class of the object
  • classified_class is the class in which the object is classified
  • classification_cost is the cost of classifying an object from `correct_class` into `classified_class`
  • <tab> is the tab character
If we want to assign equal costs to the misclassification decisions, then the file could look like this:
 porn	porn	0
 notporn	notporn	0
 porn	notporn	1
 notporn	porn	1

If we want to assign unequal costs to the misclassification decisions, we can do that by changing the costs. For example, if it is much worse to classify a porn page as notporn, we can assign a cost of 5 there, and keep the cost of misclassifying a notporn as porn to 1. Then the file could look like this:

 porn	porn	0
 notporn	notporn	0
 porn	notporn	5
 notporn	porn	1

evaluation

A plain tab-separated text file containing the correct labels for (some of) the objects in the data. These evaluation labels, unlike the labels in the goldfile, do not get passed to the algorithm during execution. Instead, they are used only for evaluating the quality of the estimations about the worker quality and about the object labels. Each line has the form objectid<tab>correct_label where:

  • objectid is the id of the object that has been labeled
  • correct_label is the correct label for objectid
  • <tab> is the tab character

iterations

An integer, indicating the maximum number of iterations to run the EM algorithm. Usually a small number of iterations is enough to detect the error rate of each worker. The default value is 50. Typically, there is no need to change this variable.

epsilon

A decimal number, indicating what the difference between the log-likelihood should be for the algorithm to stop iterating the EM algorithm. The default value is 10E-6 (i.e., 0.00001). Typically, there is no need to change this variable.

Clone this wiki locally