-
Notifications
You must be signed in to change notification settings - Fork 26
Input Files
To execute the algorithm, you need to point to files that contain the data that should be processed. At a minimum, you need to pass the categoriesfile and the inputfile with the labels assigned by the workers to the different objects. See below for a more detailed explanation.
A text file that contains the list of categories used to annotate the objects, one category per line. The categoriesfile can optionally be used to define the prior values for the different categories, instead of letting the priors be defined by the data. In that case, it becomes a tab-separated file and each line has the form category<tab>prior where:
- category is the name of the category
- prior is a number from 0 to 1, indicating the prior probability of observing this category in the data (this is optional)
- <tab> is the tab character
porn 0.5 notporn 0.5
If we do not want to specify the priors but rather let them be estimated from the data (in a maximum likelihood manner) then the file will be:
porn notporn
A plain tab-separated text file containing the labels assigned by the workers. Each line has the form workerid<tab>objectid<tab>assigned_label where:
- workerid is the id of the worker that assigned the label
- objectid is the id of the object that has been labeled
- assigned_label is the label assigned by workerid to objectid
- <tab> is the tab character
worker1 http://sunnyfun.com porn worker1 http://sex-mission.com porn worker1 http://google.com porn worker1 http://youporn.com porn worker1 http://yahoo.com porn worker2 http://sunnyfun.com notporn worker2 http://sex-mission.com porn worker2 http://google.com notporn worker2 http://youporn.com porn worker2 http://yahoo.com porn worker3 http://sunnyfun.com notporn worker3 http://sex-mission.com porn worker3 http://google.com notporn worker3 http://youporn.com porn worker3 http://yahoo.com notporn worker4 http://sunnyfun.com notporn worker4 http://sex-mission.com porn worker4 http://google.com notporn worker4 http://youporn.com porn worker4 http://yahoo.com notporn worker5 http://sunnyfun.com porn worker5 http://sex-mission.com notporn worker5 http://google.com porn worker5 http://youporn.com notporn worker5 http://yahoo.com porn
A plain tab-separated text file containing the correct labels for some of the objects in the data. Having some pre-labeled data helps in better estimating the quality of the workers and, in turn, the correct labels for the objects for which we do not have the correct label. Each line has the form objectid<tab>gold_label where:
- objectid is the id of the object that has been labeled
- gold_label is the correct label for objectid
- <tab> is the tab character
http://sex-mission.com porn http://sunnyfun.com notporn
Note that it is not necessary to have any data in this file.
A plain tab-separated text file the classification costs of classifying an object that belongs to a category A into a category B. Each line has the form correct_class<tab>classified_class<tab>classification_cost where:
- correct_class is the correct class of the object
- classified_class is the class in which the object is classified
- classification_cost is the cost of classifying an object from `correct_class` into `classified_class`
- <tab> is the tab character
porn porn 0 notporn notporn 0 porn notporn 1 notporn porn 1
If we want to assign unequal costs to the misclassification decisions, we can do that by changing the costs. For example, if it is much worse to classify a porn page as notporn, we can assign a cost of 5 there, and keep the cost of misclassifying a notporn as porn to 1. Then the file could look like this:
porn porn 0 notporn notporn 0 porn notporn 5 notporn porn 1
A plain tab-separated text file containing the correct labels for (some of) the objects in the data. These evaluation labels, unlike the labels in the goldfile, do not get passed to the algorithm during execution. Instead, they are used only for evaluating the quality of the estimations about the worker quality and about the object labels. Each line has the form objectid<tab>correct_label where:
- objectid is the id of the object that has been labeled
- correct_label is the correct label for objectid
- <tab> is the tab character
An integer, indicating the maximum number of iterations to run the EM algorithm. Usually a small number of iterations is enough to detect the error rate of each worker. The default value is 50. Typically, there is no need to change this variable.
A decimal number, indicating what the difference between the log-likelihood should be for the algorithm to stop iterating the EM algorithm. The default value is 10E-6 (i.e., 0.00001). Typically, there is no need to change this variable.