A simple app that leverages an SQLite database and synonym searches to locate
semantic pair candidates.
It supports reading verified sentence pairs from an existing .csv file.
Inverse sentence pairs are not stored in the DB; s1-s2 is considered to be
equivalent to s2-s1. Same-sentence pairs are also excluded, since they
require no evaluation. The initial number of unrated pairs is equal to:
item_count! / (2!(item_count - 2)!)
For example, a corpus consisting of 3875 unique sentences should result in
3875! / (2 x 3873!) = (3875 x 3874) / 2 = 7505875
unrated pairs.
If 250 rows are present in the verified sentence file, the number of unrated
pairs after the call to the initialize() function should be 7505625.
Implement the generation of two .csv files:
- a list of automatically rated sentence pairs that can then be manually verified
- a list of all the sentence pairs in the corpus, which should include both
same-sentence pairs (
s1-s1) and inverse pairs (s1-s2ands2-s1)