Programming assignment for COMP3009J Information Retrieval.
A basic information retrieval system that is capable of performing preprocessing, indexing, retrieval (using BM25) and evaluation.
The project can be divided into three modules: Indexing, Querying and Evaluating.
-
For running indexing programs (like
index_small_corpus.pyandindex_large_corpus.py)./index_small_corpus.py -p /path/to/comp3009j-corpus-small # or ./index_large_corpus.py -p /path/to/comp3009j-corpus-large -
For running querying programs (like
query_small_corpus.pyandquery_large_corpus.py)-
In interactive mode
./query_small_corpus.py -m interactive -p /path/to/comp3009j-corpus-small # or ./query_large_corpus.py -m interactive -p /path/to/comp3009j-corpus-large -
In automatic mode
./query_small_corpus.py -m automatic -p /path/to/comp3009j-corpus-small # or ./query_large_corpus.py -m automatic -p /path/to/comp3009j-corpus-large
-
-
For running the evaluation programs (like
evaluate_small_corpus.pyandevaluate_large_corpus.py)./evaluate_small_corpus.py -p /path/to/comp3009j-corpus-small # or ./evaluate_large_corpus.py -p /path/to/comp3009j-corpus-large
This module will produce a file named <ucd_id>-small.index or <ucd_id>-large.index where all the indexes are organized in a JSON format such as:
{
"term": {
"file-id": "BM25 weight",
...
},
...
}This module search through the files under the directory ../documents/. It hehaves slightly differently in different modes.
In the interactive mode, the program will take your query and print out the results immediately. The format of the result is as follow:
| Query ID | <always 0 in the interactive mode> | Document ID | Relevance Judgment |
In the automatic mode, the program will automatically search through the documents based on the queries in ../queries.txt and save the results in a file named <ucd_id>-small.results or <ucd_id>-large.results. The results will be in a format similar to those in the interactive mode.
| Query ID | Document ID | Rank of the Document | Similarity Score |
This module is aimed to evaluate the results of the querying module. It uses the following evaluation metrics:
- Precision
- Recall
- R-Precision
- Precision at K
- Mean Average Precision (MAP)
- Normalized Discounted Cumulative Gain (NDCG)
- Binary Preference (for large corpus only)