dat640_group_project

Prerequisites

First it is important to download the dataset from: https://gustav1.ux.uis.no/dat640/msmarco-passage.tar.gz The dataset is large, hence it is not in the repository
Also need to to install the required pip packages

pip install -r requirements.txt --index-url https://download.pytorch.org/whl/cu118

Indexing

There is two different indexing files, one for the baseline and one with document expansion.
One of the indexing files must be run first. It will create a terrier index for the dataset.

To run the baseline indexing

python indexingBaseline.py

To run the document expansion indexing

python indexingExpando.py

Baseline Retreival

The baseline_retrival.py must be run after the baseline index is created. This will utilize the terrier index and use BM25 to retrieve the documents

python baseline_retrival.py

Expando Mono Duo

The mono_duo_retrival.py must be run after the baseline index is created. This will utilize the terrier index and use Expando Mono Duo to retrieve the documents

python mono_duo_retrival.py

Expando Mono Duo with T5 Query Rewrites

The final_retrival.py must be run after the document expansion index is created. This will utilize the terrier index and use Expando Mono Duo with query rewrites to retrieve the documents

python final_retrival.py

Scores

To enable the trec_eval, you must clone this repo: https://github.com/usnistgov/trec_eval
Thereafter cd into the folder trec_eval and then use the command make
When to are in the root directory afterwards you can use this command to score an output file:

./trec_eval/trec_eval -c -m recall.1000 -m map -m recip_rank -m ndcg_cut.3 -l2 -M1000 data/qrels_train.txt {file}

Will give out the Recall@1000, NDCG@3, MAP, and MRR. Replace file with the file which are going to be scored.

{./data/bm25score.txt} for the baseline results
{./data/monoduo.txt} for the Expando Mono Duo results
{./data/finalretrival.txt} for the Expando Mono Duo with T5 Query Rewrites results

The file top3.py is created for formatting the output files from the retrival files to the desired format in Kaggle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dat640_group_project

Prerequisites

Indexing

Baseline Retreival

Expando Mono Duo

Expando Mono Duo with T5 Query Rewrites

Scores

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
data		data
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
baseline_retrival.py		baseline_retrival.py
final_retrival.py		final_retrival.py
indexingBaseline.py		indexingBaseline.py
indexingExpando.py		indexingExpando.py
mono_duo_retrival.py		mono_duo_retrival.py
requirements.txt		requirements.txt
top3.py		top3.py

Folders and files

Latest commit

History

Repository files navigation

dat640_group_project

Prerequisites

Indexing

Baseline Retreival

Expando Mono Duo

Expando Mono Duo with T5 Query Rewrites

Scores

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages