-
Notifications
You must be signed in to change notification settings - Fork 4
Description
The WhooshIndex seems to cap out at about 6 million PDFs, so we need a new more scalable solution for keyword search. This solution should scale to 100+ million pages, live on disk, and have an acceptable latency (<.5 sec). There's a few out-of-the-box solutions that we can use for this.
- SQLite FTS5
- LanceDB
- ElasticSearch (Open-Source Version)
- Postgresql Full-Text Search
I think that this is roughly the order that I would try them out in. The first two run as "embedded databases" which means that they operate as a normal library within a python program. The latter two run as "server utilities" which means that we would need to separately (i.e. outside of the python program) boot them up, then we would interact with them via their python interface. The former can be a bit easier to manage, so I'd recommend that we start there.
In the code base, I think there are 4 files that would be touched by this PR:
- "indexing.py": This file has an abstract class called
AbstractKeywordIndex, and the bulk of this PR would be about implementing a sub-class that uses the new index type. - "scripts/generate_index_keywords.py": This file manages downloading text files from s3 and adding them to the keyword index. You'll just need to add a command-line argument that chooses which kind of index to use.
- "scripts/index_keywords.sh": This file just calls the previous one, so you'll have to update it to use the new index type argument.
- "pyproject.toml": You shouldn't edit this directly, but you'll want to call "poetry add XXX" to add any new dependencies that you need. In the case of sqlite, you shouldn't need to add any because it comes by default with Python and we already use it, but ymmv.