-
Notifications
You must be signed in to change notification settings - Fork 3
Description
i'm using Elasticsearch 7.11.1, Python 3.7.13
In the "Build QA engine" section, when I respond to the query as follows:
Enter your query here: what does covid-19 cause
It outputs an error:
WARNING:allennlp.data.fields.sequence_label_field:Your label namespace was 'pos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.
See documentation for `non_padded_namespaces` parameter in Vocabulary.
INFO:elasticsearch:GET http://localhost:9200/ [status:200 request:0.520s]
INFO:elasticsearch:POST http://localhost:9200/elastic_index/_search [status:200 request:0.353s]
The number of datapacks(including query) is 1
Traceback (most recent call last):
File "./examples/pipeline/inference/search_cord19.py", line 97, in <module>
data_pack = next(nlp.process_dataset()).get_pack_at(1)
File "/home/ubuntu/.pyenv/versions/3.7.13/lib/python3.7/site-packages/forte/data/multi_pack.py", line 491, in get_pack_at
return self.packs[index]
IndexError: list index out of range
It seems I'm not reading the datasets at all, even though I tried to read the sample datasets that were provided in the previous step with
python examples/pipeline/indexer/cordindexer.py --data-dir ./data/document_parses/sample_pdf_json
which output the following really quickly, so it doesn't seem it indexed any data...
WARNING:root:Re-declared a new class named [ConstituentNode], which is probably used in import.
INFO:elasticsearch:GET http://localhost:9200/ [status:200 request:0.008s]
/home/ubuntu/.pyenv/versions/3.7.13/lib/python3.7/site-packages/elasticsearch/connection/base.py:200: ElasticsearchWarning: [types removal] Specifying types in bulk requests is deprecated.
warnings.warn(message, category=ElasticsearchWarning)
INFO:elasticsearch:POST http://localhost:9200/_bulk?refresh=true [status:200 request:0.338s]
and that directory contains three dataset files:
- 55736408816d3f956d830854659f24109444a36c.json
- aadc3e716b6cb0e898953dff056124378b31483c.json
- ffff73d17bc392ee68f3f16ef37d25579cb99322.json
i also noticed that in the config.yml file for the Indexer, it has fields doc_id and content https://github.com/petuum/composing_information_system/blob/main/examples/pipeline/indexer/config.yml#L3, however the above dataset files don't contain those fields at all, most of the content is in fields title, text, and section, but if i update that config.yml to be the following i get the same outcome
create_index:
batch_size: 10000
fields:
# - doc_id
# - content
- title
- text
- section