ProcessDocumentHeader() in WikipediaDumpProcessor should use analyzer.

Currently ProcessDocumentHeader() does not use the Lucene analyzer for the document title. This leads to problems with terms that contain colons. As an example, in the file AA\wiki_83, document 11327 https://en.wikipedia.org/?curid=11327 has the title "Wikipedia:Free On-line Dictionary of Computing/symbols - B". Since this title is not passed through the Lucene tokenizer, the colon makes it through and we end up with the term "Wikipedia:Free" in the Document Frequency Table. When we use the Document Frequency Table as a source of test queries, we try to parse the query "Wikipedia:Free" and fail because the parser thinks that "Wikipedia" is a stream name prefix. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ProcessDocumentHeader() in WikipediaDumpProcessor should use analyzer. #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ProcessDocumentHeader() in WikipediaDumpProcessor should use analyzer. #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions