Skip to content

ProcessDocumentHeader() in WikipediaDumpProcessor should use analyzer. #5

@MikeHopcroft

Description

@MikeHopcroft

Currently ProcessDocumentHeader() does not use the Lucene analyzer for the document title. This leads to problems with terms that contain colons. As an example, in the file AA\wiki_83, document 11327 https://en.wikipedia.org/?curid=11327 has the title "Wikipedia:Free On-line Dictionary of Computing/symbols - B". Since this title is not passed through the Lucene tokenizer, the colon makes it through and we end up with the term "Wikipedia:Free" in the Document Frequency Table. When we use the Document Frequency Table as a source of test queries, we try to parse the query "Wikipedia:Free" and fail because the parser thinks that "Wikipedia" is a stream name prefix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions