Skip to content

Corpus#removeWords is not working properly with unicode characters #8

@namirsab

Description

@namirsab

Observed

If you have a word like zurück in your documents, and you have this set of words to remove ['zur']
Then this step will remove zur in the word, converting zurück into ück.
That's happening because the function is using word boundaries (\b) which are known not to work with Unicode.

Expected

  • the function uses an unicode compatible regexp.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions