Skip to content

pyzet grep two-way support for searching for non-ASCII characters #34

@tpwo

Description

@tpwo

Alphabets that use non-ASCII characters are annoying to grep for, and there should be a way to enable a convenient search patterns that deal with this problem.

Problem statement

  • A user used non-ASCII character in the zettel content, and would like to find it with using only ASCII characters

  • A user used ASCII character in the zettel content (any reason like laziness/mistake/copied text), and would like to find it also when looking for its non-ASCII counterpart

Example

E.g. for Polish we have:

ą -- a
ć -- c
ę -- e
ł -- l
ń -- n
ó -- o
ś -- s
ź -- z
ż -- z

Of course, capital letters also should be supported.

Behaviors

  • grepping for zolta ges should find żółta gęś

  • grepping for żółta gęś should find zolta ges -- (use case: we want to find a copied text from someone who haven't used diacritics)

  • probably controlled with a special flag or even multiple flags (i.e. there can be different modes: a single two-way or two one-way)

Implementation

  • git grep pattern should be probably modified in such a way that it looks for strings with OR parts when one or the other character should match

  • However, multiple non-ASCII chars can map to a single ASCII, e.g. both ż and ź map to z. In such case, all three should be detected when grepping for z, but only two when grepping for ż or ź (because ż and ź shouldn't be treated as the same letter)

  • There are many languages, so hard-coding these rules for Polish doesn't seem like the best idea under the sun. I would prefer to create some kind of abstraction layer, so the rules can be added independently for each language. Maybe it can be even a part of a config file for custom mappings (to be checked is how YAML handles non-ASCII), but I think that built-in support for given languages can be included.

  • Above, I only wondered about a situation when we have char to char mapping. But there are examples when multiple ASCII characters map to a single non-ASCII char (e.g. German ß maps to ss). I'm not sure if this is trivial to extend it like that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions