-
Notifications
You must be signed in to change notification settings - Fork 2
pyzet grep two-way support for searching for non-ASCII characters #34
Description
Alphabets that use non-ASCII characters are annoying to grep for, and there should be a way to enable a convenient search patterns that deal with this problem.
Problem statement
-
A user used non-ASCII character in the zettel content, and would like to find it with using only ASCII characters
-
A user used ASCII character in the zettel content (any reason like laziness/mistake/copied text), and would like to find it also when looking for its non-ASCII counterpart
Example
E.g. for Polish we have:
ą -- a
ć -- c
ę -- e
ł -- l
ń -- n
ó -- o
ś -- s
ź -- z
ż -- z
Of course, capital letters also should be supported.
Behaviors
-
grepping for
zolta gesshould findżółta gęś -
grepping for
żółta gęśshould findzolta ges-- (use case: we want to find a copied text from someone who haven't used diacritics) -
probably controlled with a special flag or even multiple flags (i.e. there can be different modes: a single two-way or two one-way)
Implementation
-
git greppattern should be probably modified in such a way that it looks for strings withORparts when one or the other character should match -
However, multiple non-ASCII chars can map to a single ASCII, e.g. both
żandźmap toz. In such case, all three should be detected when grepping forz, but only two when grepping forżorź(becauseżandźshouldn't be treated as the same letter) -
There are many languages, so hard-coding these rules for Polish doesn't seem like the best idea under the sun. I would prefer to create some kind of abstraction layer, so the rules can be added independently for each language. Maybe it can be even a part of a config file for custom mappings (to be checked is how YAML handles non-ASCII), but I think that built-in support for given languages can be included.
-
Above, I only wondered about a situation when we have char to char mapping. But there are examples when multiple ASCII characters map to a single non-ASCII char (e.g. German
ßmaps toss). I'm not sure if this is trivial to extend it like that.