Non-emoji numerals are detected as emoji#3
Non-emoji numerals are detected as emoji#3kainosnoema wants to merge 1 commit intotoddkramer:masterfrom
Conversation
|
Progress: it seems that the only way to properly detect a sequence of codepoints is using The one hitch to this solution is the one I mentioned about modifiers, but that can be handled by checking if the sequence is made up of two codepoints, the first one being an emoji and the second one being a modifier. I'm working on a pull-request now with this approach, adding tests as I go. |
Due to the use of `unicodeScalars` previously, some ASCII characters
were being identified as emoji. In particular, the "Keycap Digit N"
characters are composed of the ASCII character followed by two other
codepoints. Keycap Digit Zero, for example, contains these scalars:
```
- [1065] : "0"
- [1066] : "\u{FE0F}"
- [1067] : "\u{20E3}"
```
In order to properly handle these sequences without false positives, we
have to split emojis into their composed character sequences and store
those as a set instead. The one complication here is that there are many
permutations of emoji with the skin tone modifiers. Instead of storing
each of these, we simply check if a character sequence has two
codepoints, and if so, that the first character is an emoji and the
second is a skin tone modifier. This is a fairly simple and efficient
way to accurately identify the presence of valid emoji.
Signed-off-by: Evan Owen <kainosnoema@gmail.com>
|
Alright, here's a stab at fixing things. It requires a dramatically different approach to emoji detection, but it seems to be the most straightforward way to accurately detect emoji without false-positives on ASCII digits. Performance is good too after the first enumeration of all emoji sequences. Because it's so different though, you may have some suggestions on how to improve. Edit: The other major change here is that I've removed |
Non-emoji numerals are treated as emoji. e.g. this fails:
This is because
String.unicodeScalers()splits emojis into their codepoints, which for some characters yields standard ASCII. As an example, here are the codepoints for the "0 in a box" emoji:It's not as easy as removing ASCII characters from the list of unicode scalars, since that would break the implementation of
containsEmojiOnly(). One solution would be to find a way to split strings into their composed character sequences, but then you'd have to also combine all possible modifier permutations. Still thinking of the proper way to solve this.