Skip to content

Library doesn't seem to take character sets into account #43

@sixtyfive

Description

@sixtyfive

I'm trying to distinguish between a couple of European languages and Turkish/Arabic/Aramaic. Whatlanguage does a fair job of the European languages, but beyond that falls apart at the seams.

main » wl.language("Bilgi Teknolojileri Kurumu (BTK) tarafından 29 Nisan 2017 tarihinde.")
=> :russian

It cannot be Russian, as that would be written with a different set of characters.

main » wl.language("البرنامج ليس ذكي جدا")
=> :arabic

ًWorks fine, even though I wasn't very nice to it. But as evidenced in #41, there are issues there, too. The reporter of #41 doesn't make it very explicit, but hits the same spot, especially with the numbers. (His first and second strings are easily recognizable as Farsi, not Arabic, by way of their second, i.e. the left, word not being part of the Arabic dictionary but very commonly used in Farsi).

main » wl.language("ܣܰܪܐ: ܓܷܕ ܣܳܚܝܢܰܐ، ܓܷܕ ܡܫܰܡܣܝܢܰܐ، ܓܷܕ ܫܳܬܝܢܰܐ ܩܰܚܘܰܐ ܘܦܰܠܓܶܗ ܕܝܰܘܡܐ ܠܰܦ ܐܝ ܣܰܥܰܐ ܬܪܰܥܣܰܪ ܘܦܰܠܓܶܗ ܓܷܕ ܡܰܥܪܝܢܰܐ. ܗܰܘܟ݂ܰܐ ܓܷܕ ܫܳܦܰܥ ܐܘ ܝܰܘܡܰܝܕ݂ܰܢ.")
=> :russian

Makes one wonder if Russian is a last-resort fallback. Again, though, it cannot possibly be Russian, because it's a completely different character set, namely that of Aramaic.

I'd also like to point out #27 again at this point. I do so with a sad face. In addition I would like to point out that these matters have been noticed elsewhere as well: "[...] this project still has a way [sic] to go [...]", posted to StackExchange on July 9, 2014.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions