Skip to content

Replacing MeCab with alternative parsing dictionary / adding implementation for user input when reviewing cards #66

@nlovell1

Description

@nlovell1

Not sure if this is the best place to post, but I'm new to GitHub so please let me know.

I'm interested in helping with replacing MeCab with another parser, particularly out of frustration with 1. homophonic grammar structures marked as 'known' actually have more than one, often different semantic uses, and 2. disregard to collocations, colloquialisms, and figures of speech and instead breaking them up... both of which in my experience have brought cards to i+2 or greater. It seems that what would be needed to solve this is beyond the scope of general tokenizers / morphological analyzers. Morphemizers like MeCab or even Sudachi seem to tokenize a sentence into “morphemes", but my expected results are actually 文節 (clauses)... the only software I can find that does that is J.depP https://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jdepp/, and I'm unsure how this would be implemented.

I would also like to revamp the system involving comprehension cards, as often the morph indicated as the target morph of that sentence is not actually the morph that is unknown to the user. Ideally I would like to see implementation that asks for user input to redefine the target, or rather unfamiliar/unknown morph in a sentence when the parsing dictionary gets it incorrect. It is unclear at this time to whether or not improving the parser would even facilitate the need for this implementation, but as of right now, I think that could be a potential band-aid.

I would love to help out the development of this, but am a little unsure on where to start. Please let me know if I can do anything.
Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions