Skip to content

Conversation

@beniza
Copy link

@beniza beniza commented Jan 7, 2020

I've made the following changes

  • In the Grammar
    • Allow lines with a period at the end
  • In the code
    • The parsed data is saved to an output file
    • The entries failed to parse are saved in the error.log file
  • Documentation
    • Expanded the README.md
    • Added a description of MDF lexical entries

Please review the code before merge. The rest of the files/data may be merged directly

beniza and others added 10 commits January 7, 2020 10:21
  - Removed extra whitespaces found in dictionaray input file
  - removed an irrelevant line from the input file
  - ran the new code and generated output
    - output is in data/dict.txt
    - entries not parsed by grammar are in error.log
  - Changes in the Grammar
    - an entry is valid even if there is a period at the end of the line
    - a pos can be terminated with either a fullstop or a comma
      - comma is a typo
    - glosses can be terminated with fullcolon
      - full colons are typo
    - attempted to support phrase entries (multiple words in headword)
      - failed, hence commented out the code

  - bailey now generate an sfm output (in MDF) of the input text
  - better handling of input file
    - in case of a mal constructructed line in the input text,
      - bailey will copy the line to error.log
     - continue to parse the next line
    - the parsed output will be stored in dict.txt file
This has been recreated from the MDF documentation.
entry = hash headword comma pos ws senses subentry period emptyline
# entry = hash headphrase comma pos ws senses subentry period emptyline
hash = (~"#")*
# headphrase = headword (ws headword)*
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this to capture the head words with multiple words. Most of the exception are due to this.

However I couldn't get this to work.

sense = (ml ws ml)* ml
ml = ~"[\u0d00-\u0d7f]*"
semicolon = ~";"
semicolon = ~"[;:]"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in many places the keyboardists made typos where they put a : in the place of ;. Since we are not preserving the data, I thought of bypassing them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant