Skip to content

Conversation

@hmlee245
Copy link

@hmlee245 hmlee245 commented Aug 19, 2025

What does this PR do ?

This PR adds time semiotic class to Korean ITN

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

hmlee245 and others added 30 commits May 13, 2025 15:02
Sparrowhawk testing is not done yet.

Signed-off-by: hmlee245 <hmlee245@gmail.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
changes made to 9f7e876.

Signed-off-by: hmlee245 <hmlee245@gmail.com>
… test cases

Signed-off-by: hmlee245 <hmlee245@gmail.com>
…45/NeMo-text-processing into Draft-Version2/Korean-ITN

Signed-off-by: hmlee245 <hmlee245@gmail.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
Signed-off-by: hmlee245 <hmlee245@gmail.com>
Copy link
Member

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments

Hyunmin Lee and others added 2 commits August 28, 2025 16:46
Signed-off-by: Hyunmin Lee <hyunminl@hyunminl-mlt.client.nvidia.com>
Copy link
Member

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the minutes part is too liberal with potential cardinals. correct me if i'm wrong.


hour_component = pynutil.insert("hours: \"") + (graph_hours + spacing + hour_suffix) + pynutil.insert("\"")

minute_component = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this graph beyond 0-59 though?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does only accept 0-59 properly. Anything beyond will be accepted awkwardly. For example, "60분" will be tokenized as Cardinal 6, minute 10.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a block to that to limit awkward examples?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I've been working on this week + money semiotic class. Will try to update those asap this week.

hmlee245 and others added 4 commits September 23, 2025 05:00
money = MoneyFst()
money_graph = money.fst

graph = cardinal_graph | ordinal_graph | decimal_graph | fraction_graph | time_graph | date_graph | money_graph
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pynin union and make it multiple lines. the linter should catch this.

Copy link
Member

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm . minor formatting change

@github-actions
Copy link

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Oct 14, 2025
graph_regular = hour + minute + second

# 오전 = AM, 오후 = PM
prefix_words = pynini.union(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make union (....) + spacing. Optimization usually catches these but it's not a given so might as well safe the quick op.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to union the pynini.accep. It gives me TypeError for str, tuple issue.

+ pynini.closure(delete_space + suffix_tag, 0, 1)
)

cardinal = CardinalFst()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make cardinal fst an argument for the init function. this allow you to pass the fst from the cardinal in the tagger graph and avoid having to instantiate the graph twice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not using the cardinal fst for anything else and only to detect hour/minute/second that are out of normal range. So, wouldn't this be instantiating once?

money = MoneyFst(cardinal)
money_graph = money.fst

telephone = TelephoneFst()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass cardinal to your telephone like the above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not using cardinal for telephone. I am just using the same digits for cardinal class.

Copy link
Member

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request regarding telephone graph

@github-actions github-actions bot removed the Stale label Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants