Improve the ja character set per ARIB feedback#614
Conversation
|
Looking at liaison text and ARIB B-62 2.2E1 F1, I think line 3642 should target character codes listed in this list of Japanese Supplementary character set, e.g. |
Co-authored-by: himorin / Atsushi Shimono <atsushi@himor.in>
Co-authored-by: himorin / Atsushi Shimono <atsushi@himor.in>
Co-authored-by: himorin / Atsushi Shimono <atsushi@himor.in>
Co-authored-by: himorin / Atsushi Shimono <atsushi@himor.in>
|
The Timed Text Working Group just discussed
The full IRC log of that discussion<nigel> Topic: Improve the ja character set per ARIB feedback #614<nigel> github: https://github.com//pull/614 <cpn> Pierre: Atsushi, if I accept both your comments, are you happy with the PR? <cpn> Atsushi: The comment about ARIB was just a suggestion <nigel> -> Atsushi's comment https://github.com//pull/614#issuecomment-3332729041 <cpn> Atsushi: The comment relates to the suggested change above. <cpn> Pierre: I accepted the suggestion. Is the PR ok now? <cpn> Atsushi: Yes <nigel> SUMMARY: Atsushi's comments accepted |
|
I've read through whole updated text again. |
Ideographic Variation Selector is not a defined term in Unicode. 23.4 Variation Selectors, CJK Compatibility Ideographs states:
I assumed ARIB meant standardized variation sequences for CJK compatibility ideographs. |
|
In UTS #37, IVS (Ideographic Variation Sequence) is defined as a sequence of two coded characters, first as Ideographic, second as one of variation selector. Since IVS itself is a "sequence" of two Unicode codepoints, but uses one variation selector, so sometimes it is written as Ideographic Variation Selector (like About IVD/IVS at CITPC) (or used the term in early phase of development e.g. some proposals). CJK Compatibility ideographs are compatibility ideographs most of which are normalized into CJK Unified ideographs, but required for backward compatibility with local character encodings. Also some parts of CJK Compatibility ideographs are included as collections listed here, like IBM 32 compatibility ideographs U+FA0E to U+FA2D are listed as part of collection 287 Common Japanese. Following Unicode 6.3, these ranges got another table using SVS (Standardized Variation Sequences, using Standardized Variation Selectors - U+FE00 to U+FE02) as described in the section, which makes codepoints in CJK Compatibility ideographs to be written with CJK Unified Ideographs with one of SVS. So, I believe the note included as the last line of list 2 in liaison text, shall be read as use the Ideographic-specific Variation Selector defined in Unicode, or Ideographic Variation Sequence (IVS) defined in Unicode. |
Ideographic variation sequences are not part of Unicode and instead specified in UTS 37, but Standardized variation sequences are specified in Unicode. Does ARIB STD-B62 reference UTS 37? Are sure that ARIB does not mean Standardized variation sequences? Can ARIB provide examples of what they mean by "variation of Kanji character"? |
|
The basic question is: which clauses of the Unicode standard and what expected conformance does the following requirement from the ARIB liaison refer to? For variation of Kanji characters, the Ideographic Variation Selector defined in [ISO10646] shall be used. In particular, are specifications beyond ISO 10646 required to specify and/or conform to the requirement. In addition, a few examples would be appreciated. |
|
The Timed Text Working Group just discussed
The full IRC log of that discussion<nigel> Subtopic: Improve the ja character set per ARIB feedback #614<nigel> github: https://github.com//pull/614 <nigel> Pierre: [shares screen] <nigel> .. Liaison from ARIB raises the question at hand. <nigel> .. ARIB kindly suggested character set changes for ja, which is great. <nigel> .. There's a note about Ideographic Variation Selector. <nigel> .. However that is not a defined term. <nigel> .. Atsushi and I have been discussing how to interpret it. <nigel> .. We need to figure out what that means, so we don't write something different from <nigel> .. what they intend. <nigel> .. From Atsushi's last comment I think "ideographic variation sequence"? <nigel> Atsushi: CJK compatibility ideographs are there for compatibility. <nigel> .. There can be mismapping between character set and what Unicode says. <nigel> .. For backward compatibility between local character set and unicode some characters <nigel> .. have both mappings within [scribe missed]. <nigel> .. I believe that is not related to variation sequence or anything else. <nigel> .. If someone wants to say about the variation selector usually we say <nigel> .. "ideographic variation selector" or "ideographic variation sequence" <nigel> .. so they should mean the same as each other. They are terms used interchangeably. <nigel> .. I believe what the point means is that the ideographic variation sequences shall be used. <nigel> Pierre: That's not part of main Unicode, it's part of UCS-37. Does ARIB reference UCS-37? <nigel> Atsushi: Variation selector itself is in ISO10646 <nigel> Pierre: That's a much broader thing though, includes emoji selectors which I think we don't want. <nigel> Atsushi: shows [Ideographic variation sequence] in Unicode 17.0.0 <nigel> Pierre: You have to know how to represent it. <nigel> Atsushi: Representation is described in a separate database, not in ISO10646. <nigel> Pierre: Before saying you must or should support this I want to know absolutely certainly that <nigel> .. is what ARIB has in mind. Can we get a sample? <nigel> .. I don't want to suggest a mandatory thing that's wrong or won't be used. <nigel> Atsushi: I wonder if I can ask a "side" way from colleagues in NHK. <nigel> Pierre: Please ask informally! I'm interested as an Editor in knowing which part of Unicode <nigel> .. this "SHALL" exactly means. <nigel> .. Just to clarify the terminology that doesn't exactly match the spec. <nigel> Atsushi: Is it okay to reply to the liaison email by myself? <nigel> Nigel: Yes I think that would be good. I'd suggest if you can write informally in response <nigel> .. that we noticed this small difference in language and want to make sure that we understand <nigel> .. correctly and ask for guidance or even sample data then that would help clear this up for us. <nigel> .. I don't want to go around a whole formal liaison/response loop which will take a long time. <nigel> Pierre: [drafts the essential request in the GitHub issue] <nigel> SUMMARY: @himorin to ask informally for clarification as per the above discussion. |
|
(still waiting reply from ARIB colleagues.) |
|
The Timed Text Working Group just discussed
The full IRC log of that discussion<nigel> Subtopic: Improve the ja character set per ARIB feedback #614<nigel> github: https://github.com//pull/614 <wschildbach> pierre: we added a recommend charset based on ARIB input. <wschildbach> .. unfortunately, there is in the liaison some vagueness. We should make sure we get it right. <wschildbach> .. we asked for more details but got no clarifciation. <wschildbach> .. don't want to remove the text but we need clarification. This is informative (should not a shall), it is usefull but not necessary. <wschildbach> nigel: I think that the idiographic selector is not defined where it says it is. <wschildbach> .. translation issue? <wschildbach> atsushi: this is not a stopping issue <wschildbach> nigel: if your colleague comes later, let's ask them <wschildbach> .. is there a choice of terminology and we need to use the correct one? <wschildbach> pierre: this is a complex part with many things falling underneath it. <wschildbach> .. what would be most useful would be an example of what is meant. <wschildbach> .. I find it a complex part of the unicode spec. <wschildbach> .. as atsushi pointed out, terms may have changed. Ideally have an example. <wschildbach> .. here is sample tet that uses IVS, and here is what we expect the rendering to be. <wschildbach> s/tet/text/ <wschildbach> .. and we could include a spec action. <nigel> s/a spec action/in the spec actually <nigel> s/in/it in <wschildbach> nigel: this is unresolved right now. so we are saying we can proceed to CRS without resolving? <wschildbach> atsushi: agrees. <wschildbach> nigel: we merge later. <wschildbach> atsushi: this is not normative, so don't need another crs <wschildbach> nigel: we can put change in and request transition to rec <wschildbach> .. implementation report will be empty. It is a formality. <wschildbach> s/crs/CRS/ <nigel> SUMMARY: Hold this PR open pending feedback and hopefully an example, and do not hold up CRS publication <nigel> forcedDisplay and visibility="hidden" #484 <nigel> s/forced/Subtopic: forced |
|
@himorin I have updated the ja character set with the list you provided: |
|
Could you add back about IVS part as suggestion? IVS is not related to |
The slides mentions that ARIB STD-B62 "picks 19 glyphs from IVD" not support for any and all contents of the IVD. Can you provide the list of 19? |
|
Add Table 7-8: Operational Ideographic Variation Sequence from |
|
The Timed Text Working Group just discussed
The full IRC log of that discussion<cpn> Subtopic: Japanese character set<nigel> github: https://github.com//pull/614 <nigel> s/set/set #614 <cpn> Nigel: We have some feedback from ARIB <cpn> ... What's the status now? <cpn> Pierre: Atsushi has asked for text to be added that requires conformance with IVS <cpn> Atsushi: What ARIB is used is standard IVS and IVD specificied in ISO ?? spec <cpn> ... I asked to remove CJK Compatibility Ideographs, and add a note on using IVS for ideographic characters. This is background material for that <cpn> Pierre: My concern is IVD is huge, with lots of unrelated stuff. Can we just include a list of the 19 glyphs? <cpn> Atsushi: ARIB-STD-62 refers to IVD ... <cpn> Pierre: My objection from the beginning about referencing IVD is that it's unbounded, and we don't want people to have to support all of IVD just to support the Japanese character set <cpn> ... Can we copy the list? <cpn> Atsushi: I believe so. I'm not sure exactly where they are <atsushi> https://www.arib.or.jp/kikaku/kikaku_hoso/tr-b39.html <cpn> Atsushi: I found the English version <cpn> ... Look at the second table <nigel> -> English translation of Fascicle 2 https://www.arib.or.jp/english/html/overview/doc/8-TR-B39v2_5-2p5-E1.pdf <cpn> Nigel: Found it, Table 7-8, page 3-63, Fascicle 2 <nigel> Table 7-8 in section 7.4.4 includes the 19 IVS characters <cpn> Pierre: I'll add it to the PR <cpn> Nigel: In Section 8.4.1, it says the ideographic variation sequence is not operated. <cpn> Atsushi: This is not a standard, but an operational recommendation <cpn> ... In the discussion in Japan, they list commonly used variation selector characters. We mention Table 5.2 and 3 in the IMSC document <cpn> ... I believe this is just a set of characters that are actually used in current broadcast systems <cpn> Pierre Not sure how to interpret that sentence, Nigel. Atsushi, what does the original say? <cpn> Atsushi: I don't have access, I only have the text provided by Ohmata-san <cpn> Nigel: I think those sequences have proven useful, but not actually required <cpn> Pierre: I recommend drafting the PR and ask ARIB for feedback <cpn> Atsushi: The table in 7.4.4 are commonly used in broadcasting in Japan. It describes fallback operation, commonly used for IVS and IVD characters. <cpn> ... Not all fonts support IVD glyphs, as IVD includes several sets of variation sequences <cpn> ... But in any case, I'll ask Ohmata-san <cpn> Nigel: I'll also look at the re-ordering PR <nigel> SUMMARY: @palemieux to add the 19 IVS, @himorin to check details with contact |
|
@himorin Please share the revised ja character set with ARIB folks for their feedback. |
|
Sorry to leak the PR Preview fix attempt into here, but since it's the only open PR that contains content I thought it worthwhile. I cherry-picked @himorin 's fix in #637 and the build step seemed to pass, but the PR description wasn't updated to add the Preview and Diff links. However, manually navigating to https://pr-preview.s3.amazonaws.com/w3c/imsc/pull/614.html shows that the PR Preview did actually build. The diff link I was expecting is https://pr-preview.s3.amazonaws.com/w3c/imsc/614/9ea529d...875ec4c.html but it doesn't look like that's been created - at least I get an Access Denied page from it. |
Co-authored-by: Nigel Megitt <nigel.megitt@bbc.co.uk>
|
The Timed Text Working Group just discussed
The full IRC log of that discussion<nigel> Topic: Improve the ja character set per ARIB feedback #614<nigel> github: https://github.com//pull/614 <nigel> Nigel: We had some good input that we processed last meeting. What's the status? <nigel> Pierre: [shares screen showing preview of the pull request] <nigel> .. Atsushi, what do you suggest? <nigel> Atsushi: ARIB TR document is some sort of operational manual which records the current situation. <nigel> .. It is not normative. <nigel> .. It could be changed. <nigel> Pierre: Remove the TR-B39 sequence? <nigel> Atsushi: Maybe just note that these are operationally used but not normative. <nigel> Pierre: suggests removing the explicit list and just referencing TR-B39 <nigel> Atsushi: the note also could apply to CJK ideographic characters <nigel> Pierre: Make the reference to the ARIB TR a note? <nigel> Atsushi: Yes something like that, or suggest that IVS is used for CJK and operationally used IVSes are <nigel> .. used in ARIB. <nigel> Pierre: What's the down side of referencing ARIB-TR-B39? <nigel> Atsushi: It's a link from normative text to a non-normative document. <nigel> Pierre: The entire annex is just a SHOULD not a SHALL. <nigel> Atsushi: SHOULD is normative too. <nigel> Pierre: It's useful though. <nigel> Atsushi: Yes, useful but the normative definition in ARIB STD is that IVS may be used, but operationally <nigel> .. characters listed in ARIB TR are the ones currently used. <nigel> Pierre: Exactly. <nigel> Atsushi: That's why I'm afraid that the TR definition may be changed, so I want to turn that part into a non-normative note. <nigel> Pierre: Sure, [makes edit that the IVS is a Note. <nigel> s/e./e.] <nigel> Pierre: Nigel, are you happy with this? <nigel> Nigel: Yes. Are there any other issues related to the ARIB ja character set that we should be covering off here? <nigel> Pierre: [checks] I think so <nigel> Atsushi: Yes <nigel> Nigel: Great, let's go for it then. <nigel> Pierre: [pushes the change] <nigel> Nigel: The note about 10646 and Unicode - why is that in the ja section? Oh, because it's only referenced in the ja character set <nigel> Pierre: That's right <nigel> Atsushi: There are corrections that are only in the ISO spec. <nigel> .. I approved the PR <nigel> Nigel: I approved it too <nigel> Pierre: I have to fix the merge conflicts then I'll merge the PR. <nigel> SUMMARY: PR to be merged. |
Closes #613
Preview | Diff