Skip to content

Test set prediction scores file does not properly format the prediction and reference strings #959

@mmartin9684-sil

Description

@mmartin9684-sil

When the test set predictions are written to the predictions scores file (test.trg-predictions.detok.txt..scores.tsv), it does not properly format the predictions string and the reference string(s). If any of these strings contain an unmatched quotation mark, the TSV file will not load properly in tools (e.g., Google Sheets) that parse the TSV file. The unmatched quotation mark in the string will cause these tools to search for a matching quotation mark, bypassing the tab separator, and cause multiple fields and/or lines of the TSV file to be considered as part of the same string.

For instance, the Prediction string in this line of a prediction file has an unterminated double-quote:
76 14.54 42.86 16.67 10.00 6.25 1.000 39.15 41.80 39.55 25.35 0.55227655 "इमिगु तुति हिया स्‍यायेत याकनं च्‍वनी। मनूतय्‌त स्‍यायेत इमिगु तुति न्‍ह्यज्‍याः।
Importing this predictions file into Google Sheets will result in the Prediction string, the Reference string, and the following two lines of additional predictions to be combined as the value for the Prediction string. The unmatched quotation mark needs to be escaped when it is written to the predictions scores file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpipeline 5: testIssue relating to testing a model quality with Bleu or other metrics.

    Type

    Projects

    Status

    🔖 Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions