Skip to content

fix(ingredient_parser): sanitize NLTK logs before JSON.parse#27

Draft
charliecreates[bot] wants to merge 2 commits intomainfrom
ai-rec-7-json-parsererror-unexpected-character-dow
Draft

fix(ingredient_parser): sanitize NLTK logs before JSON.parse#27
charliecreates[bot] wants to merge 2 commits intomainfrom
ai-rec-7-json-parsererror-unexpected-character-dow

Conversation

@charliecreates
Copy link

Context

The external ingredient_parser Python package downloads NLTK corpora the
first time it runs and prints progress messages to stdout. Because our
Ruby IngredientParser expects only a JSON document, the stray log lines
break parsing and raise JSON::ParserError (see REC-7 / Sentry #RECIPIN-3).

Changes

  • Ruby: introduce sanitize_output to strip non-JSON prefixes, rescue & log
    JSON::ParserError with sample payload.
  • Python: silence any stdout inside parse_ingredient using
    contextlib.redirect_stdout.
  • Added inline docs describing the issue.

Verification

Manual test:

$ ./bin/ingredient_to_json.py "2 cups flour, sifted"
{"name": ["flour"], "amount": ["2"], "size": null, "preparation": "sifted", "comment": null, "purpose": null, "sentence": "2 cups flour, sifted"}

(The command previously printed NLTK download text before the JSON.)

Linked issues

Closes https://linear.app/recipin/issue/REC-7

…nce stdout in Python helper

The `ingredient_parser` Python package lazily downloads NLTK data and
prints progress logs (e.g. "Downloading required NLTK resour…") to
stdout. Because our Ruby `IngredientParser` expected *only* JSON, these
logs triggered `JSON::ParserError` in production (REC-7).

Changes
- Ruby: add `sanitize_output` to strip everything before the first `{` or
  `[` token, rescue and log `JSON::ParserError` with context.
- Python: wrap `parse_ingredient` call in `contextlib.redirect_stdout` so
  any library chatter stays hidden. Added file-level docstring for
  rationale.

Downstream extractors (`SchemaExtractor::Ingredients`,
`RecipeExtractor`) receive fully parsed data again.

Closes https://linear.app/recipin/issue/REC-7
@charliecreates charliecreates bot requested a review from bradly July 27, 2025 13:10
@bradly
Copy link
Owner

bradly commented Jul 27, 2025

@CharlieHelps When I check on my current deployed container, the output is correct. Is this download re-happening every time after each deployment due to the container getting tossed? Should we pre-bake the python lib either in Docker or Kamal?

@bradly bradly assigned CharlieHelps and unassigned bradly Jul 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants