This document describes the approach taken in the InvoiceParser implementation for parsing Switch invoices using Veryfi OCR text. It covers the overall architecture, format detection logic, parsing strategies, error handling, and testing considerations.
- Paradigm: The solution follows an object‑oriented design centered around the
InvoiceParserclass, which encapsulates all behavior needed to process a single invoice: calling Veryfi, validating the format, and extracting structured fields fromocr_text. - Single responsibility per method:
process_document()orchestrates the full pipeline: calls the Veryfi API, runs the format detector, then triggers parsing steps.parse_vendor,parse_bill_to,parse_general_fields,parse_line_items, andparse_totaleach handle one logical section of the invoice.matches_switch_format()is a separate pure function responsible only for deciding whether anocr_textbelongs to the “Switch invoice” layout.
- Configuration via environment variables: Veryfi credentials are read from environment variables and validated in
__init__, so secrets are not hard‑coded and failures are explicit. - Data modeling:
- Simple dictionaries (
vendor,bill_to,general_fields) hold top‑level fields. - Line items are stored as a list of dicts with explicit keys (
sku,description,quantity,tax_rate,price,total), which later can be converted to a pandasDataFrame. DetectResultis a@dataclass, making the format detector’s return value explicit (is_supported,reason).
- Simple dictionaries (
- The
matches_switch_format(ocr_text)function implements the requirement “support any document with the same format while excluding other documents”. - It computes several boolean indicators:
- Vendor markers: presence of
"switch"and the specific"PO Box 674592"string. - Header labels:
"Invoice Date","Due Date","Invoice No". - Line‑item table headers:
"Description","Quantity","Rate","Amount". - Footer marker:
"Please update your system". - Vendor line structure: reuses the same header logic as the parser (slice between the first and second
"Invoice", remove"Page X of Y", and search forInvoice\n<name>\t<city, ST ZIP>).
- Vendor markers: presence of
- These indicators are aggregated into a
score. The document is accepted only if:score >= 4,- the vendor line is present, and
- header labels are present.
process_document()callsmatches_switch_formatimmediately after obtainingocr_text. Ifis_supportedisFalse, it raisesValueError("Document format not supported: ...")and does not run any parsing. This cleanly separates supported invoices from all other documents (including the candidate’s own test document).
- Parsing operates only on OCR text returned by Veryfi (
self.ocr_text), never on raw PDFs. - Each section of the invoice is parsed using targeted regex aware of layout:
- Vendor:
- Slice a
headersubstring between first and second"Invoice"and remove"Page X of Y". - Extract
vendor_nameandvendor_city_statewithInvoice\s*\n([^\t\n]+)\t([^\n]+). - Extract
PO Boxwith(PO Box\s*\d+)[^\n]*and combine with the city/state line into a single address.
- Slice a
- Bill‑to block:
- Slice between
"Invoice No."and"Account No.". - Extract three consecutive non‑empty lines (name + two address lines) with
\n([^\n]+)\n([^\n]+)\n([^\n]+)\n.
- Slice between
- General fields:
- Invoice and due dates and invoice number from the header table using a strict pattern for dates and numeric ID.
- Account number and PO number from
"Account No."block usingAccount No\.[^\n]*\n[^\n]*\n([A-Z0-9\-]+)\s+([A-Z0-9\-]+).
- Line items:
- Slice
items_blockbetween"Description"and"Please update your system", then drop the header row. - Split into physical lines and classify each line as:
- an item line (contains three decimal numbers) or
- a continuation line (no triple‑number pattern).
- Build “logical items” by:
- For each new item line, closing the previous item and starting a new description + numeric part.
- Appending continuation lines to the current description.
- Use a second regex on the numeric part to extract
quantity,rate, andamountas floats.
- Slice
- Vendor:
- This two‑pass approach (line classification + numeric parsing) makes the solution robust to variations like:
- different numbers of tabs between columns,
- long descriptions that wrap to the next line, and
- slight OCR differences between invoices.
- Missing OCR text: every
parse_*method first checksif not self.ocr_textand initializes its output toNone/empty structures, preventing attribute errors. - Missing matches: regex searches are always guarded:
- if
matchisNone, the corresponding fields are set toNoneinstead of trying to access.group. - If no valid line items are found,
self.line_itemsis set toNone.
- if
- API credentials:
__init__validates that all Veryfi credentials are present in environment variables and raises immediately if anything is missing. - Idempotence:
reset()clears all parsed state, andset_invoice()builds a fresh invoice dict; the parser can be reused across multiple documents.
Tests where done in folder tests/ prior to the final class implementation. More test need to be added in the future.