Skip to content

Add type coercion for LLM output before validation errors #28

@michaeldistel

Description

@michaeldistel

Summary

Attempt to parse/coerce field values to the expected type before throwing validation errors, making the system more forgiving of minor LLM formatting mistakes.

Problem

Models sometimes return correct data but in the wrong type format. Currently this fails validation immediately:

Example from benchmarks:

mistral:7b on 02-receipt-medium:
  ⚠️  [subtotal] Field 'subtotal' must be a number, got string
  ⚠️  [tax] Field 'tax' must be a number, got string  
  ⚠️  [total] Field 'total' must be a number, got string

The model likely returned:

{
  "subtotal": "8.75",
  "tax": "0.70",
  "total": "9.45"
}

Instead of:

{
  "subtotal": 8.75,
  "tax": 0.70,
  "total": 9.45
}

This is valid data, just wrong type formatting.

Proposed Solution

Add type coercion in src/core/validator.ts before validation:

String → Number

case 'number':
    let numValue = value;
    
    // Try to coerce string to number
    if (typeof value === 'string') {
        const parsed = parseFloat(value.trim());
        if (isValidNumber(parsed)) {
            numValue = parsed;
            // Optionally update the data object with coerced value
        }
    }
    
    if (typeof numValue !== 'number') {
        errors.push(...);
    }

String → Boolean

case 'boolean':
    let boolValue = value;
    
    // Try to coerce string to boolean
    if (typeof value === 'string') {
        const lower = value.toLowerCase().trim();
        if (lower === 'true' || lower === '1' || lower === 'yes') {
            boolValue = true;
        } else if (lower === 'false' || lower === '0' || lower === 'no') {
            boolValue = false;
        }
    }
    
    if (typeof boolValue !== 'boolean') {
        errors.push(...);
    }

String → Date

Already somewhat lenient (accepts Date objects), but could be more forgiving:

case 'date':
    // Already accepts both string and Date object
    // Could add more date format parsing (e.g., "12/15/2024" → ISO format)

Number → String

case 'string':
    let strValue = value;
    
    // Coerce number/boolean to string
    if (typeof value === 'number' || typeof value === 'boolean') {
        strValue = String(value);
    }
    
    if (typeof strValue !== 'string') {
        errors.push(...);
    }

Benefits

  1. Better model compatibility: Different models have different JSON serialization behaviors
  2. More forgiving extraction: Focus on data accuracy rather than formatting
  3. Reduced false negatives: Valid data won't be rejected for minor formatting issues
  4. Still validates: Type coercion only happens when safe/unambiguous

Considerations

  • Data mutation: Should coerced values update the original data object?
  • Logging: Should we log when coercion happens for debugging?
  • Opt-in/out: Should this be configurable per schema?
  • Confidence scoring: Should coerced fields have lower confidence?

Implementation

  1. Add coercion logic to validateFieldValue() in src/core/validator.ts
  2. Add tests for each coercion case
  3. Update documentation to explain coercion behavior
  4. Consider adding strictTypes schema option to disable coercion

Priority

Medium - Would improve model compatibility and reduce false negatives, but current strict validation is also valuable

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions