-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
Description
Summary
Attempt to parse/coerce field values to the expected type before throwing validation errors, making the system more forgiving of minor LLM formatting mistakes.
Problem
Models sometimes return correct data but in the wrong type format. Currently this fails validation immediately:
Example from benchmarks:
mistral:7b on 02-receipt-medium:
⚠️ [subtotal] Field 'subtotal' must be a number, got string
⚠️ [tax] Field 'tax' must be a number, got string
⚠️ [total] Field 'total' must be a number, got string
The model likely returned:
{
"subtotal": "8.75",
"tax": "0.70",
"total": "9.45"
}Instead of:
{
"subtotal": 8.75,
"tax": 0.70,
"total": 9.45
}This is valid data, just wrong type formatting.
Proposed Solution
Add type coercion in src/core/validator.ts before validation:
String → Number
case 'number':
let numValue = value;
// Try to coerce string to number
if (typeof value === 'string') {
const parsed = parseFloat(value.trim());
if (isValidNumber(parsed)) {
numValue = parsed;
// Optionally update the data object with coerced value
}
}
if (typeof numValue !== 'number') {
errors.push(...);
}String → Boolean
case 'boolean':
let boolValue = value;
// Try to coerce string to boolean
if (typeof value === 'string') {
const lower = value.toLowerCase().trim();
if (lower === 'true' || lower === '1' || lower === 'yes') {
boolValue = true;
} else if (lower === 'false' || lower === '0' || lower === 'no') {
boolValue = false;
}
}
if (typeof boolValue !== 'boolean') {
errors.push(...);
}String → Date
Already somewhat lenient (accepts Date objects), but could be more forgiving:
case 'date':
// Already accepts both string and Date object
// Could add more date format parsing (e.g., "12/15/2024" → ISO format)Number → String
case 'string':
let strValue = value;
// Coerce number/boolean to string
if (typeof value === 'number' || typeof value === 'boolean') {
strValue = String(value);
}
if (typeof strValue !== 'string') {
errors.push(...);
}Benefits
- Better model compatibility: Different models have different JSON serialization behaviors
- More forgiving extraction: Focus on data accuracy rather than formatting
- Reduced false negatives: Valid data won't be rejected for minor formatting issues
- Still validates: Type coercion only happens when safe/unambiguous
Considerations
- Data mutation: Should coerced values update the original data object?
- Logging: Should we log when coercion happens for debugging?
- Opt-in/out: Should this be configurable per schema?
- Confidence scoring: Should coerced fields have lower confidence?
Implementation
- Add coercion logic to
validateFieldValue()insrc/core/validator.ts - Add tests for each coercion case
- Update documentation to explain coercion behavior
- Consider adding
strictTypesschema option to disable coercion
Priority
Medium - Would improve model compatibility and reduce false negatives, but current strict validation is also valuable
Related
- Seen in model comparison benchmarks (mistral:7b returning string numbers)
- Related to Add support for additional data types #26 (additional data types)
Reactions are currently unavailable