Every extraction includes detailed confidence metrics to help you assess the quality of the results.
{
"confidence": {
"image_analysis": 85,
"data_extraction": 90,
"validation": 88,
"overall": 88
}
}image_analysis (0-100)
Quality of the source image or text. Based on OCR confidence or PDF text extraction quality.
data_extraction (0-100)
Accuracy of data extraction from the text. Based on LLM extraction confidence.
validation (0-100)
Schema validation success rate. Reduced by 5% per missing required field or type mismatch (max 30% penalty).
overall (0-100)
Combined confidence score. Average of all metrics.
| Score Range | Quality | Recommendation |
|---|---|---|
| 90-100 | Excellent | Data is highly reliable, use with confidence |
| 75-89 | Good | Data is generally reliable, spot-check important fields |
| 60-74 | Fair | Review extracted data carefully |
| 0-59 | Poor | Manual review required, consider re-processing |
The system automatically improves extraction quality:
If overall confidence < 60%, the system automatically retries with enhanced settings:
- Increased OCR preprocessing
- More detailed extraction prompts
- Stricter validation
Missing required fields or type mismatches reduce validation confidence:
- Missing required field: -5% per field
- Type mismatch: -5% per field
- Maximum penalty: -30%
Images are automatically preprocessed to improve OCR accuracy:
- Grayscale conversion
- Contrast enhancement (2x)
- Sharpening
- Denoising
- Brightness adjustment (1.2x)
Combines confidence from multiple sources for accurate overall score:
- OCR confidence (if applicable)
- LLM extraction confidence
- Schema validation results
Clean typed PDF invoice with clear text.
{
"confidence": {
"image_analysis": 98,
"data_extraction": 95,
"validation": 92,
"overall": 95
}
}Characteristics:
- Typed PDF with selectable text
- Clear formatting and structure
- All required fields present
- No type mismatches
Handwritten timesheet with some unclear text.
{
"confidence": {
"image_analysis": 74,
"data_extraction": 78,
"validation": 73,
"overall": 75
}
}Characteristics:
- Handwritten text (harder to read)
- Some unclear characters
- Most fields extracted successfully
- Minor validation issues
Faded scan with poor image quality.
{
"confidence": {
"image_analysis": 52,
"data_extraction": 60,
"validation": 53,
"overall": 55
}
}Characteristics:
- Poor image quality (faded, low contrast)
- OCR struggled with text recognition
- Missing some required fields
- Multiple validation errors
Action: System automatically retries with enhanced settings.
Flag documents with low confidence for manual review.
result = extract_document(file, schema)
if result['confidence']['overall'] < 75:
flag_for_review(result['job_id'], result['confidence'])Apply different processing based on confidence.
confidence = result['confidence']['overall']
if confidence >= 90:
# Auto-approve
approve_document(result['extracted_data'])
elif confidence >= 75:
# Spot-check critical fields
if validate_critical_fields(result['extracted_data']):
approve_document(result['extracted_data'])
else:
flag_for_review(result['job_id'])
else:
# Manual review required
flag_for_review(result['job_id'])Monitor confidence scores over time to identify trends.
# Track average confidence by document type
invoice_avg_confidence = 92 # Excellent
timesheet_avg_confidence = 78 # Good, but could improve
receipt_avg_confidence = 85 # Good
# Identify problem areas
if timesheet_avg_confidence < 80:
# Consider improving prompt or image quality
improve_timesheet_processing()Send alerts when confidence drops below threshold.
result = extract_document(file, schema)
if result['confidence']['overall'] < 60:
send_alert(
f"Low confidence extraction: {result['job_id']} "
f"(confidence: {result['confidence']['overall']}%)"
)- Scan at 300 DPI or higher
- Use good lighting
- Ensure text is clear and legible
- Avoid shadows and glare
- Use correct field types
- Make optional fields nullable
- Avoid overly complex nested structures
- Test schema with sample documents
- Provide clear extraction instructions
- Specify format requirements
- Handle edge cases explicitly
- Guide AI on unclear text handling
- Prefer typed PDFs over scans
- Convert scanned PDFs to typed when possible
- Clean up documents before scanning
- Use high-quality scanners