Skip to content
Last updated

Every extraction includes detailed confidence metrics to help you assess the quality of the results.

Confidence Score Breakdown

{
  "confidence": {
    "image_analysis": 85,
    "data_extraction": 90,
    "validation": 88,
    "overall": 88
  }
}

Metric Definitions

image_analysis (0-100)
Quality of the source image or text. Based on OCR confidence or PDF text extraction quality.

data_extraction (0-100)
Accuracy of data extraction from the text. Based on LLM extraction confidence.

validation (0-100)
Schema validation success rate. Reduced by 5% per missing required field or type mismatch (max 30% penalty).

overall (0-100)
Combined confidence score. Average of all metrics.

Confidence Score Ranges

Score RangeQualityRecommendation
90-100ExcellentData is highly reliable, use with confidence
75-89GoodData is generally reliable, spot-check important fields
60-74FairReview extracted data carefully
0-59PoorManual review required, consider re-processing

Automatic Quality Improvements

The system automatically improves extraction quality:

Low Confidence Retry

If overall confidence < 60%, the system automatically retries with enhanced settings:

  • Increased OCR preprocessing
  • More detailed extraction prompts
  • Stricter validation

Validation Penalties

Missing required fields or type mismatches reduce validation confidence:

  • Missing required field: -5% per field
  • Type mismatch: -5% per field
  • Maximum penalty: -30%

OCR Preprocessing

Images are automatically preprocessed to improve OCR accuracy:

  • Grayscale conversion
  • Contrast enhancement (2x)
  • Sharpening
  • Denoising
  • Brightness adjustment (1.2x)

Confidence Blending

Combines confidence from multiple sources for accurate overall score:

  • OCR confidence (if applicable)
  • LLM extraction confidence
  • Schema validation results

Real-World Examples

High Confidence (95%)

Clean typed PDF invoice with clear text.

{
  "confidence": {
    "image_analysis": 98,
    "data_extraction": 95,
    "validation": 92,
    "overall": 95
  }
}

Characteristics:

  • Typed PDF with selectable text
  • Clear formatting and structure
  • All required fields present
  • No type mismatches

Medium Confidence (75%)

Handwritten timesheet with some unclear text.

{
  "confidence": {
    "image_analysis": 74,
    "data_extraction": 78,
    "validation": 73,
    "overall": 75
  }
}

Characteristics:

  • Handwritten text (harder to read)
  • Some unclear characters
  • Most fields extracted successfully
  • Minor validation issues

Low Confidence (55%)

Faded scan with poor image quality.

{
  "confidence": {
    "image_analysis": 52,
    "data_extraction": 60,
    "validation": 53,
    "overall": 55
  }
}

Characteristics:

  • Poor image quality (faded, low contrast)
  • OCR struggled with text recognition
  • Missing some required fields
  • Multiple validation errors

Action: System automatically retries with enhanced settings.

Using Confidence Scores

Flag for Review

Flag documents with low confidence for manual review.

result = extract_document(file, schema)

if result['confidence']['overall'] < 75:
    flag_for_review(result['job_id'], result['confidence'])

Conditional Processing

Apply different processing based on confidence.

confidence = result['confidence']['overall']

if confidence >= 90:
    # Auto-approve
    approve_document(result['extracted_data'])
elif confidence >= 75:
    # Spot-check critical fields
    if validate_critical_fields(result['extracted_data']):
        approve_document(result['extracted_data'])
    else:
        flag_for_review(result['job_id'])
else:
    # Manual review required
    flag_for_review(result['job_id'])

Track Quality Metrics

Monitor confidence scores over time to identify trends.

# Track average confidence by document type
invoice_avg_confidence = 92  # Excellent
timesheet_avg_confidence = 78  # Good, but could improve
receipt_avg_confidence = 85  # Good

# Identify problem areas
if timesheet_avg_confidence < 80:
    # Consider improving prompt or image quality
    improve_timesheet_processing()

Alert on Low Confidence

Send alerts when confidence drops below threshold.

result = extract_document(file, schema)

if result['confidence']['overall'] < 60:
    send_alert(
        f"Low confidence extraction: {result['job_id']} "
        f"(confidence: {result['confidence']['overall']}%)"
    )

Improving Confidence Scores

Improve Image Quality

  • Scan at 300 DPI or higher
  • Use good lighting
  • Ensure text is clear and legible
  • Avoid shadows and glare

Optimize Schema

  • Use correct field types
  • Make optional fields nullable
  • Avoid overly complex nested structures
  • Test schema with sample documents

Refine Prompts

  • Provide clear extraction instructions
  • Specify format requirements
  • Handle edge cases explicitly
  • Guide AI on unclear text handling

Use Better Source Documents

  • Prefer typed PDFs over scans
  • Convert scanned PDFs to typed when possible
  • Clean up documents before scanning
  • Use high-quality scanners