# Understanding Confidence Metrics Every extraction includes detailed confidence metrics to help you assess the quality of the results. ## Confidence Score Breakdown ```json { "confidence": { "image_analysis": 85, "data_extraction": 90, "validation": 88, "overall": 88 } } ``` ### Metric Definitions **`image_analysis`** (0-100) Quality of the source image or text. Based on OCR confidence or PDF text extraction quality. **`data_extraction`** (0-100) Accuracy of data extraction from the text. Based on LLM extraction confidence. **`validation`** (0-100) Schema validation success rate. Reduced by 5% per missing required field or type mismatch (max 30% penalty). **`overall`** (0-100) Combined confidence score. Average of all metrics. ## Confidence Score Ranges | Score Range | Quality | Recommendation | | --- | --- | --- | | 90-100 | Excellent | Data is highly reliable, use with confidence | | 75-89 | Good | Data is generally reliable, spot-check important fields | | 60-74 | Fair | Review extracted data carefully | | 0-59 | Poor | Manual review required, consider re-processing | ## Automatic Quality Improvements The system automatically improves extraction quality: ### Low Confidence Retry If overall confidence < 60%, the system automatically retries with enhanced settings: - Increased OCR preprocessing - More detailed extraction prompts - Stricter validation ### Validation Penalties Missing required fields or type mismatches reduce validation confidence: - Missing required field: -5% per field - Type mismatch: -5% per field - Maximum penalty: -30% ### OCR Preprocessing Images are automatically preprocessed to improve OCR accuracy: - Grayscale conversion - Contrast enhancement (2x) - Sharpening - Denoising - Brightness adjustment (1.2x) ### Confidence Blending Combines confidence from multiple sources for accurate overall score: - OCR confidence (if applicable) - LLM extraction confidence - Schema validation results ## Real-World Examples ### High Confidence (95%) Clean typed PDF invoice with clear text. ```json { "confidence": { "image_analysis": 98, "data_extraction": 95, "validation": 92, "overall": 95 } } ``` **Characteristics:** - Typed PDF with selectable text - Clear formatting and structure - All required fields present - No type mismatches ### Medium Confidence (75%) Handwritten timesheet with some unclear text. ```json { "confidence": { "image_analysis": 74, "data_extraction": 78, "validation": 73, "overall": 75 } } ``` **Characteristics:** - Handwritten text (harder to read) - Some unclear characters - Most fields extracted successfully - Minor validation issues ### Low Confidence (55%) Faded scan with poor image quality. ```json { "confidence": { "image_analysis": 52, "data_extraction": 60, "validation": 53, "overall": 55 } } ``` **Characteristics:** - Poor image quality (faded, low contrast) - OCR struggled with text recognition - Missing some required fields - Multiple validation errors **Action:** System automatically retries with enhanced settings. ## Using Confidence Scores ### Flag for Review Flag documents with low confidence for manual review. ```python result = extract_document(file, schema) if result['confidence']['overall'] < 75: flag_for_review(result['job_id'], result['confidence']) ``` ### Conditional Processing Apply different processing based on confidence. ```python confidence = result['confidence']['overall'] if confidence >= 90: # Auto-approve approve_document(result['extracted_data']) elif confidence >= 75: # Spot-check critical fields if validate_critical_fields(result['extracted_data']): approve_document(result['extracted_data']) else: flag_for_review(result['job_id']) else: # Manual review required flag_for_review(result['job_id']) ``` ### Track Quality Metrics Monitor confidence scores over time to identify trends. ```python # Track average confidence by document type invoice_avg_confidence = 92 # Excellent timesheet_avg_confidence = 78 # Good, but could improve receipt_avg_confidence = 85 # Good # Identify problem areas if timesheet_avg_confidence < 80: # Consider improving prompt or image quality improve_timesheet_processing() ``` ### Alert on Low Confidence Send alerts when confidence drops below threshold. ```python result = extract_document(file, schema) if result['confidence']['overall'] < 60: send_alert( f"Low confidence extraction: {result['job_id']} " f"(confidence: {result['confidence']['overall']}%)" ) ``` ## Improving Confidence Scores ### Improve Image Quality - Scan at 300 DPI or higher - Use good lighting - Ensure text is clear and legible - Avoid shadows and glare ### Optimize Schema - Use correct field types - Make optional fields nullable - Avoid overly complex nested structures - Test schema with sample documents ### Refine Prompts - Provide clear extraction instructions - Specify format requirements - Handle edge cases explicitly - Guide AI on unclear text handling ### Use Better Source Documents - Prefer typed PDFs over scans - Convert scanned PDFs to typed when possible - Clean up documents before scanning - Use high-quality scanners ## Related Resources - [Extract Document Data](/docs/api-reference/documents/extract) - [Best Practices & Guidelines](/docs/api-reference/documents/best-practices) - [Document Extraction Overview](/docs/api-reference/documents/introduction)