# Document Extraction Best Practices ## Schema Design ### Be Specific with Field Names Use descriptive field names that clearly indicate the data type and purpose. ```json { "invoice_number": "string", "invoice_date": "string", "due_date": "string", "vendor_name": "string", "total_amount": "number", "currency": "string" } ``` ### Use Correct Types Match field types to the expected data format. - `string` - Text content, IDs, dates - `number` - Numeric values (integers or decimals) - `integer` - Whole numbers only - `boolean` - True/false values - `array` - Lists of items - `object` - Nested structures ### Handle Missing Data Allow null values for optional fields to handle incomplete documents. ```json { "invoice_number": "string", "total_amount": "number", "notes": "string | null", "purchase_order": "string | null" } ``` ### Use Nested Structures Use arrays for repeating items like line items or entries. ```json { "invoice_number": "string", "line_items": [ { "description": "string", "quantity": "number", "unit_price": "number", "total": "number" } ] } ``` ## Prompt Engineering ### Be Specific Clearly state what you want extracted and how to handle edge cases. ``` "Extract timesheet data from this handwritten form. Focus on: 1. Date entries in YYYY-MM-DD format 2. Start and end times in 24-hour format 3. Calculate total hours by subtracting 45-minute break 4. If handwriting is unclear, mark field as null 5. Ignore notes or comments in margins" ``` ### Provide Context Explain the document type and structure to guide the AI. ``` "This is a medical prescription form. Extract: - Patient name and date of birth from top section - Medication names and dosages from middle section - Doctor signature date from bottom - Ignore pharmacy stamps and barcodes" ``` ### Handle Edge Cases Specify how to handle unclear or missing data. ``` "For handwritten text, if unclear, mark the field as null rather than guessing. If multiple values are present, extract only the most recent one." ``` ### Format Requirements Specify exact formats for dates, numbers, and other structured data. ``` "Extract dates in YYYY-MM-DD format. Convert any other date formats. Extract currency amounts without symbols (e.g., 1250.00 instead of $1,250.00). Round all decimal numbers to 2 decimal places." ``` ## Processing Strategy ### Synchronous vs Asynchronous **Use Synchronous (`sync: true`) for:** - Small documents (< 5 pages) - Real-time extraction needs - Interactive UIs where users wait for results - Testing and development **Use Asynchronous (`sync: false`) for:** - Large documents (> 5 pages) - Batch processing - Background jobs - High-volume processing ### Batch Processing Process similar documents together for better efficiency. ```python # Good: Batch similar documents files = ["invoice1.pdf", "invoice2.pdf", "invoice3.pdf"] response = extract_batch(files, invoice_schema) # Less efficient: Process individually for file in files: response = extract_single(file, invoice_schema) ``` ### Model Selection - **ftg-3.0**: Default, best balance of speed and accuracy - **gpt-4o-mini**: Faster, lower cost, good for simple documents - **gpt-4o**: Highest accuracy, use for complex or critical documents ## Quality Assurance ### Check Confidence Scores Review documents with low confidence scores (<75%). ```python result = extract_document(file, schema) if result['confidence']['overall'] < 75: # Flag for manual review flag_for_review(result['job_id']) ``` ### Confidence Score Ranges | Score | Quality | Action | | --- | --- | --- | | 90-100 | Excellent | Use with confidence | | 75-89 | Good | Spot-check important fields | | 60-74 | Fair | Review carefully | | 0-59 | Poor | Manual review required | ### Validate Critical Fields Always validate critical fields like amounts and dates. ```python extracted = result['extracted_data'] # Validate amount is reasonable if extracted['total_amount'] > 100000: alert_high_amount(extracted) # Validate date format try: datetime.strptime(extracted['date'], '%Y-%m-%d') except ValueError: flag_invalid_date(extracted) ``` ### Track Accuracy Monitor extraction accuracy over time to identify problem areas. ```python # Track success rate total_extractions = 1000 successful = 950 accuracy = successful / total_extractions # 95% # Track by document type invoice_accuracy = 98% timesheet_accuracy = 85% # May need prompt improvement ``` ## Performance Optimization ### Optimize File Size Compress images to 300 DPI (sufficient for OCR). ```python from PIL import Image img = Image.open('scan.jpg') img = img.resize((int(img.width * 0.5), int(img.height * 0.5))) img.save('scan_optimized.jpg', quality=85, dpi=(300, 300)) ``` ### Pre-process Documents Convert scanned PDFs to typed PDFs when possible for 10-50x cost savings. ### Use Appropriate Sync Mode Don't block on large documents - use async mode. ```python # Good: Async for large documents if file_size > 5_000_000: # 5 MB result = extract_async(file, schema) job_id = result['job_id'] # Poll for results later else: result = extract_sync(file, schema) ``` ### Cache Results Cache extracted data to avoid re-processing the same document. ```python import hashlib def get_file_hash(file_path): with open(file_path, 'rb') as f: return hashlib.md5(f.read()).hexdigest() file_hash = get_file_hash('invoice.pdf') cached = cache.get(file_hash) if cached: return cached else: result = extract_document('invoice.pdf', schema) cache.set(file_hash, result, ttl=86400) # 24 hours return result ``` ## Error Handling ### Implement Retry Logic Retry failed extractions with exponential backoff. ```python import time def extract_with_retry(file, schema, max_retries=3): for attempt in range(max_retries): try: result = extract_document(file, schema) if result['success']: return result except Exception as e: if attempt < max_retries - 1: wait_time = 2 ** attempt # Exponential backoff time.sleep(wait_time) else: raise ``` ### Handle Common Errors ```python try: result = extract_document(file, schema) except FileToolargeError: # Compress file and retry compressed = compress_file(file) result = extract_document(compressed, schema) except UnsupportedFileTypeError: # Convert to supported format converted = convert_to_pdf(file) result = extract_document(converted, schema) except RateLimitError: # Wait and retry time.sleep(60) result = extract_document(file, schema) ``` ## Cost Optimization ### Use Typed Documents PDFs with selectable text are 10-50x cheaper than scanned images. ### Choose Appropriate Model Use ftg-3.0 for most cases, gpt-4o only when needed. ### Batch Similar Documents Process multiple documents in one batch for better efficiency. ### Pre-process Images Reduce image resolution to 300 DPI before uploading. ### Monitor Costs Track costs per document type to identify optimization opportunities. ```python # Track costs by document type invoice_cost = 0.002 # Typed PDF timesheet_cost = 0.015 # Handwritten image receipt_cost = 0.008 # Scanned image # Monthly costs monthly_invoices = 1000 * invoice_cost # $2 monthly_timesheets = 500 * timesheet_cost # $7.50 monthly_receipts = 2000 * receipt_cost # $16 total_monthly = monthly_invoices + monthly_timesheets + monthly_receipts # $25.50 ``` ## Related Resources - [Extract Document Data](/docs/api-reference/documents/extract) - [Batch Document Extraction](/docs/api-reference/documents/extract-batch) - [Get Job Status](/docs/api-reference/documents/get-job-status) - [Document Extraction Overview](/docs/api-reference/documents/introduction)