Document Extraction Best Practices
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude

Schema Design

Be Specific with Field Names

Use descriptive field names that clearly indicate the data type and purpose.

{
  "invoice_number": "string",
  "invoice_date": "string",
  "due_date": "string",
  "vendor_name": "string",
  "total_amount": "number",
  "currency": "string"
}

Use Correct Types

Match field types to the expected data format.

string - Text content, IDs, dates
number - Numeric values (integers or decimals)
integer - Whole numbers only
boolean - True/false values
array - Lists of items
object - Nested structures

Handle Missing Data

Allow null values for optional fields to handle incomplete documents.

{
  "invoice_number": "string",
  "total_amount": "number",
  "notes": "string | null",
  "purchase_order": "string | null"
}

Use Nested Structures

Use arrays for repeating items like line items or entries.

{
  "invoice_number": "string",
  "line_items": [
    {
      "description": "string",
      "quantity": "number",
      "unit_price": "number",
      "total": "number"
    }
  ]
}

Prompt Engineering

Be Specific

Clearly state what you want extracted and how to handle edge cases.

"Extract timesheet data from this handwritten form. Focus on:
1. Date entries in YYYY-MM-DD format
2. Start and end times in 24-hour format
3. Calculate total hours by subtracting 45-minute break
4. If handwriting is unclear, mark field as null
5. Ignore notes or comments in margins"

Provide Context

Explain the document type and structure to guide the AI.

"This is a medical prescription form. Extract:
- Patient name and date of birth from top section
- Medication names and dosages from middle section
- Doctor signature date from bottom
- Ignore pharmacy stamps and barcodes"

Handle Edge Cases

Specify how to handle unclear or missing data.

"For handwritten text, if unclear, mark the field as null rather than guessing.
If multiple values are present, extract only the most recent one."

Format Requirements

Specify exact formats for dates, numbers, and other structured data.

"Extract dates in YYYY-MM-DD format. Convert any other date formats.
Extract currency amounts without symbols (e.g., 1250.00 instead of $1,250.00).
Round all decimal numbers to 2 decimal places."

Processing Strategy

Synchronous vs Asynchronous

Use Synchronous (sync: true) for:

Small documents (< 5 pages)
Real-time extraction needs
Interactive UIs where users wait for results
Testing and development

Use Asynchronous (sync: false) for:

Large documents (> 5 pages)
Batch processing
Background jobs
High-volume processing

Batch Processing

Process similar documents together for better efficiency.

# Good: Batch similar documents
files = ["invoice1.pdf", "invoice2.pdf", "invoice3.pdf"]
response = extract_batch(files, invoice_schema)

# Less efficient: Process individually
for file in files:
    response = extract_single(file, invoice_schema)

Model Selection

gpt-4o: Default, best balance of speed and accuracy
gpt-4o-mini: Faster, lower cost, good for simple documents
gpt-4o: Highest accuracy, use for complex or critical documents

Quality Assurance

Check Confidence Scores

Review documents with low confidence scores (<75%).

result = extract_document(file, schema)

if result['confidence']['overall'] < 75:
    # Flag for manual review
    flag_for_review(result['job_id'])

Confidence Score Ranges

Score	Quality	Action
90-100	Excellent	Use with confidence
75-89	Good	Spot-check important fields
60-74	Fair	Review carefully
0-59	Poor	Manual review required

Validate Critical Fields

Always validate critical fields like amounts and dates.

extracted = result['extracted_data']

# Validate amount is reasonable
if extracted['total_amount'] > 100000:
    alert_high_amount(extracted)

# Validate date format
try:
    datetime.strptime(extracted['date'], '%Y-%m-%d')
except ValueError:
    flag_invalid_date(extracted)

Track Accuracy

Monitor extraction accuracy over time to identify problem areas.

# Track success rate
total_extractions = 1000
successful = 950
accuracy = successful / total_extractions  # 95%

# Track by document type
invoice_accuracy = 98%
timesheet_accuracy = 85%  # May need prompt improvement

Performance Optimization

Optimize File Size

Compress images to 300 DPI (sufficient for OCR).

from PIL import Image

img = Image.open('scan.jpg')
img = img.resize((int(img.width * 0.5), int(img.height * 0.5)))
img.save('scan_optimized.jpg', quality=85, dpi=(300, 300))

Pre-process Documents

Convert scanned PDFs to typed PDFs when possible for 10-50x cost savings.

Use Appropriate Sync Mode

Don't block on large documents - use async mode.

# Good: Async for large documents
if file_size > 5_000_000:  # 5 MB
    result = extract_async(file, schema)
    job_id = result['job_id']
    # Poll for results later
else:
    result = extract_sync(file, schema)

Cache Results

Cache extracted data to avoid re-processing the same document.

import hashlib

def get_file_hash(file_path):
    with open(file_path, 'rb') as f:
        return hashlib.md5(f.read()).hexdigest()

file_hash = get_file_hash('invoice.pdf')
cached = cache.get(file_hash)

if cached:
    return cached
else:
    result = extract_document('invoice.pdf', schema)
    cache.set(file_hash, result, ttl=86400)  # 24 hours
    return result

Error Handling

Implement Retry Logic

Retry failed extractions with exponential backoff.

import time

def extract_with_retry(file, schema, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = extract_document(file, schema)
            if result['success']:
                return result
        except Exception as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                time.sleep(wait_time)
            else:
                raise

Handle Common Errors

try:
    result = extract_document(file, schema)
except FileToolargeError:
    # Compress file and retry
    compressed = compress_file(file)
    result = extract_document(compressed, schema)
except UnsupportedFileTypeError:
    # Convert to supported format
    converted = convert_to_pdf(file)
    result = extract_document(converted, schema)
except RateLimitError:
    # Wait and retry
    time.sleep(60)
    result = extract_document(file, schema)

Cost Optimization

Use Typed Documents

PDFs with selectable text are 10-50x cheaper than scanned images.

Choose Appropriate Model

Use gpt-4o for most cases, gpt-4o only when needed.

Batch Similar Documents

Process multiple documents in one batch for better efficiency.

Pre-process Images

Reduce image resolution to 300 DPI before uploading.

Monitor Costs

Track costs per document type to identify optimization opportunities.

# Track costs by document type
invoice_cost = 0.002  # Typed PDF
timesheet_cost = 0.015  # Handwritten image
receipt_cost = 0.008  # Scanned image

# Monthly costs
monthly_invoices = 1000 * invoice_cost  # $2
monthly_timesheets = 500 * timesheet_cost  # $7.50
monthly_receipts = 2000 * receipt_cost  # $16
total_monthly = monthly_invoices + monthly_timesheets + monthly_receipts  # $25.50

Document Extraction Best PracticesCopyCopy for LLMCopy page as Markdown for LLMsView as MarkdownOpen this page as MarkdownOpen in ChatGPTGet insights from ChatGPTOpen in ClaudeGet insights from Claude