# Document Extraction Best Practices

## Schema Design

### Be Specific with Field Names

Use descriptive field names that clearly indicate the data type and purpose.


```json
{
  "invoice_number": "string",
  "invoice_date": "string",
  "due_date": "string",
  "vendor_name": "string",
  "total_amount": "number",
  "currency": "string"
}
```

### Use Correct Types

Match field types to the expected data format.

- `string` - Text content, IDs, dates
- `number` - Numeric values (integers or decimals)
- `integer` - Whole numbers only
- `boolean` - True/false values
- `array` - Lists of items
- `object` - Nested structures


### Handle Missing Data

Allow null values for optional fields to handle incomplete documents.


```json
{
  "invoice_number": "string",
  "total_amount": "number",
  "notes": "string | null",
  "purchase_order": "string | null"
}
```

### Use Nested Structures

Use arrays for repeating items like line items or entries.


```json
{
  "invoice_number": "string",
  "line_items": [
    {
      "description": "string",
      "quantity": "number",
      "unit_price": "number",
      "total": "number"
    }
  ]
}
```

## Prompt Engineering

### Be Specific

Clearly state what you want extracted and how to handle edge cases.


```
"Extract timesheet data from this handwritten form. Focus on:
1. Date entries in YYYY-MM-DD format
2. Start and end times in 24-hour format
3. Calculate total hours by subtracting 45-minute break
4. If handwriting is unclear, mark field as null
5. Ignore notes or comments in margins"
```

### Provide Context

Explain the document type and structure to guide the AI.


```
"This is a medical prescription form. Extract:
- Patient name and date of birth from top section
- Medication names and dosages from middle section
- Doctor signature date from bottom
- Ignore pharmacy stamps and barcodes"
```

### Handle Edge Cases

Specify how to handle unclear or missing data.


```
"For handwritten text, if unclear, mark the field as null rather than guessing.
If multiple values are present, extract only the most recent one."
```

### Format Requirements

Specify exact formats for dates, numbers, and other structured data.


```
"Extract dates in YYYY-MM-DD format. Convert any other date formats.
Extract currency amounts without symbols (e.g., 1250.00 instead of $1,250.00).
Round all decimal numbers to 2 decimal places."
```

## Processing Strategy

### Synchronous vs Asynchronous

**Use Synchronous (`sync: true`) for:**

- Small documents (< 5 pages)
- Real-time extraction needs
- Interactive UIs where users wait for results
- Testing and development


**Use Asynchronous (`sync: false`) for:**

- Large documents (> 5 pages)
- Batch processing
- Background jobs
- High-volume processing


### Batch Processing

Process similar documents together for better efficiency.


```python
# Good: Batch similar documents
files = ["invoice1.pdf", "invoice2.pdf", "invoice3.pdf"]
response = extract_batch(files, invoice_schema)

# Less efficient: Process individually
for file in files:
    response = extract_single(file, invoice_schema)
```

### Model Selection

- **ftg-3.0**: Default, best balance of speed and accuracy
- **gpt-4o-mini**: Faster, lower cost, good for simple documents
- **gpt-4o**: Highest accuracy, use for complex or critical documents


## Quality Assurance

### Check Confidence Scores

Review documents with low confidence scores (<75%).


```python
result = extract_document(file, schema)

if result['confidence']['overall'] < 75:
    # Flag for manual review
    flag_for_review(result['job_id'])
```

### Confidence Score Ranges

| Score | Quality | Action |
|  --- | --- | --- |
| 90-100 | Excellent | Use with confidence |
| 75-89 | Good | Spot-check important fields |
| 60-74 | Fair | Review carefully |
| 0-59 | Poor | Manual review required |


### Validate Critical Fields

Always validate critical fields like amounts and dates.


```python
extracted = result['extracted_data']

# Validate amount is reasonable
if extracted['total_amount'] > 100000:
    alert_high_amount(extracted)

# Validate date format
try:
    datetime.strptime(extracted['date'], '%Y-%m-%d')
except ValueError:
    flag_invalid_date(extracted)
```

### Track Accuracy

Monitor extraction accuracy over time to identify problem areas.


```python
# Track success rate
total_extractions = 1000
successful = 950
accuracy = successful / total_extractions  # 95%

# Track by document type
invoice_accuracy = 98%
timesheet_accuracy = 85%  # May need prompt improvement
```

## Performance Optimization

### Optimize File Size

Compress images to 300 DPI (sufficient for OCR).


```python
from PIL import Image

img = Image.open('scan.jpg')
img = img.resize((int(img.width * 0.5), int(img.height * 0.5)))
img.save('scan_optimized.jpg', quality=85, dpi=(300, 300))
```

### Pre-process Documents

Convert scanned PDFs to typed PDFs when possible for 10-50x cost savings.

### Use Appropriate Sync Mode

Don't block on large documents - use async mode.


```python
# Good: Async for large documents
if file_size > 5_000_000:  # 5 MB
    result = extract_async(file, schema)
    job_id = result['job_id']
    # Poll for results later
else:
    result = extract_sync(file, schema)
```

### Cache Results

Cache extracted data to avoid re-processing the same document.


```python
import hashlib

def get_file_hash(file_path):
    with open(file_path, 'rb') as f:
        return hashlib.md5(f.read()).hexdigest()

file_hash = get_file_hash('invoice.pdf')
cached = cache.get(file_hash)

if cached:
    return cached
else:
    result = extract_document('invoice.pdf', schema)
    cache.set(file_hash, result, ttl=86400)  # 24 hours
    return result
```

## Error Handling

### Implement Retry Logic

Retry failed extractions with exponential backoff.


```python
import time

def extract_with_retry(file, schema, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = extract_document(file, schema)
            if result['success']:
                return result
        except Exception as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                time.sleep(wait_time)
            else:
                raise
```

### Handle Common Errors


```python
try:
    result = extract_document(file, schema)
except FileToolargeError:
    # Compress file and retry
    compressed = compress_file(file)
    result = extract_document(compressed, schema)
except UnsupportedFileTypeError:
    # Convert to supported format
    converted = convert_to_pdf(file)
    result = extract_document(converted, schema)
except RateLimitError:
    # Wait and retry
    time.sleep(60)
    result = extract_document(file, schema)
```

## Cost Optimization

### Use Typed Documents

PDFs with selectable text are 10-50x cheaper than scanned images.

### Choose Appropriate Model

Use ftg-3.0 for most cases, gpt-4o only when needed.

### Batch Similar Documents

Process multiple documents in one batch for better efficiency.

### Pre-process Images

Reduce image resolution to 300 DPI before uploading.

### Monitor Costs

Track costs per document type to identify optimization opportunities.


```python
# Track costs by document type
invoice_cost = 0.002  # Typed PDF
timesheet_cost = 0.015  # Handwritten image
receipt_cost = 0.008  # Scanned image

# Monthly costs
monthly_invoices = 1000 * invoice_cost  # $2
monthly_timesheets = 500 * timesheet_cost  # $7.50
monthly_receipts = 2000 * receipt_cost  # $16
total_monthly = monthly_invoices + monthly_timesheets + monthly_receipts  # $25.50
```

## Related Resources

- [Extract Document Data](/docs/api-reference/documents/extract)
- [Batch Document Extraction](/docs/api-reference/documents/extract-batch)
- [Get Job Status](/docs/api-reference/documents/get-job-status)
- [Document Extraction Overview](/docs/api-reference/documents/introduction)