Use descriptive field names that clearly indicate the data type and purpose.
{
"invoice_number": "string",
"invoice_date": "string",
"due_date": "string",
"vendor_name": "string",
"total_amount": "number",
"currency": "string"
}Match field types to the expected data format.
string- Text content, IDs, datesnumber- Numeric values (integers or decimals)integer- Whole numbers onlyboolean- True/false valuesarray- Lists of itemsobject- Nested structures
Allow null values for optional fields to handle incomplete documents.
{
"invoice_number": "string",
"total_amount": "number",
"notes": "string | null",
"purchase_order": "string | null"
}Use arrays for repeating items like line items or entries.
{
"invoice_number": "string",
"line_items": [
{
"description": "string",
"quantity": "number",
"unit_price": "number",
"total": "number"
}
]
}Clearly state what you want extracted and how to handle edge cases.
"Extract timesheet data from this handwritten form. Focus on:
1. Date entries in YYYY-MM-DD format
2. Start and end times in 24-hour format
3. Calculate total hours by subtracting 45-minute break
4. If handwriting is unclear, mark field as null
5. Ignore notes or comments in margins"Explain the document type and structure to guide the AI.
"This is a medical prescription form. Extract:
- Patient name and date of birth from top section
- Medication names and dosages from middle section
- Doctor signature date from bottom
- Ignore pharmacy stamps and barcodes"Specify how to handle unclear or missing data.
"For handwritten text, if unclear, mark the field as null rather than guessing.
If multiple values are present, extract only the most recent one."Specify exact formats for dates, numbers, and other structured data.
"Extract dates in YYYY-MM-DD format. Convert any other date formats.
Extract currency amounts without symbols (e.g., 1250.00 instead of $1,250.00).
Round all decimal numbers to 2 decimal places."Use Synchronous (sync: true) for:
- Small documents (< 5 pages)
- Real-time extraction needs
- Interactive UIs where users wait for results
- Testing and development
Use Asynchronous (sync: false) for:
- Large documents (> 5 pages)
- Batch processing
- Background jobs
- High-volume processing
Process similar documents together for better efficiency.
# Good: Batch similar documents
files = ["invoice1.pdf", "invoice2.pdf", "invoice3.pdf"]
response = extract_batch(files, invoice_schema)
# Less efficient: Process individually
for file in files:
response = extract_single(file, invoice_schema)- ftg-3.0: Default, best balance of speed and accuracy
- gpt-4o-mini: Faster, lower cost, good for simple documents
- gpt-4o: Highest accuracy, use for complex or critical documents
Review documents with low confidence scores (<75%).
result = extract_document(file, schema)
if result['confidence']['overall'] < 75:
# Flag for manual review
flag_for_review(result['job_id'])| Score | Quality | Action |
|---|---|---|
| 90-100 | Excellent | Use with confidence |
| 75-89 | Good | Spot-check important fields |
| 60-74 | Fair | Review carefully |
| 0-59 | Poor | Manual review required |
Always validate critical fields like amounts and dates.
extracted = result['extracted_data']
# Validate amount is reasonable
if extracted['total_amount'] > 100000:
alert_high_amount(extracted)
# Validate date format
try:
datetime.strptime(extracted['date'], '%Y-%m-%d')
except ValueError:
flag_invalid_date(extracted)Monitor extraction accuracy over time to identify problem areas.
# Track success rate
total_extractions = 1000
successful = 950
accuracy = successful / total_extractions # 95%
# Track by document type
invoice_accuracy = 98%
timesheet_accuracy = 85% # May need prompt improvementCompress images to 300 DPI (sufficient for OCR).
from PIL import Image
img = Image.open('scan.jpg')
img = img.resize((int(img.width * 0.5), int(img.height * 0.5)))
img.save('scan_optimized.jpg', quality=85, dpi=(300, 300))Convert scanned PDFs to typed PDFs when possible for 10-50x cost savings.
Don't block on large documents - use async mode.
# Good: Async for large documents
if file_size > 5_000_000: # 5 MB
result = extract_async(file, schema)
job_id = result['job_id']
# Poll for results later
else:
result = extract_sync(file, schema)Cache extracted data to avoid re-processing the same document.
import hashlib
def get_file_hash(file_path):
with open(file_path, 'rb') as f:
return hashlib.md5(f.read()).hexdigest()
file_hash = get_file_hash('invoice.pdf')
cached = cache.get(file_hash)
if cached:
return cached
else:
result = extract_document('invoice.pdf', schema)
cache.set(file_hash, result, ttl=86400) # 24 hours
return resultRetry failed extractions with exponential backoff.
import time
def extract_with_retry(file, schema, max_retries=3):
for attempt in range(max_retries):
try:
result = extract_document(file, schema)
if result['success']:
return result
except Exception as e:
if attempt < max_retries - 1:
wait_time = 2 ** attempt # Exponential backoff
time.sleep(wait_time)
else:
raisetry:
result = extract_document(file, schema)
except FileToolargeError:
# Compress file and retry
compressed = compress_file(file)
result = extract_document(compressed, schema)
except UnsupportedFileTypeError:
# Convert to supported format
converted = convert_to_pdf(file)
result = extract_document(converted, schema)
except RateLimitError:
# Wait and retry
time.sleep(60)
result = extract_document(file, schema)PDFs with selectable text are 10-50x cheaper than scanned images.
Use ftg-3.0 for most cases, gpt-4o only when needed.
Process multiple documents in one batch for better efficiency.
Reduce image resolution to 300 DPI before uploading.
Track costs per document type to identify optimization opportunities.
# Track costs by document type
invoice_cost = 0.002 # Typed PDF
timesheet_cost = 0.015 # Handwritten image
receipt_cost = 0.008 # Scanned image
# Monthly costs
monthly_invoices = 1000 * invoice_cost # $2
monthly_timesheets = 500 * timesheet_cost # $7.50
monthly_receipts = 2000 * receipt_cost # $16
total_monthly = monthly_invoices + monthly_timesheets + monthly_receipts # $25.50