Skip to content
Last updated

Extract structured data from any document type (PDFs, Word documents, Excel files, images, scanned documents) by providing a file and a JSON schema.

Overview

The Document Data Extraction API uses an intelligent multi-stage processing pipeline to automatically detect the best extraction strategy, handle both typed and scanned documents, and use AI to structure the extracted content according to your schema.

Authentication

All endpoints require Bearer token authentication:

X-API-Key: YOUR_ACCESS_TOKEN

Get your API key from Freddy Hub

How It Works

Processing Pipeline

  1. Document Analysis: Automatically detects document type and determines optimal extraction strategy
  2. Text Extraction:
    • Fast Path: Direct text extraction for typed PDFs, Word, and Excel documents
    • OCR Path: Advanced OCR with image preprocessing for images and scanned documents
  3. Data Structuring: Uses AI with JSON mode to structure extracted text according to your schema
  4. Validation: Validates extracted data against your schema and adjusts confidence scores
  5. Quality Assurance: Automatic retry with enhanced settings if confidence is low (<60%)

Automatic Strategy Selection

The system automatically chooses the best extraction method:

  • Typed Documents (PDF, DOCX, XLSX): Direct text extraction → Fast & cost-effective (~$0.001 per document)
  • Images/Scanned Documents: OCR with preprocessing → High accuracy (~$0.01-0.05 per document)
  • Fallback: If direct extraction fails, automatically falls back to OCR

OCR Image Preprocessing

For images and scanned documents, the system applies advanced preprocessing to improve OCR accuracy:

  • Grayscale conversion: Removes color noise
  • Contrast enhancement: 2x contrast boost for better text visibility
  • Sharpening: Enhances text edges
  • Denoising: Removes background noise
  • Brightness adjustment: 1.2x brightness boost

This preprocessing significantly improves OCR accuracy for handwritten text, low-quality scans, and faded documents.

Supported File Types

Documents: PDF (.pdf), Word (.docx), Excel (.xlsx)
Images: JPEG (.jpg, .jpeg), PNG (.png), GIF (.gif), BMP (.bmp), TIFF (.tiff)

File Size Limit: 50 MB

Key Features

  • Automatic document type detection
  • Intelligent extraction strategy selection
  • Support for typed and scanned documents
  • Advanced OCR with image preprocessing
  • Schema-based data structuring
  • Detailed confidence metrics
  • Automatic quality improvement (retry on low confidence)
  • Batch processing (up to 50 documents)
  • Custom prompt support
  • Cost optimization

Use Cases

  • Timesheet Processing - Extract work hours, shifts, and employee data from timesheets (See guide)
  • Invoice and receipt processing
  • Form data extraction
  • Document digitization
  • Data entry automation
  • Contract analysis
  • ID card and passport scanning
  • Medical record processing
  • Financial document parsing

Rate Limits

TierRequests/DayConcurrent JobsMax File Size
Free100110 MB
Basic1,000325 MB
Pro10,0001050 MB
EnterpriseCustomCustomCustom

Pricing

Costs are calculated based on the processing method used:

MethodCost RangeWhen Used
Direct Text Extraction~$0.001 per documentTyped PDFs, Word, Excel
OCR + LLM Extraction~$0.01-0.05 per documentImages, scanned documents, handwritten text
Retry (Low Confidence)+$0.001-0.005 per documentAutomatic retry if confidence < 60%

Cost Optimization Tips

  1. Use Typed Documents: PDFs with selectable text are 10-50x cheaper than scanned images
  2. Choose Appropriate Model: Use ftg-3.0 for most cases, gpt-4o only when needed
  3. Batch Processing: Process multiple documents in one batch for better efficiency
  4. Optimize Images: Reduce image resolution to 300 DPI (sufficient for OCR)
  5. Pre-process Documents: Convert scanned PDFs to typed PDFs when possible

Next Steps