# Document Data Extraction Extract structured data from any document type (PDFs, Word documents, Excel files, images, scanned documents) by providing a file and a JSON schema. ## Overview The Document Data Extraction API uses an intelligent multi-stage processing pipeline to automatically detect the best extraction strategy, handle both typed and scanned documents, and use AI to structure the extracted content according to your schema. ## Authentication All endpoints require Bearer token authentication: ```bash X-API-Key: YOUR_ACCESS_TOKEN ``` [Get your API key from Freddy Hub](https://freddy-hub.aitronos.com/freddy/api) ## How It Works ### Processing Pipeline 1. **Document Analysis**: Automatically detects document type and determines optimal extraction strategy 2. **Text Extraction**: - **Fast Path**: Direct text extraction for typed PDFs, Word, and Excel documents - **OCR Path**: Advanced OCR with image preprocessing for images and scanned documents 3. **Data Structuring**: Uses AI with JSON mode to structure extracted text according to your schema 4. **Validation**: Validates extracted data against your schema and adjusts confidence scores 5. **Quality Assurance**: Automatic retry with enhanced settings if confidence is low (<60%) ### Automatic Strategy Selection The system automatically chooses the best extraction method: - **Typed Documents** (PDF, DOCX, XLSX): Direct text extraction → Fast & cost-effective (~$0.001 per document) - **Images/Scanned Documents**: OCR with preprocessing → High accuracy (~$0.01-0.05 per document) - **Fallback**: If direct extraction fails, automatically falls back to OCR ### OCR Image Preprocessing For images and scanned documents, the system applies advanced preprocessing to improve OCR accuracy: - **Grayscale conversion**: Removes color noise - **Contrast enhancement**: 2x contrast boost for better text visibility - **Sharpening**: Enhances text edges - **Denoising**: Removes background noise - **Brightness adjustment**: 1.2x brightness boost This preprocessing significantly improves OCR accuracy for handwritten text, low-quality scans, and faded documents. ## Supported File Types **Documents**: PDF (.pdf), Word (.docx), Excel (.xlsx) **Images**: JPEG (.jpg, .jpeg), PNG (.png), GIF (.gif), BMP (.bmp), TIFF (.tiff) **File Size Limit**: 50 MB ## Key Features - Automatic document type detection - Intelligent extraction strategy selection - Support for typed and scanned documents - Advanced OCR with image preprocessing - Schema-based data structuring - Detailed confidence metrics - Automatic quality improvement (retry on low confidence) - Batch processing (up to 50 documents) - Custom prompt support - Cost optimization ## Use Cases - **Timesheet Processing** - Extract work hours, shifts, and employee data from timesheets ([See guide](/docs/documentation/examples/timesheet-extraction)) - Invoice and receipt processing - Form data extraction - Document digitization - Data entry automation - Contract analysis - ID card and passport scanning - Medical record processing - Financial document parsing ## Rate Limits | Tier | Requests/Day | Concurrent Jobs | Max File Size | | --- | --- | --- | --- | | Free | 100 | 1 | 10 MB | | Basic | 1,000 | 3 | 25 MB | | Pro | 10,000 | 10 | 50 MB | | Enterprise | Custom | Custom | Custom | ## Pricing Costs are calculated based on the processing method used: | Method | Cost Range | When Used | | --- | --- | --- | | Direct Text Extraction | ~$0.001 per document | Typed PDFs, Word, Excel | | OCR + LLM Extraction | ~$0.01-0.05 per document | Images, scanned documents, handwritten text | | Retry (Low Confidence) | +$0.001-0.005 per document | Automatic retry if confidence < 60% | ### Cost Optimization Tips 1. **Use Typed Documents**: PDFs with selectable text are 10-50x cheaper than scanned images 2. **Choose Appropriate Model**: Use ftg-3.0 for most cases, gpt-4o only when needed 3. **Batch Processing**: Process multiple documents in one batch for better efficiency 4. **Optimize Images**: Reduce image resolution to 300 DPI (sufficient for OCR) 5. **Pre-process Documents**: Convert scanned PDFs to typed PDFs when possible ## Next Steps - [Extract Document Data](/docs/api-reference/documents/extract) - Extract structured data from a single document - [Batch Document Extraction](/docs/api-reference/documents/extract-batch) - Process multiple documents in parallel - [Get Job Status](/docs/api-reference/documents/get-job-status) - Check the status of an extraction job - [Analyze Image](/docs/api-reference/documents/analyze-image) - Analyze images using Vision API