# Document Data Extraction

Extract structured data from any document type (PDFs, Word documents, Excel files, images, scanned documents) by providing a file and a JSON schema.

## Overview

The Document Data Extraction API uses an intelligent multi-stage processing pipeline to automatically detect the best extraction strategy, handle both typed and scanned documents, and use AI to structure the extracted content according to your schema.

## Authentication

All endpoints require Bearer token authentication:


```bash
X-API-Key: YOUR_ACCESS_TOKEN
```

[Get your API key from Freddy Hub](https://freddy-hub.aitronos.com/freddy/api)

## How It Works

### Processing Pipeline

1. **Document Analysis**: Automatically detects document type and determines optimal extraction strategy
2. **Text Extraction**:
  - **Fast Path**: Direct text extraction for typed PDFs, Word, and Excel documents
  - **OCR Path**: Advanced OCR with image preprocessing for images and scanned documents
3. **Data Structuring**: Uses AI with JSON mode to structure extracted text according to your schema
4. **Validation**: Validates extracted data against your schema and adjusts confidence scores
5. **Quality Assurance**: Automatic retry with enhanced settings if confidence is low (<60%)


### Automatic Strategy Selection

The system automatically chooses the best extraction method:

- **Typed Documents** (PDF, DOCX, XLSX): Direct text extraction → Fast & cost-effective (~$0.001 per document)
- **Images/Scanned Documents**: OCR with preprocessing → High accuracy (~$0.01-0.05 per document)
- **Fallback**: If direct extraction fails, automatically falls back to OCR


### OCR Image Preprocessing

For images and scanned documents, the system applies advanced preprocessing to improve OCR accuracy:

- **Grayscale conversion**: Removes color noise
- **Contrast enhancement**: 2x contrast boost for better text visibility
- **Sharpening**: Enhances text edges
- **Denoising**: Removes background noise
- **Brightness adjustment**: 1.2x brightness boost


This preprocessing significantly improves OCR accuracy for handwritten text, low-quality scans, and faded documents.

## Supported File Types

**Documents**: PDF (.pdf), Word (.docx), Excel (.xlsx)
**Images**: JPEG (.jpg, .jpeg), PNG (.png), GIF (.gif), BMP (.bmp), TIFF (.tiff)

**File Size Limit**: 50 MB

## Key Features

- Automatic document type detection
- Intelligent extraction strategy selection
- Support for typed and scanned documents
- Advanced OCR with image preprocessing
- Schema-based data structuring
- Detailed confidence metrics
- Automatic quality improvement (retry on low confidence)
- Batch processing (up to 50 documents)
- Custom prompt support
- Cost optimization


## Use Cases

- **Timesheet Processing** - Extract work hours, shifts, and employee data from timesheets ([See guide](/docs/documentation/examples/timesheet-extraction))
- Invoice and receipt processing
- Form data extraction
- Document digitization
- Data entry automation
- Contract analysis
- ID card and passport scanning
- Medical record processing
- Financial document parsing


## Rate Limits

| Tier | Requests/Day | Concurrent Jobs | Max File Size |
|  --- | --- | --- | --- |
| Free | 100 | 1 | 10 MB |
| Basic | 1,000 | 3 | 25 MB |
| Pro | 10,000 | 10 | 50 MB |
| Enterprise | Custom | Custom | Custom |


## Pricing

Costs are calculated based on the processing method used:

| Method | Cost Range | When Used |
|  --- | --- | --- |
| Direct Text Extraction | ~$0.001 per document | Typed PDFs, Word, Excel |
| OCR + LLM Extraction | ~$0.01-0.05 per document | Images, scanned documents, handwritten text |
| Retry (Low Confidence) | +$0.001-0.005 per document | Automatic retry if confidence < 60% |


### Cost Optimization Tips

1. **Use Typed Documents**: PDFs with selectable text are 10-50x cheaper than scanned images
2. **Choose Appropriate Model**: Use ftg-3.0 for most cases, gpt-4o only when needed
3. **Batch Processing**: Process multiple documents in one batch for better efficiency
4. **Optimize Images**: Reduce image resolution to 300 DPI (sufficient for OCR)
5. **Pre-process Documents**: Convert scanned PDFs to typed PDFs when possible


## Next Steps

- [Extract Document Data](/docs/api-reference/documents/extract) - Extract structured data from a single document
- [Batch Document Extraction](/docs/api-reference/documents/extract-batch) - Process multiple documents in parallel
- [Get Job Status](/docs/api-reference/documents/get-job-status) - Check the status of an extraction job
- [Analyze Image](/docs/api-reference/documents/analyze-image) - Analyze images using Vision API