Skip to content
Last updated

Extract structured timesheet data from PDF documents (typed or handwritten) using the Document Extraction API. This guide shows you how to build a complete timesheet extraction solution.

Overview

The Document Extraction API can automatically extract timesheet data including:

  • Employee information (ID, name, company)
  • Work shift details (start/end times, durations)
  • Monthly summaries (total hours, days worked)
  • Document type classification (handwritten vs typed)
  • Confidence scores for quality control

Quick Start

1. Define Your Schema

Create a JSON schema that matches your timesheet structure:

{
  "type": "object",
  "properties": {
    "confidence": {
      "type": "object",
      "properties": {
        "image_analysis": {"type": "number", "minimum": 0, "maximum": 100},
        "data_extraction": {"type": "number", "minimum": 0, "maximum": 100},
        "validation": {"type": "number", "minimum": 0, "maximum": 100},
        "overall": {"type": "number", "minimum": 0, "maximum": 100}
      },
      "required": ["image_analysis", "data_extraction", "validation", "overall"]
    },
    "employee_id": {"type": "string"},
    "employee_name": {"type": "string"},
    "company_name": {"type": "string"},
    "summary": {
      "type": "object",
      "properties": {
        "month": {"type": "string"},
        "year": {"type": "integer"},
        "total_days_worked": {"type": "integer"},
        "total_time_minutes": {"type": "integer"},
        "total_time_hours": {"type": "number"},
        "total_time_formatted": {"type": "string"}
      }
    },
    "log_entries": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "start_date_time": {"type": "string", "format": "date-time"},
          "end_date_time": {"type": "string", "format": "date-time"},
          "total_time": {"type": "integer"}
        }
      }
    },
    "document_type": {"type": "string", "enum": ["handwritten", "typed"]}
  }
}

2. Create Custom Prompt

Guide the AI with specific extraction instructions:

Extract all timesheet data. For confidence scores: estimate image_analysis 
based on text clarity (0-100), data_extraction based on how complete the 
data is (0-100), validation as 100 if all required fields are present, and 
overall as the average of all three. Classify document_type as "handwritten" 
if the document appears to be handwritten, otherwise "typed". For log_entries, 
extract all work shifts with start and end times in ISO-8601 format with 
timezone (+02:00 for Central European Time). Calculate total_time in minutes 
for each entry. In the summary, calculate totals across all entries.

3. Extract Data

import requests
import json

url = "https://api.aitronos.com/v1/documents/extract"
headers = {"Authorization": "Bearer YOUR_API_TOKEN"}

schema = {
  # ... your schema from step 1
}

files = {"file": open("timesheet.pdf", "rb")}
data = {
    "organization_id": "org_abc123",
    "schema": json.dumps(schema),
    "sync": "true",
    "model": "gpt-4o",
    "prompt": "Extract all timesheet data..."  # Your custom prompt
}

response = requests.post(url, headers=headers, files=files, data=data)
result = response.json()

if result['success'] and result['status'] == 'completed':
    print(f"Employee: {result['extracted_data']['employee_name']}")
    print(f"Total Hours: {result['extracted_data']['summary']['total_time_formatted']}")
    print(f"Confidence: {result['confidence']}%")
else:
    print(f"Error: {result.get('error_message')}")

Complete Example

Python Implementation

import requests
import json
from typing import Dict, Any

class TimesheetExtractor:
    def __init__(self, api_token: str, organization_id: str):
        self.api_token = api_token
        self.organization_id = organization_id
        self.api_url = "https://api.aitronos.com/v1/documents/extract"
    
    def get_schema(self) -> Dict[str, Any]:
        """Return the timesheet extraction schema."""
        return {
            "type": "object",
            "properties": {
                "confidence": {
                    "type": "object",
                    "properties": {
                        "image_analysis": {"type": "number", "minimum": 0, "maximum": 100},
                        "data_extraction": {"type": "number", "minimum": 0, "maximum": 100},
                        "validation": {"type": "number", "minimum": 0, "maximum": 100},
                        "overall": {"type": "number", "minimum": 0, "maximum": 100}
                    },
                    "required": ["image_analysis", "data_extraction", "validation", "overall"]
                },
                "employee_id": {"type": "string"},
                "employee_name": {"type": "string"},
                "company_name": {"type": "string"},
                "summary": {
                    "type": "object",
                    "properties": {
                        "month": {"type": "string"},
                        "year": {"type": "integer"},
                        "total_days_worked": {"type": "integer"},
                        "total_time_minutes": {"type": "integer"},
                        "total_time_hours": {"type": "number"},
                        "total_time_formatted": {"type": "string"}
                    },
                    "required": ["month", "year", "total_days_worked", 
                               "total_time_minutes", "total_time_hours", 
                               "total_time_formatted"]
                },
                "log_entries": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "start_date_time": {"type": "string"},
                            "end_date_time": {"type": "string"},
                            "total_time": {"type": "integer"}
                        },
                        "required": ["start_date_time", "end_date_time", "total_time"]
                    }
                },
                "document_type": {"type": "string", "enum": ["handwritten", "typed"]}
            },
            "required": ["confidence", "employee_id", "employee_name", 
                        "company_name", "summary", "log_entries", "document_type"]
        }
    
    def get_prompt(self) -> str:
        """Return the extraction prompt."""
        return """Extract all timesheet data. For confidence scores: estimate 
        image_analysis based on text clarity (0-100), data_extraction based on 
        how complete the data is (0-100), validation as 100 if all required 
        fields are present, and overall as the average of all three. Classify 
        document_type as "handwritten" if the document appears to be handwritten, 
        otherwise "typed". For log_entries, extract all work shifts with start 
        and end times in ISO-8601 format with timezone (+02:00 for Central 
        European Time). Calculate total_time in minutes for each entry. In the 
        summary, calculate totals across all entries."""
    
    def extract(self, file_path: str, sync: bool = True) -> Dict[str, Any]:
        """Extract timesheet data from a PDF file."""
        headers = {"Authorization": f"Bearer {self.api_token}"}
        
        files = {"file": open(file_path, "rb")}
        data = {
            "organization_id": self.organization_id,
            "schema": json.dumps(self.get_schema()),
            "prompt": self.get_prompt(),
            "sync": str(sync).lower(),
            "model": "gpt-4o"
        }
        
        response = requests.post(self.api_url, headers=headers, 
                               files=files, data=data)
        return response.json()
    
    def validate_extraction(self, result: Dict[str, Any]) -> bool:
        """Validate extraction quality."""
        if not result.get('success'):
            return False
        
        if result.get('status') != 'completed':
            return False
        
        # Check confidence threshold
        confidence = result.get('confidence', 0)
        if confidence < 70:
            print(f"Warning: Low confidence ({confidence}%) - manual review recommended")
            return False
        
        # Check required fields
        data = result.get('extracted_data', {})
        required_fields = ['employee_id', 'employee_name', 'summary', 'log_entries']
        for field in required_fields:
            if not data.get(field):
                print(f"Warning: Missing required field: {field}")
                return False
        
        return True

# Usage
extractor = TimesheetExtractor(
    api_token="YOUR_API_TOKEN",
    organization_id="org_abc123"
)

result = extractor.extract("timesheet.pdf")

if extractor.validate_extraction(result):
    data = result['extracted_data']
    print(f"✅ Extraction successful!")
    print(f"Employee: {data['employee_name']} ({data['employee_id']})")
    print(f"Company: {data['company_name']}")
    print(f"Period: {data['summary']['month']} {data['summary']['year']}")
    print(f"Total Hours: {data['summary']['total_time_formatted']}")
    print(f"Days Worked: {data['summary']['total_days_worked']}")
    print(f"Document Type: {data['document_type']}")
    print(f"Confidence: {result['confidence']}%")
    print(f"Cost: CHF {result['cost_chf']:.4f}")
else:
    print("❌ Extraction failed or quality too low")

Response Structure

Successful Extraction

{
  "success": true,
  "job_id": "doc_abc123",
  "status": "completed",
  "extracted_data": {
    "confidence": {
      "image_analysis": 95,
      "data_extraction": 90,
      "validation": 100,
      "overall": 95
    },
    "employee_id": "12326",
    "employee_name": "Frei (T) Celine",
    "company_name": "Spitex Region Zofingen AG",
    "summary": {
      "month": "August",
      "year": 2025,
      "total_days_worked": 5,
      "total_time_minutes": 1973,
      "total_time_hours": 32.88,
      "total_time_formatted": "32:53"
    },
    "log_entries": [
      {
        "start_date_time": "2025-08-07T16:50:00+02:00",
        "end_date_time": "2025-08-07T22:17:00+02:00",
        "total_time": 327
      },
      {
        "start_date_time": "2025-08-08T12:00:00+02:00",
        "end_date_time": "2025-08-08T12:15:00+02:00",
        "total_time": 15
      }
    ],
    "document_type": "typed"
  },
  "explanation": "Extracted timesheet for employee 12326 (Celine Frei) from Spitex Region Zofingen AG for August 2025. Found 5 work days with a total of 32.88 hours (1,973 minutes).",
  "confidence": 95.0,
  "processing_time": 8.78,
  "cost_chf": 0.015,
  "model_used": "gpt-4o",
  "created_at": "2025-12-19T14:27:07Z",
  "completed_at": "2025-12-19T14:27:20Z"
}

Best Practices

1. Use Confidence Scores

Always check confidence scores before processing data:

if result['confidence'] < 70:
    # Flag for manual review
    send_to_manual_review(result)
elif result['confidence'] < 85:
    # Automated processing with validation
    process_with_validation(result)
else:
    # High confidence - automated processing
    process_automatically(result)

2. Handle Missing Data

Not all fields may be present in every document:

data = result['extracted_data']

# Safe field access
employee_id = data.get('employee_id', 'UNKNOWN')
company = data.get('company_name', 'Not specified')

# Check for empty log entries
if not data.get('log_entries'):
    print("Warning: No work shifts found")

3. Validate Calculations

Verify that totals match individual entries:

def validate_totals(data: Dict[str, Any]) -> bool:
    """Verify summary totals match log entries."""
    log_entries = data.get('log_entries', [])
    summary = data.get('summary', {})
    
    # Calculate actual total from entries
    actual_minutes = sum(entry['total_time'] for entry in log_entries)
    reported_minutes = summary.get('total_time_minutes', 0)
    
    # Allow small rounding differences
    if abs(actual_minutes - reported_minutes) > 5:
        print(f"Warning: Total mismatch - {actual_minutes} vs {reported_minutes}")
        return False
    
    return True

4. Batch Processing

Process multiple timesheets efficiently:

import os
from concurrent.futures import ThreadPoolExecutor

def process_timesheet_batch(file_paths: list[str]) -> list[Dict[str, Any]]:
    """Process multiple timesheets in parallel."""
    extractor = TimesheetExtractor(
        api_token=os.environ["FREDDY_API_KEY"],
        organization_id="org_abc123"
    )
    
    with ThreadPoolExecutor(max_workers=5) as executor:
        results = list(executor.map(extractor.extract, file_paths))
    
    return results

# Process all PDFs in a directory
pdf_files = [f for f in os.listdir("timesheets/") if f.endswith('.pdf')]
results = process_timesheet_batch([f"timesheets/{f}" for f in pdf_files])

# Filter successful extractions
successful = [r for r in results if r.get('success') and r.get('confidence', 0) >= 70]
print(f"Processed {len(successful)}/{len(results)} timesheets successfully")

Troubleshooting

Low Confidence Scores

Problem: Extraction confidence below 70%

Solutions:

  • Ensure document is high quality (300+ DPI for scans)
  • Check that document is not rotated or skewed
  • Verify document contains expected data
  • Try using gpt-4o instead of gpt-4o-mini for better accuracy

Missing Log Entries

Problem: Some work shifts not extracted

Solutions:

  • Add to custom prompt: "Extract ALL work shifts, including partial days"
  • Check explanation field for details
  • Verify document format matches expected structure
  • Ensure shifts are clearly visible in the document

Incorrect Time Calculations

Problem: Total hours don't match individual entries

Solutions:

  • Specify timezone in custom prompt
  • Verify time format in source document (AM/PM vs 24-hour)
  • Check for overlapping shifts
  • Validate date boundaries (shifts crossing midnight)

Wrong Document Type Classification

Problem: Typed document classified as handwritten

Solutions:

  • Add to prompt: "Classify as handwritten only if text is clearly handwritten"
  • Check OCR confidence scores
  • Manually override if needed

Cost Optimization

Choose the Right Model

  • gpt-4o-mini: ~$0.005-0.02 per document (good for simple typed timesheets)
  • gpt-4o: ~$0.01-0.05 per document (better for handwritten or complex documents)
# Use mini for typed documents
if document_type == "typed":
    model = "gpt-4o-mini"
else:
    model = "gpt-4o"

Cache Results

Results are cached for 24 hours. Avoid re-processing the same document:

import hashlib

def get_document_hash(file_path: str) -> str:
    """Generate hash for document caching."""
    with open(file_path, 'rb') as f:
        return hashlib.sha256(f.read()).hexdigest()

# Check cache before processing
doc_hash = get_document_hash("timesheet.pdf")
if doc_hash in processed_cache:
    return processed_cache[doc_hash]