# Timesheet Data Extraction Guide Extract structured timesheet data from PDF documents (typed or handwritten) using the Document Extraction API. This guide shows you how to build a complete timesheet extraction solution. ## Overview The Document Extraction API can automatically extract timesheet data including: - Employee information (ID, name, company) - Work shift details (start/end times, durations) - Monthly summaries (total hours, days worked) - Document type classification (handwritten vs typed) - Confidence scores for quality control ## Quick Start ### 1. Define Your Schema Create a JSON schema that matches your timesheet structure: ```json { "type": "object", "properties": { "confidence": { "type": "object", "properties": { "image_analysis": {"type": "number", "minimum": 0, "maximum": 100}, "data_extraction": {"type": "number", "minimum": 0, "maximum": 100}, "validation": {"type": "number", "minimum": 0, "maximum": 100}, "overall": {"type": "number", "minimum": 0, "maximum": 100} }, "required": ["image_analysis", "data_extraction", "validation", "overall"] }, "employee_id": {"type": "string"}, "employee_name": {"type": "string"}, "company_name": {"type": "string"}, "summary": { "type": "object", "properties": { "month": {"type": "string"}, "year": {"type": "integer"}, "total_days_worked": {"type": "integer"}, "total_time_minutes": {"type": "integer"}, "total_time_hours": {"type": "number"}, "total_time_formatted": {"type": "string"} } }, "log_entries": { "type": "array", "items": { "type": "object", "properties": { "start_date_time": {"type": "string", "format": "date-time"}, "end_date_time": {"type": "string", "format": "date-time"}, "total_time": {"type": "integer"} } } }, "document_type": {"type": "string", "enum": ["handwritten", "typed"]} } } ``` ### 2. Create Custom Prompt Guide the AI with specific extraction instructions: ```text Extract all timesheet data. For confidence scores: estimate image_analysis based on text clarity (0-100), data_extraction based on how complete the data is (0-100), validation as 100 if all required fields are present, and overall as the average of all three. Classify document_type as "handwritten" if the document appears to be handwritten, otherwise "typed". For log_entries, extract all work shifts with start and end times in ISO-8601 format with timezone (+02:00 for Central European Time). Calculate total_time in minutes for each entry. In the summary, calculate totals across all entries. ``` ### 3. Extract Data ```python import requests import json url = "https://api.aitronos.com/v1/documents/extract" headers = {"Authorization": "Bearer YOUR_API_TOKEN"} schema = { # ... your schema from step 1 } files = {"file": open("timesheet.pdf", "rb")} data = { "organization_id": "org_abc123", "schema": json.dumps(schema), "sync": "true", "model": "gpt-4o", "prompt": "Extract all timesheet data..." # Your custom prompt } response = requests.post(url, headers=headers, files=files, data=data) result = response.json() if result['success'] and result['status'] == 'completed': print(f"Employee: {result['extracted_data']['employee_name']}") print(f"Total Hours: {result['extracted_data']['summary']['total_time_formatted']}") print(f"Confidence: {result['confidence']}%") else: print(f"Error: {result.get('error_message')}") ``` ## Complete Example ### Python Implementation ```python import requests import json from typing import Dict, Any class TimesheetExtractor: def __init__(self, api_token: str, organization_id: str): self.api_token = api_token self.organization_id = organization_id self.api_url = "https://api.aitronos.com/v1/documents/extract" def get_schema(self) -> Dict[str, Any]: """Return the timesheet extraction schema.""" return { "type": "object", "properties": { "confidence": { "type": "object", "properties": { "image_analysis": {"type": "number", "minimum": 0, "maximum": 100}, "data_extraction": {"type": "number", "minimum": 0, "maximum": 100}, "validation": {"type": "number", "minimum": 0, "maximum": 100}, "overall": {"type": "number", "minimum": 0, "maximum": 100} }, "required": ["image_analysis", "data_extraction", "validation", "overall"] }, "employee_id": {"type": "string"}, "employee_name": {"type": "string"}, "company_name": {"type": "string"}, "summary": { "type": "object", "properties": { "month": {"type": "string"}, "year": {"type": "integer"}, "total_days_worked": {"type": "integer"}, "total_time_minutes": {"type": "integer"}, "total_time_hours": {"type": "number"}, "total_time_formatted": {"type": "string"} }, "required": ["month", "year", "total_days_worked", "total_time_minutes", "total_time_hours", "total_time_formatted"] }, "log_entries": { "type": "array", "items": { "type": "object", "properties": { "start_date_time": {"type": "string"}, "end_date_time": {"type": "string"}, "total_time": {"type": "integer"} }, "required": ["start_date_time", "end_date_time", "total_time"] } }, "document_type": {"type": "string", "enum": ["handwritten", "typed"]} }, "required": ["confidence", "employee_id", "employee_name", "company_name", "summary", "log_entries", "document_type"] } def get_prompt(self) -> str: """Return the extraction prompt.""" return """Extract all timesheet data. For confidence scores: estimate image_analysis based on text clarity (0-100), data_extraction based on how complete the data is (0-100), validation as 100 if all required fields are present, and overall as the average of all three. Classify document_type as "handwritten" if the document appears to be handwritten, otherwise "typed". For log_entries, extract all work shifts with start and end times in ISO-8601 format with timezone (+02:00 for Central European Time). Calculate total_time in minutes for each entry. In the summary, calculate totals across all entries.""" def extract(self, file_path: str, sync: bool = True) -> Dict[str, Any]: """Extract timesheet data from a PDF file.""" headers = {"Authorization": f"Bearer {self.api_token}"} files = {"file": open(file_path, "rb")} data = { "organization_id": self.organization_id, "schema": json.dumps(self.get_schema()), "prompt": self.get_prompt(), "sync": str(sync).lower(), "model": "gpt-4o" } response = requests.post(self.api_url, headers=headers, files=files, data=data) return response.json() def validate_extraction(self, result: Dict[str, Any]) -> bool: """Validate extraction quality.""" if not result.get('success'): return False if result.get('status') != 'completed': return False # Check confidence threshold confidence = result.get('confidence', 0) if confidence < 70: print(f"Warning: Low confidence ({confidence}%) - manual review recommended") return False # Check required fields data = result.get('extracted_data', {}) required_fields = ['employee_id', 'employee_name', 'summary', 'log_entries'] for field in required_fields: if not data.get(field): print(f"Warning: Missing required field: {field}") return False return True # Usage extractor = TimesheetExtractor( api_token="YOUR_API_TOKEN", organization_id="org_abc123" ) result = extractor.extract("timesheet.pdf") if extractor.validate_extraction(result): data = result['extracted_data'] print(f"✅ Extraction successful!") print(f"Employee: {data['employee_name']} ({data['employee_id']})") print(f"Company: {data['company_name']}") print(f"Period: {data['summary']['month']} {data['summary']['year']}") print(f"Total Hours: {data['summary']['total_time_formatted']}") print(f"Days Worked: {data['summary']['total_days_worked']}") print(f"Document Type: {data['document_type']}") print(f"Confidence: {result['confidence']}%") print(f"Cost: CHF {result['cost_chf']:.4f}") else: print("❌ Extraction failed or quality too low") ``` ## Response Structure ### Successful Extraction ```json { "success": true, "job_id": "doc_abc123", "status": "completed", "extracted_data": { "confidence": { "image_analysis": 95, "data_extraction": 90, "validation": 100, "overall": 95 }, "employee_id": "12326", "employee_name": "Frei (T) Celine", "company_name": "Spitex Region Zofingen AG", "summary": { "month": "August", "year": 2025, "total_days_worked": 5, "total_time_minutes": 1973, "total_time_hours": 32.88, "total_time_formatted": "32:53" }, "log_entries": [ { "start_date_time": "2025-08-07T16:50:00+02:00", "end_date_time": "2025-08-07T22:17:00+02:00", "total_time": 327 }, { "start_date_time": "2025-08-08T12:00:00+02:00", "end_date_time": "2025-08-08T12:15:00+02:00", "total_time": 15 } ], "document_type": "typed" }, "explanation": "Extracted timesheet for employee 12326 (Celine Frei) from Spitex Region Zofingen AG for August 2025. Found 5 work days with a total of 32.88 hours (1,973 minutes).", "confidence": 95.0, "processing_time": 8.78, "cost_chf": 0.015, "model_used": "gpt-4o", "created_at": "2025-12-19T14:27:07Z", "completed_at": "2025-12-19T14:27:20Z" } ``` ## Best Practices ### 1. Use Confidence Scores Always check confidence scores before processing data: ```python if result['confidence'] < 70: # Flag for manual review send_to_manual_review(result) elif result['confidence'] < 85: # Automated processing with validation process_with_validation(result) else: # High confidence - automated processing process_automatically(result) ``` ### 2. Handle Missing Data Not all fields may be present in every document: ```python data = result['extracted_data'] # Safe field access employee_id = data.get('employee_id', 'UNKNOWN') company = data.get('company_name', 'Not specified') # Check for empty log entries if not data.get('log_entries'): print("Warning: No work shifts found") ``` ### 3. Validate Calculations Verify that totals match individual entries: ```python def validate_totals(data: Dict[str, Any]) -> bool: """Verify summary totals match log entries.""" log_entries = data.get('log_entries', []) summary = data.get('summary', {}) # Calculate actual total from entries actual_minutes = sum(entry['total_time'] for entry in log_entries) reported_minutes = summary.get('total_time_minutes', 0) # Allow small rounding differences if abs(actual_minutes - reported_minutes) > 5: print(f"Warning: Total mismatch - {actual_minutes} vs {reported_minutes}") return False return True ``` ### 4. Batch Processing Process multiple timesheets efficiently: ```python import os from concurrent.futures import ThreadPoolExecutor def process_timesheet_batch(file_paths: list[str]) -> list[Dict[str, Any]]: """Process multiple timesheets in parallel.""" extractor = TimesheetExtractor( api_token=os.environ["FREDDY_API_KEY"], organization_id="org_abc123" ) with ThreadPoolExecutor(max_workers=5) as executor: results = list(executor.map(extractor.extract, file_paths)) return results # Process all PDFs in a directory pdf_files = [f for f in os.listdir("timesheets/") if f.endswith('.pdf')] results = process_timesheet_batch([f"timesheets/{f}" for f in pdf_files]) # Filter successful extractions successful = [r for r in results if r.get('success') and r.get('confidence', 0) >= 70] print(f"Processed {len(successful)}/{len(results)} timesheets successfully") ``` ## Troubleshooting ### Low Confidence Scores **Problem**: Extraction confidence below 70% **Solutions**: - Ensure document is high quality (300+ DPI for scans) - Check that document is not rotated or skewed - Verify document contains expected data - Try using `gpt-4o` instead of `gpt-4o-mini` for better accuracy ### Missing Log Entries **Problem**: Some work shifts not extracted **Solutions**: - Add to custom prompt: "Extract ALL work shifts, including partial days" - Check `explanation` field for details - Verify document format matches expected structure - Ensure shifts are clearly visible in the document ### Incorrect Time Calculations **Problem**: Total hours don't match individual entries **Solutions**: - Specify timezone in custom prompt - Verify time format in source document (AM/PM vs 24-hour) - Check for overlapping shifts - Validate date boundaries (shifts crossing midnight) ### Wrong Document Type Classification **Problem**: Typed document classified as handwritten **Solutions**: - Add to prompt: "Classify as handwritten only if text is clearly handwritten" - Check OCR confidence scores - Manually override if needed ## Cost Optimization ### Choose the Right Model - **gpt-4o-mini**: ~$0.005-0.02 per document (good for simple typed timesheets) - **gpt-4o**: ~$0.01-0.05 per document (better for handwritten or complex documents) ```python # Use mini for typed documents if document_type == "typed": model = "gpt-4o-mini" else: model = "gpt-4o" ``` ### Cache Results Results are cached for 24 hours. Avoid re-processing the same document: ```python import hashlib def get_document_hash(file_path: str) -> str: """Generate hash for document caching.""" with open(file_path, 'rb') as f: return hashlib.sha256(f.read()).hexdigest() # Check cache before processing doc_hash = get_document_hash("timesheet.pdf") if doc_hash in processed_cache: return processed_cache[doc_hash] ``` ## Related Resources - [Document Extraction API Reference](/docs/api-reference/documents/extract) - [Batch Document Extraction](/docs/api-reference/documents/extract-batch) - [Get Job Status](/docs/api-reference/documents/get-job-status) - [Document Extraction Overview](/docs/api-reference/documents/introduction)