Structured Data Extraction
Turn any document into clean, structured JSON data.
DocuDevs allows you to extract specific fields from documents using natural language prompts or strict JSON schemas. This is the core feature of the platform, enabling you to process invoices, contracts, forms, and more with high precision.
How It Works
- Upload a document (PDF, Image, Word, etc.)
- Provide Instructions (Prompt) or a Schema
- Receive JSON data extracted from the document
Basic Extraction (Prompt-based)
The simplest way to extract data is to describe what you want in plain English.
Example: Extracting data from an invoice.
- cURL
- Python SDK
- CLI
curl -X POST https://api.docudevs.ai/document/upload-files \
-H "Authorization: Bearer $API_KEY" \
-F "document=@invoice.pdf" \
-F "instructions=Extract the invoice number, date, total amount, and vendor name."
from docudevs.docudevs_client import DocuDevsClient
import os
client = DocuDevsClient(token=os.getenv('API_KEY'))
with open("invoice.pdf", "rb") as f:
guid = await client.submit_and_process_document(
document=f.read(),
document_mime_type="application/pdf",
prompt="Extract the invoice number, date, total amount, and vendor name."
)
result = await client.wait_until_ready(guid, result_format="json")
print(result)
docudevs process invoice.pdf --prompt "Extract the invoice number, date, total amount, and vendor name."
Response:
{
"invoice_number": "INV-2024-001",
"date": "2024-01-15",
"total_amount": "1500.00 USD",
"vendor_name": "Acme Corp"
}
Structured Extraction (Schema-based)
For production use cases, you should use a JSON Schema. This ensures the API returns data in the exact format you need, with correct data types (numbers, dates, arrays).
Why use a Schema?
- Consistency: Always get the same JSON structure.
- Type Safety: Numbers are numbers, dates are standard strings.
- Validation: The AI knows exactly what fields are required.
Example: Strict Invoice Schema
Let's define a schema for an invoice with line items.
{
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"date": {"type": "string", "format": "date"},
"total": {"type": "number"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"amount": {"type": "number"}
}
}
}
},
"required": ["invoice_number", "total"]
}
- cURL
- Python SDK
- CLI
# Save schema to schema.json first
curl -X POST https://api.docudevs.ai/document/upload-files/sync \
-H "Authorization: Bearer $API_KEY" \
-F "document=@invoice.pdf" \
-F "schema=@schema.json"
import json
schema = {
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"total": {"type": "number"}
}
}
# Schema must be passed as a JSON string
guid = await client.submit_and_process_document(
document=doc_bytes,
document_mime_type="application/pdf",
schema=json.dumps(schema)
)
docudevs process invoice.pdf --schema-file schema.json
Type-Safe Extraction with Pydantic (Python)
The Python SDK integrates seamlessly with Pydantic, allowing you to define your schema using Python classes. This gives you full type safety and IDE autocompletion.
- Python SDK
from pydantic import BaseModel, Field
from typing import List
import json
# 1. Define your data model
class LineItem(BaseModel):
description: str
quantity: int
price: float
class Invoice(BaseModel):
number: str = Field(description="The invoice number")
date: str
items: List[LineItem]
total: float
# 2. Generate schema from model
schema = json.dumps(Invoice.model_json_schema())
# 3. Process document
guid = await client.submit_and_process_document(
document=doc_bytes,
document_mime_type="application/pdf",
schema=schema
)
# 4. Validate result back into Pydantic model
result_dict = await client.wait_until_ready(guid, result_format="json")
invoice = Invoice(**result_dict)
print(f"Invoice {invoice.number} total: {invoice.total}")
Advanced Configuration
You can combine extraction with other features like Barcode/QR scanning or specific OCR settings.
Extracting QR Codes
- cURL
- Python SDK
- CLI
# config.json: {"barcodes": true}
curl -X POST https://api.docudevs.ai/document/upload-files \
-H "Authorization: Bearer $API_KEY" \
-F "document=@invoice.pdf" \
-F "metadata=@config.json" \
-F "instructions=Extract QR codes"
guid = await client.submit_and_process_document(
document=doc_bytes,
document_mime_type="application/pdf",
prompt="Extract QR codes",
barcodes=True # Enable barcode scanning
)
docudevs process invoice.pdf --barcodes --prompt "Extract QR codes"
Next Steps
- Map-Reduce Extraction: Handle very large documents (50+ pages).
- Batch Processing: Process thousands of documents efficiently.
- Named Configurations: Save your schemas and prompts for reuse.