Skip to main content

Map-Reduce Extraction

Process long documents (50+ pages) by breaking them into manageable chunks.

Map-Reduce is a powerful processing mode designed for long, repetitive, or multi-section documents. Instead of trying to fit the entire document into a single AI context window, DocuDevs splits the document into overlapping windows ("chunks"), extracts data from each chunk, and then aggregates the results.

How It Works

  1. Map: The document is split into chunks (e.g., 5 pages each).
  2. Process: Each chunk is processed independently to extract data.
  3. Reduce: The results are combined into a single, clean JSON output.

You always receive a consistent JSON payload:

{
"header": { ... optional ... },
"records": [ ... rows ... ]
}

When to Use Map-Reduce

  • Long Documents: Annual reports, 100+ page contracts, large catalogs.
  • Repetitive Data: Bank statements, medical records, logs.
  • Header + Rows: Documents with a summary section (header) and pages of detailed rows.

Configuration Options

OptionDescriptionDefault
splitTypeChunking mode: PAGE or MARKDOWN_HEADER.PAGE
splitHeaderLevelHeader level used when splitType=MARKDOWN_HEADER (1 for #, 2 for ##).2
pagesPerChunkNumber of pages per extraction window.1
overlapPagesNumber of pages to overlap between chunks to catch data spanning pages.0
dedupKeyJSON path to a unique field (e.g., sku, date) used to remove duplicates caused by overlap.null
parallel_processingWhen true, chunks are processed in parallel by multiple workers for faster throughput.false
header_optionsConfiguration for extracting document-level metadata (header).null

When splitType=MARKDOWN_HEADER, chunking is based on OCR markdown sections instead of pages. In this mode, overlapPages and dedupKey do not apply.

Example: Processing a Long Invoice

Imagine a 50-page invoice where the first page has the Invoice Number and Date (Header), and the remaining 49 pages contain Line Items (Rows).

from docudevs.docudevs_client import DocuDevsClient
import os
import asyncio

client = DocuDevsClient(token=os.getenv('API_KEY'))

async def process_long_invoice():
with open("large-invoice.pdf", "rb") as f:
document_data = f.read()

# Submit Map-Reduce Job
job_id = await client.submit_and_process_document_map_reduce(
document=document_data,
document_mime_type="application/pdf",

# Instructions for extracting rows (Line Items)
prompt="Extract invoice line items (sku, description, quantity, unitPrice, total).",

# Chunking Strategy
pages_per_chunk=4,
overlap_pages=1,
dedup_key="lineItems.sku", # Unique key to deduplicate rows
parallel_processing=True, # Process chunks in parallel for faster results

# Header Extraction Strategy
header_options={
"page_limit": 1, # Look for header only on page 1
"include_in_rows": False, # Don't include page 1 in row processing
"row_prompt_augmentation": "This is an invoice from Acme Corp." # Context for every chunk
},
header_schema='{"invoiceNumber":"string","issueDate":"string","billingAddress":"string"}'
)

print(f"Job submitted: {job_id}")

# Wait for results
result = await client.wait_until_ready(job_id, result_format="json")

print("Header:", result.get("header"))
print(f"Extracted {len(result.get('records', []))} line items.")

if __name__ == "__main__":
asyncio.run(process_long_invoice())

Example: Split by Markdown Headers

Use this when OCR text is markdown and your document is organized by headings.

job_id = await client.submit_and_process_document_map_reduce(
document=document_data,
document_mime_type="application/pdf",
prompt="Extract obligations by section.",
split_type="markdown_header",
split_header_level=2, # split on ## headings
parallel_processing=True
)

Header Capture Strategy

Often, the first few pages contain "Header" information (Summary, Dates, Parties) that applies to the whole document.

  • page_limit: How many pages at the start are considered "Header".
  • header_schema: A specific JSON schema just for the header data.
  • row_prompt_augmentation: Text injected into every subsequent chunk prompt. Useful for passing context (e.g., "This is Invoice #123") to help the AI understand isolated pages.

Best Practices

  1. Always use a dedupKey: When overlapPages > 0, the same row might appear in two chunks. The dedupKey tells DocuDevs how to identify and merge them.
  2. Start small: Try pagesPerChunk=4. If the AI misses context, increase it. If it's too slow, decrease it.
  3. Monitor Progress: Map-Reduce jobs take longer. Use the status endpoint to track progress.
  4. Use markdown-header split for sectioned docs: If OCR output is markdown with clear headings, set splitType=MARKDOWN_HEADER and choose splitHeaderLevel (1 for top-level sections, 2 for subsection chunks).
status = await client.status(job_id)
print(f"Progress: {status.parsed.mapReduceStatus.completedChunks} / {status.parsed.mapReduceStatus.totalChunks}")

Re-running Map-Reduce on an Existing Job

If you have already processed or OCR'd a document, you can run map-reduce on it again without re-uploading the file or re-running OCR. This is useful when you want to:

  • Iterate on your prompt/schema without paying for OCR again.
  • Run different chunking strategies on the same document.
  • Extract different fields from a document that was already OCR'd.

Use submit_and_wait_for_map_reduce with the parent_job_id of the completed job:

import asyncio, os
from docudevs.docudevs_client import DocuDevsClient

client = DocuDevsClient(token=os.getenv('API_KEY'))

async def reprocess_with_map_reduce():
# Assume we already have a completed job from a previous run
existing_job_id = "your-completed-job-guid"

# Re-run with map-reduce — no upload, no OCR cost
result = await client.submit_and_wait_for_map_reduce(
parent_job_id=existing_job_id,
prompt="Extract all line items (sku, description, qty, total)",
schema='{"type":"array","items":{"type":"object"}}',
pages_per_chunk=5,
overlap_pages=1,
dedup_key="sku",
parallel_processing=True,
timeout=300
)

print("Header:", result.get("header"))
print(f"Extracted {len(result.get('records', []))} line items.")

asyncio.run(reprocess_with_map_reduce())
tip

Since OCR is the most expensive part of processing, re-running extraction with different prompts or schemas on an already-processed document is very cost-effective.

Troubleshooting

  • Missing Rows: Check if rows are split across pages. Increase overlapPages.
  • Duplicate Rows: Ensure your dedupKey is truly unique for each row.
  • Slow Processing: Reduce pagesPerChunk to speed up individual chunk processing.
  • Wrong Markdown Sections: For markdown-header chunking, switch between splitHeaderLevel=1 and splitHeaderLevel=2 depending on your document structure.