How to convert documents into markdown
Goal
Pull external document formats (Word, PDF, EPUB, HTML, LaTeX, …) into markdown so they can flow through the rest of the comma pipeline — analyze, polish, distill, transform, render to a different format.
Prerequisites
- pandoc in
$PATH— handles html, docx, epub, odt, latex, rst, org - pdftotext in
$PATH(only forfrom-pdf) — part of poppler;brew install poppleron macOS
Basic recipes
From a Word document
from-docx report.docx
Returns markdown on stdout. Pipe wherever you need it:
from-docx report.docx | save -f report.md
from-docx report.docx | proof | save -f report.md
from-docx report.docx | polish --brief "quarterly update" > polished.md
Extract embedded images alongside the text:
from-docx report.docx --extract-media ./figures
From a PDF paper
from-pdf paper.pdf
PDF output is plain text, not true markdown — pandoc cannot read PDF directly, so we fall back to pdftotext. Headings, tables and multi-column layouts may not survive. Still useful for downstream analysis:
from-pdf paper.pdf | distill | iwe new "Research paper"
from-pdf paper.pdf | report --no-llm | to yaml
To preserve the original layout (columns, line breaks):
from-pdf paper.pdf --layout
From an EPUB book
from-epub book.epub | save -f book.md
from-epub book.epub | sentences | length # count sentences in the whole book
from-epub book.epub | report --lang en # full analysis of the corpus
From HTML
from-html page.html
open --raw page.html | from-html # also works
Note that for live URLs you usually want fetch instead — it uses Mozilla Readability to extract just the article content, where from-html keeps everything in the HTML body.
From other formats
| Format | Command | Input type |
|---|---|---|
| LaTeX | from-latex paper.tex or piped |
text |
| reStructuredText | from-rst manual.rst or piped |
text |
| Org-mode | from-org notes.org or piped |
text |
| LibreOffice ODT | from-odt document.odt |
binary |
All text-format commands accept either a file path positional or piped text. Binary formats require a file path.
Round-trip with the to-* commands
The from-* and to-* commands are inverses. You can round-trip a document through markdown for any transformation that needs structural rewriting:
# Open a Word doc, polish, re-render as Word with a custom template
from-docx draft.docx \
| polish --level editorial --brief "Q3 review" \
| to-docx polished.docx --reference-doc company-style.docx
# Open an EPUB, translate every chapter, build a new EPUB
from-epub book-en.epub \
| tr da \
| to-epub book-da.epub --title "Bog på dansk" --author "Translator Name"
# Open a LaTeX paper, summarize, output a PDF brief
from-latex paper.tex \
| sum --max 10 \
| to-pdf summary.pdf --title "Paper summary"
Working with extracted images
The binary-format commands (from-docx, from-epub, from-odt) embed images inside the file. Use --extract-media <dir> to write them to disk alongside the markdown:
from-docx report.docx --extract-media ./media | save -f report.md
ls ./media # check what was extracted
The output markdown will contain -style references pointing at the extracted files.
Caveats
- PDF quality varies hugely. A clean LaTeX-generated PDF extracts well. A scanned image of a printed page extracts as garbage — you would need OCR (
tesseract) first, which is out of comma’s scope. - Tables in DOCX/ODT convert to markdown tables, which look ugly inline. Run
from-docx | proofto normalize them, or accept that markdown tables aren’t pretty. - EPUB fidelity depends on the source. A well-structured EPUB with semantic markup extracts beautifully; a DRM-stripped, image-heavy EPUB may produce noise.
- LaTeX with custom commands may need pandoc to be told about your packages. For complex documents, expect to clean up the result manually.
Related
- reference/convert — full flag reference for every from-* and to-* command
- How to render to PDF — the inverse direction
- Tutorial 02 — research-to-publish pipeline — sees from-* commands as the natural entry point when capture is a file rather than a URL