How to analyze text statistically
Goal
Get quantitative facts about a text — length, readability, frequency, repetition — without spending LLM tokens.
The all-in-one report
If you want everything at once:
open --raw text.md | report --lang en --top 15 --no-llm
report aggregates stats, lix, ttr, freq, ngrams (bi and tri), hapax, repeats and extract into one structured record. --no-llm skips the LLM-backed extras (sentiment, keywords, detect, readability) so it stays deterministic.
To get yaml or json output, pipe through to yaml / to json:
open --raw text.md | report --no-llm | to yaml
Individual commands
Basic counts
open --raw text.md | stats
Returns lines, words, bytes, chars, graphemes, sentences, paragraphs and average word length. stats --verbose adds top word frequencies and word length extremes.
Readability
open --raw text.md | lix
Returns the Lix score plus a Scandinavian-standard interpretation band:
lix: 41.2
sentences: 14
words: 312
long_words: 78
avg_sentence_length: 22.29
long_word_pct: 25.00
interpretation: middel (dagblade)
Lix is calibrated for Scandinavian languages but works as a rough indicator in English too.
Word frequency
open --raw text.md | freq --lang en --top 20
Returns a table of value/count rows after stopword filtering. Built-in stopword sets: en and da. Pass --no-stop to skip filtering or --stop [list] for a custom set.
N-grams
open --raw text.md | ngrams --n 2 --top 20 # bigrams
open --raw text.md | ngrams --n 3 --top 10 # trigrams
Repeated phrases (editor’s tool)
Find accidentally duplicated multi-word phrases:
open --raw text.md | repeats --min-length 4 --min-count 2
Phrases of 4+ words that appear 2+ times. Useful for catching unconscious repetition in long drafts.
Lexical richness
open --raw text.md | ttr
Type-token ratio: unique_words / total_words. Higher = more vocabulary variation. Above 0.5 is usually solid for non-fiction; fiction often runs higher.
Hapax legomena
open --raw text.md | hapax
Words that appear exactly once. A high count signals lexical variety; a low count signals repetition. Classic stylometric measure.
Concordance (keyword in context)
open --raw text.md | kwic espresso --window 5
For each occurrence of “espresso”, show 5 words before and 5 after. Pure corpus-linguistic tool — see how a term is actually used.
Compare two texts
open --raw mine.md | compare reference.md --top 15
Returns two lists: words distinctive to your text, and words distinctive to the reference. Useful for stylistic positioning.
Similarity
open --raw chapter1.md | similar chapter2.md --k 4
Jaccard similarity over 4-word shingles. Score in [0, 1]; higher = more similar. Useful for near-duplicate detection or style consistency checks.
Extract structured data
open --raw text.md | extract --kind url # all URLs
open --raw text.md | extract --kind email # all emails
open --raw text.md | extract --kind hashtag # all #tags
open --raw text.md | extract --kind mention # all @handles
Per-paragraph analysis
sentences and paragraphs are primitives that turn text into lists. Compose them with any of the above:
open --raw text.md | paragraphs | each {|p| $p | lix}
Returns a Lix score per paragraph. Spot the dense ones.
open --raw text.md | sentences | length
Total sentence count (faster than reading stats.sentences if that’s all you want).
Cost
Everything on this page runs locally — no network, no tokens. The slowest operation is reading the file. For very large corpora (book-length), word-frequency operations take seconds; everything else is sub-second.
If you want LLM-backed analyses (sentiment, named entities, qualitative readability assessment), see reference/analyze. For verification against external sources (fact-checking, quote verification, claim extraction), see reference/validate.
Related
- reference/analyze — full flag reference
- Explanation: deterministic vs. LLM — when to reach for each