Tutorial detail

CommonMark Document Model: Characters, Lines, Blocks, and Inlines

Step 4 • Beginner

Build the mental model of CommonMark parsing units before advanced syntax work.

CommonMark starts with text primitives (characters and lines), then determines block structure, and only then resolves inline syntax.

From characters to document structure

The CommonMark document model begins with the smallest practical units: characters and lines. This may sound obvious, but it is essential for understanding why Markdown parsing behaves the way it does. Markdown source is not first interpreted as visual layout. It is processed as a sequence of lines containing Unicode characters, spaces, tabs, punctuation, and line endings. Structure emerges from those lines according to rules.

A line can be blank, contain text, begin with indentation, start a block marker, or continue an existing container. A blank line can separate paragraphs, loosen a list, or terminate certain parsing contexts. A tab can behave like spaces in indentation-sensitive positions. These low-level definitions prevent confusion when documents contain hard-wrapped prose, nested lists, or code blocks.

CommonMark divides document elements into blocks and inlines. Blocks represent larger structural units such as paragraphs, headings, lists, block quotes, thematic breaks, and code blocks. Inlines represent content inside certain blocks, such as emphasis, links, code spans, images, entity references, and text. This separation is one of the most important mental models for Markdown.

Blocks before inlines

CommonMark parsing is best understood as block-first. The parser first identifies the block structure of the document. Only after that does it parse inline syntax inside the blocks that support inline content. This explains why a piece of punctuation may be interpreted structurally before it has any chance to act as inline formatting.

For example, a line starting with - may begin a list item. A line with three dashes may become a thematic break. A line beginning with # may become an ATX heading. These block decisions happen before emphasis or link parsing. If you expect inline syntax to dominate, Markdown behavior can seem surprising. If you understand block precedence, the output becomes more predictable.

This two-phase approach is also efficient. Block parsing can collect reference definitions and determine container boundaries. Inline parsing can then resolve links and emphasis with the necessary context. A parser does not have to fully understand every inline construct before it knows where paragraphs and lists begin.

Container blocks and leaf blocks

CommonMark distinguishes between container blocks and leaf blocks. Container blocks can contain other blocks. Block quotes and list items are the most common examples. A list item may contain paragraphs, nested lists, block quotes, or code blocks. A block quote may contain headings, lists, and paragraphs.

Leaf blocks do not contain other blocks. Headings, thematic breaks, code blocks, HTML blocks, link reference definitions, and paragraphs are examples. They represent terminal structures in the block tree. Understanding this distinction is important when debugging nested Markdown. If something is a leaf block, it will not absorb later blocks as children. If something is a container, indentation and continuation rules matter.

The document itself can be viewed as a sequence of blocks. Those blocks may contain nested blocks, and some blocks contain inline content. The resulting model is tree-like, even if the original source is plain text. This tree is what renderers use to produce HTML, ASTs, or other outputs.

Paragraphs as fallback structure

Paragraphs are the fallback block type for many lines of text. If a line does not begin another block construct and is not blank, it often contributes to a paragraph. Consecutive paragraph lines are combined, and soft line breaks inside the paragraph may render as spaces or line breaks depending on renderer behavior.

This fallback behavior is part of Markdown’s readability. You can write normal prose without wrapping it in tags. But it also means paragraph interruption rules matter. Some block constructs can interrupt paragraphs; others require a blank line. Setext headings, thematic breaks, lists, and indented code blocks all have specific rules about interruption and ambiguity.

When Markdown output looks wrong, ask whether the parser saw a paragraph where you expected another block, or another block where you expected a paragraph. Most structural surprises come from that boundary.

Inlines as local semantics

Inline parsing applies within inline-capable blocks. The parser identifies emphasis, strong emphasis, code spans, links, images, autolinks, escapes, and entity references. Inline parsing has its own precedence and delimiter rules. For example, code spans treat their contents literally, so Markdown syntax inside a code span is not parsed as emphasis or links.

Inline semantics are local to the containing block, but they can depend on document-level reference definitions. Reference-style links use labels defined elsewhere in the document. That means the parser’s earlier block pass must collect definitions before inline resolution is complete.

This is a useful design pattern for large Markdown systems. Treat source Markdown as a pipeline: raw text becomes block structure, block structure provides inline contexts, inline contexts produce a richer document tree, and the tree becomes output. AST tools such as mdast make this pipeline visible.

Document outline integrity

The document model determines whether the rendered output preserves the author’s intended outline. If a heading is parsed as a paragraph, it will not become a structural section. If an indented paragraph becomes a code block, the renderer will treat prose as literal text. If a list container closes earlier than expected, nested content moves to a different position in the tree.

For technical documents, outline integrity is a correctness issue. The source must parse into a predictable block tree before inline semantics can be trusted. The next step is understanding block parsing and precedence, because most outline defects begin at block boundaries.

FAQ

What is a Markdown block?

A block is a structural document unit such as a paragraph, heading, list, block quote, code block, or thematic break.

What is a Markdown inline?

An inline is content inside certain blocks, such as emphasis, links, images, code spans, text, escapes, and entity references.

Why does CommonMark parse blocks first?

Block-first parsing resolves document structure before local inline formatting. This makes precedence predictable and allows reference definitions to be collected.

What is a container block?

A container block is a block that can contain other blocks. Lists and block quotes are common examples.

Why does this model matter for authors?

It helps authors understand why indentation, blank lines, and punctuation placement change the rendered structure of a document.

Read the standards context in CommonMark Standardization.
Apply the model in Markdown Block Parsing and Precedence Rules.
Study block examples in Markdown Headings, Paragraphs, Line Breaks, and Thematic Breaks.
Move from source structure to trees in Markdown AST with mdast.
Build processing workflows in unified and remark Pipelines.

Continue with Markdown Block Parsing and Precedence Rules.

References

CommonMark Specification

Navigation

Previous step Next step Tutorial index

Series map