Guide to the WIX Grammar

This is part tutorial, part notes-to-myself because I often forget this stuff.

Atoms that fall across an indenting linebreak

About 90% of the tricky stuff in the grammar is there to handle this single case. Here's the most common incarnation (literal text is in cyan, indent/outdent symbols added by the pre-pass are shown in red).

The tough part here is that the region which matches the Link production (green below) overlaps the region which matches the Balanced production (red below), and neither contains the other:

This is really tough for boolean cfg's to deal with. At first I considered just prohibiting it, but that would've meant not being able to use emacs' fill-paragraph freely.

The Solution

It took a lot of trial and error, but the solution turns out to be three simple principles:

  1. Indentation symbols are considered ordinary whitespace, just like a space or a newline. Outdent symbols are not.

  2. Most productions are specified to be unbalanced – that is, they may contain more indentations than outdentations.

  3. The grammar writer very, very carefully chooses where to replace “Foo” with “Foo <<* & Balanced”. Wherever this happens, it effectively tosses out potential matchings that have too many indents. This compensates for the fact that we're effectively ignoring the indent symbols in #1.

    Note: “<<” in the grammar matches outdent symbols in the input, “*” means zero-or-more, and “&” is conjunction (language intersection). For more detail, see sbp

This has the effect of expanding the red region (Balanced) to encompass the entire paragraph. In other words, we match Balanced against stuff that ends with outdents, but might not start with indents – there may be “leading junk” on the region we check for Balancedness.

So, if you're modifying the grammar, the basic idea is that you should follow #1 and #2, and try not to add any more instances of #3 unless you really, truly know what you're doing.

What this means in practice is that typically when you have a list of indentable entities (for example, bullets within a bullet list) you'll want to force all but the last one to be balanced, and leave the last one unbalanced.

Why not force all paragraph-level entities to be balanced?

(FIXME: explain this better)

It has to do with prioritized matches. Originally bullet lists forced all the bullets in the list to be balanced. The problem here was that the parser would try to match a paragraph against both the “bulleted list” and “plain text paragraph” productions. The former would match a region including the trailing outdent, while the latter would match a region which did not include the trailing outdent. This meant that we had to “hoist” the prioritized-match up to the point after TextParagraph had been turned into (TextParagraph <<* & Balanced). This structural constraint on the grammar wound up being a major source of bugs and generally cumbersome – you had to think really carefully about where you were trying to do a priority-match involving both balanced and unbalanced things. So standardizing on “always unbalanced at the end” simplified things a lot.