twin fox creations / log / creating-a-markup-language /
2016-10-16 15:18:52

Creating a markup language


In my first pass at implementing this blog, the content was stored using a highly structured tree format. For example:

paragraph
| I wanted to play around with static site generation, exploring ideas about
| data representation.
paragraph
| First things first: how to represent the data for a post.
header
| Data!
paragraph
| ...

This works pretty well, but the need for repetitive paragraph sections is a bit unfortunate.

Additionally, without a proper markup language, there's no way to style inline text. For that reason, I relied on nested HTML before:

paragraph
| The <code>content</code> body could be improved ...

Let's see if we can make our own markup language instead.

The goal

What is a markup language, anyway? To quote the wikipedia article:

A markup language is a system for annotating a document in a way that is syntactically distinguishable from the text.

I like to think of it as a language for writing augmentable text.

Overall, our goal is to replace this:

paragraph
| First things first: how to represent the data for a post.
header
| Data!

With something like this:

First things first: how to represent the data for a post

<header>Data!</header>
The one character markup

In order to support inline annotations, we have to use a special character to escape normal interpretation. I'm using $.

That's all we need - we can simply embed annotations inbetween our special character ($bold$) and use a special command ($end$) to mark the end of boundaries:

Yay for $bold$fun stuff$end$!

This works, but can start to look messy very fast. For example:

$bold$$italic$This statement is bold and italic!$end$$end$

Additionally, since we've claimed the $ character for our language, we'll need a special command to escape it as well:

That'll be $dol$4.50 please!
Adding a second character

Since the $end$ command is used quite a bit, we might make our language less verbose by reserving a second character to use in its place. I'll use |:

$bold$$italic$This statement is bold and italic!||

Actually, it looks a bit better if we use it not just as a replacement for $end$, but more precisely as a character to denote boundaries:

$bold|$italic|This statement is bold and italic!||

Bringing in a second character has another nice benefit as well - we can now use $$ or $| to escape our two special characters, since there's no ambiguity in the parser:

That'll be $$4.50 please!

All in all, that pretty well covers our ability to annotate inline text.

What about a larger scale?

Annotating sections

Suppose we have a block of code that we'd like to annotate.

With our language so far, there's nothing preventing us from spanning boundaries across multiple lines:

$code|
  int main() {
    return 0;
  }
|

One idea I played with was to define the boundaries using pipes and indentation instead:

$code
| int main() {
|   return 0;
| }

This works pretty well - it adds a bit of complexity to the parser, but it can be implemented without ambiguity. And it looks nice.

There's something bothering me though - the need to escape special characters within code blocks:

$code
| $$code
| $| int main() {
| $|   return 0;
| $| }

This can't really be avoided - when our language is relying on special characters within text, we have to give them special treatment within that text. Go figure.

Compromise

The best compromise I could come up with was to keep our new markup language for doing inline stuff - we need that functionality anyway - but also keep our indented tree structure for handling blocks and sections.

Compare this:

Since we've claimed the $code|$$| character for our language, we'll need
a special command to escape it as well:

$code|That'll be $$dol$$4.50 please!|

To this:

text
| Since we've claimed the $code|$$| character for our language, we'll need
| a special command to escape it as well:
code
| That'll be $dol$4.50 please!

Both work, but I think I prefer the latter. If only to avoid the escaped characters.

- ava