The Goofy Markup Language Processor

Last update to this text: April 12, 2022.

GMLP is a number of functions for converting text in any markup language to any other text.

There is more on the GMLP Design, the reasoning behind the GMLP code, an in-depth look at the default ("goofy") GMLP markup and how to use it's CLI in the docs/ directory.

Introduction

GMLP simply runs text input lines through various preg_replace() and/or str_replace() calls defined by an external array—a user defined array. To help with more complex processing there are various function hooks to be called for each line, multiple lines (lists for example) or the entire text.

Conversions

There are five types of conversions that can be defined. (These are based on the "parts", or "categories" of typical English grammar, more about which later.) I call them:

Lines
Inlines
Entities
Character
Block

All only occur according to the data definitions, and in a particular order.

Conversions are processed as key/value associative arrays, with the key being replaced by the value, the exceptions being the Character and Block conversions explained below.

What follows is a brief summary of each type and then a detailed look at the mechanisms of each one.

Lines

These convert an entire line directly to a paragraph (enclosed within block-level HTML elements) by a regular expression generally beginning with a start of line meta-character (^).

Inlines

These replace a regular expression key for words within a line with it's value. These provide the bulk of the conversions.

Entities

The name refers to the default rule to convert certain characters to named entities, but these are simply an array of string replacements (and not regular expressions), with keys replaced by their values.

Characters

These replace character pairs with start and end tags, and are defined as single character keys with the value the tag replacements. (To be replaced by inlines rules.)

Blocks

Block conversions have a start line, followed by any number of lines making up the block and then an end line. The start and end lines are regular expressions. Each block definition has a number of conditions that can be "applied" to the block.

Paragraphs

Everybody probably knows this, but some basic rules for processing text are that the text is delimited by newlines, with two consecutive newlines indicating a paragraph and empty paragraphs are ignored (but some defined rules may depend on blank lines).

For HTML each paragraph is enclosed in <p></p>, and each single newline in a paragraph converted to a <br>. For example (with extra spaces for clarity):

    TEXT: Line One
    HTML: <p>Line One</p>
    TEXT: Line One\n\n Line Two
    HTML: <p>Line One</p> <p>Line Two</p>
    TEXT: Line One\n\n A\nB\nC\n\n Line Two
    HTML: <p>Line One</p> <p>A<br>B<br>C</p> <p>Line Two</p>

With the creation of paragraphs—or not—no other processing of the text occurs by the code. Any other text conversion is only done by the "conversion rules" in the conversion definition data. If no conversion rules are defined, only these paragraphing rules are applied to the text.

That is what I call "Algorithm P", the basics of converting text, whether or not each line in a paragraph has line endings or not.

The basic data for the above is (yet another) array:

    array(
        'BR' => '<br>',
        'P1' => '<p>',
        'P2' => '</p>',
        'NL' => "\n",
    );

Which means that for non-HTML output BR, P1 and P2 are simply set to empty.

There is one other thing, based on paragraphs without or with line-breaks. Think of this example where A, B and C are sentences. First, without line-breaks.

    A B C

And with.

    A
    B
    C

This brings up the term "greediness", in that the latter example can be set to not add BR to the end of each line for HTML (using spaces essentially). For text, BR is empty to simulate greediness, or set to \n, for greedy-like paragraphs. (It is actually slightly different in the code; more on that later.)

One stand out thing about the code is that it operates solely line-by-line and does look forward or back, so multiple-line markup (setext style markup) was not designed to be supported. (Though there is a way to support that.)

A Look At The Regular Expressions

The regular expressions for these examples are in the Default Definition File, something I created to support a personalized blog and was not based on any known markup language but were solely my personal preference at the time. Only later, after the first version of GMLP, did I learn about Markdown and others.

The conversion "rules" are arrays and placed into a PHP file which is included by the code. The arrays are processed in a particular order—conversions can affect other conversions. The order of the types of conversions is controlled by the code and not the data.

What follows are details about each conversion array in the order that they are applied to the text.

As with all the array definitions, if they do not exist or are empty, they will have no effect on the input text.

Lines

The next array is lines and they convert a line and then skip all further conversions. Three examples (all definitions use the $gmlp_translate array):

    $gmlp_translate['lines'] = array(
    '/^;/' => NULL,
    '/^<!--.*-->$/' => '',
    '/^h([1-6]):\s*(.*)/' => '<h$1>$2</h$1>'
    '/^[A-Z ]+$/' => 'gmlp_convertcase',
    );

With the lines array, if the key matches the line, what happens to the line is based in the value:

The details of the lines array are:

  1. If value is NULL the line is discarded.
  2. If value is FALSE keep line and do not continue processing this array.
  3. If value is an empty string use line as is or $1 if subpattern is defined.
  4. If value is function or closure, the function is called with line and any search matches, and line is replaced by the return value.
  5. If the the preg match did produce subpatterns, the line is replaced by the value; if subpatterns exist the line is replaced by the result of preg_replace() with the key and value on the line.

Blocks

Block conversions occur next by the code—if a block begin RE matches. But since block processing is more complex are explained later.

Inlines

For inlines, the value is either the name of a PHP function to be called on the line or a RE replacement string.

    $gmlp_translate['inlines'] = array(
    '/\[(.*)\]\((.*)\)/U' => '<a href="$2">$1</a>',
    '/\'\'(.*)\'\'/U' => 'htmlentities',
    );

The details of the inlines array are:

  1. If value is a string preg_replace() with the key and value on the line.
  2. If value is a function name or closure, replace all subpatterns in line with function return value.

There is an option to change how closures work:

  1. The closure is called with the line and the matches array, and the line is replaced by the return value.

These are applied for each line, in order, from first to last, which means results propagate through each one; which means there can be order dependencies.

Entities

The entities array are simple string replacements, so named for they are like:

    $gmlp_translate['entities'] = array(
    '--' => '&mdash;'
    );

but, of course, they can be anything. Like, perhaps:

    $gmlp_translate['entities'] = array(
    '---' => '&#8212;', 
    '--' => '&#8211;', 
    'xn&#8211;' => 'xn--',
    '...' => '&#8230;',
    '``' => '&#8220;', 
    '\'\'' => '&#8221;', 
    '(tm)' => '&#8482;',
    );

(Why that third one is seemingly backward... I don't know—they were a copy/paste from a popular blogware.)

This one is enabled by the definition options array:

    $gmlp_translate['OPTIONS'] = array(
    'entities' => 1,
    );

Characters

The chars array is last, and these enclose key delimited words with value HTML tags:

    $gmlp_translate['chars'] = array(
    '*' => 'b',
    '_' => 'u',
    '^' => 'em',
    '\'' => 'code',
    );

These are not just straight replacements though, as they try to figure out just what a word is:

Beginning of sentence: "Bold word."
Trailing punctuation: "Italic word."
Preserve inside chars: "$identifier_with_underscores;"
And intermixed: "This is bold with underline."
And not within HTML: font family Verdana.

Word boundaries can be defined.

Blocks

A block definition is an associative array defined with the following:

"begin"
the RE to start the block
"end"
the RE to end the block
"pre"
a string to prepend to the block
"post"
a string to append to the block or a function to call on the entire block
"replace"
a string to append to each line or a function to call on each line
"first"
a boolean if true to use the begin line in the block
"last"
a boolean if true to use the end line in the block
"continue"
a boolean to retain end line (for the following paragraph)
"newline"
a boolean to retain all newlines in the block

Of them, only "begin" and "end" are required, all others are optional and default to FALSE (and if no others are defined the result is that nothing will be done to the block).

Here is an example:

    $gmlp_translate['blocks']['html'] = array(
    'begin' => '/^<html>/',
    'end' => '/^<\/html>/',
    'replace' => 'htmlentities',
    'pre' => '<pre>',
    'post' => '</pre>'
    );

The html block starts with "<html>", ends with "</html>", runs each line through the PHP function htmlentities(), prefixes the block with <pre> and postfixes the block with </pre>. Which is, this:

<html>
HTML <b>example</b> of bold.
</html>

will result in this:

HTML <b>example</b> of bold.

Another example is:

    $gmlp_translate['blocks']['php'] = array(
    'begin' => '/^<\?php/',
    'end' => '/^\?>/',
    'post' => 'gmlp_highlightstr',
    'first' => 1,
    'last' => 1
    );

where post is a function to call on the entire block (note how post has a dual use which is done to reduce data complexity), and first and last means to keep the begin and end lines in the block. (See Functions.)

The function gmlp_highlightstr() is defined in the data definition file and is:

<?php
   
function gmlp_highlightstr($s) {
        return 
highlight_string($s,TRUE);
    }
?>

And a way to do PHP code without the <?php ?> by using <php> </php>:

    function gmlp_highlightstr($s) {
        return 
highlight_string($s,TRUE);
    }

There is a similar block definition that encloses highlighted PHP in styled <div> to do this:

PHP code     function gmlp_highlightstr($s) {
        return 
highlight_string($s,TRUE);
    }

The block convert function is not a pretty one. It is also a bit confusing. And it is a bit too long. And, well, you can See the Code for yourself.

There is a way to use your own function to process a block, the function member:

    $gmlp_translate['blocks'][] = array(
    'begin' => '/^\s*\/\*\*$/',
    'function' => 'phpdoc',
    );

It just needs to follow a few rules. See `defs/phpdoc.php` and `gmlp_func_ls.php`.

Functions

Since a Definition File is a PHP file it can define it's own support functions within it. Optionally, a Definition File can have any support functions in a separate file simply for maintenance perhaps. A separate functions file can also be included by a Definition File.

But the main API function that loads a Definition File can take a second argument for a Function File.

PHP code function gmlp_open($definitions NULL$functions NULL) {
global 
$gmlp_translate;

   
gmlp_def_def();
    if (
$definitions)
        include 
$definitions;
    if (
$functions)
        include 
$functions;
    return 
TRUE;
}

The function gmlp_def_def() initializes $gmlp_translate to a default state (see also defs/def.php).

The global array can be modified by the function gmlp_add().

Hooks

Hooks are like callbacks, but really just user defined functions called during certain times during the text processing:

    $gmlp_translate['HOOKS'] = array(
    'pre-convert' => 'pre_convert_function',
    'convert' => 'convert_function',
    'lines' => 'lines_function',
    'post-convert' => 'post_convert_function',
    );

pre_convert is called with the data to convert in it's entirety as a string. convert is called with the data to convert in it's entirety as an array of lines. lines is called for each line of the data except those that are processed as lines or blocks. post_convert is called with the converted data in it's entirety as a string.

Conclusion

GMLP works good enough as it currently is for a dozen regular expressions to define a markup language. The code is not too slow, but mostly that is due to the implementation of the PHP PCRE code.

There are two basic uses for GMLP: a Web application that wants to have simple markup for input data or converting text files in various ways. I use it for both.

The "markup" I chose are extremely simple and were developed to suit my style of writing. If you examine the "source" to the HTML you are viewing (gmlp.txt) you will see that it is mostly just straight text, with a few odd exceptions.

Caveats

Most HTML in the input text will be left as is. Using block-level tags may not turn out well though as they end up between <p> tags. The lines array can directly allow some as in:

    '/^<hr>$/' => '',

and there are block definitions for <ul>, <ol>, <div> and <blockquote>, and any amount can be added (although it would be a bit tedious and data intensive).

But the idea is to convert text with as few HTML tags as possible.

HTML comments within a string, like this: (view page source to see) will have the double dashes (‐‐) converted to &mdash;. But comments on a line are spared that by the lines entry:

    '/^<!--.*-->$/' => '',

Since lines skip further processing when matched that line has it's double dashes preserved.

Bugs

The code does not try to "close things". That is, if a start tag (either line or block) does not have an end tag the output is undefined. (Single character codes though, ' ^ * |, are preserved.)

One currently known bug is in the default Characters conversion code (see gmlp_chars()), this text does not compute:

    Sample line--*emphasis clause*--more words.

That results in: Sample line—*emphasis clause*—more words. This is because the entity substitution for –– is &mdash;:

    Sample line&mdash;*emphasis clause*&mdash;more words.

The "bug" is the character begin word regular expression is [^a-zA-Z=;\/], so * is not considered a word boundary.

By redefining word boundaries this problem goes away.

Or, just use spaces around the --.

Notes
  1. The order of conversions is one of the tricky things about stateless conversion as GMLP does, and was the the hardest part to "get right". It also requires the occasional use of named elements (such as &#039;) to avoid some conversions—this is a flaw in the chosen markup data and not in the code.
  2. When I read that I say to myself, "regular expression," rather than "are-eee," so I always write "a RE."
  3. I chose for that <H2>, a personal preference that happens to tie in with other stuff I use this code for.
  4. I should one day write about how I figured out how to figure out just what a word is—it took a while, and in the end, the code does what it does with just one, short RE.
  5. Laziness is the true mother of invention, Norbert Wiener said, and laziness is (mostly) why I wrote this code. First, to reduce the amount of markup required to format text. And second, I was tired of constantly modifying code each time I came up with yet another markup sequence, especially a new block code—it is much easier to maintain, edit and use data files.
  6. Which some might consider a flaw. Data can be written to properly handle this—note that I said data; I would rather not maintain an array of block-level HTML elements and try to deal with it that way. The basic premise of this, though, is to not have to use HTML to format the text. Let us try first to see if the text's own attributes can be used.
  7. Not supporting HTML comments within strings is not different than the HTML "rule" of not nesting some elements, which is in effect, "Please don't do that."

Extended Notes

Block HTML

For handling block HTML elements in the text, I opted to not maintain an array of current block HTML elements, as the point of the code is that it should not know anything about such things. In fact, the code can be considered to be truly dumb as it does nothing by itself to the text beyond the newline conversions mentioned. The data definitions "do" all the work. And the result are simple, small loops of dumb code that only appears to be "smart," it is the data that controls the code. Change the data even slightly and the result will be completely different.

The result is a combination of a few elements handled by block conversion and "Please don't do that." This code is not just about having markup codes in a text file, but also to reduce the amount of markup needed to convert text to HTML.

End

The possibilities are many, and are easy to implement, and that is one reason to code this way. Admittedly, though, this code and some of the definition function code is a bit sloppy in places. (And some of the earlier releases were pretty poorly implemented—but the code is not too embarrassing—except in a few places...)


"A man's got to know his possibilities."

"A man's got to know his possibilities."

"A man's got to know his possibilities."
"A man's got to know his possibilities."

"A man's got to know his possibilities."

"A man's got to know his possibilities."

"A man's got to know his possibilities."