The Gmlp Markup Language Processor

Last update: July 23, 2014.

GMLP is a simple set of functions for converting text from a markup language to any other text.

Introduction

GMLP simply runs text input lines through various preg_replaces() defined in an external array—a user defined array. To help with more complex processing there are various function hooks to be called for each line, multiple lines (lists for example) or the entire text.

Paragraphs

Some basic rules for processing text are that the text is delimited by newlines, with two consecutive newlines indicating a paragraph and empty paragraphs are ignored (but some defined rules may depend on blank lines).

Each paragraph is enclosed in <P></P>, and each single newline in a paragraph converted to a <BR>. For example (with extra spaces for clarity):

    TEXT: Line One
    HTML: <p>Line One</p>
    TEXT: Line One\n\n Line Two
    HTML: <p>Line One</p> <p>Line Two</p>
    TEXT: Line One\n\n A\nB\nC\n\n Line Two
    HTML: <p>Line One</p> <p>A<br>B<br>C</p> <p>Line Two</p>

There is one caveat with converting the text to paragraphs, in that they should not contain block-level elements. If any markup creates block-level HTML elements they must be created outside of a paragraph. If the text contains any block-level elements they must not be enclosed in paragraphs. Here is an example:

    TEXT: Line One\n <ul>\n<li>item</li>\n</ul>\n Line Two
    HTML: <p>Line One</p> <ul><li>item</li></ul> <p>Line Two</p>

GMLP does not know about the difference between inline and block elements, and does not track whether or not it is in a block element. But there are ways to deal with this. One is to define an array of block HTML elements and try to "look for them." One is to have rules to "handle" them. Another is, "Please don't do that."

With the creation of paragraphs—or not—no other processing of the text occurs by the code. Any other text conversion is only done by the "conversion rules" in the conversion definition data. If no conversion rules are defined, only these paragraphing rules are applied to the text.

There is more on the the basic Paragraph Algorithm. One stand out thing about this code is that it operates solely line-by-line and does look forward or back, so multiple-line markup (setext style markup) is not supported (not designed to be).

Conversions

There are five types of conversions that can be defined. I call them:

Lines
Inlines
Entities
Character
Block

All only occur according to the data definitions, and in a particular order. There are two special case data definitions, one to "skip a line" from the conversion process, and one to be applied to some of the Block conversions.

Conversions are processed as key/value associative arrays, with the key being replaced by the value, the exceptions being the Character and Block conversions explained below.

What follows is a brief summary of each type and then a detailed look at the mechanisms of each one.

Line Conversions

These convert an entire line directly to a paragraph (enclosed within block-level HTML elements) by a regular expression generally beginning with a start of line meta-character (^). If the value is a function, the result is the function called with the match.

Inlines Conversions

These replace a regular expression key with it's value. If the value is a function, the result is the function called with the match. These provide the bulk of the conversions.

Entities Conversions

The name refers to the default rule to convert certain characters to named entities, but these are simply an array of string replacements (and not regular expressions), with keys replaced by their values.

Character Conversions

These replace character pairs with start and end tags, and are defined as single character keys with the value the tag replacements.

Block Conversions

Block conversions have a start line, followed by any number of lines making up the block and then an end line. The start and end lines are regular expressions. Each block definition has a number of conditions that can be "applied" to the block. The defined start/end tags are in an HTML-like format, e.g. <code>...</code>, with each one on it's own line, however, they can be any regular expression.

A Look At The Regular Expressions

The regular expressions I came up with for these examples are not based on any existing markup language but are solely my personal preference. The purpose of GMLP is that any markup codes can be designed.

The conversion "rules" are arrays and placed into a PHP file which is then simply included by the code. The arrays are processed in a particular order—conversions can affect other conversions. The order of the types of conversions is controlled by the code and not the data.

The Default Translation Data are viewable as well as for Markdown and Textile.

What follows are details about each conversion array in the order that they are applied to the text.

Skip

The first array is the skip array and this test is performed first for all lines:

    $translate['skip'] = array(
   
"/^`(.*)'$/"
   
);

Which means that lines beginning with a backtick (`) and ending with a single quote (') are ignored from processing, and those characters are discarded from the line—if the regular expression (RE) does not have a sub-pattern the line is used as-is.

This is a good example of what GMLP does—any regular expression can be used to skip a line. Two backticks? Double single quotes? Whatever. As with all the other conversion arrays there can also be more than one regular expression to meet the condition.

(One thing to note about the skip array is that it is a regular array and not an associative array as are most others.)

Lines

The next array is lines and they convert a line and then skip all further conversions. Three examples:

    $translate['lines'] = array(
   
'/^<!--.*-->$/' => '',
   
'/^h([1-6]):\s*(.*)/' => '<h$1>$2</h$1>'
    '/^[A-Z &\?\'!]+$/' 
=> 'gmlp_convertcase',
    );

With the lines array the value is either a PHP function to be called on the line (or the first sub-pattern) or a RE replacement string; with the exception of an empty value meaning use the line as-is. For the last example the function is part of the translation code and it turns an uppercase line into a header element.

The Default Translation Functions are viewable.

Blocks

Block conversions occur next by the code but are more complex and are explained later.

Inlines

The inlines array is next and the value is either a PHP function to be called on the line or a RE replacement string and the code continues. Here is an example to convert `GNU style' markup, PHP functions (like preg_match() and echo('string')) and links:

    $translate['inlines'] = array(
   
'/`([^\'].*)\'/U' => '<code>`$1&#039;</code>',
   
'/([a-zA-Z_]+\([^\)]*\))/' => 'gmlp_php_string',
   
'/(http[s]*:\/\S*)([:;])([a-zA-Z0-9 ]+)/' => '<a href="$1">$3</a>',
    );

The first one uses &#039; to avoid some character translations that can occur later.

Entities

The entities array are simple string replacements, so named for they are like:

    $translate['entities'] = array(
   
'--' => '&mdash;'
   
);

but, of course, they can be anything. Like, perhaps:

    $translate['entities'] = array(
   
'---' => '&#8212;'
   
'--' => '&#8211;'
   
'xn&#8211;' => 'xn--',
   
'...' => '&#8230;',
   
'``' => '&#8220;'
   
'\'\'' => '&#8221;'
   
'(tm)' => '&#8482;',
    );

(Why that third one is backwards... I don't know.) This one is enabled by the definition array line:

    $translate['entitytranslate'] = 1;

Chars

The chars array is last, and these enclose key delimited words with value HTML tags:

    $translate['chars'] = array(
   
'*' => 'b',
   
'_' => 'u',
   
'^' => 'em',
   
'\'' => 'code',
    );

These are not just straight replacements though, as they try to figure out just what a word is:

Beginning of sentence: "Bold word."
Trailing punctuation: "Italic word."
Preserve inside chars: "$identifier_with_underscores;"
And intermixed: "This is bold with underline."
And not within HTML: font family Arial.

Word boundaries can be defined.

Blocks

A block definition is an associative array defined with the following:

"begin"
the RE to start the block
"end"
the RE to end the block
"pre"
a string to prepend to the block
"post"
a string to append to the block or a function to call on the entire block
"replace"
a string to append to each line or a function to call on each line
"first"
a boolean if true to use the begin line in the block
"last"
a boolean if true to use the end line in the block
"continue"
a boolean to retain end line (for the following paragraph)
"newline"
a boolean to retain all newlines in the block

Of them, only "begin" and "end" are required, all others are optional and default to FALSE (and if no others are defined the result is that nothing will be done to the block).

Here is an example:

    $translate['blocks']['html'] = array(
   
'begin' => '/^<html>/',
   
'end' => '/^<\/html>/',
   
'replace' => 'htmlentities',
   
'pre' => '<pre>',
   
'post' => '</pre>'
   
);

The html block starts with <html>, ends with </html>, runs each line through the PHP function htmlentities(), prefixes the block with <pre> and postfixes the block with </pre>. Which is, this:

<html>
HTML <b>example</b> of bold.
</html>

will result in this:

HTML <b>example</b> of bold.

Another example is:

    $translate['blocks']['php'] = array(
   
'begin' => '/^\s*<\?php/',
   
'end' => '/^\s*\?>/',
   
'post' => 'gmlp_highlightstr',
   
'first' => 1,
   
'last' => 1
   
);

where post is a function to call on the entire block (note how post has a dual use which is done to reduce data complexity), and first and last means to keep the begin and end lines in the block.

The function gmlp_highlightstr() is defined in the data definition file and is:

<?php
    
function gmlp_highlightstr($s) {
        return 
highlight_string($s,TRUE);
    }
?>

And a way to do PHP code without the <?php ?> by using <php></php>:

    function gmlp_highlightstr($s) {
        return 
highlight_string($s,TRUE);
    }

There is a similar block definition that encloses highlighted PHP in styled <DIV> to do this:

PHP code     function gmlp_highlightstr($s) {
        return 
highlight_string($s,TRUE);
    }

The block convert function is not a pretty one. It is also a bit confusing. And it is a bit too long. And, well, you can See the Code for yourself.

Conclusion

GMLP works good enough as it currently is for a dozen regular expressions to define a markup language. The code is certainly not slow, but mostly that is due to the implementation of the PHP PCRE code.

There are two basic uses for GMLP: a Web application that wants to have simple markup for input data or converting text files to HTML. I use it for both.

The "markup" I chose are extremely simple and were developed to suit my style of writing. If you examine the "source" to the HTML you are viewing (gmlp.txt) you will see that it is mostly just straight text, with a few exceptions, in a format that is familiar.

Caveats

Any HTML will be left as is. Using block-level tags may not turn out well though as they end up between <P> tags. The lines array can directly allow some as in:

    '/^<hr>$/' => '',

and there are block definitions for <UL>, <OL>, <DIV> and <BLOCKQUOTE>, and any amount can be added (although it would be a bit tedious and data intensive).

But the idea is to convert text with as few markup codes as possible.

HTML comments within a string, like this: (view page source to see) will have the double dashes (‐‐) converted to &mdash;. But comments on a line are spared that by the lines entry:

    '/^<!--.*-->$/' => '',

Since lines skip further processing when matched that line has it's double dashes preserved.

Bugs

The code does not try to "close things". That is, if a start tag (either line or block) does not have an end tag the output is undefined. (Single character codes though, ' ^ * |, are preserved.)

One currently known bug is in the default Characters conversion code (see gmlp_chars()), this text does not compute:

    Sample line--*emphasis clause*--more words.

That results in: Sample line—*emphasis clause*—more words. This is because the entity substitution for –– is &mdash;:

    Sample line&mdash;*emphasis clause*&mdash;more words.

The "bug" is the character RE lead is [^a-zA-Z=;\/], so ;* is not considered a word boundary.

By redefining word boundaries this problem goes away. Notes
  1. These are not part of the code but part of the the translation data which will be clear by viewing the lines array. These work along with some CSS.
  2. The order of conversions is one of the tricky things about stateless conversion as GMLP does, and was the the hardest part to "get right". It also requires the occasional use of named elements (such as &#039;) to avoid some conversions—this is a flaw in the chosen markup data and not in the code.
  3. It would be very interesting to write a version of the code that would apply the data in the order it was defined. That might even result in smaller code.
  4. When I read that I say to myself, "regular expression," rather than "are—eee," so I always write "a RE."
  5. I chose for that <H2>, a personal preference that happens to tie in with other stuff I use this code for.
  6. This is not unlike having to use &amp;#039; to display &#039;.
  7. I should one day write about how I figured out how to figure out just what a word is—it took awhile, and in the end, the code does what it does with just one, short RE.
  8. Laziness is the true mother of invention, Norbert Wiener said, and laziness is (mostly) why I wrote this code. First, to reduce the amount of markup required to format text. And second, I was tired of constantly modifying code each time I came up with yet another markup sequence, especially a new block code—it is much easier to maintain, edit and use data files.
  9. Which might be considered a flaw. Code can be written to properly handle this—note that I said code; I would rather not maintain an array of block-level HTML elements and try to deal with it that way. The basic premise of this, though, is to not have to use HTML to format the text. Let us try first to see if the text's own attributes can be used.
  10. Not supporting HTML comments within strings is not different than the HTML "rule" of not nesting some elements, which is in effect, "Please don't do that."

Extended Notes

Block HTML

For handling block HTML elements in the text, I opted to not maintain an array of current block HTML elements, as the point of the code is that it should not know anything about such things. In fact, the code can be considered to be truly dumb as it does nothing by itself to the text beyond the newline conversions mentioned. The data definitions "do" all the work. And the result are simple, small loops of dumb code that only appears to be "smart," it is the data that controls the code. Change the data even slightly and the result will be completely different.

The result is a combination of a few elements handled by block conversion and "Please don't do that." This code is not just about having markup codes in a text file, but also to reduce the amount of markup needed to convert text to HTML.

End

The possibilities are many, and are easy to implement, and that is one reason to code this way. (The code is less than 500 lines.)


"A man's got to know his possibilities."

"A man's got to know his possibilities."

"A man's got to know his possibilities."
"A man's got to know his possibilities."

"A man's got to know his possibilities."

"A man's got to know his possibilities."

"A man's got to know his possibilities."