The Gmlp Markup Language Processor

Last code update: March 9, 2014.
Last documentation update: March 9, 2014.

GMLP is for converting text with one markup to text with another markup (although only to HTML is 100% complete; next version will fix that).

March 9: Bug mentioned the other day has been put back! Well, it actually was not a bug but was an inadequately designed function. But that function has merit, it just needs some tweaking. What I did was to make it optional, and by doing so, actually enhanced the ability and flexibility of the code. The documentation is in the code. (I also fixed the real bug related to CRs in the text.)

Previous archive might have had the file permissions wrong. This has been fixed. I apologize for that as it was a stupid mistake.

Introduction

Some basic rules for processing text are that the text is delimited by newlines, with two consecutive newlines indicating a paragraph. The input text is always at least one paragraph, consisting of at least one line, consisting of at least one word, consisting of at least one character. (Empty paragraphs are ignored.)

This document has many footnotes. There is also an Extended Notes section that starts with more about that last paragraph.

To convert text to HTML something must be done to mark paragraphs. There are two ways to do this. One way is to append each line with a <BR>, with the result of two newlines being two <BR>s to separate paragraphs. A more flexible way (a <BR> element cannot have margins or padding) is that each paragraph is enclosed in <P></P>, and each single newline in a paragraph converted to a <BR>. For example (with extra spaces for clarity):

    TEXT: Line One
    HTML: <p>Line One</p>
    TEXT: Line One\n\n Line Two
    HTML: <p>Line One</p> <p>Line Two</p>
    TEXT: Line One\n\n A\nB\nC\n\n Line Two
    HTML: <p>Line One</p> <p>A<br>B<br>C</p> <p>Line Two</p>

There is one caveat with converting the text to paragraphs, in that they should not contain block-level elements. This can be accounted for in two ways. First, if any markup creates block-level HTML elements they must be created outside of a paragraph. Second, if the text contains any block-level elements they must not be enclosed in paragraphs. Here is an example:

    TEXT: Line One\n <ul>\n<li>item</li>\n</ul>\n Line Two
    HTML: <p>Line One</p> <ul><li>item</li></ul> <p>Line Two</p>

Any HTML in the text will be left alone, but GMLP is stateless and does not track whether or not it is in a block element, therefore, block elements will be wrapped in paragraphs. There are two ways to deal with this. One is to have rules to "handle" block elements. The other is, "Please don't do that."

The output of <P>s and <BR>s is going to be configurable/changable. There is more on the Next Version.

With the creation of paragraphs—or not—no other processing of the text occurs by the code. Any other text conversion is only done by the "conversion rules" in the conversion definition data. If no conversion rules are defined, only these paragraphing rules are applied to the text.

There is more on the the basic Paragraph Algorithm.

This text and the other (and the source files—the actual source files, the code can display itself) are translated when viewed. Obviously, for turning documentation or source files into HTML, the output would be saved as a file. But this is a demonstration of the code, and as I have been writing this text I have been able to tweak the code.

Conversions

There are five types of conversions that can be defined. I call them:

Lines
Inlines
Entities
Character
Block

All only occur according to the data definitions, and in a particular order. There are two special case data definitions, one to "skip a line" from the conversion process, and one to be applied to some of the Block conversions.

Conversions are processed as key/value associative arrays, with the key being replaced by the value, the exceptions being the Character and Block conversions explained below.

What follows is a brief summary of each type and then a detailed look at the mechanisms of each one.

Line Conversions

These convert an entire line directly to a paragraph (enclosed within block-level HTML elements) by a regular expression generally beginning with a start of line meta-character (^). If the value is a function, the result is the function called with the match.

Inlines Conversions

These replace a regular expression key with it's value. If the value is a function, the result is the function called with the match. These provide the bulk of the conversions.

Entities Conversions

The name refers to the default rule to convert certain characters to named entities, but these are simply an array of string replacements (and not regular expressions), with keys replaced by their values.

Character Conversions

These replace character pairs with start and end tags, and are defined as single character keys with the value the tag replacements.

Block Conversions

Block conversions have a start line, followed by any number of lines making up the block and then an end line. The start and end lines are regular expressions. Each block definition has a number of conditions that can be "applied" to the block. The defined start/end tags are in an HTML-like format, e.g. <code>...</code>, with each one on it's own line, however, they can be any regular expression.

Html Tagged Blocks

The Lines data has a way to mark the beginning and end of one or more paragraphs with Block like tags that needs special mention: tags such as <message>...</message>, with each one on it's own line. These tags, if they exist in a line in the Lines data, will be replaced with classed <DIV>s like: <div class="message">...</div>.

A Look At The Regular Expressions

The regular expressions I came up with for these examples are not based on any existing markup language but are solely my personal preference. The purpose of GMLP is that any markup codes can be designed.

In fact, there is more on defining rules for Existing Markup Languages such as Markdown and Textile. (Which will make more sense after reading the following descriptions.)

The conversion "rules" are arrays and placed into a PHP file which is then simply included by the code. The arrays are processed in a particular order—conversions can affect other conversions. The order of the types of conversions is controlled by the code and not the data.

The Default Translation Data are viewable.

What follows are details about each conversion array in the order that they are applied to the text.

Skip

The first array is the skip array and this test is performed first for all lines:

    $translate['skip'] = array(
   
"/^`(.*)'$/"
   
);

Which means that lines beginning with a backtick (`) and ending with a single quote (') are ignored from processing, and those characters are discarded from the line—if the regular expression (RE) does not have a subpattern the line is used as-is.

This is a good example of what GMLP does—any regular expression can be used to skip a line. Two backticks? Double single quotes? Whatever. As with all the other conversion arrays there can also be more than one regular expression to meet the condition.

(One thing to note about the skip array is that it is a regular array and not an associative array as are most others.)

Lines

The next array is lines and they convert a line and then skip all further conversions. Three examples:

    $translate['lines'] = array(
   
'/^<!--.*-->$/' => '',
   
'/^h([1-6]):\s*(.*)/' => '<h$1>$2</h$1>'
    '/^[A-Z &\?\'!]+$/' 
=> 'gmlp_convertcase',
    );

With the lines array the value is either a PHP function to be called on the line (or the first subpattern) or a RE replacement string; with the exception of an empty value meaning use the line as-is. For the last example the function is part of the translation code and it turns an uppercase line into a header element.

The Default Translation Functions are viewable.

Blocks

Block conversions occur next by the code but are more complex and are explained later.

Inlines

The inlines array is next and the value is either a PHP function to be called on the line or a RE replacement string and the code continues. Here is an example to convert `GNU style' markup, PHP functions (like preg_match() and echo('string')) and links:

    $translate['inlines'] = array(
   
'/`([^\'].*)\'/U' => '<code>`$1&#039;</code>',
   
'/([a-zA-Z_]+\([^\)]*\))/' => 'gmlp_php_string',
   
'/(http[s]*:\/\S*)([:;])([a-zA-Z0-9 ]+)/' => '<a href="$1">$3</a>',
    );

The first one uses &#039; to avoid some character translations that can occur later.

Entities

The entities array are simple string replacements, so named for they are like:

    $translate['entities'] = array(
   
'--' => '&mdash;'
   
);

but, of course, they can be anything. Like, perhaps:

    $translate['entities'] = array(
   
'---' => '&#8212;'
   
'--' => '&#8211;'
   
'xn&#8211;' => 'xn--',
   
'...' => '&#8230;',
   
'``' => '&#8220;'
   
'\'\'' => '&#8221;'
   
'(tm)' => '&#8482;',
    );

(Why that third one is backwards... I don't know.) This one is enabled by the definition array line:

    $translate['entitytranslate'] = 1;

Chars

The chars array is last, and these tagize key delimited words by value:

    $translate['chars'] = array(
   
'*' => 'b',
   
'_' => 'u',
   
'^' => 'em',
   
'\'' => 'code',
    );

These are not just straight replacements though, as they try to figure out just what a word is:

Beginning of sentence: "Bold word."
Trailing punctuation: "Italic word."
Preserve inside chars: "$identifier_with_underscores;"
And intermixed: "This is bold with underline."
And not within HTML: font family Arial Narrow.

March 8: I can do some boasting about this one as it does what is does with a single, small RE (for each character pair). It's really cool.

Blocks

A block definition is an associative array defined with the following:

"begin"
the RE to start the block
"end"
the RE to end the block
"pre"
a string to prepend to the block
"post"
a string to append to the block or a function to call on the entire block
"replace"
a string to append to each line or a function to call on each line
"first"
a boolean if true to use the begin line in the block
"last"
a boolean if true to use the end line in the block
"continue"
a boolean to retain end line (for the following paragraph)
"newline"
a boolean to retain all newlines in the block

Of them, only "begin" and "end" are required, all others are optional and default to FALSE (and if no others are defined the result is that nothing will be done to the block).

Here is an example:

    $translate['blocks']['html'] = array(
   
'begin' => '/^<html>/',
   
'end' => '/^<\/html>/',
   
'replace' => 'htmlentities',
   
'pre' => '<pre>',
   
'post' => '</pre>'
   
);

The html block starts with <html>, ends with </html>, runs each line through the PHP function htmlentities(), prefixes the block with <pre> and postfixes the block with </pre>. Which is, this:

<html>
HTML <b>example</b> of bold.
</html>

will result in this:

HTML <b>example</b> of bold.

Another example is:

    $translate['blocks']['php'] = array(
   
'begin' => '/^\s*<\?php/',
   
'end' => '/^\s*\?>/',
   
'post' => 'gmlp_highlightstr',
   
'first' => 1,
   
'last' => 1
   
);

where post is a function to call on the entire block (note how post has a dual use which is done to reduce data complexity), and first and last means to keep the begin and end lines in the block.

The function gmlp_highlightstr() is defined in the data definition file and is:

<?php

function gmlp_highlightstr($s) {
    return 
highlight_string($s,TRUE);
}

?>

There is a similar block definition that encloses highlighted PHP in styled <DIV> to do this:

PHP code function gmlp_highlightstr($s) {
    return 
highlight_string($s,TRUE);
}

The block convert function is not a pretty one. It is also a bit confusing. And it is a bit too long. And, well, you can See the Code for yourself.

Special Block Conversions

March 9: This function is the function that does not work as well as it should, so the block code now has a new function handler to call an external function on a block—it's pretty cool.

This version of the code includes a new way to convert text with leading spaces. Previously, such text was simply a block conversion. Now, the text is "looked at" to see if it might be PHP code or plain text by a simple test. (It's the "looking" that is not good enough.)

Conclusion

GMLP works good enough as it currently is for a dozen regular expressions to define a markup language. The code is certainly not slow, but mostly that is due to the implementation of the PHP PCRE code.

There are two basic uses for GMLP: a Web application that wants to have simple markup for input data or converting text files to HTML. I use it for both.

The "markup" I chose are extremely simple and were developed to suit my style of writing. If you examine the "source" to the HTML you are viewing (gmlp.txt) you will see that it is mostly just straight text, with a few exceptions, in a format that is familiar.

Caveats

Any HTML will be left as is. Using block-level tags may not turn out well though as they end up between <P> tags. The lines array can directly allow some as in:

    '/^<hr>$/' => '',

and there are block definitions for <UL>, <OL>, <DIV> and <BLOCKQUOTE>, and a few more can be added without a performance hit.

It would be inefficient to parse many more block elements this way. The idea is to convert text with as few markup codes as possible.

HTML comments within a string, like this: (view page source to see) will have the double dashes (‐‐) converted to &mdash;. But comments on a line are spared that by the lines entry:

    '/^<!--.*-->$/' => '',

Since lines skip further processing when matched that line has it's double dashes preserved.

Known Bugs

The code does not try to "close things". That is, if a start tag (either line or block) does not have an end tag the output is undefined. (Single character codes though, ' ^ * |, are preserved.)

Since the code knows nothing about HTML, an currently wraps paragraph within <P>, block HTML elements will get wrapped as well unless there is a rule of some kind to handle them—this is how the main code is designed to work. All conversions occur solely on rule definitions.

Notes
  1. As of this version there is more documentation than code.
  2. Note here the subtle difference between a paragraph and a line: paragraphs are separated by two newlines, lines by one newline. The first is an invention of mine, the second of computing. In addition, a single paragraph can consist of a single line. Leading, trailing and multiple (more than two) newlines are ignored.
  3. These are not part of the code but part of the the translation data which will be clear by viewing the lines array. These work along with some CSS.
  4. The order of conversions is one of the tricky things about stateless conversion as GMLP does, and was the the hardest part to "get right". It also requires the occasional use of named elements (such as &#039;) to avoid some conversions—this is a flaw in the chosen markup data and not in the code.
  5. It would be very interesting to write a version of the code that would apply the data in the order it was defined. That might even result in smaller code.
  6. When I read that I say to myself, "regular expression," rather than "are—eee," so I always write "a RE."
  7. I chose for that <H2>, a personal preference that happens to tie in with other stuff I use this code for.
  8. This is not unlike having to use &amp;#039; to display &#039;.
  9. I should one day write about how I figured out how to figure out just what a word is—it took awhile, and in the end, the code does what it does with just one, short RE.
  10. Laziness is the true mother of invention, Norbert Wiener said, and laziness is (mostly) why I wrote this code. First, to reduce the amount of markup required to format text. And second, I was tired of constantly modifying code each time I came up with yet another markup sequence, especially a new block code—it is much easier to maintain, edit and use data files.
  11. Which might be considered a flaw. Code can be written to properly handle this—note that I said code; I would rather not maintain an array of block-level HTML elements and try to deal with it that way. The basic premise of this, though, is to not have to use HTML to format the text. Let us try first to see if the text's own attributes can be used.
  12. Not supporting HTML comments within strings is not different than the HTML "rule" of not nesting some elements, which is in effect, "Please don't do that."

Extended Notes

Block HTML

For handling block HTML elements in the text, I opted to not maintain an array of current block HTML elements, as the point of the code is that it should not know anything about such things. In fact, the code can be considered to be truly dumb as it does nothing by itself to the text beyond the newline conversions mentioned. The data definitions "do" all the work. And the result is simple, small loops of dumb code that only appears to be "smart," it is the data that controls the code. Change the data even slightly and the result will be completely different.

The result is a combination of a few elements handled by block conversion and "Please don't do that." This code is not just about having markup codes in a data file, but also to reduce the amount of markup needed to convert text to HTML.

File Names

Here is an interesting rule. I first started writing file names in uppercase to make them stand out for the reader. When I came up with this code I did this:

    '/([A-Z_\/]+\.(INI|TXT|HTML|HTM|PHP)+)/' => '<code>$1</code>',

But, since grammar has rules of it's own I can start using lowercase file names and use:

    '/([a-z_\/]+\.(ini|txt|html|htm|php)+)/' => '<code>$1</code>',

as periods, outside of file names, do not occur directly in front of a letter. But, since many documents have uppercase file names like that (like in files such as README and INSTALL and other documentation) my convention still exists (and it has become a habit of sorts).

Or, there can be a function applied like:

    function gmlp_clower($s) {
        return 
'<code>'.strtolower($s).'</code>';
    }

Or, strtolower() can be applied directly and the output could be used to convert the file. Multiple definition files can be created to do any of these things.

The H2

A line of all uppercase (with a few other characters) are converted to capital case within <H2> (as shown in the Lines section above). It would be interesting to be able to convert lines of capital case within <H3>. Currently that would take a very complicated RE beyond my skill (really, I wouldn't even attempt it).

The code can easily be changed to test for a key being a function name...

Footnotes

Footnotes are created by two rules: an Inlines rule and a Block rule. The first converts double brackets ([]) at the end of a sentence to a (slightly raised and smaller) number. (It works through the use of a CSS "rule".) The second a number of lines to a numbered list.

The conversion functions code can easily be changed to create # links and span IDs to "connect" them...

End

The possibilities are many, and are easy to implement, and that is one reason to code this way.

The GMLP Motto

I originally thought of the "possibilities" phrase below (formatted in various ways) as this code's motto, but I think I have the right one in:

If you can think it, you can do it.

"A man's got to know his possibilities."

"A man's got to know his possibilities."

"A man's got to know his possibilities."
"A man's got to know his possibilities."

"A man's got to know his possibilities."

"A man's got to know his possibilities."

"A man's got to know his possibilities."


Written by Greg Jennings. You can comment on this code at Freecode and/or SourceForge.

Older archives: GMLP 0.6, GMLP 0.9a.

See also: DEBUGsekretsWordBash