The GMLP Markup Language Processor
GMLP is for converting user entered text—along with "markup codes"—into HTML.
GMLP though, does not have a single string literal test for markup in the code.
GMLP conversions are based on user defined regular expressions.
GMLP is written in PHP: GMLP 0.3 Download
Introduction
Some basic rules for processing the text are that the text is delimited by newlines, with two consecutive newlines delimiting a paragraph. The input text is always at least one paragraph, consisting of at least one line, consisting of at least one word, consisting of at least one character. (With the special case that empty paragraphs are ignored.)
First, to convert text to HTML something must be done to mark paragraphs. There are two ways to do this, both with similar implications. One way is to append each line with a <BR>, with the result of two newlines being two <BR>s to separate paragraphs. However, that way is not flexible enough (a <BR> element cannot have margins or padding). Therefore, each paragraph is enclosed in <P> tags, and each single newline in a paragraph converted to a <BR>. For example (with extra spaces for clarity):
TEXT: Line One
HTML: <p>Line One</p>
TEXT: Line One\n\n Line Two
HTML: <p>Line One</p> <p>Line Two</p>
TEXT: Line One\n\n A\nB\nC\n\n Line Two
HTML: <p>Line One</p> <p>A<br>B<br>C</p> <p>Line Two</p>
There is one caveat with converting the text to paragraphs, in that they cannot contain block-level elements. This must be accounted for in two ways. First, if any markup creates HTML in block-level elements, they must be created outside of a paragraph. Second, if the text contains any block-level elements they must not be enclosed in paragraphs. Here is an example:
TEXT: Line One\n <ul>\n<li>item</li>\n</ul>\n Line Two
HTML: <p>Line One</p> <ul><li>item</li></ul> <p>Line Two</p>
GMLP is stateless so it does not "know" about the difference between inline and block elements, and cannot "track" whether or not it is "in a block element". But there is one way to deal with this, and that is to have a requirement that any block level HTML must be on their own line and with no blank lines. This simply goes along with the requirement of "two newlines mark a new paragraph". It also means that if the text violates these requirements the output is "undefined" and most likely will not be rendered properly, just as any mismatched HTML tags would not be rendered properly by a web browser.
So, with the creation of paragraphs—or not—no other processing of the text occurs by the code. Any other text conversion is only done by the "conversion rules" in the conversion definition data. If no conversion rules are defined, only these paragraphing rules are applied to the text.
Conversions
There are four types of conversions that can be defined.
Character
Word
Line
Block
And all only occur according to the data definitions, and in a particular order which is explained later. There is one special case data definition and that is to "skip a line" from the conversion process.
All conversions are processed as key/value associative arrays, with the key being replaced by the value as regular expressions for most conversions, the exceptions being the "character pair" and the "block" conversions explained below.
Character Conversions
There are two types of character conversion, each with it's own definitions. The first type is simply an array of string replacements, with keys replaced with their values. The second type replaces character pairs with start and end tags, and are defined as single character keys with the value the tag replacements (which will be further explained later).
The first type as implemented is a string to string replacement and not strictly character to character. This type is also run-time dependent, being able to be disabled by a configuration setting.
The character conversion definitions are the only ones that are not based on regular expressions.
Word Conversions
These conversions are not words as "space delimited" words, but rather simply regular expression replacements applied per line before the lines are converted to paragraphs. The bulk of a markup language will be defined by these. There are two types.
The first applies a regular expression key match through the value function, which is, if the regular expression (RE) matches, the result is the function named by the value called with the match.
The other type is a straight RE replacement applied to each line.
Line Conversions
Lines can be converted directly to a paragraph if they match a regular expression (which starts with a "^" character). These turn the line (one or more sentences) into a paragraph (enclosed in block-level HTML elements).
Block Conversions
Block conversions have a start line, followed by any number of lines making up the block (not treated as paragraphs) and then an end line. The start and end lines are regular expressions. Each block definition has a number of conditions that can be "applied" to the block. Each line within the block is appended with a newline.
A Look At The Regular Expressions
The regular expressions I came up with for these examples are not based on any existing markup language. I discuss my choices later on. But again, I will state that the purpose of GMLP is for anyone to create any markup language.
The conversion "rules" are arrays and placed into a PHP file which is then simply included by the code. The arrays are processed in a particular order, which is kind of important as some conversions affect other conversions. The type of conversion and the processing order is controlled by the code and not the data.
The array name chosen is translate because the code kind of "translates one markup language to another" (not exactly right in it's current form but the idea is sound).
Skip
The first array is the skip array and this test is performed first for all lines:
$translate['skip'] = array(
"/^`(.*)'$/"
);
Which means that lines beginning with a backtick (`) and ending with a single quote (') are ignored from processing, and those characters are discarded from the line—if the RE does not have a subpattern the line is used as-is.
This is the best example of what GMLP does—any regular expression can be used to skip a line. Two backticks? Double single quotes? Whatever. As with all the other conversion arrays there can also be more than one regular expression to meet the condition.
(One thing to note about the skip array is that it is a regular array and not an associative array as are most others.)
Lines
The next array is lines and they convert a line (actually, a paragraph) and then skips all further conversions. I'll provide three examples:
$translate['lines'] = array(
'/^<pre>.*<\/pre>/' => 'continue',
'/^<html>.*<\/html>/' => 'htmlentities',
'/^h([1-6]):\s*(.*)/' => '<h$1>$2</h$1>'
);
With the lines array the value is either continue to indicate the the line is to be used as-is, a PHP function to be called on the line or a RE replacement string.
Block
Block conversions are done next in the processing but are explained later as they require much more detail.
Codes
Next in the processing are codes, and these are like:
$translate['codes'] = array(
"/''(.*)''/U" => 'htmlentities'
);
with the value a PHP function to replace the match with.
Inlines
The inlines array is next (and as I write this I realize that they can be combined with the codes array), and they perform a RE replace on each line. Here is an example to convert `GNU style' markup:
$translate['inlines'] = array(
'/`([^\'].*)\'/U' => '<code>`$1\'</code>',
);
Entities
The entities array are simple string replacements, so named for they are like:
$translate['entities'] = array(
'--' => '—'
);
but, of course, they can be anything. This one also can be disabled if the definition array has a line such as:
$translate['entitytranslate'] = 0;
Chars
The chars array is last, and these tagize key delimited words by value:
$translate['chars'] = array(
'*' => 'b',
'_' => 'u',
'\'' => 'code'
);
These are not just straight replacements though, as they try to figure out just what a word is:
Beginning of sentence: "Bold word."
Trailing punctuation: "Italic word."
Preserve inside chars: "$identifier_with_underscores;"
And intermixed: "This is bold with underline."
And not within HTML: font-family Arial Narrow.
Back To Blocks
A block definition is an associative array defined with the following:
- "begin"
- the RE to start the block
- "end"
- the RE to end the block
- "pre"
- a string to prepend to the block
- "post"
- a string to append to the block or a function to call on the entire block
- "replace"
- a string to append to each line or a function to call on each line
- "first"
- a boolean if true to use the begin line in the block
- "last"
- a boolean if true to use the end line in the block
- "continue"
- a boolean to retain end line (for the following paragraph)
Of them, only "begin" and "end" are required, all others are optional (and if no others are defined the result is that nothing will be done to the block).
Here is an example:
$translate['blocks']['html'] = array(
'begin' => '/^<html>/',
'end' => '/^<\/html>/',
'replace' => 'htmlentities',
'pre' => '<pre>',
'post' => '</pre>'
);
The html block starts with <html>, ends with </html>, runs each line through the PHP function htmlentities, prefixes the block with <pre> and postfixes the block with </pre>. Which is, this:
<html>
HTML <b>example</b> of bold.
</html>
will result in this:
HTML <b>example</b> of bold.
Another example is:
$translate['blocks']['php'] = array(
'begin' => '/^\s*<\?php/',
'end' => '/^\s*\?>/',
'post' => 'highlightstr',
'first' => 1,
'last' => 1
);
where post is a function to call on the entire block (note how post has a dual use which is done to reduce data complexity), and first and last means to keep the begin and end lines in the block.
The function highlightstr is defined in the data definition file and is:
<?php
function highlightstr($s) {
return highlight_string($s,TRUE).'<br>';
}
?>
(Which of course was converted by the code.)
Conclusion
GMLP works good enough as it currently is for a few dozen regular expressions to define a markup language. The code is certainly not slow, but mostly that is due to the implementation of the PHP PCRE code.
There are two uses for GMLP, for a BBS/CMS/Forum type application that wants to have simple markup, and for converting text files to HTML—which is why I wrote it, actually.
The "markup" I chose are extremely simple and were developed to do the latter. If you examine the "source" to the HTML you are viewing (gmlp.txt) you will see that it is mostly just straight text in a format that everyone uses.
Bugs
There is one known bug. Paragraphs—text between double newlines—are wrapped in <P> tags as explained at the beginning. But that cannot be done all the time. For example, we do not want to wrap block tags:
<dl>
<dt>"begin"<dd>the RE to start the block</dd>
</dl>
So I use this:
if (preg_match('/^<.*>$/',$_))
...
But that skips a line like this:
<b>Section</b>
which should be within <P> tags. Frankly, I do not know what to do to fix that, as I do not want to create and array of all possible block (start and end) tags... The idea behind GMLP is to eliminate inline string literals in the code for the markup conversions. Perhaps one or two really big regular expressions? I don't know... it'll come to me (I hope).
Notes
1. The order of conversions is one of the tricky things about stateless conversion as GMLP does, and was the the hardest part to "get right".
2. Now that I write this, I see that I could redesign the code to apply the conversions in the order of the arrays. This might eliminate the hardcoded array names from the code as well as eliminating some functions!
3. There are some <P>s and <BR>s and that's enough!
4. Of course, I could make this code ignore HTML entirely, converting them via htmlentities(), and only use non-HTML markup code. (That should be fairly easy to do, actually.)
"A man's got to know his possibilities."
"A man's got to know his possibilities."
"A man's got to know his possibilities."
"A man's got to know his possibilities."
"A man's got to know his possibilities."
"A man's got to know his possibilities."
"A man's got to know his possibilities."