Simple Markup
Simple Mark Up is a "Text to Text Algorithm" – any input text to any output text. Download: smu-1.4.2.zip.
Oh, and this might be useful: php-debug.zip (see DEBUG Readme).
SMU is not at all related to textutils like SED and AWK.
The "From > To" text definitions are external to the main code, in an associated array of regular expressions and replacement strings.
One does not need complex, multi-line expressions to transform text. SMU uses short, simple, easy to understand (and to write) regular expressions (why simple is in the name).
The design is, that the main code neither knows nor cares anything about "how" the text is translated.
Generally, the input data will be formatted, i.e. created by humans, such as "Markdown" text. The output data too will generally be formatted for humans[], but need not be.
However, with SMU, even very complex text such as (the worst of) HTML can be parsed with regular expressions.
Herein, the term "text" implies formatted text; "data" means text that may be unformatted or even binary. While either may be created/written/parsed by machine, the former is generally written by humans.
Version
This is version 1.4.2, 5th release, September 2022.
The current base algorithms are of just under 600 lines of PHP code. And SMU is making the claim that Markdown markup is now implemented in just 250 lines of PHP data and code.
Quick Start
In this example the function smuread()
returns the next line of the INPUT, including the newline
while (($line = smuread($infile)) != FALSE) { echo str_replace("foo","bar",$line); }
SMU reads lines as paragraphs, which end in two newlines. (The way documentation is written.)
This is a \n paragraph.\n \n This is a paragraph.\n \n
SMU search and replacement strings are defined as a PHP data array.
$smudata = [ 'foo' => 'bar', ];
SMU's algorithm is nested loops.
while ($line = smuread($infile) { foreach ($smudata as $k => $v) { echo str_replace($k,$v,$line); } }
SMU actually uses regular expressions.
$smudata = [ '/foo/' => 'bar', ];
while ($line = smuread($infile)) { foreach ($smudata as $k => $v) { echo preg_replace($k,$v,$line); } }
And there is a "switch/case" on the value of the data array, so a function/closure can be defined to replace a match, or "conditionals" can be applied such as to ignore a line.
Usage
The main code is not called directly, but is included by a "definition file"; these are of the notation smu_<name>.php
, where name kinda indicates what it does.
A definition file – or as is sometimes called, a "dataset" or "the DATA" – defines data in the form of PHP associative arrays (of particular names) that get applied to the input text to convert to the output text.
For example, to convert Markdown to HTML:
php smu_md.php readme.md > readme.html
Command Line Interface
While "typical" use of this code was to create (modify and existing) dataset to perform new conversion rules, there is a CLI to adjust the runtime configuration of existing datasets.
For example, to convert Markdown to HTML:
./smu readme.md smu_md.php > readme.html
The input file and data file swapped is only because there is a default data file,
smu_.php
, described next.
Non-Usage
By non-usage is meant using the code without a dataset. The main code is just an API and does nothing, it is to be used by a dataset, with the "default dataset" used as a template.
The way into the API is the function simple_markup()
, which is called with a file name (or the input text to use).
Given this test text, where the character ·
is used for a newline:
One.· Two.· · Three four.·
The default dataset is like:
include 'smu.php';
echo simple_markup($argv[1]);
which used as ./smu test_text.txt
will output:
One.· Two.· · Three four.·
Which is the same. By design.
To do more to the output a combination of defined constants and data are used. For example, here is applying "greediness":
const MU_GREEDY = 1;
include 'smu.php';
echo simple_markup($argv[1]);
which will output:
One. Two.· · Three four.·
Adding const HTML = 1;
it will output:
<p>One. Two.</p>· · <p>Three four.</p>·
The CLI is more comprehensive with options to apply those changes, -greedy=1
and -html=1
, and much more.
DATA
The purpose of SMU is to apply DATA to convert input TEXT to output TEXT, with the DATA as simple arrays.
There are two types of arrays one for "blocks" and one for "inlines", with keys of regular expressions and values their replacement text.
Lines
An example from Markdown is to replace all # Header lines with <h1>Header</h1>, with a regular expression like:
/^#\s*([^#]*)(#*)$/
with a replacement string like:
<h1>$1</h1>
It's just that SMU does not have dozens of complex, multi-line expressions applied to an entire input text, but simple expressions in an array applied to each line of input like:
$mu_lines = [
'/^#\s*([^#]+)$/' => '<h1>$1</h1>',
];
If that seems more complicated, one might be surprised.
Inlines
Inlines, or emphasis in Markdown terminology, is similar, with emphasis like:
$mu_inlines = [
'/\*(.+)\*/U' => '<em>$1</em>',
];
The actual expression used handles word boundries, escapes etc.; but that is the basic expression.
Data Code
The value of any regular expression can be a function, which is how more complex output can occur, like multi-line blocks such as for ordered and un-ordered lists. This is called "data code".
A really simple example is one expression for all Markdown headers with the data code for it a closure like:
$mu_lines = [
'/^(#+)\s*([^#]+)/' => function($m){
return "<h".$n=strlen($m[1]).">${m[2]}</h$n>";},
];
Which shows that assigned functions get passed the $matches
argument of preg_match()
. Or the data can just be the function name, like md_header for this to be used:
function md_header($m) {
$h = trim($m[2]);
$n = strlen($m[1]);
return "<h$n>$h</h$n>";
}
This way all datasets can be easily expanded to do anything in two ways, the array data or it's functions, like:
function md_header($m) {
$h = trim($m[2]);
$n = strlen($m[1]);
$i = str_replace(' ','_',strtolower($h));
return "<h$n id=\"$i\">$h</h$n>";
}
Conclusion
SMU implements Markdown to HTML in under 800 lines of code and data.
The SMU Motto: "You can change the code without changing the code."
The SMU Algorithm
This is a slightly "trimmed" version of the SMU Algorithm. It is made up of two other (sub) algorithms that do the "lines" and "blocks" markup and then the "inlines" markup.
<?php #
/* simple_markup - the simple markup function - main entry point */
function simple_markup($data) {
$s = ''; /* return string */
$p = MU_P; /* paragraph wrappers "paragraph" */
$b = MU_B; /* "break" */
$n = MU_EOL; /* newline */
smu_open($data);
while (($_ = smu_read($data)) !== NULL) {
$l = markup_lines($_,$data);
if ($l === FALSE) {
markup_getline($_,$data);
$l = markup_inlines($_);
if ($l) {
$l = "$p$l$b";
}
}
$s .= "$l$n";
}
return $s;
}
The initialization of the local variables with the defined constants makes the code easier to read and work with.
For the "opening" of the input text, $data
is overloaded and passed by reference. Overloaded in that it was a file name (string) and then becomes the file handle.
Most people will decry such overloading and modification. No need to justify that usage now though. (If a reader is grimacing no one is forcing them to continue.)
The algorithm start is the "while (there's another line)" loop based on smu_read()
, with the first line read into $_
, which is NULL to indicate no more.
The single letter variable names (and $_
) are also a convention, and $l
is overloaded and $_
passed by reference.
Same caveats as previous note.
The markup_lines()
function, itself another algorithm, will either markup a line or multiple lines or not. If it did markup some text the result ($l
) will be part of the formatted output, if not its return is FALSE (and the unmodified line is still in $_
).
The markup_inlines()
function, itself another algorithm, does the inlines markup (or not). But before that the markup_getline()
function is used to possibly read a paragraph (greediness as defined by Markdown).
"Inlines" can sometimes be called "emphasis", as in what Markdown does, though SMU does nothing on it's own—the DATA defines what "inlines" are.
The result of inlines markup is the line(s) markedup or not, in $l
. Then the line (paragraph) is enclosed within the "paragraph wrappers", $p
and $b
—for HTML it's the typical <p> and </p> tags; for TEXT $p
and $b
are (usually) empty.
The formatted output text keeps growing by append, which upon loop termination gets returned.
That is all. And is why "Simple" is used in the name. The functions that do the markup are next.
Lines Markup
This is what SMU does: it applies a list of regular expression replacements to each line of input text—with the ability to group lines together.
Here is the lines algorithm and commentary.
function markup_lines(&$line, &$data) {
global $mu_lines;
foreach ($mu_lines as $regex => $repl) {
$n = preg_match($regex,$line,$m);
if (!$n) {
continue;
}
if (is_object($repl) || function_exists($repl)) {
$res = $repl($line,$regex,$data,$m);
return $res;
}
$res = preg_replace($regex,$repl,$line);
return $res;
}
return FALSE;
}
The lines array is a list of regular expression keys with a value of what to do with, or how to apply, the regex to the line. If the value is a function that function is applied to the line. Otherwise the value is a replacement string applied to the line.
Two very important points with this are: 1) $data
being passed by reference is how a (defined data) function can read multiple lines for block markup, such as a Markdown quote block; 2) if a regex matches a line the result is returned and lines markup stops for that line.
The function returns markedup text or FALSE to indicate that no markup occurred.
While some people too will decry multiple return points in a function, there is a reason for doing so here. Again, this code is not being forced on anyone.
Inlines Markup
The inlines algorithm is similar in architecture but differing in process.
function markup_inlines($line) {
global $mu_inlines;
foreach ($mu_inlines as $regex => $repl) {
$n = preg_match($regex,$line);
if (!$n) {
continue;
}
if (is_object($repl) || function_exists($repl)) {
$line = preg_replace_callback($regex,$repl,$line);
}
else {
$line = preg_replace($regex,$repl,$line);
}
}
return $line;
}
That applies a list of regular expressions to a line, applying them all to the line if there is a match. Again the value for a regex is a function or a replacement string.
The line is always returned, either markedup or not.
Conclusion
An astute reader may have seen the flaws of this code, one of operation, one of performance.
First is exposed by a simple question: "How does line/block markup apply inlines markup?" In this version (1.4.2, Autumn, 2022) the function(s) that markup blocks calls the markup_inlines()
function, if it wants. While that does add some complexity to the overall process, it's justification is it also adds flexibility.
Just how block inlines markup occurs is documented elsewhere.
Then, a reader may be thinking, "All inlines regular expression definitions are applied to each line?" Yup. "Why not apply them in turn to the entire input at once?" Nope. For the SMU markdown data there are about 23 regular expressions, and average about 32 characters. This code ain't gonna be slow.
This code has, needs, no complex, multi-line regular expressions, with perhaps dozens of assertions. That is one of the reasons for this code.