[silly "message" here]

I Reckon This Must be the Place, I Reckon

A Place that is small, efficient and fast or something.

Parsing HTML with Regular Expressions

Presented here is a simple technique for Parsing HTML with Regular Expressions in a simple and straight forward manner.

If anyone says to you, "You can't parse HTML with regular expressions." Please ask them, "Have you tried?"

First Things First

The first thing to do when starting something is to define just what it is.

What do you mean by "parsing?"

Do you mean an AST (Abstract Something Tree(? Maybe you want to do some Rendering? Or are you making a Compiler? Whatever those things are, this is about:

Parsing HTML as it's elements of TAGs and DATA.

HTML is made of elements and data, as defined by the HTML Standard. Elements are "start tags" and "end tags", and data is everything outside of those tags.

Parsing, therefore, in this case, is simply to "read" an HTML document element by element, data by data.

The Code

Here is the basic POC:

PHP code const TAGB '/^<[\w]+[^>]*>/';
const 
TAGA '/(.*)(<\/[\w]+>)/U';

$data file_get_contents('test.html');
$data str_replace("\n",'',$data);

while (
$data) {
       
$r preg_match(TAGB,$data,$m);
        if (
$r) {
           
$d '';
           
$t $m[0];
        } else {
           
$r preg_match(TAGA,$data,$m);
            if (!
$r) exit("im confused");
           
$d $m[1];
           
$t $m[2];
           
$data substr($data,strlen($d));
        }

       
$data substr($data,strlen($t));
       
$data ltrim($data);
        echo 
"tag: \"$t\" , $d \n";
}

But all that's moot...

If one parses HTML with Perl it's with Regular Expressions...

If one parses HTML with SED it's with Regular Expressions...

...

So, really, WTF?