Often it is necessary to present PHP source code via XHTML, for documentation or (as in these articles) to illustrate a point. When presenting such code, we require:
- Valid XHTML
- Semantic Validity (<code> tags)
- Syntax highlighting
- Line numbering
- Easy for viewers to copy and paste code
So, what are our options? PHP provides an inbuilt function to highlight and display code, highlight_string. However, the XHTML output from this is, frankly, a mess. Although it produces valid XHTML with highlighted syntax, there is no line numbering, and styling is embedded into the source code. It is possible to use this function and then clean up the output — the best example of this approach is probably Chris Shiflett's highlighting code. He encapsulates the code in an XHTML ordered list, and uses spans to markup the syntax:
XHTML
<ol><li><code> ...<span class="comment"> ... </span>... </code></li><li><code> ... </code></li></ol>
The advantage of this approach is that it creates line numbers automatically, and still allows viewers to copy and paste easily from the presented code. As valid semantic XHTML, it therefore hits all our targets, and is a good solution to the problem of highlighting source code. However, while I think the markup is excellent, the methodology bugs me a little: re-parsing the original output from highlight_string is a hack, and makes the resulting code fragile. The output from highlight_string is already different between PHP4 to PHP5, what happens if it is changed again? There must be a better way. There is, and indeed Chris mentions it in his original article: the PHP tokenizer library.
The rest of this article explains how this library can be used to produce highlighted source code as XHTML. There are a number of niggly problems which need to be overcome, and the impatient among you may want to skip ahead directly to our final PHP highlighting using the tokenizer library example.
Laying the Foundations
The PHP tokenizer library provides access to the Zend lexical language parser, and is enabled by default. The flagship function token_get_all converts a source code string into an array of component pieces like variables, strings, keywords, and so on. This array can then be parsed
to produce a markup similar to Chris' original.
The first step is to write a function that will convert the 'tokens' into a classname. There are quite a number of different tokens, which gives us very fine-grained control over the syntax highlighting, but makes for a rather verbose function. If we want to markup comment, variable, string, number, tag and content outside PHP tags separately, this function might be:
PHP 5
<?phpfunction getClassFromToken($type){switch ($type) {case T_OPEN_TAG:case T_OPEN_TAG_WITH_ECHO:case T_CLOSE_TAG:return 'tag';case T_COMMENT:case T_DOC_COMMENT:return 'comment';case T_INLINE_HTML:return 'inline-content';case T_CONSTANT_ENCAPSED_STRING:return 'string';case T_VARIABLE:case T_STRING_VARNAME:return 'variable';case T_LNUMBER:case T_DNUMBER:return 'number';case T_ABSTRACT:case T_ARRAY:case T_AS:/* ...a number of keyword tokenshave been omitted for brevity!...*/case T_TRY:case T_UNSET:case T_WHILE:return 'keyword';default:return false;}}?>
The source code is parsed into tokens by the function token_get_all that returns an array of source pieces. If the source piece has a token, an array of content and token value is returned. If no token can be associated with that piece, the plain string is returned. For our purposes, we need to normalise the value for each piece to get the content and class name.
PHP 5
<?php$tokens = token_get_all($source);foreach ($tokens as $piece) {/* standardise token */if (is_array($piece)) {list($type,$content) = $piece;$class = getClassFromToken($type);} else {$type = false;$class = false;$content = $piece;}/* $type == token type$class == class name required$content == source string */}?>
At this point we are ready to tackle the main task of iterating over the tokenized source and encapsulating it in an ordered list, highlighting syntax with XHTML class spans. Line breaks in the original source are simply treated as whitespace by the tokenizer library and need to be explicitally detected by the iteration code.
PHP 5
<?php$tokens = token_get_all($source);$out = '<ol class="php">'.PHP_EOL.'<li><code>';foreach ($tokens as $piece) {/* standardise token:$type == token type$class == class name required$content == source string *//* act on token. */$content = explode("\n",$content);for ($i=0,$max=count($content)-1; $i<=$max; $i++) {/* add new line on 2nd iteration or greater */if ($i>0) {$out .= '</code></li>'.PHP_EOL.'<li><code>';}/* trim if at end of line */if ($i<$max) {$content[$i] = rtrim($content[$i]);}/* wrap content in spans */if ($class !== false) {$out .= '<span class="'.$class.'">'.htmlentities($content[$i]).'</span>';} else {$out .= htmlentities($content[$i]);}}}$out .= '</code></li>'.PHP_EOL.'</ol>';
We have now written the core of the source highlighting process, but there is still one situation where the current code will fail, and that is within a doubled-quoted string which has embedded variables.
Dealing with Variables within a Double-Quoted String
There are two token types associated with strings, T_STRING and T_CONSTANT_ENCAPSED_STRING. The latter token is associated with single quoted strings, or double quoted strings with no embedded variables. The T_STRING type is associated with all plain strings located within the source, strings such as function names, class names, and so on. We want to highlight only quoted strings, thus only the T_CONSTANT_ENCAPSED_STRING type returns a 'string' classname from our getClassFromToken function. However, when a double-quoted string contains embedded variables, the quoted string is broken down into a number of sub-tokens:
PHP 5
<?php$source = '<?php "embedded $variable"; ?>';print_r(token_get_all($source));/*Array ([0] => Array( [0] => T_OPEN_TAG[1] => <?php )[1] => "[2] => Array( [0] => T_STRING[1] => embedded )[3] => Array( [0] => T_WHITESPACE[1] => )[4] => Array( [0] => T_VARIABLE[1] => $variable )[5] => "[6] => ;[7] => Array( [0] => T_WHITESPACE[1] => )[8] => Array( [0] => T_CLOSE_TAG[1] => ?> ))*/?>
So when a double quoted string is broken into sub-pieces we want to highlight T_STRING tokens within the double quotes. An active double quote is always parsed into a piece by itself, so we simply need to track whether a double quote is open or closed and treat T_STRING appropriately:
PHP 5
<?php$tokens = token_get_all($source);$in_quotes = false;foreach ($tokens as $piece) {/* standardise token$type == token type$class == class name required$content == source string *//* account for embedded variables */if ($type === false && $content === '"') {$class = getClassFromToken(T_CONSTANT_ENCAPSED_STRING);$in_quotes = !$in_quotes;} elseif ($in_quotes && $type === T_STRING) {$class = getClassFromToken(T_CONSTANT_ENCAPSED_STRING);}/* parse token piece to XHTML */}?>
And voila! All quoted strings are highlighted appropriately, irrespective of whether they have embedded variables or not:
PHP 5
<?phpecho "embedded $variable in string";echo "embedded {$array['key']} in string";?>
The Final Highlighter Script
Bring together all the elements discussed so far into a single class, add in some odd-even line highlighting, sprinkle in some CSS magic... And we arrive at the final PHP highlighting using the tokenizer library example. This validates to XHTML Strict Standards, numbers the source lines, highlights PHP syntax and allows viewers to copy and paste the code easily.