Philadelphia Reflections: XHTML vs. HTML

XHTML vs. HTML

The markup language used by web browsers continues to evolve. The most current version (as of April 2009) is XHTML 1.1, an XML version of HTML.

Many browsers, most particularly IE, do not support XHTML. Technically speaking, they support only the "text/html" mime type, not "application/xhtml+xml". Lots of web developers have gone to the trouble of sticking closing tags ( />) in their BR, HR, META and INPUT tags and a DOCTYPE at the top but then serve the code as "text/html".

This produces a syntactic mish mash which may be worse than using strict HTML 4.01.

Why "worse"? Because of the possibility of unintended results from providing incorrect instructions to the browser. If you care about the output produced by the browser, which most developers and content providers emphatically do, then you have to be careful about what instructions you give the browser. You simply cannot count on getting what you want if what you're telling the browser to do is syntactically incorrect.

However, it's a little difficult to see just what good XHTML is:

There are rumors that it renders the non-image portion of a page as much as 50% faster than HTML, but what with gzip and broadband being pretty common these days, it's hard to see that as an especially compelling reason to be bothered.
Furthermore, those browsers that do render XHTML (Mozilla, Firefox) are very picky about syntax and blow up much too easily.
And the claim that XHTML is the way to get your web pages onto cell phones and toaster ovens leaves me cold. It's just not believable that the format required for these special devices will be the same as for a computer monitor. (For the current status of handheld support on this site, see How to detect an iPhone and other mobile devices

Internet cognoscenti speak disparagingly of "tag soup" but the Internet is a lot more about content than it is about syntax, so who really cares?

Well, somehow, I do. A little. Since we use PHP on this site, we have the opportunity to figure out what features are supported by a browser and render the correct types of tags, mime-types, etc.

Check out the HTTP headers and the page source to see the following script in action:

It renders XHTML 1.1 whenever it encounters a browser that can support it
It uses output buffering (which demonstrably if illogically improves rendering response time)
It sends the whole thing using gzip compression if the browser will support it
But also, it concedes certain issues based on experience for the sake of a smoothly-operating website

<?php
//
//  This script figures out what kind of mime type (HTML vs XHTML) the browser supports and sends the correct headers
//  It also initiates compression, specifies cache-ing and sends other <meta http-equiv headers
// 
//  My thanks to https://www.workingwith.me.uk/articles/scripting/mimetypes for the basic idea and structure
// 
//  $_SERVER["ACCEPT"] describes the mime_types a browser supports in a comma-separated list:
// 
//    mime_type,mime_type,mime_type
// 
//  If a browser prefers one mime_type or group of mime_types, it adds a q-value
// 
//    mime_type,mime_type;q=x.x, mime_type,mime_type,mime_type,...,mime_type;q=x.x
// 
//  The q-value is a number between 0.0 and 1.0 ... the higher the number, the greater the preference
//  The idea is that if we can serve more than one mime_type we should serve the browser's higher preference
//
//  ob_start("ob_gzhandler"); does all the work to compress the output if the browser can handle it
//  ob_start("fix_code"); calls the "fix_code" function instead, so initiating gzip is my responsibility
//
//  $_SERVER["HTTP_USER_AGENT"] is an opaque decription of the browser itself
//
//  $_SERVER['HTTP_ACCEPT_ENCODING'] describes compression capabilities
//
//  I output these three variables as an HTML comment so I can debug things more easily
//
//  Despite my desire to do things "right", you will see I accomodate myself to the reality of user-supplied content 
//  and browser peculiarities in order to have a working website
//

function fix_code($buffer)
  {
  #
  # Called for HTML browsers to delete all the lovely close-brackets
  # it's up to me to initiate the gzipping because ob_start is called by "fix_code" instead of "ob_gzhandler"
  #
  if (stristr($_SERVER['HTTP_ACCEPT_ENCODING'], 'gzip'))
    {
    header("Content-Encoding: gzip"); // notifies the far-end to un-gzip 
    return (gzencode(str_replace(" />", ">", $buffer),6,FORCE_GZIP));
    }
    else
      {
      return (str_replace(" />", ">", $buffer));
      }
  }

#
# default values
#
$charset          = "UTF-8";       # See https://en.wikipedia.org/wiki/UTF-8
$mime             = "text/html";   # Plain vanilla
$cache_control    = "max-age=200"; # Cache expires after 200 seconds

$xhtml_q          = 0;
$html_q           = 0;

# see https://www.w3.org/QA/2002/04/valid-dtd-list.html
$DOCTYPE_xhtml11  = "<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'https://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>\n"; 
$DOCTYPE_xhtml10  = "<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Strict//EN' 'https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd'>\n";
$DOCTYPE_wap      = "<!DOCTYPE html PUBLIC '-//WAPFORUM//DTD XHTML Mobile 1.2//EN' 'https://www.openmobilealliance.org/tech/DTD/xhtml-mobile12.dtd'>\n";
$DOCTYPE_html401  = "<!DOCTYPE html PUBLIC '-//W3C//DTD HTML 4.01//EN' 'https://www.w3.org/TR/html4/strict.dtd'>\n";
$DOCTYPE_html401l = "<!DOCTYPE html PUBLIC '-//W3C//DTD HTML 4.01 Transitional//EN' 'https://www.w3.org/TR/html4/loose.dtd'>\n";

$html_xhtml       = "<html xmlns='https://www.w3.org/1999/xhtml' xml:lang='en'>\n\n";
$html_iphone      = "<html xmlns='https://www.w3.org/1999/xhtml' xml:lang='en' manifest='iphone.manifest'>\n\n";
$html_html401     = "<html lang='en'>\n\n";
$html_html401_IE  = "<html lang='en' xmlns:v='urn:schemas-microsoft-com:vml'>\n\n";  # xmlns:v='urn:schemas-microsoft-com:vml' is recommended by Google for maps display using IE
$html_plain       = "<html>\n\n";

# parental control tag
$pics_Label       = '(pics-1.1 "https://www.icra.org/pics/vocabularyv03/" l 
	gen true for "https://philadelphia-reflections.com" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0) 
	gen true for "https://www.philadelphia-reflections.com" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0) 
	gen true for "https://search.freefind.com" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0) 
	gen true for "https://www.search.freefind.com" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0) 
	gen true for "https://statcounter.com" r (n 0 s 0! v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0) 
	gen true for "https://www.statcounter.com" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0) 
	gen true for "https://c3.statcounter.com" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0) 
	gen true for "https://www.c3.statcounter.com" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0))';

# I include the following HTML comment for my ongoing debugging purposes
$show_info        = "<!-- \nHTTP_USER_AGENT      $_SERVER[HTTP_USER_AGENT]\nHTTP_ACCEPT_ENCODING $_SERVER[HTTP_ACCEPT_ENCODING]\nHTTP_ACCEPT          $_SERVER[HTTP_ACCEPT]\n -->\n\n";

# note that I eval $prolog_type below so that the xml header (if any) gets the right charset
$prolog_type      = '$DOCTYPE_html401l $html_plain $show_info';

#
# the logic
# 

# W3C Validator
if (stristr($_SERVER["HTTP_USER_AGENT"],"W3C_Validator")) 
  {
  ob_start("ob_gzhandler");
  $mime        = "application/xhtml+xml";
    # UTF-8 produces character-type errors
    $charset     = "iso-8859-1";
  $prolog_type = '$xml_header $DOCTYPE_xhtml11 $html_xhtml $show_info';
  }
  else
    {
    # fancy wap-enabled handheld device
    if(stristr($_SERVER["HTTP_ACCEPT"],"application/vnd.wap.xhtml+xml")) 
      { 
      ob_start("ob_gzhandler");
        # per https://www.ready.mobi/ and https://www.w3.org/TR/mobileOK-basic10-tests/ application/xhtml+xml is preferred
//      $mime        = "application/vnd.wap.xhtml+xml";
        $mime        = "application/xhtml+xml";
      $prolog_type = '$xml_header $DOCTYPE_wap $html_plain $show_info';
      }
      else
        {
        # non-wap xhtml-enabled browser
        if(stristr($_SERVER["HTTP_ACCEPT"],"application/xhtml+xml")) 
          { 
          # retrieve the q values for "application/xhtml+xml" and "text/html"

          if (preg_match('%application/xhtml\+xml[^;]*?;q=([1|0]\.[1-9]+)%i', $_SERVER["HTTP_ACCEPT"], $matches)) 
            {
            $xhtml_q = (float)$matches[1];
            }

          if (preg_match('%text/html[^;]*?;q=([1|0]\.[1-9]+)%i', $_SERVER["HTTP_ACCEPT"], $matches)) 
            {
            $html_q = (float)$matches[1];
            }

          # if the q value for HTML is greater than for XHTML
          # then treat output as HTML 4.01 strict (Opera 9.64, for instance)

          if($html_q > $xhtml_q) 
            {
            ob_start("fix_code");
            $mime        = "text/html";
              # UTF-8 produces character-type errors
              $charset     = "iso-8859-1";
            $prolog_type = '$DOCTYPE_html401 $html_html401 $show_info';
            }

            # otherwise, go with XHTML
            else
              {
              ob_start("ob_gzhandler");
                # for the time-being application/xhtml+xml is too strict for us: unless your tags are PERFECT, it blows up
//              $mime        = "application/xhtml+xml";
                $mime        = "text/html";
                # UTF-8 produces character-type errors
                $charset = "iso-8859-1";

              # see "Safari Web Content Guide for iPhone OS" for cache manifest description
              if (stristr($_SERVER["HTTP_USER_AGENT"],"iPhone")) 
                {
                $prolog_type = '$xml_header $DOCTYPE_xhtml11  $html_iphone $show_info';
                }
                else
                 {
                  $prolog_type = '$xml_header $DOCTYPE_xhtml11  $html_xhtml $show_info';
                 }
              }
            }
          
          else
            {
            # plain text/html browser
            if(stristr($_SERVER["HTTP_ACCEPT"],"text/html")) 
              { 
              ob_start("fix_code");
              $mime        = "text/html";
                # UTF-8 produces character-type errors
                $charset     = "iso-8859-1";
              $prolog_type = '$DOCTYPE_html401 $html_html401 $show_info';
              }
              else
                {
                # if the browser doesn't specify any X/HTML mime type, treat like HTML 4.01 Transitional (IE 7, for instance)
                ob_start("fix_code");
                $mime        = "text/html";
                  # UTF-8 produces character-type errors
                  $charset     = "iso-8859-1";
                $prolog_type = '$DOCTYPE_html401l $html_plain $show_info';
                # if IE then include Google's recommended "xmlns:v  ..." 
                if(stristr($_SERVER["HTTP_USER_AGENT"],"MSIE")) 
                  {
                  $prolog_type = '$DOCTYPE_html401l $html_html401_IE $show_info';
                  }
                }
            }
        }
    }

#
# output the mime type, prolog type and other <meta http-equiv= variables
#
header("Content-Type: $mime; charset=$charset");
header("Content-Language: en-us");
header("Vary: Accept");

header("Cache-Control: $cache_control");

header("Content-Script-Type: text/javascript");
header("Content-Style-Type: text/css");
header("imagetoolbar: no");

// parental controls from https://www.icra.org/
header("pics-Label: $pics_Label");

// privacy header created at https://www.p3pwiz.com/
header("P3P: policyref=\"https://www.philadelphia-reflections.com/w3c/p3p.xml\", CP=\"NID DSP NOI COR\"");

$xml_header       = "<?xml version='1.0' encoding='$charset' ?>\n";
eval("\$prolog_type = \"$prolog_type\";");

print $prolog_type;
?>

Here's an interesting article on Doctype Switching: https://gutfeldt.ch/matthias/articles/doctypeswitch.html

The Philadelphia Reflections webmaster: George IV

(my thanks to https://centricle.com/tools/html-entities/ for HTML encoding)

Originally published: Thursday, August 03, 2006; most-recently modified: Monday, June 04, 2012