Migrating to XForms
November 1, 2006
In 2001, the W3C set out to create an XML standard for implementing user forms in XHTML by publishing the XForms 1.0 Working Draft. The purpose of XForms is to eventually replace existing HTML forms, which are limited in capability and notoriously difficult to develop in. If you are not familiar with XForms, or aren't convinced of their benefits, start off by checking out What are XForms.
In March of this year, the W3C announced the XForms 1.0 Second Edition Recommendation. In July, Mozilla announced Preview Release 0.6 of their XForms extension. It won't be long until browsers begin supporting XForms, and once this happens, they will be the prevalent and preferred method of user data collection on the internet. Until then, it's in our best interest to begin migrating our current XHTML forms to XForms so that we're ready once the new standard is mainstream.
Our goal here is to take an XHTML document containing one or more standard forms, convert the forms into XForms format while preserving all of the information, and generate a new XHTML document as a result. To achieve this, we will be using the PHP parser functions, which have been around since PHP 4 and have been used in many PHP APIs, such as Magpie (an RSS parser) and nuSOAP (a library for web services support).
Figure 1. XForms Parser
Figure 1 is an overview of how the system will work. Essentially, there are three main phases (grey). In Phase 1, we prepare the input file for parsing and split it into several segments. In Phase 2, we actually pass the data through the parser. Note that the only segment of the input file that is actually parsed is the <body> tag (green). Because XForms require elements in both the <head> and <body> HTML, the parser will also append data to the contents of the <head> tag. This appended data is labeled "A" (orange). "B" represents the portion of the input XHTML that closes the <head> tag. Each phase will be explained separately.
Phase 1: Preparing the Input
As is evident in Figure 1, it is crucial that we split the input file into many segments
so
that we parse only the portion of the XHTML file that we need to, and so that we append
the
necessary XForms elements to the <head> tag. To accomplish this, we use two PHP
functions: stripos()
and substr()
. The first function tells us the
position of a string (needle) inside a larger string (haystack). We will pass the
result we
get from this function to the second function: substr()
. As you might guess,
substr()
gives us a part (substring) of a larger string -- all we have to
tell it is the start position and the substring's desired length.
Now that you understand what we're doing, you're probably wondering why we're doing it. Take a look at the code below, and you should get a clearer idea:
/*A*/ $instr = file_get_contents("inputform.html"); $pos["headstart"] = stripos($instr,"<head>"); $pos["headend"] = stripos($instr,"</head>"); $pos["bodystart"] = stripos($instr,"<body>"); $pos["bodyend"] = stripos($instr,"</body>")+7; /*B*/ $input["top"] = substr($instr,0,$pos["headstart"]); $input["head"] = substr($instr,$pos["headstart"],$pos["headend"]-$pos["headstart"]); $input["middle"] = substr($instr,$pos["headend"],$pos["bodystart"]-$pos["headend"]); $input["body"] = substr($instr,$pos["bodystart"],$pos["bodyend"]-$pos["bodystart"]); $input["bottom"] = substr($instr,$pos["bodyend"]);
A: file_get_contents()
fetches the contents of the input HTML
(inputform.html
) and stores it in the variable $instr
(line 1).
The next four lines call stripos()
to get the positions where the <head>
tag begins, the <head> tag ends, the <body> tag begins, and the <body> tag
ends (respectively). We added "7" to the position of the end of the <body> tag so
that
the position is that of the first character after the <body> tag. To
understand why we've made this exception, let's look at the second part of the code.
B: Here we call substr()
and split the input into the five sections outlined
in Figure 1. The first parameter passed to substr()
is the input string (in
this case, $instr
), the second is the position of the first character of the
substring that will be returned, and the third parameter is the length of the desired
substring. We already have the right positions (the simple algebra used to verify
this has
been omitted), so we simply pass the positions we got in the previous four lines.
We added
"7" to the last position retrieved (i.e., the closing </body> tag) so that we include
this closing tag inside the $input
["body"] substring. We do this
because this substring will be the one passed to the parser; we include the closing
tag so
that the substring runs through the parser without throwing an error.
Because the PHP parser is designed primarily for XML input, we will need to make some
minor
changes to the contents of the <body> tag (stored in $input["body"]
). For
example, the following three form tags would each throw a PHP parser error:
<input type="text" name="t" disabled /> <input type="checkbox" name="c" value="c1" checked /> <select multiple name="s"> <option value="1">One</option> </select>
This happens because element attributes without set values are not allowed in XML.
Namely:
disabled
, checked
, and multiple
. To avoid this, we
will "trick" the parser by assigning null values for these element attributes so that
the
modified HTML look like this:
<input type="text" name="t" disabled="" /> <input type="checkbox" name="c" value="c1" checked="" /> <select multiple="" name="s"> <option value="1">One</option> </select>
The following code accomplishes this task:
$fixatt = array("multiple","checked","disabled"); foreach ($fixatt as $a) $input["body"] = str_replace(" $a "," $a=\"\" ",$input["body"]);
str_replace()
is another useful PHP function. It searches for a certain string
(first parameter) inside a larger string (third parameter), and replaces it with a
replacement string (second parameter). The function returns the new, modified string.
Note
that if you plan to extend this code to larger HTML files with mixed data, you should
use
the preg_replace()
function instead because str_replace
will not
be selective enough in some cases. That is, if your HTML body contains any of the
words in
$fixatt
, they will automatically have " ="" " appended to them. You can be
more specific with preg_replace()
since it uses regular expressions, thus
allowing you to limit modifications to only those within <form> tags.
As we have successfully prepared the HTML for parsing, we can move on to the main phase: the parser.
Phase 2: The Parser
Initially, we will construct the parser so that it is able to read XHTML and reconstruct it as output. Thus, the output will be identical to the input. The purpose of this first step is to ensure that the parser is able to preserve the portions of the HTML that are not form elements.
Before we actually go into the parsing logic, we define the initial parser configuration as follows:
/*A*/ define(NSPACES_ON,true); $f = (NSPACES_ON) ? "f:" : ""; /*B*/ $parser = xml_parser_create(); xml_set_element_handler($parser, "tagOpen", "tagClosed"); xml_set_character_data_handler($parser, "tagContent"); $curtags = array();
A: To allow for greater syntax flexibility, we provide a way to turn namespaces on or off. If you are unfamiliar with XML namespaces, check out XML Namespaces By Example. The current W3C proposal for XHTML requires the namespace references for XForms to be included. However, once XHTML 2.0 becomes a recommendation, they will not be required. Visit the W3C HTML Homepage for more information.
B: Here is where we set up the parser itself. The first line simply creates a parser
resource. The second line is critical: it defines the functions that the parser calls
when
it encounters the start and end of an XHTML element (or tag). The PHP function that
is used
to accomplish this, xml_set_element_handler()
, takes three parameters: the
variable representing the parser resource, the name of the function that is called
at the
start of a tag, and the function called when the tag (XHTML element) is closed. Next,
xml_set_character_data_handler()
defines the function called when any
non-HTML data is encountered by the parser (also known as character data, or CDATA).
The
parameters are similar: the first is the parser resource, and the second is the function
name to call when any CDATA is encountered. The functions tagOpen()
,
tagClosed()
, and tagContent()
are known as "handlers," since
they are called by an internal system versus by programmer-written code. The internal
system
in this case is the PHP parser. On the last line, we initialize the $curtags
array. This array (implemented as a stack) will be visible to all three handlers so
that we
always know what tag is being read and what other tags are open. The way
$curtag
works will be explained in more detail later in this article.
Parser Foundation
As an abstract example, here is some simple XML. Let's assume the parser is running
with
the settings that we've just defined above: <greeting friendly="true"> Hello
World! </greeting>
. The parser walks through the above XML character by
character. When it reaches the end of line 1, it calls tagOpen()
, passing the
data inside the <greeting> tag. Once the function executes, it continues to traverse
the XML, detecting more XML on line 3. At this point, it calls tagContents()
and passes the text inside the <greeting> tag (including the two line breaks). After
that function runs, it reads the name of the closing tag and passes it to the
tagClosed()
function. That's essentially how the PHP parser works.
Now that we've gone through some PHP parser basics, we can start tackling the logic
of the
parser itself. As mentioned before, this initial version is only meant to pass the
input
file through the parser and reconstruct a file with identical data as output. We will
add
the form translation code once we get this first part right. Let's start with the
tagOpen()
function (the start element handler):
function tagOpen($parser, $name, $attrs) { /*A*/ global $outbody, $curtags, $sctag; $sctag = true; /*B*/ array_unshift($curtags,$name); /*C*/ switch ($curtags[0]) { /*Cases for form tag translation go here*/ default: /*D*/ $outbody .= "<".$name; foreach ($attrs as $k=>$v) $outbody .= " $k=\"$v\""; $outbody .= ">"; break; } }
A: First we define all variables that have to be seen by all handlers.
$outbody
contains the parsed output for the <body> tag, while the
purpose of $curtag
has been previously mentioned. The Boolean variable
$sctag
determines whether the current tag is self-closing. For example,
<br/>, <hr/>, <img/>, and <input/> are all self-closing tags. This
is set to true
by default.
B: The function array_unshift()
, in conjunction with
array_shift()
, allows us to implement $curtags
as a simple
stack. array_unshift()
puts $name
as an element at the front of
the array while shifting all other elements of the array down one position. On the
other
hand, array_shift()
does the opposite: it removes the first element of the
array and overrides its position by shifting all other elements in the array up one
position. Implementing a stack like this is convenient in PHP, as the top of the stack
can
be examined (without changes) simply by accessing $curtags[0]
. Thus, the first
element in the array is the most recently opened XHTML tag, the second element is
the open
tag that is one level up from the current one, and so on. Also, the size of
$curtags
tells us our current tag depth.
C: This switch statement determines what to do based on the current tag. As we add the forms translation logic, we will add more cases to the switch statement. For now, we are only concerned with the default case, which should completely preserve the original XHTML syntax.
D: The unchanged XHTML syntax is appended to $outbody
here. The
foreach
loop traverses through the associative array that contains the
attribute information, and appends to $outbody
as appropriate. For example, the
tag <style id="1"> would result in $attrs
having an element with a key of
"id" and an associated value of "1".
Now we'll examine the tagContents()
and tagClosed()
functions
(the CDATA and end element handlers, respectively):
function tagContent($parser, $data) { global $outbody, $curtags, $sctag; switch ($curtags[0]) { /*Cases for form tag translation go here*/ default: /*A*/ $sctag = false; $outbody .= $data; break; } } function tagClosed($parser, $name) { global $outbody, $curtags, $sctag; switch ($name) { /*Cases for form tag translation go here*/ default: /*B*/ if ($sctag) //self-closing tag $outbody = substr($outbody,0,-1) . "/>"; else $outbody .= "</$name>"; break; } /*C*/ array_shift($curtags); }
When comparing these two handlers with the first one we discussed, we see similarities:
both begin by exposing the required variables globally (lines 3 and 17), and both
contain a
switch statement that selects cases based on the current tag name. As with
tagOpen()
, we will add more cases to these switch statements once we add
support for XForms translation.
A: Once we reach this point, we know that we're in a standard XHTML tag that contains
non-HTML data. In other words, it is not a self-contained tag. Therefore, we set
$sctag
to false
. Also, we make sure that this non-HTML data is
carried through to the output file by appending it to $outbody
.
B: If the tag that we're currently parsing turns out to be a self-contained tag, we
have to
remove the ">" character that was added in tagOpen()
and replace it with
"/>" (line 24). Otherwise, we close the tag the expected way (line 26).
C: At the end of tagClosed()
, we are done with the current tag, so we remove
it from the top of the stack using array_shift()
.
Now that we've set the foundations of our parser, we can start adding in the logic necessary to translate the HTML form elements into XForm elements.
Translating to XForms
From this point on, an understanding of XForms is assumed -- if you are unfamiliar or need brushing up, I recommend "What Are XForms" (mentioned earlier).
Let's look at an input XHTML file containing a simple form:
<html> <head> <title>sample form</title> </head> <body> <form action="#" method="get" name="s"> Find <input type="text" name="Find" /> <input type="submit" value="Go" /> </form> </body> </html>
If we translate this form into the XForms model, it looks like this:
<html xmlns:f='http://www.w3.org/2002/xforms'> <head> <title>sample form</title> <f:model><f:submission action='#' method='get' id='s'/></f:model></head> <body> <p class='form'> Find <f:input ref='Find'><f:label>Find</f:label></f:input> <f:submit submission='s'><f:label>Go</f:label></f:submit> </p> </body> </html>
Now that we have our input and output requirements, we can add the necessary XForms
translation logic to our element handlers (the added code is in bold).
Let's start with openTag()
:
function tagOpen($parser, $name, $attrs) { /*A*/ global $outbody, $curtags, $sctag; global $outhead, $curformid, $f; $sctag = true; array_unshift($curtags,$name); switch ($curtags[0]) { case "FORM": /*B*/ if (!isset($attrs["ENCTYPE"])) { if ($attrs["METHOD"] != "post") $method = $attrs["METHOD"]; } else if ($attrs["ENCTYPE"] == "application/x-www-form-urlencoded") $method = "urlencoded-post"; else if ($attrs["ENCTYPE"] == "multipart/form-data") $method = "form-data-post"; /*C*/ $curformid = $attrs["NAME"]; $outhead .= "<$f"."submission action='".$attrs["ACTION"] . "' method='" . $method . "' id='" . $attrs["NAME"] . "'/>"; $outbody .= "<div class='form'>"; break; case "INPUT": /*D*/ $sctag = false; switch ($attrs["TYPE"]) { /*Add'l cases for form tag translation go here*/ case "text": $outbody .= "<$f"."input ref='".$attrs["NAME"] . "'><$f" . "label>".$attrs["NAME"]."</$f"."label>"."</$f"."input>"; break; case "submit": $outbody .= "<$f"."submit submission='$curformid'><$f" . "label>".$attrs["VALUE"]."</$f"."label>"."</$f"."submit>"; break; } break; default: $outbody .= "<".$name; foreach ($attrs as $k=>$v) $outbody .= " $k=\"$v\""; $outbody .= ">"; break; } }
A: We had to add some more globally scoped variables to support the new logic.
$outhead
contains all the XForms tags that need to be added to the
<head> tag (represented by the orange box labeled "A" in Figure 1).
$curformid
contains the unique identifier of the current form; although not
strictly necessary for this example, it can be useful for scaling the parser to handle
multiple forms, and for detecting errors in the HTML when the forms are improperly
nested.
Lastly, $f
either contains "f:" or is an empty string. As discussed previously,
this is included so that we can easily turn namespaces on and off without changing
more than
one part of the code.
B: To determine the submission behavior, HTML forms use two attributes:
enctype
and method
. However, XForms only uses one attribute --
method
-- to accomplish this. The appropriate mapping is defined here. Using a series of if/else statements, we can assign the appropriate value to
$method
. For the sake of simplicity, error handling is omitted; however, it's
worth noting that there's an opportunity here to throw an exception if the HTML data
is
incomplete: e.g., enctype
should be set if method="post"
.
C: Although HTML form elements can have an ID attribute, we have chosen to assign
the ID
attribute of the created XForm with the HTML form's name
attribute (instead of
its id
attribute). The reason for this is because name
is more
commonly used as a unique identifier for an HTML form than id
. Finally, note
that all the data in the <form> tag is stored in the <head> tag of the output
XHTML. For the body, we use a <div> element to replace the <form> element as a
container for all child tags and the form contents. If, for example, there was style
information associated with the <form> tag, we could easily redefine the CSS so that
it refers to the new <div> tag instead.
D: This is where we extract all the info from a form's <input> tag. Note that HTML
forms have multiple input types, so we need another switch/case control that selects
a case
on the value of the type
attribute. Because our sample form has only two input
types, we define only two cases for now.
As you would expect, most of the work is done by tagOpen()
. Here is
tagClosed()
and tagContent()
, with the additions in
bold:
function tagClosed($parser, $name) { global $outbody, $curtags, $sctag; global $outhead, $curformid, $f; switch ($name) { /*A*/ case "INPUT": //do nothing break; case "FORM": $curformid = ""; $outbody .= "</div>"; break; default: if ($sctag) //self-closing tag $outbody = substr($outbody,0,-1) . "/>"; else $outbody .= "</$name>"; break; } array_shift($curtags); } function tagContent($parser, $data) { global $outbody, $curtags, $sctag; global $outhead, $curformid, $f; switch ($curtags[0]) { /*B*/ default: $sctag = false $outbody .= $data; break; } }
A: We have added the cases for both the <input> and <form> tags. It's important
to add it for the <input> tag, even though no code is executed, so that it's not
treated as the default case. The reason for this is because we have already added
the XForms
closing tags in tagOpen()
for the HTML input tag, so no further tags need to be
added at this point. The logic for handling the closing of the <form> tag is also
straightforward -- we just close the <div> tag that was opened when we handled the
start of the <form> tag in tagOpen()
.
B: For the tags we added so far (<form> and <input>), we don't need to add any
cases in the tagContent()
function. However, we will need to do so when we
include support for tags such as <option> (nested in a <select> tag).
You can add further HTML form support using a similar approach -- just add the cases
in the
switch statements. Note the nested switch statement in the tagOpen()
function:
this will eventually have the most cases because most form tags are <input> tags,
and
there will be one case for every possible value of the type
attribute. Here is
a useful table that
you can use as a translation guide. It shows you the XForms element that each HTML
form
element should be mapped to.
Now that we have some basic functionality, we can move on to Phase 3, which completes the operation of the parser.
Phase 3: Finalizing the Output
As you've hopefully noticed, we haven't actually called the function that runs the parser. This is what we do next:
/*A*/ if (!xml_parse($parser, $input["body"], true)) { $error = xml_error_string(xml_get_error_code($parser)); $line = xml_get_current_line_number($parser); die("HTML error: " . $error . " , line " . $line); } xml_parser_free($parser); /*B*/ $outhead = "<$f"."model>".$outhead."</$f"."model>"; $finaloutput = $input["top"].$input["head"].$outhead.$input["middle"].$outbody.$input["bottom"]; if (NSPACES_ON) $finaloutput = str_replace("<html","<html xmlns:f='http://www.w3.org/2002/xforms'",$finaloutput); /*C*/ $outfile = "output.html"; $fh = fopen($outfile, "w"); if (!fwrite($fh, $finaloutput)) die("Failed to write to file.");
A: xml_parse()
is what puts the gears in motion -- we pass the variable
representing the parser and the input that we wish to parse as the first two parameters,
respectively. The third parameter is set to false
if we want to pass the input
in smaller chunks (this is done when the input is very large and a lot of processing
is
required). In our case, we will be parsing the input in one go, so we set the third
parameter to true
. If xml_parse()
returns false
, it
encounters an error and is unable to finish parsing the input. When this happens,
we use the
xml_get_error_code()
function to find out what happened, and
xml_get_current_line_number()
to find out where it happened. The final
parser-related function, xml_parser_free()
, removes the parser resource from
memory. This is only done once we're finished with the parser entirely.
B: As previously mentioned, $outhead
contains all the XForms elements that
need to be added to the <head> tag of the output XHTML. However, before we do this,
we
encase all of this in a <model> tag to indicate that they are XForms tags. Now, we
stick the file segments back together (as shown in Figure 1) and store the end result
in
$finaloutput
. Before storing our result in a file, we add the namespace
declaration to the <html> tag, using str_replace()
. This function was
explained when we used it in Phase 1.
C: Now that we have our translated form, we need to put it into a file.
fopen()
defines a file handler, which tells PHP that we will be doing
something with a file. In this case, we will be writing to it, so we pass a parameter
of
w
(the first parameter is the name of the output file). The function that
does the actual file writing is fwrite()
-- we pass the file handler we
declared earlier, along with the data we wish to write. We produce an error message
if the
write fails.
At this point, we have consolidated the translated XHTML and written it to an output file. This marks the end of Phase 3, and the completion of the parser.
Scaling the Parser
What has been provided here are the rudimentary building blocks for a complete HTML to XForms translator. As explained earlier, the parser can be easily scaled to handle all possible HTML form elements and translate them into XForms. In addition to the introductory XForms article mentioned earlier, you may find this link useful: XForms for HTML Authors. It explains in detail how to use XForms to provide all features available with HTML forms.
All the files that were discussed (including the main translator) are available below: