Euro-XML
September 18, 2002
The new European currency, the euro, has a symbol € in Unicode 3.2 as character U+20AC. How can we use it with XML?
There are three ways of representing the euro in XML:
- numeric character references,
- character entity references, and
- direct characters.
This article examines these and other more arcane but important ramifications.
Numeric Character References
You can enter the Euro character as data in element content or attribute values using
number character references in any XML document: hexadecimal €
or
decimal €
. This character is allowed both in XML 1.0 and the proposed
XML 1.1.
Numeric character references will not be recognized in CDATA marked sections and cannot be used in XML names, such as element names, attribute names and IDs.
Character Entity References
A friendlier alternative is to use the standard entity €
. This can be
used in the same places that you can use numeric character references.
An entity must have a declaration. The most failsafe approach is to supply your own:
make
sure your document has a DOCTYPE
declaration with the following declaration as
part of its internal subset.
<!ENTITY euro "€">
The internal subset is the thing between the brackets in many DOCTYPE
declarations:
<!DOCTYPE ... [ <!-- internal subset --> ]>
If you are using XHTML or HTML you are in luck: there is already a declaration provided for the euro as part of the HTML Special entity set. If you wish to use those entity declarations, include the following markup declarations:
<!ENTITY % HTMLspecial PUBLIC "-//W3C//ENTITIES Special//EN//HTML"> %HTMLspecial;
Earlier this year, ISO JTC1 SC34 decided to add the Euro to the ISONum public entity set, with the same definition as HTML's. The updated version has not been released, and this will not be dependably available for some time.
If you are using the €
form in XML other than XHTML, you should
provide your own definition. It is not an error to have an entity defined multiply
times;
the first found is used in preference to subsequent versions. Because there is no
different
opinion on which Unicode character should be used, there should be no harm in putting
the
entity declaration at the end of the internal subset.
Direct Characters
Third, if you are using UTF-8 or UTF-16, then you can enter the character directly. Your GUI may provide a mechanism, and editors aimed for publishing will also provide some mechanism.
In Adobe® FrameMaker®, for example, you hold Alt
down and type
0128
on the keypad. In my Topologi™ Collaborative Markup Editor, you
can enter the character by hex number or use the Keyboard>Currency menu.
The character type of modern programming languages such as Java, C#, Python and recently Perl is Unicode, typically in the UTF-16 encoding which uses fixed 16-bit code points to represent characters.
When a Unicode character greater than U+FFFF is needed, two UTF-16 code points are used, using a mechanism called surrogates, which complicate the simple expectation that one code should equal one character if you have 16-bit characters. For more information on characters and encodings, there are two excellent books: Ken Lunde's CJKV Information Processing (O'Reilly) which concentrates on East Asian encoding issues, and Tony Graham's Unicode: a Primer (IDG), which has useful information on Unicode in particular.
Goodbye ISO 8859-1?
However, the most common way that the Euro will be used will be as part of an XML document encoded using your system's local or regional character set. And this is where the Euro will complicate our lives in XML.
Web developers familiar with HTML will typically choose to use ISO 8859-1 (Latin 1) as the encoding for Western European documents: indeed, it is the default for HTML.
The problem? ISO 8859-1 does not have the euro character in it. Instead, the developers of the ISO 8859 series have issued ISO 8859-15 (Latin 9), which both adds the euro and replaces some unfortunate and rarely-used characters in 8859-1 with letters for better support of French and Finnish. In particular, the euro takes over the 0xA4 code point used as the generic CURRENCY SIGN in ISO 8859-1.
Even more confusingly, this set, which is officially Latin 9 is now being called, especially in Linux circles, Latin 0; probably a fitting brand name.
Character encodings are registered with IANA. Here is the registration:
Name: ISO-8859-15 Alias: iso-ir-203 Alias: iso-8859-15 (preferred MIME name) Alias: latin9 Alias: latin0 Alias: csISOLatin15
So if you are using Latin 0 with XML, the preferred XML header is
<?xml version="1.0" encoding="iso-8859-15"?>
When sending XML documents as text over HTTP, use
Content-Type: text/xml;charset=iso-8859-15
The Case of the Missing On-Screen CharacterThis is a trap for new and old players: you look at the document on your screen and the character is not there. The obvious conclusion: the euro has been stripped out during some import or processing. Not so fast. While it may be that the euro was deleted during import (many transcoding systems just discard characters they don't know what to do with) the more likely explanation is that the current font does not have the Euro character. The place in your document where the euro character is expected may be empty, or perhaps have some other glyph (picture) showing (a square box, for example). Since Win98 and MacOS 8.5, operating systems, transcoders, fonts and applications have had a time to become euro-friendly in preparation for 2002; check that your systems have been updated. Until last month, Microsoft had euro-friendly versions of their Core Web Fonts available at http://www.microsoft.com/typography. This was a great way for older versions of Windows to keep abreast. Microsoft have removed this now, but the independent Corefonts project has been set up to redistribute the fonts under the license originally granted by Microsoft, offering support for Windows and Linux systems. Linux systems are a little fiddly with respect to fonts. You may have to check that
you have Latin 0 fonts. Try the utility |
Windows Code Pages
On the Windows side, the most common Windows code page for Western documents is
CP1252
, sometimes aliased as ANSI
. (If you are at a party and
wish to avoid standards-people, just loudly talk about ANSI code page and watch whose
nostrils twitch.) CP 1252
is a superset of ISO 8859-1.
The euro character has been introduced as 0x80 in most Microsoft code pages for Europe:
- CP1252 (Western Europe),
- CP1250 (Eastern Europe),
- CP1253 (Greek),
- CP1254 (Turkish),
- CP1255 (Baltic)
An exception is the Cyrillic code page CP 1521, it is code point 0x88.
In Unicode, the character at U+0080 is reserved for control characters, to be determined
by
the application, but suggested as ISO
6429 C1 set by default. (The C1 controls are the characters in an 8-bit set between
0x80
and 0x9F
, reserved for control functions.)
In any case, neither Unicode nor ISO 6429 specify a character for the control code
0x80
, so by any criteria if you find a character at 0x80
in your
Unicode data there is something fishy going on. The most likely explanation is that
someone
has used the new CP 1252
but has mislabeled it as ISO 8859-1 or ISO 8859-15.
Developers should note that they cannot rely on transcoding software to catch the
error
where there is a 0x80 in data labeled ISO 8859-n; even though modern APIs such as
Java 1.4's transcoders will generate exceptions when a bad encoding is detected, 0x80
is a
legitimate (though unused) code in ISO 8859-n encodings and so will probably not
generate an error. I note that at least some versions of MSXML 4 do the right thing
and
complain, but in XML 1.0 the behavior has been underspecified and largely up to the
skills
and expectations of the programmers creating XML parsers.
How could XML 1.1 help?There has been some discussion recently of how to treat those control characters in XML 1.1. In order to catch as many encoding mislabeling problems as possible, it would be best to ban the C1 characters outright; but that would leave a range of characters that can appear in a DOM but not be interchanged. The best compromise between the two conflicting requirements of interchange and
error-detection will be to say that control characters, except for the various
whitespace characters like CR, NL, SPACE, TAB and NEL, must be serialized in XML 1.1
as numeric character references. (The NEL character |
Until recently most people in Europe could get away with editing their documents with
a
CP 1252
editor: the extra characters are not that common.
But now XML developers in the West join their East Asian colleagues in being able to recognize encoding related problems.
Hints
If you are using CP1252
, you may find some problems: older transcoders may
not know the correct name, and transcoding software may not understand the correct
alias.
The official IANA name seems to be windows-1252
but the Java transcoder uses
the name cp1252
.
If you are using Windows, there is usually an an excellent Character Map utility available in the Accessories menu: look through each font to find one which has the euro glyph available, then use that font in the application in which you are trying viewing the XML document. Other operating systems have similar utilities available.