XML, the eXtensible Markup Language, has been attracting a lot of attention recently. This article provides a 5-minute overview of XML, and explains why it matters to you.
The Standard General Markup Language is about two decades old. SGML was originally designed for processing large documentation sets, but SGML is neither a programming language nor a text formatting language. Instead, it's a meta-language that allows defining customized markup languages. The most famous SGML-based language today is unquestionably HTML.
Because SGML has been around for two decades, many companies offer SGML tools and products and it's firmly entrenched in many high-end document processing applications. SGML is quite a large language; however, uUnderstanding the basics isn't very difficult. It does contain many rarely used features which are harder to understand. Implementing a full SGML parser is difficult, and this has given SGML a reputation for fearsome complexity. This reputation isn't truly deserved, but it's been enough to scare many people away from using it.
XML, then, is a stripped-down version of SGML that sacrifices some power in return for easier understanding and implementation. It's still a meta-language, but many of SGML's lesser-used features and options have been dropped. The XML 1.0 specification is about 40 pages long, and a parser can be implemented with a few weeks of effort.
A mark-up language specified using XML looks a lot like HTML:
<?xml version="1.0"?> <!DOCTYPE myth SYSTEM "myth.dtd"> <myth> <name lang='latin'>Hercules</name> <name lang='greek'>Herakles</name> <description>Son of Zeus and Alcmena.</description> <mortal/> </myth>
An XML document consists of a single element containing
sub-elements which can have further sub-elements inside them.
Elements are indicated by tags in the text, consisting of text within
angle brackets <...>
. Two forms of elements are available. An
element may contain content between opening and closing tags, as in
<name>Hercules</name>
, which is a name
element
containing the data "Hercules". This content may be text data, other
XML elements, or a mixture of text and elements. Elements can also be empty, in
which case they're represented as a single tag ending with a slash, as
in <mortal/>
, which is an empty mortal
element. This
is different from HTML, where empty elements such as <BR>
or
<IMG>
aren't indicated differently from a non-empty element
such as <H1>
. Also unlike HTML, XML element names are
case-sensitive; mortal
and Mortal
are two different
element types.
Opening and empty tags can also contain attributes, which specify
values associated with an element. For example, text such as
<name lang='greek'>Herakles</name>
, the name
element
has a lang
attribute with a value of "greek". In <name
lang='latin'>Hercules</name>
, the attribute's value is "latin".
Another difference from HTML is that quotation marks around an attribute's
value are not optional.
The rules for a given XML application are specified with a Document
Type Definition, or DTD. The DTD carefully lists the allowed element names
and how elements can be nested inside each other.
The DTD also specifies the attributes which can be defined for each
element, their default values, and whether they can be omitted. For
example, to make a comparision with HTML, the LI
element,
representing an entry in a list, can only occur inside certain
elements which represent lists, such as OL
or UL
.
The document type definition is specified in the DOCTYPE declaration; the above document uses a DTD called "mythology" that I invented for this article. The "mythology" DTD might contain the following declarations:
<!ELEMENT myth (name+, description, mortal?)> <!ELEMENT name (#PCDATA)> <!ATTLIST name lang ( latin | greek ) "latin"> <!ELEMENT description (#PCDATA)> <!ELEMENT mortal EMPTY>
I won't go into every detail of these lines here, but will draw your attention to the most
significant features. The lines beginning
with <!ELEMENT
are element declarations. They declare the
element's name and what it can contain. So, the myth
element must contain one or more name
elements, followed by a
single description
element, followed by an optional
mortal
element. (+, *, and ? have the meanings as
in regular expressions: one or more, zero or more, and zero or one
occurrences.) The mortal
tag, on the other hand, must always be
empty.
The third line declares the name
element to have an attribute named
lang
; this attribute can have either of the two values "latin" or
"greek" and defaults to "latin" if it's not specified.
A validating parser can be given a DTD and a document in order to verify that a given document is valid, i.e., it follows all the DTD's rules. This is quite different from HTML, since Web browsers have historically had very forgiving parsers, and so relatively few people make any effort to write valid HTML. This looseness means that code to render HTML text must be full of hacks and special cases; it's to be hoped that XML doesn't fall into the same trap of leniency.
This article doesn't cover all of XML's features -- I haven't discussed all the possible attribute types, what entities are, or XML's use of Unicode, which enables XML processors to handle data written in practically any human alphabet. For the full details of XML's syntax, the one definitive source is the XML 1.0 specification, available on the Web at the World Wide Web Consortium's XML page (see resources). however, like all specifications, it's quite formal and isn't intended to be a friendly introduction or a tutorial. gentler introductions are beginning to appear on the web and on bookstore shelves.
xml will most likely become very common over the next few years. many new web-related data formats are being drafted as xml dtds; three examples are mathml for specifying mathematical equations, rdf (resource description format) for classifying and describing information resources, and smil for synchronized multimedia. there are also individual efforts to define dtds for all sorts of applications including genealogical data, electronic data interchange, vector graphics, you name it; the list is growing all the time.
xml isn't primarily a competitor to html. the world wide web consortium is planning to base the next generation of html on xml, but html as it currently stads isn't going to disappear any time soon. many people have already learned html and are happily using it; they don't particularly want or need the ability to create new mark-up languages. there are millions of existing html documents now on the web, and converting them to xml would take a lot of time and effort; many documents may never be converted.
but xml is going to be very significant, and xml support will be very common. the next versions of the mozilla and internet explorer browsers will each support xml, and will use it internally in various ways. more and more new data formats will be written as xml dtds. the argument driving this is simple laziness: if xml is available on every platform, and if its capabilities are suited to the task, then using xml will save time with little effort, which is always a persuasive argument.
in addition, xml will be easily accessible to programmers. james
clark's expat parser (see Resources) is high-quality code and is
freely available under the terms of the Mozilla Public License. I
wouldn't be surprised to see future Linux distributions coming with
Expat as part of the base system. Interfaces to Expat for scripting
languages such as Python, Perl, and Tcl are already in development,
and will probably have been finished by the time you read this. Soon
adding XML parsing to a program will be as easy as adding from xml
import parsers
or use XML::Parser
to your code.