Schema Standards

A DTD makes it possible to validate the structure of relatively simple XML documents, but that's as far as it goes.

A DTD can't restrict the content of elements, and it can't specify complex relationships. For example, it is impossible to specify that a <heading> for a <book> must have both a <title> and an <author>, whereas a <heading> for a <chapter> needs only a <title>. In a DTD, you get to specify the structure of the <heading> element only one time. There is no context sensitivity, because a DTD specification is not hierarchical.

For example, for a mailing address that contains several parsed character data (PCDATA) elements, the DTD might look something like this:

<!ELEMENT mailAddress (name, address, zipcode)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT zipcode (#PCDATA)> 

As you can see, the specifications are linear. So if you need another "name" element in the DTD, you need a different identifier for it. You could not simply call it "name" without conflicting with the <name> element defined for use in a <mailAddress>.

Another problem with the nonhierarchical nature of DTD specifications is that it is not clear what the comments are meant to explain. A comment at the top might be intended to apply to the whole structure, or it might be intended only for the first item. Finally, DTDs do not allow you to formally specify field-validation criteria, such as the 5-digit (or 5 and 4) limitation for the zipcode field.

Finally, a DTD uses syntax that is substantially different from that of XML, so it can't be processed by using a standard XML parser. This means that you can't, for example, read a DTD into a DOM, modify it, and then write it back out again.

To remedy these shortcomings, a number of standards have arisen that define a more databaselike, hierarchical schema that specifies validation criteria. The major proposals are discussed in the following sections.

XML Schema

XML Schema is a large, complex standard that has two parts. One part specifies structure relationships. (This is the largest and most complex part.) The other part specifies mechanisms for validating the content of XML elements by specifying a (potentially very sophisticated) data type for each element. The good news is that XML Schema for Structures lets you specify virtually any relationship you can imagine. The bad news is that it is very difficult to implement, and it's hard to learn. Most of the alternatives provide simpler structure definitions while incorporating XML Schema's data-typing mechanisms.

For more information on XML Schema, see the W3C specs XML Schema (Structures) and XML Schema (Data Types), as well as other information accessible at http://www.w3c.org/XML/Schema.

RELAX NG

Simpler than XML Structure Schema, Regular Language Description for XML (Next Generation) is an emerging standard under the auspices of OASIS (Organization for the Advancement of Structured Information Standards). It may also become an ISO standard in the near future.

RELAX NG uses regular-expression patterns to express constraints on structure relationships, and it uses XML Schema data-typing mechanisms to express content constraints. This standard also uses XML syntax, and it includes a DTD-to-RELAX converter. (It's "next generation" because it's a newer version of the RELAX schema mechanism that integrated TREX--Tree Regular Expressions for XML--a means of expressing validation criteria by describing a pattern for the structure and content of an XML document.)

For more information on RELAX NG, see http://www.oasis-open.org/committees/relax-ng/

SOX

Schema for Object-oriented XML is a schema proposal that includes extensible data types, namespaces, and embedded documentation.

For more information on SOX, see http://www.w3.org/TR/NOTE-SOX.

Schematron

Schema for Object-oriented XML is an assertion-based schema mechanism that allows for sophisticated validation.

For more information on the Schematron validation mechanism, see http://www.ascc.net/xml/resource/schematron/schematron.html.