Using the Validating Parser

By now, you have done a lot of experimenting with the nonvalidating parser. It's time to have a look at the validating parser to find out what happens when you use it to parse the sample presentation.

You need to understand about two things about the validating parser at the outset:

Configuring the Factory

The first step is to modify the Echo program so that it uses the validating parser instead of the nonvalidating parser.


Note: The code in this section is contained in Echo10.java.


To use the validating parser, make the following highlighted changes:

public static void main(String argv[])
{
  if (argv.length != 1) {
    ...
  }
  // Use the default (non-validating) parser
  // Use the validating parser
  SAXParserFactory factory = SAXParserFactory.newInstance();
  factory.setValidating(true);
  try {
    ... 

Here, you configure the factory so that it will produce a validating parser when newSAXParser is invoked. To configure it to return a namespace-aware parser, you can also use setNamespaceAware(true). Sun's implementation supports any combination of configuration options. (If a combination is not supported by a particular implementation, it is required to generate a factory configuration error.)

Validating with XML Schema

Although a full treatment of XML Schema is beyond the scope of this tutorial, this section shows you the steps you take to validate an XML document using an existing schema written in the XML Schema language. (To learn more about XML Schema, you can review the online tutorial, XML Schema Part 0: Primer, at http://www.w3.org/TR/xmlschema-0/. You can also examine the sample programs that are part of the JAXP download. They use a simple XML Schema definition to validate personnel data stored in an XML file.)


Note: There are multiple schema-definition languages, including RELAX NG, Schematron, and the W3C "XML Schema" standard. (Even a DTD qualifies as a "schema," although it is the only one that does not use XML syntax to describe schema constraints.) However, "XML Schema" presents us with a terminology challenge. Although the phrase "XML Schema schema" would be precise, we'll use the phrase "XML Schema definition" to avoid the appearance of redundancy.


To be notified of validation errors in an XML document, the parser factory must be configured to create a validating parser, as shown in the preceding section. In addition, the following must be true:

Setting the SAX Parser Properties

It's helpful to start by defining the constants you'll use when setting the properties:

static final String JAXP_SCHEMA_LANGUAGE =
    "http://java.sun.com/xml/jaxp/properties/schemaLanguage";

static final String W3C_XML_SCHEMA =
    "http://www.w3.org/2001/XMLSchema"; 

Next, you configure the parser factory to generate a parser that is namespace-aware as well as validating:

...
  SAXParserFactory factory = SAXParserFactory.newInstance();
  factory.setNamespaceAware(true);
  factory.setValidating(true); 

You'll learn more about namespaces in Validating with XML Schema. For now, understand that schema validation is a namespace-oriented process. Because JAXP-compliant parsers are not namespace-aware by default, it is necessary to set the property for schema validation to work.

The last step is to configure the parser to tell it which schema language to use. Here, you use the constants you defined earlier to specify the W3C's XML Schema language:

saxParser.setProperty(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA); 

In the process, however, there is an extra error to handle. You'll take a look at that error next.

Setting Up the Appropriate Error Handling

In addition to the error handling you've already learned about, there is one error that can occur when you are configuring the parser for schema-based validation. If the parser is not 1.2-compliant and therefore does not support XML Schema, it can throw a SAXNotRecognizedException.

To handle that case, you wrap the setProperty() statement in a try/catch block, as shown in the code highlighted here:

...
SAXParser saxParser = factory.newSAXParser();
try {
  saxParser.setProperty(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA);
} 
catch (SAXNotRecognizedException x) {
  // Happens if the parser does not support JAXP 1.2
  ...
}
... 

Associating a Document with a Schema

Now that the program is ready to validate the data using an XML Schema definition, it is only necessary to ensure that the XML document is associated with one. There are two ways to do that:


Note: When the application specifies the schema to use, it overrides any schema declaration in the document.


To specify the schema definition in the document, you create XML such as this:

<documentRoot
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation='YourSchemaDefinition.xsd'
>
  ... 

The first attribute defines the XML namespace (xmlns) prefix, xsi, which stands for XML Schema instance. The second line specifies the schema to use for elements in the document that do not have a namespace prefix--that is, for the elements you typically define in any simple, uncomplicated XML document.


Note: You'll learn about namespaces in Validating with XML Schema. For now, think of these attributes as the "magic incantation" you use to validate a simple XML file that doesn't use them. After you've learned more about namespaces, you'll see how to use XML Schema to validate complex documents that use them. Those ideas are discussed in Validating with Multiple Namespaces.


You can also specify the schema file in the application:

static final String JAXP_SCHEMA_SOURCE =
    "http://java.sun.com/xml/jaxp/properties/schemaSource";

...
SAXParser saxParser = spf.newSAXParser();
...
saxParser.setProperty(JAXP_SCHEMA_SOURCE,
    new File(schemaSource)); 

Now that you know how to use an XML Schema definition, we'll turn to the kinds of errors you can see when the application is validating its incoming data. To do that, you'll use a document type definition (DTD) as you experiment with validation.

Experimenting with Validation Errors

To see what happens when the XML document does not specify a DTD, remove the DOCTYPE statement from the XML file and run the Echo program on it.


Note: The output shown here is contained in Echo10-01.txt. (The browsable version is Echo10-01.html.)


The result you see looks like this:

<?xml version='1.0' encoding='UTF-8'?>
** Parsing error, line 9, uri .../slideSample01.xml
  Document root element "slideshow", must match DOCTYPE root 
"null" 

Note: This message was generated by the JAXP 1.2 libraries. If you are using a different parser, the error message is likely to be somewhat different.


This message says that the root element of the document must match the element specified in the DOCTYPE declaration. That declaration specifies the document's DTD. Because you don't yet have one, it's value is null. In other words, the message is saying that you are trying to validate the document, but no DTD has been declared, because no DOCTYPE declaration is present.

So now you know that a DTD is a requirement for a valid document. That makes sense. What happens when you run the parser on your current version of the slide presentation, with the DTD specified?


Note: The output shown here is produced using slideSample07.xml, as described in Referencing Binary Entities. The output is contained in Echo10-07.txt. (The browsable version is Echo10-07.html.)


This time, the parser gives a different error message:

  ** Parsing error, line 29, uri file:...
  The content of element type "slide" must match 
"(image?,title,item*) 

This message says that the element found at line 29 (<item>) does not match the definition of the <slide> element in the DTD. The error occurs because the definition says that the slide element requires a title. That element is not optional, and the copyright slide does not have one. To fix the problem, add a question mark to make title an optional element:

<!ELEMENT slide (image?, title?, item*)> 

Now what happens when you run the program?


Note: You could also remove the copyright slide, producing the same result shown next, as reflected in Echo10-06.txt. (The browsable version is Echo10-06.html.)


The answer is that everything runs fine until the parser runs into the <em> tag contained in the overview slide. Because that tag is not defined in the DTD, the attempt to validate the document fails. The output looks like this:

  ...
  ELEMENT: <title>
  CHARS:   Overview
  END_ELM: </title>
  ELEMENT: <item>
  CHARS:   Why ** Parsing error, line 28, uri: ...
Element "em" must be declared.
org.xml.sax.SAXParseException: ...
... 

The error message identifies the part of the DTD that caused validation to fail. In this case it is the line that defines an item element as (#PCDATA | item).

As an exercise, make a copy of the file and remove all occurrences of <em> from it. Can the file be validated now? (In the next section, you'll learn how to define parameter entries so that we can use XHTML in the elements we are defining as part of the slide presentation.)

Error Handling in the Validating Parser

It is important to recognize that the only reason an exception is thrown when the file fails validation is as a result of the error-handling code you entered in the early stages of this tutorial. That code is reproduced here:

public void error(SAXParseException e)
throws SAXParseException
{
  throw e;
} 

If that exception is not thrown, the validation errors are simply ignored. Try commenting out the line that throws the exception. What happens when you run the parser now?

In general, a SAX parsing error is a validation error, although you have seen that it can also be generated if the file specifies a version of XML that the parser is not prepared to handle. Remember that your application will not generate a validation exception unless you supply an error handler such as the one here.