Displaying Special Characters and CDATA

The next thing we will do with the parser is to customize it a bit so that you can see how to get information it usually ignores. In this section, you'll learn how the parser handles

Special characters (<, &, and so on)

Text with XML-style syntax

Handling Special Characters

In XML, an entity is an XML structure (or plain text) that has a name. Referencing the entity by name causes it to be inserted into the document in place of the entity reference. To create an entity reference, you surround the entity name with an ampersand and a semicolon:

Market Size &lt; predicted

When you run the Echo program on slideSample03.xml, you see the following output:

ELEMENT:        <item>
CHARS:        Market Size < predicted
END_ELM:        </item>

The parser has converted the reference into the entity it represents and has passed the entity to the application.

Handling Text with XML-Style Syntax

When you are handling large blocks of XML or HTML that include many special characters, you use a CDATA section.

A CDATA section works like <pre>...</pre> in HTML, only more so: all whitespace in a CDATA section is significant, and characters in it are not interpreted as XML. A CDATA section starts with <![CDATA[ and ends with ]]>. The file slideSample04.xml contains this CDATA section for a fictitious technical slide:

   ...
  <slide type="tech">
    <title>How it Works</title>
    <item>First we fozzle the frobmorten</item>
    <item>Then we framboze the staten</item>
    <item>Finally, we frenzle the fuznaten</item>
    <item><![CDATA[Diagram:
      frobmorten <--------------- fuznaten
        |            <3>             ^
        | <1>                        | <1> = fozzle
        V                            | <2> = framboze 
      staten-------------------------+ <3> = frenzle
               <2>
    ]]></item>
  </slide>
</slideshow>

  ELEMENT: <item>
  CHARS:   Diagram:
frobmorten <--------------- fuznaten
  |            <3>             ^
  | <1>                        | <1> = fozzle
  V                            | <2> = framboze 
staten-------------------------+ <3> = frenzle
         <2>

END_ELM: </item>

You can see here that the text in the CDATA section arrived as it was written. Because the parser didn't treat the angle brackets as XML, they didn't generate the fatal errors they would otherwise cause. (If the angle brackets weren't in a CDATA section, the document would not be well formed.)

Handling CDATA and Other Characters

The existence of CDATA makes the proper echoing of XML a bit tricky. If the text to be output is not in a CDATA section, then any angle brackets, ampersands, and other special characters in the text should be replaced with the appropriate entity reference. (Replacing left angle brackets and ampersands is most important, other characters will be interpreted properly without misleading the parser.)

But if the output text is in a CDATA section, then the substitutions should not occur, resulting in text like that in the earlier example. In a simple program such as our Echo application, it's not a big deal. But many XML-filtering applications will want to keep track of whether the text appears in a CDATA section, so that they can treat special characters properly. (Later, you will see how to use a LexicalHandler to find out whether or not you are processing a CDATA section.)

One other area to watch for is attributes. The text of an attribute value can also contain angle brackets and semicolons that need to be replaced by entity references. (Attribute text can never be in a CDATA section, though, so there is never any question about doing that substitution.)