... in Java by Richard G Baldwin

Java JAXP, Exposing a DOM Tree

Baldwin shows you how to write a program to display a DOM tree on the screen in a format that is much easier to interpret than raw XML code.

Published: December 16, 2003
By Richard G. Baldwin

Java Programming Notes # 2204

Preface
Preview
Discussion and Sample Code
Run the Program
Summary
What's Next?
Complete Program Listings

Preface

What is JAXP?

As the name implies, the Java API for XML Processing (JAXP) is an API designed to help you write programs for processing XML documents. JAXP is very important for many reasons, not the least of which is the fact that it is a critical part of the Java Web Services Developer Pack (JWSDP). As you are probably already aware, web services is expected by many to be a very important aspect of the Internet of the future

This is the third lesson in a series designed to initially help you understand how to use JAXP, and to eventually help you understand how to use the JWSDP.

The first lesson was entitled Java API for XML Processing (JAXP), Getting Started. The previous lesson was entitled Getting Started with Java JAXP and XSL Transformations (XSLT).

What is XML?

XML is an acronym for the eXtensible Markup Language. I will not attempt to teach XML in this series of tutorial lessons. Rather, I will assume that you already understand XML, and I will teach you how to use JAXP to write programs for creating and processing XML documents.

I have published numerous tutorial lessons on XML at Gamelan.com and www.DickBaldwin.com. You may find it useful to refer to those lessons. In addition, I provided a review of the salient aspects of XML in the first lesson in this series. From time to time, I will also provide background information regarding XML in the lessons in this series.

Viewing tip

You may find it useful to open another copy of this lesson in a separate browser window. That will make it easier for you to scroll back and forth among the different listings and figures while you are reading about them.

Supplementary material

I recommend that you also study the other lessons in my extensive collection of online Java tutorials. You will find those lessons published at Gamelan.com. However, as of the date of this writing, Gamelan doesn't maintain a consolidated index of my Java tutorial lessons, and sometimes they are difficult to locate there. You will find a consolidated index at www.DickBaldwin.com.

Preview

A tree structure in memory

A DOM parser can be used to create a tree structure in memory that represents an XML document. In Java, that tree structure is encapsulated in an object of the interface type Document. Document and its superinterface Node declare numerous methods that may be used to navigate, extract information from, modify, and otherwise manipulate the DOM tree. As is always the case, classes that implement Document must provide concrete definitions of those methods.

Many operations are possible

Given an object of type Document, there are many methods that can be invoked on the object to perform a variety of operations. For example, it is possible to move nodes from one location in the tree to another location in the tree, thus rearranging the structure of the XML document represented by the Document object. It is also possible to delete nodes, and to insert new nodes. It is also possible to recursively traverse the tree, extracting information about the nodes along the way.

I showed you ...

In the previous lesson on Java JAXP, I began by providing a brief review of XSL and XSL Transformations (XSLT).

Then I showed you how to create an identity Transformer object, and how to use that object to:

Display a DOM tree structure on the screen in XML format.
Write the contents of a DOM tree structure into an output XML file.

Following that, I showed you how to write exception handlers that provide meaningful information in the event of errors and exceptions, with particular emphasis on parser errors and exceptions.

I will show you ...

In this lesson, I will show you how to write a program to display a DOM tree on the screen in a format that is much easier to interpret than raw XML code. I will explain two different versions of the program. One version will simply identify text nodes in the output tree. The other will display the value of text nodes in the output tree. The first version will ignore attributes in the output tree. The second version will include attributes in the output tree.

Discussion and Sample Code

The first program that I will discuss, named DomTree01, analyzes a DOM tree that represents an XML document, and produces an output on the screen similar to the tree shown in Figure 1.

#document DOCUMENT_NODE
  A DOCUMENT_TYPE_NODE
  #comment COMMENT_NODE
  xml-stylesheet PROCESSING_INSTRUCTION_NODE
  processor PROCESSING_INSTRUCTION_NODE
  A ELEMENT_NODE
    Q ELEMENT_NODE
      #text TEXT_NODE
    B ELEMENT_NODE
      C ELEMENT_NODE
        #text TEXT_NODE
        #cdata-section CDATA_SECTION_NODE
      R ELEMENT_NODE
        #text TEXT_NODE
      C ELEMENT_NODE
        #text TEXT_NODE
      S ELEMENT_NODE
        #text TEXT_NODE
      B ELEMENT_NODE
        C ELEMENT_NODE
          #text TEXT_NODE
      S ELEMENT_NODE
        #text TEXT_NODE
      B ELEMENT_NODE
        C ELEMENT_NODE
          #text TEXT_NODE
        T ELEMENT_NODE
          #text TEXT_NODE
        B ELEMENT_NODE
          C ELEMENT_NODE
            #text TEXT_NODE
          D ELEMENT_NODE
            E ELEMENT_NODE
              #text TEXT_NODE
              G ELEMENT_NODE
                #text TEXT_NODE
              #text TEXT_NODE
            F ELEMENT_NODE
              #text TEXT_NODE
            E ELEMENT_NODE
              #text TEXT_NODE
            F ELEMENT_NODE
              #text TEXT_NODE
            E ELEMENT_NODE
              #text TEXT_NODE
            F ELEMENT_NODE
              #text TEXT_NODE
          C ELEMENT_NODE
            #text TEXT_NODE
        C ELEMENT_NODE
          #text TEXT_NODE
      R ELEMENT_NODE
        #text TEXT_NODE
      C ELEMENT_NODE
        #text TEXT_NODE
    B ELEMENT_NODE
      R ELEMENT_NODE
        #text TEXT_NODE
      C ELEMENT_NODE
        #text TEXT_NODE

Figure 1

The physical tree structure shown in Figure 1 represents the corresponding XML document as a visual tree. As I discuss the various parts of the XML document, you should be able to correlate those parts of the document to the tree structure shown in Figure 1.

The sample XML file named DomTree01.xml

The tree structure in Figure 1 corresponds to an XML file named DomTree01.xml. As is often the case, I will discuss the XML files and the programs in fragments. A complete listing of DomTree01.xml is shown in Listing 21 near the end of the lesson. Listing 1 shows the beginning of the XML file

The structure of the XML file named DomTree01.xml

That portion of the XML file shown in Listing 1 consists of five items that are represented by the following nodes in the DOM tree:

A Document node

A Document-Type node
A Comment node
A Processing Instruction node representing a stylesheet
A Processing Instruction node representing a dummy processing instruction

The last four node types in the above list represent nodes that are children of the Document node. The Document node is the root of the entire DOM tree, and all other nodes in the DOM tree are children of the Document node.

The five items are separated by blank lines in Listing 1, so you should be able to correlate them visually with the five nodes in the above list.

(Note that although it is tempting to believe that the Document node correlates with the XML declaration in the first line of Listing 1, the XML declaration is not required, and the DOM tree will be rooted in a Document node, even in the absence of an XML declaration.)

The DOM tree exposed

Figure 2 shows a reproduction of the first five lines from Figure 1. Each line in Figure 2 represents a node in the DOM tree. You should be able to correlate each line in Figure 2 with one of the nodes in the above list, and also with one of the items in Listing 1 (except for the DOCUMENT_NODE for which there is no explicit item in Listing 1).

The indentation in Figure 2 indicates that the last four lines in Figure 2 represent nodes that are children of the node represented by the Document node in the first line.

#document DOCUMENT_NODE
  A DOCUMENT_TYPE_NODE
  #comment COMMENT_NODE
  xml-stylesheet PROCESSING_INSTRUCTION_NODE
  processor PROCESSING_INSTRUCTION_NODE

Figure 2

The prolog of the XML document

Listing 1 shows the prolog for this XML document, which includes everything prior to the start tag for the root element. Figure 2 shows the DOM nodes associated with the prolog.

The root element in the XML document

Listing 2 shows the XML code for the root element and the six nodes following the root-element node in the DOM tree. The XML code in Listing 2 produces the following node types in the DOM tree, with the parent-child relationships shown.

An Element node named A, which is the root element node

An Element node named Q

A Text node

An Element node named B

An Element node named C

A Text node
A CDATA Section node

<A>
<Q>A Big Header</Q>


<C>Level 0. This is the beginning of a B.
This text is in the Introduction section.
<![CDATA[This is CDATA < > " &]]></C>

Listing 2

A is a child of the document root node

Referring back to Figure 1, you can see that the Element node named A is a child of the Document node, which forms the root of the DOM tree. The node for element A is the root element node for the DOM tree, (which is different from the root node for the DOM tree). All of the data stored in an XML document is stored in the root element node and its children.

Figure 3 shows a reproduction of the next seven lines from Figure 1, showing the tree structure and the parent-child relationships among the nodes. The nodes shown in Figure 3 correspond to the XML code in Listing 2.

  A ELEMENT_NODE
    Q ELEMENT_NODE
      #text TEXT_NODE
    B ELEMENT_NODE
      C ELEMENT_NODE
        #text TEXT_NODE
        #cdata-section CDATA_SECTION_NODE

Figure 3

Easier to interpret

Unless you have a lot of practice reading XML code, you may have concluded by now that the representations of the DOM tree in Figures 2, and 3 are much easier to get your mind around than the raw XML shown in Listings 1 and 2.

Node types seen thus far

So far, we have seen the following types of nodes:

Document node
Document-Type node
Comment node
Processing Instruction node
Element node
Text node
CDATA Section node

It will be useful at this point to provide a brief explanation for each of these node types.

The Document node and the XML declaration

According to XML in a Nutshell by Harold and Means, which I recommend as an excellent book,

"XML documents should, (but do not have to) begin with an XML declaration. The XML declaration looks like a processing instruction with the name xml and version, standalone, and encoding attributes. Technically, it's not a processing instruction though, just the XML declaration; nothing more, nothing less."

As I mentioned earlier, every XML DOM tree is rooted in a Document node, even in the absence of an XML declaration. Apparently, the DOM tree does not contain a node that represents the XML declaration, and the XML document doesn't contain any specific text that represents the Document node.

Although the XML declaration is used for information purposes by a validating XML parser, if it is possible to recover the XML declaration from the DOM tree, I don't know how to do that at this time.

Document-Type node

A valid XML document contains a reference to a Document Type Declaration (DTD) to which the document should be compared for validation purposes. The DTD can also be included in the XML document prolog, as is the case in Listing 1.

(The DTD in Listing 1 begins with <!DOCTYPE and ends with ]>)

According to XML in a Nutshell,

"DTDs are written in a formal syntax that explains precisely which elements and entities may appear where in the document and what the elements' contents and attributes are."

For example, the DTD in Listing 1 states that the element named A must contain the elements named Q, B, and B, in that order. I'm not going to try to explain the rules for writing DTDs. There are numerous tutorials on the Web that you can refer to in this regard.

The DTD in Listing 1 produced the Document-Type node in the tree in Figure 2.

(In certain situations, a schema can be used for validation in place of a DTD.)

Comment node

A comment in XML means pretty much the same thing as a comment in Java. XML comments are generally ignored by XML processors. They are intended primarily for human consumption.

Listing 1 contains an XML comment with the file name and some other information. This comment produced the Comment node in the tree of Figure 2.

Processing Instruction node

XML processing instructions begin with <? and end with ?>. Processing instructions are intended to provide instructions to processing programs that may be called upon to process an XML document.

Listing 1 contains two separate processing instructions. The two processing instructions gave rise to the two Processing Instruction nodes in the tree in Figure 2.

Element node

As you learned in the previous two lessons, XML syntax includes elements, consisting of start tags, end tags, optional content, and optional attributes.

Listing 2 contains all or part of several elements. The elements gave rise to the Element nodes in Figure 3. The text content of the elements gave rise to the Text nodes in Figure 3.

(Note that the actual text in this XML document is not intended to have any meaning other than to constitute text nodes in the DOM tree for illustration purposes.)

Text node

When you include text as part or all of the content of an XML element, each chunk of text gives rise to a text node in the DOM tree. Figure 3 shows two text nodes produced by the text content of the elements in Listing 2.

CDATA Section node

XML recognizes two kinds of text data, PCDATA and CDATA. PCDATA stands for parsed character data. CDATA stands for character data.

The primary difference between the two is as follows. PCDATA cannot contain certain characters such as left angle brackets (<) and ampersands (&). The reason is that a left angle bracket would confuse the parser, causing it to believe that it had encountered the first character in a start or end tag. Therefore, if these characters appear in PCDATA, they must be represented by entities, such as <.

A CDATA section

When a block of text is declared to be of type CDATA, it is ignored by the parser. Therefore, it can contain any characters (with the possible exception of ]]). A block of CDATA always begins with <![CDATA[. The block always ends with ]]>.

(Note that the periods in the above sentences are not parts of the CDATA beginning and ending syntax.)

Listing 2 contains a block of CDATA, which gave rise to the CDATA Section node in Figure 3.

Note that the Element node named C in Figure 3 has two children. One child is a text node. The other child is a CDATA Section node.

An interesting case involving whitespace

I'm not going to bore you by discussing the entire XML document in this level of detail. By now, you should be able to compare the XML in Listing 21 with the DOM tree represented by Figure 1, and understand how the XML code relates to the DOM tree,.

However, there is one tricky aspect involving whitespace that deserve a little more explanation. The DOM tree nodes shown in Figure 4 represent the XML code shown in Listing 3.

            E ELEMENT_NODE
              #text TEXT_NODE
              G ELEMENT_NODE
                #text TEXT_NODE
              #text TEXT_NODE
            F ELEMENT_NODE
              #text TEXT_NODE

Figure 4

Too many text nodes

I have colored the obvious text in Listing 3 green for emphasis. At first glance, it would appear that there are too many Text nodes showing in Figure 4 to correspond to the text shown in Listing 3.

<E>First list item in E
<G>Nested G text element</G>
</E>
<F>First list item in F</F>

Listing 3

Another representation of the DOM tree

Figure 5 shows another representation of the DOM tree, similar to Figure 4, except that the actual text belonging to each Text node is shown in Figure 5.

            E ELEMENT_NODE
              #text First list item in E

              G ELEMENT_NODE
                #text Nested G text element
              #text

            F ELEMENT_NODE
              #text First list item in F

Figure 5

Note the blank lines in Figure 5. This is caused by newline characters in the actual XML code in Listing 3. In particular, there are two Text nodes belonging to the element named E. One of those Text nodes appears before the element named G and the other appears after the element named G. The Text node after the element named G was caused by the newline character immediately following the end tag for the element named G.

Element E may contain PCDATA

This happens because of one line in the DTD shown in Listing 1 and repeated below for convenience.

<!ELEMENT E (#PCDATA | G)*>

This DTD statement says that the content for an element named E may contain Text nodes (#PCDATA) and/or elements named G in any number and in any order. Thus, simple newline characters inserted into the XML to make it easier to read were interpreted as Text nodes. This gave rise to what appears to be extra Text nodes in Figure 4.

That's probably enough talk. It's time to see some Java code.

The program named DomTree01

With the preceding discussion as background, I will now discuss the program named DomTree01, which was used to process the file named DomTree01.xml and to produce the Dom tree representation shown in Figure 1. As usual, I will discuss the program in fragments. A complete listing of the program is shown in Listing 20 near the end of the lesson.

Purpose and limitations of the program

This program produces a text-based output on the screen that represents the DOM tree structure for an XML file. Note that although the code was written to support these node types, the program was not actually tested for the following node types:

DOCUMENT_FRAGMENT_NODE
ENTITY_NODE
ENTITY_REFERENCE_NODE
NOTATION_NODE

Note also that this program does not display attributes. That will be accomplished in the sample program named DomTree02 to be discussed later in this lesson.

Also note that for simplicity, no effort was made to cause the program to produce meaningful output in the event of errors and exceptions.

The program was tested using Sun's SDK 1.4.2 under WinXP.

Overall program structure

This program consists of a single class with a main method that runs as a Java application. Listing 4 shows the beginning of the class definition and the beginning of the main method.

public class DomTree01{

int indent = -1;//Indentation level for display

public static void main(String argv[]){
if (argv.length != 2){
System.err.println(
"usage: java DomTree01 fileIn validate");
System.err.println(
"validate = n for no, y for yes");
System.exit(0);
}//end if

Listing 4

The code in Listing 4 is straightforward:

It declares and initializes an instance variable that is used later for control of indentation in the output display.
It also provides usage instructions if the user starts the program with the wrong number of command-line arguments.

Running the program

Two command-line parameters are required. The first parameter is the path and file name of the file containing the XML document to be processed. The second command-line parameter is either "y" or "n" specifying whether or not the parser should attempt to validate the XML document.

(If the program is instructed to validate the document, a DTD (or schema) must be provided either inline or as a reference in the XML document.)

Steps for creating a Document object

As you learned in an earlier lesson, three steps are required to create a Document object:

Create a DocumentBuilderFactory object
Use the DocumentBuilderFactory object to create a DocumentBuilder object
Use the parse method of the DocumentBuilder object to create a Document object

Create a DocumentBuilderFactory object

The first step in the above list is accomplished by the code in Listing 5..

try{
//Get a factory object for DocumentBuilder
// objects
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();

//Configure the factory object setting
// validating true or false based on user
// input.
if(argv[1].equals("n")){
factory.setValidating(false);
}else{
factory.setValidating(true);
}//end if/else

factory.setNamespaceAware(false);

//Set to ignore cosmetic white space
// between elements.
factory.
setIgnoringElementContentWhitespace(true);

Listing 5

There is very little in Listing 5 that wasn't discussed in detail in earlier lessons. About the only thing that is new is the invocation of the setter method at the end of Listing 5 to cause the parser to ignore cosmetic whitespace in the XML document.

(Cosmetic whitespace consists of spaces, tabs, newlines, etc., inserted into the XML document between elements to make the document easier to read.)

This wasn't discussed in the previous lessons because it only works with a validating parser. The parsers used in the two previous lessons were not validating parsers.

Create a Document object

The remaining two steps required to create a Document object are accomplished in Listing 6.

//Get a DocumentBuilder (parser) object
DocumentBuilder builder =
factory.newDocumentBuilder();

//Parse the XML input file to create a
// Document object that represents the
// input XML file.
Document document = builder.parse(
new File(argv[0]));

Listing 6

The code in Listing 6 was also discussed in detail in the two previous lessons, so I won't discuss that code further here.

Process the Document object

Code that is new to this lesson begins in Listing 7. The code in Listing 7 instantiates a new object of the program class and invokes the processNode method on that object, passing the Document object's reference as a parameter.

//Instantiate an object of this class
DomTree01 thisObj = new DomTree01();

thisObj.processNode(document);

}catch(Exception e){
e.printStackTrace(System.err);
}//end catch

}// end main()

Listing 7

Listing 7 also contains a simple exception handler, which signals the end of the main method.

The processNode method

The processNode method, which begins in Listing 8, is used to recursively process the DOM tree, identifying and displaying the tree structure along the way.

private void processNode(Node node){
if (node == null){
System.err.println(
"Nothing to do, node is null");
return;
}//end if

Listing 8

Recall from an earlier lesson that the Document interface extends the Node interface, which provides a multiplicity of methods that can be used to navigate and manipulate the DOM tree. Therefore, a Document object can be treated as type Node. The required type for the incoming parameter to the processNode method is type Node.

The code in Listing 8 simply checks to confirm that the incoming reference does not have a value of null. If it does, the code in Listing 8 prints an error message and returns.

Perform the recursive processing on the incoming node

The code in Listing 9 shows the beginning of what happens if the incoming parameter is not null.

indent++;

String nodeName = node.getNodeName();
int type = node.getNodeType();

Listing 9

As you will see later, the processNode method will continue calling itself recursively until all of the nodes in the DOM tree have been examined. Information about the tree structure will be extracted and displayed as each node is examined. When all of the nodes in the DOM tree have been examined, the program will terminate.

Indentation

Recall the instance variable named indent that was declared and initialized in Listing 4. Each time control enters the processNode method (with a non-null Node parameter), the value of that instance variable is incremented. Each time control exits the method (except for the case of a null Node parameter), the value of that instance variable is decremented. Therefore, at any point in time, the value of indent indicates the current depth (in the DOM tree) of the node that is being examined.

Get node name and type

The variable named indent is incremented in Listing 9. Following this, two methods are called on the incoming Node parameter to get and save the name and the type of the node currently being examined.

Some types of nodes have generic names, such as #text. Other types of nodes have actual names, which match element names in the XML document.

The doIndent method

At this point, I am going to skip ahead and show you a very simple method named doIndent, (which actually appears near the end of the program code in Listing 20). The code for this method is shown in Listing 10.

private void doIndent(){
for(int cnt = 0; cnt < indent; cnt++){
System.out.print(" ");
}//end for loop
}//end doIndent

Listing 10

The purpose of this method is to move the cursor to the right on the screen to accomplish indentation in the display. Each time method is called, it moves the cursor to the right by an amount equal to twice the value of the variable named indent. This produces two spaces for each level of indentation.

Display the name of the node

Returning to the discussion of the processNode method, Listing 11 invokes the doIndent method to produce the required indentation, and then displays the name of the current node, followed by a space. Note that the cursor remains immediately to the right of the space and does not advance to the next line at this time.

doIndent();

System.out.print(nodeName + " ");

Listing 11

Display the type of the node on the same line

Recall that the invocation of the getNodeType method in Listing 9 returned a value of type int. The Node interface defines about a dozen symbolic constants that correlate the type values to names such as CDATA_SECTION_NODE.

A switch statement

Listing 12 shown the beginning of a switch statement that uses the type value from Listing 9, along with the constants from the Node interface to display the alphanumeric node type to the right of the node name that was displayed by the code in Listing 11.

switch(type){
case Node.CDATA_SECTION_NODE:{
System.out.println("CDATA_SECTION_NODE");
break;
}//end case Node.CDATA_SECTION_NODE

Listing 12

When the alphanumeric node type is displayed, the cursor moves down to the left-hand side of the next line.

For example, the code in Listings 11 and 12 would produce output similar to that shown in Figure 6 (the indentation may be different for different XML documents).

        #cdata-section CDATA_SECTION_NODE

Figure 6

The remainder of the switch statement

Listing 13 shows the remainder of the switch statement. There is nothing special about the code in Listing 13. As each node is examined, the code in Listing 11 performs the proper indentation and displays the name of the node. Then one of the cases in the switch statement is invoked to display the alphanumeric node type to the right of the node name and to advance the display cursor to the next line.

case Node.COMMENT_NODE:{
System.out.println("COMMENT_NODE");
break;
}//end case

case Node.DOCUMENT_FRAGMENT_NODE:{
System.out.println(
"DOCUMENT_FRAGMENT_NODE");
break;
}//end case

case Node.DOCUMENT_NODE:{
System.out.println("DOCUMENT_NODE");
break;
}//end case Node.DOCUMENT_NODE

case Node.DOCUMENT_TYPE_NODE:{
System.out.println("DOCUMENT_TYPE_NODE");
break;
}//end case

case Node.ELEMENT_NODE:{
System.out.println("ELEMENT_NODE");
break;
}//end case Node.ELEMENT_NODE

case Node.ENTITY_NODE:{
System.out.println("ENTITY_NODE");
break;
}//end case

case Node.ENTITY_REFERENCE_NODE:{
System.out.println(
"ENTITY_REFERENCE_NODE");
break;
}//end case Node.ENTITY_REFERENCE_NODE

case Node.NOTATION_NODE:{
System.out.println("NOTATION_NODE");
break;
}//end case

case Node.PROCESSING_INSTRUCTION_NODE:{
System.out.println(
"PROCESSING_INSTRUCTION_NODE");
break;
}//end case

//Handle text nodes
case Node.TEXT_NODE:{
System.out.println("TEXT_NODE");
break;
}//end case Node.TEXT_NODE

default:{
System.out.println("Unknown Node Type");
}//end default case
}//end switch

Listing 13

Get and process children of the current node

Following the switch statement, the code in Listing 14 invokes the getChildNodes method on the current node to get a list of the nodes that are children of the current node. That list is returned as an object of type NodeList. The NodeList object's reference is stored in the reference variable named children.

NodeList children = node.getChildNodes();

Listing 14

A NodeList object provides an ordered collection of nodes, and provides two methods for accessing the items in the list:

A method named getLength returns the number of nodes in the list.
A method named item takes a parameter of type int, and uses that parameter to return the Node object's reference that is stored at that index.

Make recursive call to processNode method on each child node

Provided that the NodeList reference in the variable named children is not null, the code in Listing 15 uses a for loop to process each node whose reference is stored in the list.

if (children != null){
int len = children.getLength();

for (int i = 0; i < len; i++){

//Recursion !!!
processNode(children.item(i));

}//end for loop
}//end if children

Listing 15

This is where the recursive processing occurs. The boldface statement in Listing 15, recursively invokes the processNode method once for each item in the list, passing the item as a parameter to the processNode method.

This causes the program to recursively examine every node in the DOM tree, (except for attribute nodes) extracting and displaying information about each node as it is examined. This includes nodes in the prolog of the XML document as well as nodes in the body of the XML document.

(The issue of attribute nodes will be addressed in the next sample program.)

Decrease indentation level and terminate processNode method

When all the invocations of the processNode method finally return and the current instance of the processNode method terminates, it decreases the value of the variable named indent prior to termination as shown in Listing 16.

indent--;

}//end processNode(Node)
//-------------------------------------------//

// doIndent method goes here

}//end class DomTree01

Listing 16

Listing 16 signals the end of the processNode method, and the beginning of the method named doIndent, which was discussed earlier.

(Because it was discussed earlier, the code for the doIndent method was not included in Listing 16.)

The end of the doIndent method signals the end of the class and the end of the program named DomTree01.

The program named DomTree02

The program named DomTree02 is an upgraded version of DomTree01. This program displays the actual text belonging to text nodes instead of simply showing the type of node as TEXT_NODE.

DomTree02 also displays attribute names and values, which is not the case with DomTree01.

Sample output from DomTree02

Figure 7 shows the output produced by using DomTree02 to process the XML file named DomTree02.xml. (You can view a listing of this XML file in Listing 23 near the end of the lesson.)

I colored the attributes red and the text green in Figure 7 to make them easy to spot.

(Note that some of the text consists of invisible newline characters, which are impossible to color green.)

#document DOCUMENT_NODE
  top DOCUMENT_TYPE_NODE
  #comment COMMENT_NODE
  xml-stylesheet PROCESSING_INSTRUCTION_NODE
  top ELEMENT_NODE
    #comment COMMENT_NODE
    theData ELEMENT_NODE
        Attribute: type=Programming
        Attribute: test_attr=Testing
      title ELEMENT_NODE
        #text Java
      author ELEMENT_NODE
        #text R.Baldwin
      price ELEMENT_NODE
        #text $9.95
        uvw ELEMENT_NODE
          #text abc-
          xyz ELEMENT_NODE
            #text def-
          #text

        uvw ELEMENT_NODE
          #text ghi-
        #text each

    theData ELEMENT_NODE
        Attribute: type=Pets
      title ELEMENT_NODE
          Attribute: another_test_attr=More Test
        #text Dogs
      author ELEMENT_NODE
        #text R.U.Barking
      price ELEMENT_NODE
        #text $19.95

Figure 7

Displaying text versus displaying node type

Sometimes it can be very useful to display the actual text values in the tree. At other times, the text is so voluminous that it completely overwhelms the display making it difficult to pick out the structure of the tree. In those cases, the version that simply identifies the node as a text node is probably advantageous.

(A good learning exercise would be for you to write a single program where the user specifies whether the tree is to simply identify text nodes, or is to display the actual text value of each text node, by entering a parameter on the command line.)

Will discuss in fragments

I will discuss the program named DomTree02 in fragments. A complete listing of the program is shown in Listing 22 near the end of the lesson.

Large portions of this program are identical or very similar to the code in the program named DomTree01, discussed earlier in this lesson. Therefore, I won't repeat the discussion of that code. Rather, I will restrict this discussion to those parts of this program that differ from the earlier program.

The main method in this program is essentially the same as the main method in the previous program, so I will skip a discussion of the main method.

As before, the method named processNode is used to recursively process the entire DOM tree, extracting and displaying information about the nodes in the tree along the way. The method named processNode in this program is the same as in the previous program except for the code in a couple of cases in the switch statement.

New features in DomTree02

Previously, the cases in the switch statement were used to display the alphanumeric type of each node in the tree. In this program, the case for TEXT_NODE is modified to cause the actual text value of the text node to be displayed instead of the type of the node.

In addition, the case for ELEMENT_NODE in this program is modified to get and display the names and values of all attributes associated with elements.

The ELEMENT_NODE case

I will begin by explaining the changes to the ELEMENT_NODE case in the switch statement. Listing 17 shows the beginning of the ELEMENT_NODE case.

private void processNode(Node node){

//Code deleted for brevity

switch(type){

//Case code deleted for brevity

case Node.ELEMENT_NODE:{
System.out.println("ELEMENT_NODE");
//Get and display attributes if any
NamedNodeMap attrList =
node.getAttributes();
int attrLen = 0;
if(attrList != null){
attrLen = attrList.getLength();
}//end if

Listing 17

A map of attribute nodes

There is a very important conceptual issue to deal with here. Specifically, attribute nodes are not simply child nodes of element nodes. In particular, all child nodes of an element node can be obtained in a collection of type NodeList by invoking the method named getChildNodes on the element node.

In order to get the attributes belonging to an element node, it is necessary to invoke the method named getAttributes on the element node. This method returns a reference to an object of type NamedNodeMap containing unordered references to the attribute nodes.

NamedNodeMap versus NodeList

A NamedNodeMap is a different type of data structure than a NodeList.

A NodeList is an ordered collection of references to Node objects. Items in the list are accessed on the basis of an ordinal index. They cannot be accessed on the basis of the name of a node. The order of the items in the list matches the ordering of the corresponding nodes in the DOM tree.

NamedNodeMap

Sun describes objects of type NamedNodeMap as

"collections of nodes that can be accessed by name".

Sun goes on to tell us,

"NamedNodeMaps are not maintained in any particular order. Objects contained in an object implementing NamedNodeMap may also be accessed by an ordinal index, but this is simply to allow convenient enumeration of the contents of a NamedNodeMap, and does not imply that the DOM specifies an order to these Nodes."

Therefore, references to objects representing attribute nodes can be accessed in a NamedNodeMap object either on the basis of the attribute name, or on the basis of an ordinal index. I will use an ordinal index in this program, as shown in Listing 18.

Get and display name and value of attribute nodes

Listing 18 shows the remaining code for the ELEMENT_NODE case in the switch statement.

for(int i = 0; i < attrLen; i++){
Node attrNode = attrList.item(i);
doIndent();
System.out.println(" Attribute: "
+ attrNode.getNodeName()
+ "="
+ attrNode.getNodeValue());
}//end for loop
break;
}//end case Node.ELEMENT_NODE

Listing 18

Listing 18 uses a for loop to iterate on the NamedNodeMap object, getting a reference to each attribute node in sequence, and using that reference to get and display the name and value of the attribute properly indented.

(Note in Listing 18 and Figure 7 that the attribute information was indented an additional four spaces relative to the element node to visually separate the attribute information from the child node of the element. This was done solely for cosmetic purposes.)

The modified TEXT_NODE case

Listing 19 shows the modified TEXT_NODE case in the switch statement, and the end of the switch statement.

//Case code deleted for brevity

case Node.TEXT_NODE:{
System.out.println(node.getNodeValue());
break;
}//end case Node.TEXT_NODE

//default case code deleted for brevity

}//end switch

Listing 19

The version of this case in the program named DomTree01 simply displayed the text TEXT_NODE each time the case was invoked.

This version invokes the method named getNodeValue on the node and displays the String that is returned by that method. This code produced the green text values for the text nodes represented in Figure 7.

(Recall that the word #text in Figure 7 was displayed by code that invoked the getNodeName method prior to control entering the switch statement. This is the same in both programs. Only the red and green text in Figure 7 is new.)

Beyond this point, both programs are the same

The remainder of this program is the same as DomTree01, and therefore, doesn't merit further discussion.

Run the Programs

I encourage you to copy the code and XML data from Listings 20 through 23 into your text editor. Compile and execute the programs. Experiment with them, making changes, and observing the results of your changes.

Summary

In this lesson, I showed you how to write a program to display a DOM tree on the screen in a format that is much easier to interpret than raw XML code. I explained two different versions of the program. One version simply identifies text nodes in the output tree. The other version displays the value of text nodes in the output tree. Also, the first version ignores attributes in the output tree, while the second version includes attributes in the output tree.

What's Next?

In the next lesson, I will explain default XSLT behavior and show you how to write Java code that mimics that behavior. The resulting Java code will serve as a skeleton for more advanced transformation programs.

Complete Program Listings

Complete listings of the Java class and the XML documents discussed in this lesson are shown in Listings 20 through 23 below.

/*File DomTree01.java
Copyright 2003 R.G.Baldwin

This program produces a text-based output on
the screen that represents the tree structure
of an XML file.

Not tested for DOCUMENT_FRAGMENT_NODE.
Not tested for ENTITY_NODE.
Not tested for ENTITY_REFERENCE_NODE.
Not tested for NOTATION_NODE.

The following output was produced by testing
this program with the XML file named
DomTree01.xml

#document DOCUMENT_NODE
A DOCUMENT_TYPE_NODE
#comment COMMENT_NODE
xml-stylesheet PROCESSING_INSTRUCTION_NODE
processor PROCESSING_INSTRUCTION_NODE
A ELEMENT_NODE
Q ELEMENT_NODE
#text TEXT_NODE
B ELEMENT_NODE
C ELEMENT_NODE
#text TEXT_NODE
#cdata-section CDATA_SECTION_NODE
R ELEMENT_NODE
#text TEXT_NODE
C ELEMENT_NODE
#text TEXT_NODE
S ELEMENT_NODE
#text TEXT_NODE
B ELEMENT_NODE
C ELEMENT_NODE
#text TEXT_NODE
S ELEMENT_NODE
#text TEXT_NODE
B ELEMENT_NODE
C ELEMENT_NODE
#text TEXT_NODE
T ELEMENT_NODE
#text TEXT_NODE
B ELEMENT_NODE
C ELEMENT_NODE
#text TEXT_NODE
D ELEMENT_NODE
E ELEMENT_NODE
#text TEXT_NODE
G ELEMENT_NODE
#text TEXT_NODE
#text TEXT_NODE
F ELEMENT_NODE
#text TEXT_NODE
E ELEMENT_NODE
#text TEXT_NODE
F ELEMENT_NODE
#text TEXT_NODE
E ELEMENT_NODE
#text TEXT_NODE
F ELEMENT_NODE
#text TEXT_NODE
C ELEMENT_NODE
#text TEXT_NODE
C ELEMENT_NODE
#text TEXT_NODE
R ELEMENT_NODE
#text TEXT_NODE
C ELEMENT_NODE
#text TEXT_NODE
B ELEMENT_NODE
R ELEMENT_NODE
#text TEXT_NODE
C ELEMENT_NODE
#text TEXT_NODE

Note. No effort was made to provide meaningful
information about errors and exceptions.

Tested using SDK 1.4.2 under WinXP.
************************************************/

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import java.io.File;
import java.io.FileOutputStream;
import org.w3c.dom.*;

public class DomTree01{

int indent = -1;//Indentation level for display

public static void main(String argv[]){
if (argv.length != 2){
System.err.println(
"usage: java DomTree01 fileIn validate");
System.err.println(
"validate = n for no, y for yes");
System.exit(0);
}//end if

try{
//Get a factory object for DocumentBuilder
// objects
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();

//Configure the factory object setting
// validating true or false based on user
// input.
if(argv[1].equals("n")){
factory.setValidating(false);
}else{
factory.setValidating(true);
}//end if/else

factory.setNamespaceAware(false);
//Set to ignore cosmetic white space
// between elements.
factory.
setIgnoringElementContentWhitespace(true);

//Get a DocumentBuilder (parser) object
DocumentBuilder builder =
factory.newDocumentBuilder();

//Parse the XML input file to create a
// Document object that represents the
// input XML file.
Document document = builder.parse(
new File(argv[0]));

//Instantiate an object of this class
DomTree01 thisObj = new DomTree01();

thisObj.processNode(document);

}catch(Exception e){
e.printStackTrace(System.err);
}//end catch

}// end main()
//-------------------------------------------//

//This method is used recursively to identify
// and display node structure.
private void processNode(Node node){
if (node == null){
System.err.println(
"Nothing to do, node is null");
return;
}//end if

//Increase indentation level for display.
indent++;
//Get name and type of node. Some types of
// nodes have generic names, such as #text.
// Other nodes have actual names.
String nodeName = node.getNodeName();
int type = node.getNodeType();

//Indent to the correct level and display the
// name of the node.
doIndent();
System.out.print(nodeName + " ");

//Use the type to display the type of the
// node on the same line following the name
// of the node.
switch(type){
case Node.CDATA_SECTION_NODE:{
System.out.println("CDATA_SECTION_NODE");
break;
}//end case Node.CDATA_SECTION_NODE

case Node.COMMENT_NODE:{
System.out.println("COMMENT_NODE");
break;
}//end case

case Node.DOCUMENT_FRAGMENT_NODE:{
System.out.println(
"DOCUMENT_FRAGMENT_NODE");
break;
}//end case

case Node.DOCUMENT_NODE:{
System.out.println("DOCUMENT_NODE");
break;
}//end case Node.DOCUMENT_NODE

case Node.DOCUMENT_TYPE_NODE:{
System.out.println("DOCUMENT_TYPE_NODE");
break;
}//end case

case Node.ELEMENT_NODE:{
System.out.println("ELEMENT_NODE");
break;
}//end case Node.ELEMENT_NODE

case Node.ENTITY_NODE:{
System.out.println("ENTITY_NODE");
break;
}//end case

case Node.ENTITY_REFERENCE_NODE:{
System.out.println(
"ENTITY_REFERENCE_NODE");
break;
}//end case Node.ENTITY_REFERENCE_NODE

case Node.NOTATION_NODE:{
System.out.println("NOTATION_NODE");
break;
}//end case

case Node.PROCESSING_INSTRUCTION_NODE:{
System.out.println(
"PROCESSING_INSTRUCTION_NODE");
break;
}//end case

//Handle text nodes
case Node.TEXT_NODE:{
System.out.println("TEXT_NODE");
break;
}//end case Node.TEXT_NODE

default:{
System.out.println("Unknown Node Type");
}//end default case
}//end switch

//This method is first called on the node
// that represents the root node of the DOM
// tree. The following code recursively
// processes the entire tree.
NodeList children = node.getChildNodes();
if (children != null){
int len = children.getLength();
//Iterate on NodeList of child nodes.
for (int i = 0; i < len; i++){
//Process each of the nested elements
// recursively.
processNode(children.item(i));
}//end for loop
}//end if children

//Decrease indentation level for display
indent--;
}//end processNode(Node)
//-------------------------------------------//

//This method displays two spaces for each
// level of indentation.
private void doIndent(){
for(int cnt = 0; cnt < indent; cnt++){
System.out.print(" ");
}//end for loop
}//end doIndent

}//end class DomTree01

Listing 20

<?xml version="1.0"?>

<!DOCTYPE A [
<!ELEMENT A (Q,B,B)*>
<!ELEMENT B (B | C | D | R | S | T)*>
<!ELEMENT C (#PCDATA)>
<!ELEMENT D (E | F)*>
<!ELEMENT E (#PCDATA | G)*>
<!ELEMENT F (#PCDATA)>
<!ELEMENT G (#PCDATA)>
<!ELEMENT Q (#PCDATA)>
<!ELEMENT R (#PCDATA)>
<!ELEMENT S (#PCDATA)>
<!ELEMENT T (#PCDATA)>
]>



<?xml-stylesheet type="text/xsl"
href="Dom03.xsl"?>
<?processor ProcInstr="Dummy"?>

<A>
<Q>A Big Header</Q>


<C>Level 0. This is the beginning of a B.
This text is in the Introduction section.
<![CDATA[This is CDATA < > " &]]></C>

<R>A Mid Header</R>

<C>Text block 1.</C>

<S>A Small Header</S>

<C>Text block 2.</C>


<S>Another Small Header</S>

<C>Text block 3.</C>

<T>A Smallest Header</T>

<C>Text block 4.</C>

<D>
<E>First list item in E
<G>Nested G text element</G>
</E>
<F>First list item in F</F>
<E>Second list item in E</E>
<F>Second list item in F</F>
<E>Third list item in E</E>
<F>Third list item in F</F>
</D>

<C>Text block 5.</C>

<C>Text block 6.</C>


<R>Another Mid Header</R>
<C>Text block 7.</C>



<R>Another Mid Header in Another B</R>
<C>Text block 8.</C>

</A>

Listing 21

/*File DomTree02.java
Copyright 2003 R.G.Baldwin

This program is an upgraded version of DomTree01.
This version shows the actual text belonging to
text nodes instead of simply showing the type
of node.

This version also displays attribute names and
values, which was not the case with DomTree01.

This program produces a text-based output on
the screen that represents the tree structure
of an XML file.

Not tested for DOCUMENT_FRAGMENT_NODE.
Not tested for ENTITY_NODE.
Not tested for ENTITY_REFERENCE_NODE.
Not tested for NOTATION_NODE.

The following output was produced by testing
this program with the XML file named
DomTree02.xml

#document DOCUMENT_NODE
top DOCUMENT_TYPE_NODE
#comment COMMENT_NODE
xml-stylesheet PROCESSING_INSTRUCTION_NODE
top ELEMENT_NODE
#comment COMMENT_NODE
theData ELEMENT_NODE
Attribute: type=Programming
Attribute: test_attr=Testing
title ELEMENT_NODE
#text Java
author ELEMENT_NODE
#text R.Baldwin
price ELEMENT_NODE
#text $9.95
uvw ELEMENT_NODE
#text abc-
xyz ELEMENT_NODE
#text def-
#text

uvw ELEMENT_NODE
#text ghi-
#text each

theData ELEMENT_NODE
Attribute: type=Pets
title ELEMENT_NODE
Attribute: another_test_attr=More Test
#text Dogs
author ELEMENT_NODE
#text R.U.Barking
price ELEMENT_NODE
#text $19.95

Note. No effort was made to provide meaningful
information about errors and exceptions.

Tested using SDK 1.4.2 under WinXP.
************************************************/

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import java.io.File;
import java.io.FileOutputStream;
import org.w3c.dom.*;

public class DomTree02{

int indent = -1;//Indentation level for display

public static void main(String argv[]){
if (argv.length != 2){
System.err.println(
"usage: java DomTree02 fileIn validate");
System.err.println(
"validate = n for no, y for yes");
System.exit(0);
}//end if

try{
//Get a factory object for DocumentBuilder
// objects
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();

//Configure the factory object setting
// validating true or false based on user
// input.
if(argv[1].equals("n")){
factory.setValidating(false);
}else{
factory.setValidating(true);
}//end if/else

factory.setNamespaceAware(false);
//Set to ignore cosmetic white space
// between elements.
factory.
setIgnoringElementContentWhitespace(true);

//Get a DocumentBuilder (parser) object
DocumentBuilder builder =
factory.newDocumentBuilder();

//Parse the XML input file to create a
// Document object that represents the
// input XML file.
Document document = builder.parse(
new File(argv[0]));

//Instantiate an object of this class
DomTree02 thisObj = new DomTree02();

thisObj.processNode(document);

}catch(Exception e){
e.printStackTrace(System.err);
}//end catch

}// end main()
//-------------------------------------------//

//This method is used recursively to identify
// and display node structure.
private void processNode(Node node){
if (node == null){
System.err.println(
"Nothing to do, node is null");
return;
}//end if

//Increase indentation level for display.
indent++;
//Get name and type of node. Some types of
// nodes have generic names, such as #text.
// Other nodes have actual names.
String nodeName = node.getNodeName();
int type = node.getNodeType();

//Indent to the correct level and display the
// name of the node.
doIndent();
System.out.print(nodeName + " ");

//Use the type to display the type of the
// node on the same line following the name
// of the node.
switch(type){
case Node.CDATA_SECTION_NODE:{
System.out.println("CDATA_SECTION_NODE");
break;
}//end case Node.CDATA_SECTION_NODE

case Node.COMMENT_NODE:{
System.out.println("COMMENT_NODE");
break;
}//end case

case Node.DOCUMENT_FRAGMENT_NODE:{
System.out.println(
"DOCUMENT_FRAGMENT_NODE");
break;
}//end case

case Node.DOCUMENT_NODE:{
System.out.println("DOCUMENT_NODE");
break;
}//end case Node.DOCUMENT_NODE

case Node.DOCUMENT_TYPE_NODE:{
System.out.println("DOCUMENT_TYPE_NODE");
break;
}//end case

case Node.ELEMENT_NODE:{
System.out.println("ELEMENT_NODE");
//Get and display attributes if any
NamedNodeMap attrList =
node.getAttributes();
int attrLen = 0;
if(attrList != null){
attrLen = attrList.getLength();
}//end if

for(int i = 0; i < attrLen; i++){
Node attrNode = attrList.item(i);
doIndent();
System.out.println(" Attribute: "
+ attrNode.getNodeName()
+ "="
+ attrNode.getNodeValue());
}//end for loop
break;
}//end case Node.ELEMENT_NODE

case Node.ENTITY_NODE:{
System.out.println("ENTITY_NODE");
break;
}//end case

case Node.ENTITY_REFERENCE_NODE:{
System.out.println(
"ENTITY_REFERENCE_NODE");
break;
}//end case Node.ENTITY_REFERENCE_NODE

case Node.NOTATION_NODE:{
System.out.println("NOTATION_NODE");
break;
}//end case

case Node.PROCESSING_INSTRUCTION_NODE:{
System.out.println(
"PROCESSING_INSTRUCTION_NODE");
break;
}//end case

//Handle text nodes
case Node.TEXT_NODE:{
System.out.println(node.getNodeValue());
break;
}//end case Node.TEXT_NODE

default:{
System.out.println("Unknown Node Type");
}//end default case
}//end switch

//This method is first called on the node
// that represents the root node of the DOM
// tree. The following code recursively
// processes the entire tree.
NodeList children = node.getChildNodes();
if (children != null){
int len = children.getLength();
//Iterate on NodeList of child nodes.
for (int i = 0; i < len; i++){
//Process each of the nested elements
// recursively.
processNode(children.item(i));
}//end for loop
}//end if children

//Decrease indentation level for display
indent--;
}//end processNode(Node)
//-------------------------------------------//

//This method displays two spaces for each
// level of indentation.
private void doIndent(){
for(int cnt = 0; cnt < indent; cnt++){
System.out.print(" ");
}//end for loop
}//end doIndent

}//end class DomTree02

Listing 22

<?xml version="1.0"?>

<!DOCTYPE top [
<!ELEMENT top (theData)*>
<!ELEMENT theData (title,author,price)*>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT price (#PCDATA | uvw)*>
<!ELEMENT uvw (#PCDATA | xyz)*>
<!ELEMENT xyz ANY>
<!ATTLIST theData type CDATA #REQUIRED>
<!ATTLIST theData test_attr CDATA #IMPLIED>
<!ATTLIST title another_test_attr CDATA #IMPLIED>
]>



<?xml-stylesheet type="text/xsl"
href="Dom07.xsl"?>

<top>


<theData type="Programming" test_attr="Testing">
<title>Java</title>
<author>R.Baldwin</author>
<price>$9.95<uvw>abc-<xyz>def-</xyz>
</uvw><uvw>ghi-</uvw>each
</price>
</theData>

<theData type="Pets">
<title another_test_attr="More Test">Dogs</title>
<author>R.U.Barking</author>
<price>$19.95</price>
</theData>

</top>

Listing 23

About the author

Richard Baldwin is a college professor (at Austin Community College in Austin, TX) and private consultant whose primary focus is a combination of Java, C#, and XML. In addition to the many platform and/or language independent benefits of Java and C# applications, he believes that a combination of Java, C#, and XML will become the primary driving force in the delivery of structured information on the Web.

Richard has participated in numerous consulting projects, and he frequently provides onsite training at the high-tech companies located in and around Austin, Texas. He is the author of Baldwin's Programming Tutorials, which has gained a worldwide following among experienced and aspiring programmers. He has also published articles in JavaPro magazine.

Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.

Baldwin@DickBaldwin.com

-end-