"Seminar Informationsmanagement für Wirtschaftsinformatik SS 99"
Klaus Hammermüller
9025298
 klaus@ifs.tuwien.ac.at

XML-parsing and XML-APIs

XML-parsing and XML-APIs

Abstract

In this work a short overview about different XML-parser and APIs is given, concentrating on Java-implementations. Focus is an implemented example generating a language specification from an XML-document producing a DTD (Document Type Definition) and some Java-classes related to that DTD. The use of the XML-API using the parse-tree of the origin XML-document is demonstrated.

Contents

Introduction What is this paper
What is XML
Overview on XML-parser Parsing APIs
Document Object Model (DOM)
The Simple API for XML (SAX)
Parsing Tools Aelfred Version: 1.2a
IBM XML Parser for Java 2.0
Lark and Larval parser
Microsoft-DataChannel XML Java parser
P3P Parser
Silfide XML Parser (XSP) - validating
SiRPAC: RDF/XML compiler
xmlproc validatin parser
XP
Related Terms Schema for XML
Extensible Stylesheet Language (XSL)
Resource Description Framework (RDF)
Document Style Semantics and Specification Language (DSSSL)
Example: Processing a Document Building a Meta-Model
Generating DTD and Java-Classes
Processing a Document
Generating Java Code
Conclusion

1  Introduction

1.1  What is this paper

This work shall introduce into the use of an XML document using a parser and a Java API. As there are different implementations of the APIs for Python or C as well, most of the available references concentrate on (free) Java implementations. To understand this paper, the knowledge of some basic Java-syntax is necessary.

In section a short overview about the different ways to access a XML-document using the DOM or SAX API are given, followed by a collection of different available parser which support these APIs. In the end some relevant concepts are explained.

In section a simple example is given which demonstrates the power of the combination using Java and XML generating Java code from a definition given in XML (by parsing it and using the API).

1.2  What is XML

Extensible Markup Language, abbreviated XML [XML Specification: http://www.w3.org/XML], describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them. XML is an application profile or restricted form of SGML, the Standard Generalized Markup Language [ISO 8879]. By construction, XML documents are conforming SGML documents.

XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form character data, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure.

A software module called an XML processor is used to read XML documents and provide access to their content and structure. It is assumed that an XML processor is doing its work on behalf of another module, called the application. This specification describes the required behavior of an XML processor in terms of how it must read XML data and the information it must provide to the application.

2  Overview on XML-parser

2.1  Parsing APIs

Figure shows the parsing of an XML document. There are two different APIs defined: DOM (Document Object Model) which generates an hierarchically parse-tree and SAX (Simple API for XML) which processes an document event-based without generating an parse-tree. DOM is very useful to navigate an document, SAX to process a very large document with little need of memory.

Figure 1: parsing XML

Both APIs are available within free distributed pure Java XML-parser like XML for Java from IBM. A good introduction to the use of Java and XML is given in [Chang, Dan and Harkey, Dan Client-Server data access with Java and XML 1998, Wiley, New York].

2.1.1  Document Object Model (DOM)

The DOM Level 1 Specification [Document Object Model: http://www.w3.org/DOM] is a W3C Recommendation. The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of documents. The document can be further processed and the results of that processing can be incorporated back into the presented page. This is an overview of DOM-related materials here at W3C and around the web.

It builds up a tree of a document or document-fraction which holds all elements (objects) of the document in nodes or leaves. All objects can be accessed directly about this logical structure without the need of searching the document sequentially. On the other hand the whole document is held in memory and it's size is limited by the amount of memory available.



Example [Example using DOM: http://developerlife.com/xmljavatutorial1/default.htm]: You first need a well formed XML document and a validating XML parser in order to read information into your programs. E.g. Sun and IBM both make validating XML parsers in Java [IBM XML developing zone: http://www.ibm.com/developer/xml].

Java interfaces for DOM have been defined by the W3C and these are available in the  org.w3c.dom package. The code that is required to instantiate a DOM object is different depending on which parser you use. Code for instantiating DOM objects using IBM's parser looks like:

  import com.ibm.xml.parser.*;
  import org.w3c.dom.*;

  public class DOMApp {

    public static void main (String args[]) throws Exception {
      URL u = new URL("http://beanfactory.com/xml/AddressBook.xml");
      InputStream i = u.openStream();
      Parser ps = new Parser("just put any string here");
      Document doc = ps.readStream(i);
    }
  }
Now that the DOM (org.w3c.dom.Document) object has been creating using either parser, it is time to extract information from the document. Lets talk about the AddressBook.xml file. Here is the DTD for this XML file:

  <?xml version="1.0"?>
  <!DOCTYPE ADDRESSBOOK [
  <!ELEMENT ADDRESSBOOK (PERSON)*>
  <!ELEMENT PERSON (LASTNAME, FIRSTNAME, COMPANY, EMAIL)>
  <!ELEMENT LASTNAME (#PCDATA)>
  <!ELEMENT FIRSTNAME (#PCDATA)>
  <!ELEMENT COMPANY (#PCDATA)>
  <!ELEMENT EMAIL (#PCDATA)>
  ]>

Figure 2: Illustration of this DTD

DOM creates a tree based (or hierarchical) object model from the XMLdocument. The Document (created from an XML file) contains a tree of Nodes. Methods in the Node interface allow you to find out whether a Node has children or not, and also what the type of the Node is and what its value is (if any). There are many types of Nodes, but we are interested in the following types:   TEXT_NODE (=3),  ELEMENT_NODE (=1). These types are   static int values which are defined in the   org.w3c.dom.Node.java interface created by the W3C. So a Document object is a simple container of Nodes. But, in our DTD, we have Elements, not Nodes. It just so happens that there is an interface called Element (which extends Node). It also turns out that a Node which is of type  ELEMENT_NODE is also an Element. Nodes of type  ELEMENT_NODE (or Elements) can also have children. How do you access these children? Through the  NodeList interface of course; the  NodeList interface defines 2 methods to allow the iteration of a list of Nodes. These   NodeList objects are generated by Node objects of type   ELEMENT_NODE (or Element objects). The Document interface has a method called  getElementsByTagName(String tagname) which returns a  NodeList of all the Elements with that tag name.

So here is how we can extract information from our Document object. We first ask the document object for all the Element objects that have the tag name "PERSON". This should return all the Element objects that are PERSONs; all the Element objects with this tag name are returned in a  NodeList object. We can use the  getLength() method on this  NodeList to determine how many PERSON elements are in the NodeList. Here is some code to do this:

  Document doc = ... //creat DOM from AddressBook.xml
  NodeList listOfPersons =
    doc.getElementsByTagName( "PERSON" );
  int numberOfPersons = listOfPersons.getLength();
Now that we have the  NodeList object containing all the PERSON Elements (which are also Nodes), we can iterate it to extract information from each PERSON Element (Node). The method  item(int index) in  NodeList returns a Node object. Remember that when the type of a Node is  ELEMENT_TYPE, it is actually an Element. So here is the code to get the first person from our  NodeList (assuming there is at least one person in the AddressBook.xml file):

  if (numberOfPersons > 0 ){
    Node firstPersonNode = listOfPersons.item( 0 );
    if( firstPersonNode.getNodeType() == Node.ELEMENT_NODE ){
      Element firstPersonElement = (Element)firstPersonNode;
    }
  }
Now we have a reference to the  firstPersonElement, which we can use to find out the FIRSTNAME, LASTNAME, COMPANY and EMAIL information of this PERSON element. Since the   firstPersonElement is an Element, we can use   getElementsByTagName(String) again to get the FIRSTNAME, LASTNAME, COMPANY and EMAIL elements in it. Here is the code to do get the FIRSTNAME of the  firstPersonElement:

  NodeList list =
    firstPersonElement.getElementsByTagName( "FIRSTNAME" );
Now, this list does not contain other elements, because FIRSTNAME does not contain any other Elements. The FIRSTNAME element does however contain a TEXT_NODE that is the first name of this person. So the NodeList list contains at least one Node which has the name of the person in it. Along with the text (which is the first name of the person), this  NodeList also contains other Nodes which also contain text, but this text is useless to us, because it consists of whitespace and carriage return and line feeds (crlf). This is NOT intuitive, because we expect only the name of the person to be in the  NodeList, instead there are a bunch of nodes in this  NodeList which contain a whitespace, crlfs and the String that we really want. So how do we extract the first name from this mess? We have to iterate the  NodeList, and ask each Node in it for its value by using the  getNodeValue() method. Then we have to trim() the String value and make sure that it is not "" or "
r". When we have found a value that is not whitespace or crlf, then we can assume that it is the first name of the person. Here is the code to do this parsing:

  String firstName = null;
  for (int i = 0 ; i < list.getLength() ; i ++ ){
    Node n = list.item( i ).getNodeValue().trim();
    String value = n;
    if( value.equals("") || value.equals("\r") ){
      continue; //keep iterating
    }
    else{
      firstName = value;
      break; //found the firstName!
    }
  }
Now, this procedure must be repeated for the LASTNAME, COMPANY and EMAIL
 elements.firstPersonElement must be asked to   getElementsByTagName("LASTNAME" ), then "COMPANY" and ËMAIL". Then each of the NodeLists returned must be iterated to get a non-whitespace, non-crlf, String value. You might consider putting this parsing of the  NodeList to get a text value in a utility method (in an XML utility class that you can write).

In the end there is not very much code necessary to operate within an XML document.

2.1.2  The Simple API for XML (SAX)

SAX 1.0: a free API for event-based XML parsing [Simple API for XML: http://www.megginson.com/SAX]. SAX is a standard interface for event-based XML parsing, developed collaboratively by the members of the XML-DEV mailing list. SAX 1.0 was released on Monday 11 May 1998, and is free for both commercial and non-commercial use.

SAX implementations are currently available in Java and Python, with more to come. SAX 1.0 support in both parsers and applications is growing fast.

SAX also allows, like DOM, to access XML-documents but does not need to build up a whole tree of the document's model because it works event triggered. The document is processed sequentially and the occurrence of searched patterns produce events which can be used for further operation. So SAX can handle very large documents, the amount of available memory has no impact on it's performance.



Example [Example using SAX: http://www.megginson.com/SAX/quickstart.html]: To create a very simple Java-based SAX application, you first need to install at least two Java libraries, making certain that you add all of them to your  CLASSPATH

  1. the SAX interfaces and classes;
  2. at least one XML parser that supports SAX.

Before you begin, make a note of the full classname of the SAX driver for the parser (for Aelfred, it's   com.microstar.xml.SAXDriver). Next, you will usually want to create at least one event handler to receive information about the document. The most important type of handler is the   DocumentHandler, which receives events for the start and end of elements, character data, processing instructions, and other basic XML structure.

Rather than implementing the entire interface, you can create a class that extends  HandlerBase, and then fill in the methods that you need. The following example (  MyHandler.java) prints a message each time an element starts or ends:

  import org.xml.sax.HandlerBase;
  import org.xml.sax.AttributeList;

  public class MyHandler extends HandlerBase {

    public void startElement (String name, AttributeList atts) {
      System.out.println("Start element: " + name);
    }
    public void endElement (String name) {
      System.out.println("End element: " + name);
    }
  }



Now, you can create a simple application ( SAXApp.java) to invoke SAX and parse a document using your handler:

  import org.xml.sax.Parser;
  import org.xml.sax.DocumentHandler;
  import org.xml.sax.helpers.ParserFactory;

  public class SAXApp {

    static final String parserClass = "com.microstar.xml.SAXDriver";

    public static void main (String args[]) throws Exception {
      Parser parser = ParserFactory.makeParser(parserClass);
      DocumentHandler handler = new MyHandler();
      parser.setDocumentHandler(handler);
      for (int i = 0; i < args.length; i++) {
        parser.parse(args[i]);
      }
    }
  }
This example creates a Parser object by supplying a class name to the  ParserFactory, instantiates your  MyHandler class, registers the handler with the parser, then parses all URLs supplied on the command line (note that the URLs must be absolute).



For example, consider the following very simple XML document ( roses.xml):

  <?xml version="1.0"?>

  <poem>
    <line>Roses are red,</line>
    <line>Violets are blue.</line>
    <line>Sugar is sweet,</line>
    <line>and I love you.</line>
  </poem>
To parse this with your  SAXApp application, you would supply the absolute URL of the document on the command line:

java SAXApp file://localhost/tmp/roses.xml
The output should be as follows:

  Start element: poem
  Start element: line
  End element: line
  Start element: line
  End element: line
  Start element: line
  End element: line
  Start element: line
  End element: line
  End element: poem
This is parsing XML! Now, there can be done something more interesting to do with this event handlers, e.g. manipulating the content of the elements. To handle the SAX API there is little more code to write, but the performance is with big documents as fast or faster as DOM, because the requested resources are modest.

2.2  Parsing Tools

There are many free tools available, most of the global firms investing in XML-technology. Interesting is the trend to establish free software in order to go the exclusive (and expensive) way of SGML. The most complete collection of different tools can be found with the IBM developing zone [http://www.ibm.com/developer/xml/], [IBM Alpha-Factory: http://www.alphaWorks.ibm.com/tech].

2.2.1  Aelfred Version: 1.2a

Is a free parser from the Microstar company

 http://www.microstar.com/aelfred.html. Aelfred is designed for Java programmers who want to add XML support to their applets and applications without doubling their size: Aelfred consists of only two core class files, with a total size of about 26K, and requires very little memory to run. There is also a complete SAX (Simple API for XML) driver available in this distribution for interoperability.

2.2.2  IBM XML Parser for Java 2.0

Is a free parser from IBM for non commercial use (earlier versions where completely free) and is the mostly used parser in the community

 http://www.alphaWorks.ibm.com/aw.nsf/xmltechnology/xml+parser+for+java. A validating XML parser written in pure Java that contains classes and methods for parsing, generating, manipulating, and validating XML documents. It improved native DOM performance and supports

These enhance the functionality of XML Parser for Java version 1:

2.2.3  Lark and Larval parser

Tim Bray's XML parsers. Lark is a non-validating parser; Larval is a validating parser which is just to learn about using Java and XML in non-profit and teaching environment.

 http://www.textuality.com/Lark/.

2.2.4  Microsoft-DataChannel XML Java parser

A server-side parser co-created by DataChannel and Microsoft.

 http://www.datachannel.com/xml/developers/parser.shtml.

This release brings the promise of XSL and XSL pattern matching capabilities to a Java-based XML parser for the first time. This parser release includes significant enhancements from the Beta 1 version of the parser including: a validating XML engine, XSL support, and transformations of data. Feature list:

We have not tested how the standards have been implemented yet.

2.2.5  P3P Parser

A P3P protocol parser and constructor written in pure Java containing classes and methods which is distributed by IBM.

 http://www.alphaWorks.ibm.com/aw.nsf/xmltechnology/p3p+parser.

Platform for Privacy Preferences (P3P) is a protocol that enables the private exchange of personal information on the web. "The goal of P3P is to enable Web sites to express their privacy practices and enable users to exercise preferences over those practices. P3P products will allow users to be informed of site practices (in both machine and human readable formats), to delegate decisions to their computer when appropriate, and allow users to tailor their relationship to specific sites" (W3C).

2.2.6  Silfide XML Parser (XSP) - validating

This parser is part of XSilfide a client/server based environment. XSilfide includes SIL, the Silfide Interface Language.

 http://www.loria.fr/projets/XSilfide/EN//.

SILFIDE is a project of CNRS and AUPELF-UREF. The server is lodged with the LORIA. Server SILFIDE, as an interactive server, wants to offer to the whole of the French-speaking university community working starting from the language (linguists, teachers, data processing specialists...) a tool user-friendly and reasoned for the handling of electronic resources. The basic language is French.

2.2.7  SiRPAC: RDF/XML compiler

This program compiles RDF/XML documents into the 3-tuples of the corresponding RDF data model. This tool is a reference implementation by the W3C.

 http://www.w3.org/RDF/Implementations/SiRPAC/.

The documents can reside on local file system or at a URI on the Web. Also, the parser can be configured to automatically fetch corresponding RDF schemas from the declared namespaces. This version is suitable for embedded use as well as command line use. SiRPAC builds on top of the Simple API to XML documents (SAX).

2.2.8  xmlproc validatin parser

A validating XML parser written in Python.

 http://www.stud.ifi.uio.no/ larsga/download/python/xml/xmlproc.html.

xmlproc is an XML parser written in Python. It is a nearly complete validating parser, with only minor deviations from the specification. It supports both SGML Open Catalogs and XCatalog 0.1, as well as error messages in different languages. xmlproc also supports namespaces. Access to DTD information is provided, as is a separate DTD parser. SAX drivers are provided with the parser.

2.2.9  XP

James Clark's non-validating parser written in Java. This is a famous and early parser, which was widely used in many references.

 http://www.jclark.com/xml/xp/index.html.

XP is an XML 1.0 parser written in Java. It is fully conforming: it detects all non well-formed documents. It is currently not a validating XML processor. However it can parse all external entities: external DTD subsets, external parameter entities and external general entities.

2.3  Related Terms

This section gives an overview about the related terms found by the recherche.

2.3.1  Schema for XML

There is much work in the Area of defining Schemas whith XML, there are some working drafts published by the W3C as technical Reports current status of work is published in

Part 1, Structures is part one of a two part draft of the specification for the XML Schema definition language. The document proposes facilities for describing the structure and constraining the contents of XML 1.0 documents. The schema language, which is itself represented in XML 1.0, provides a superset of the capabilities found in XML 1.0 document type definitions (DTDs.)

Part 2, Specifies a language for defining datatypes to be used in XML Schemas and, possibly, elsewhere.

2.3.2  Extensible Stylesheet Language (XSL)

Extensible Stylesheet Language (XSL) is a language for expressing stylesheets. (W3C, work in progress) It consists of two parts:

  1. a language for transforming XML documents, and
  2. an XML vocabulary for specifying formatting semantics

An XSL stylesheet specifies the presentation of a class of XML documents by describing how an instance of the class is transformed into an XML document that uses the formatting vocabulary.

Will XSL replace CSS?

No. They are likely to co-exist since they meet different needs. XSL is intended for complex formatting where the content of the document might be displayed in multiple places; for example the text of a heading might also appear in a dynamically generated table of contents. CSS is intended for dynamic formatting of online documents for multiple media; its strictly declarative nature limits its capabilities but also makes it efficient and easy to generate and modify in the content-generation workflow.

2.3.3  Resource Description Framework (RDF)

The Resource Description Framework (RDF) [Resource Definition Framework: http://www.w3.org/RDF] will soon become a W3C recommendation and is a specification currently under development within the W3C Metadata activity. RDF is designed to provide an infrastructure to support metadata across many web-based activities. RDF is the result of a number of metadata communities bringing together their needs to provide a robust and flexible architecture for supporting metadata on the Internet and WWW. Example applications include sitemaps, content ratings, stream channel definitions, search engine data collection (web crawling), digital library collections, and distributed authoring.

RDF allows different application communities to define the metadata property set that best serves the needs of each community. RDF provides a uniform and interoperable means to exchange the metadata between programs and across the Web. Furthermore, RDF provides a means for publishing both a human-readable and a machine-understandable definition of the property set itself.

2.3.4  Document Style Semantics and Specification Language (DSSSL)

DSSSL is a more general and older Standard describing style-definition languages like XSL. XSL can be converted to extended DSSSL which be understood by Jade, James Clark's DSSSL engine, which in turn can create formatted output using any of its back-end formatters (RTF, TeX, SGML, and HTML with CSS). Both the DSSSL and the 'HTML/CSS' flow objects from the original XSL submission are supported; xslj is available with source code for the experimentally inclined.

How is XSL different from DSSSL? From DSSSL-O?

DSSSL is an International Standard style sheet language. It is particularly used for formatting of print documents. DSSSL-O is a profile of DSSSL which removes some functionality and adds capabilities to make it more suited for online documentation. XSL draws on DSSSL and the DSSSL-O work and continues the trend towards a Web-oriented style sheet language by integrating experience with CSS.

Will XSL replace DSSSL?

DSSSL has capabilities that XSL does not, and continues in use in the print publishing industry. Experience with XSL might be used in a future revision of DSSSL, but it is too early to say.

3  Example: Processing a Document

What we can do with XML is limited to the abilities of the used XML-tools. XML itself is a definition of structuring information and it carries some meta-information about this structure, but it does not carry any processing information, no semantics for a document-processor. In the end someone has to implement some code to process this document.

In a very low level CSS or XSL can be used for document processing, on the next level something like JavaScript may also allow the access to computer-semantics (means executable code). These possibilities are discussed in other places. In this example we focus connecting an XML-document specified by a language-definition to executable code.

The use of XML is demonstrated generating Java-classes from a definition embedded in XML. This is technically part of the Asgaard project [http://www.ifs.tuwien.ac.at/asgaard], [Miksch, S. and Shahar, Y. and Johnson, P. Medizinische Leitlinien und Protokolle: das Asgaard/Asbru Projekt KI-Journal, Themenheft MEDIZIN Mai 1997] which stresses the support of time oriented planning processes, e.g. in the medical domain. The target language is called Asbru which allows to define plans and guidelines.

There for we want to write plans in the XML format and generate some Java classes and objects out of this document to do some more komplex operation to support the therapy process.

Figure 3: processing XML

The definition may be given by the Schema-specification [Schemata in XML: http://www.w3.org/XML/Activity.html] from the W3C. To do an easy example, this specification is replaced by a more simple one which only contains the necessary parts.



In figure 3 a 4-step process is outlined which may be implemented as a client-server application or as a stand alone-tool as well to represent the language-elements to class-instances.

  1. Defining a meta-model of the language (in the figure named Asbru) using some schema-definition DTD;
  2. Extending an existing parser (in this case from IBM) using the DOM API to create a language-specific DTD corresponding to some existing or new defined Java-classes automatically;
  3. Writing some documents (in this case called Plans) using the produces language-definition;
  4. Extending an other parser using the SAX API to connect the set of classes to the elements (tags) of the document producing running bytecode automatically.

3.1  Building a Meta-Model

First Step is to build a Meta model which includes machine-readable semantics, e.g. in Java-code. As mentioned above it could also be some script-language or an other language-target like CLIPS interpreting knowledge-rules. This Java-close example was chosen to stay close to familiar concepts.

    <?xml version="1.0"?>

    <!DOCTYPE spec [
    <!ELEMENT spec (class | reflect)*>
    <!ATTLIST spec name     #CDATA  #REQUIRED>

    <!ELEMENT class (attr*, method*)>
    <!ATTLIST class     name #CDATA  #REQUIRED>

    <!ELEMENT method (#PCDATA)>
    <!ATTLIST method    modifyer #CDATA  #REQUIRED
                        type     #CDATA  #REQUIRED
                        name     #CDATA  #REQUIRED>

    <!ELEMENT attr EMPTY>
    <!ATTLIST attr      modifyer #CDATA  #REQUIRED
                        type     #CDATA  #REQUIRED
                        name     #CDATA  #REQUIRED>

    <!ELEMENT reflect EMPTY>
    <!ATTLIST reflect   name     #CDATA #REQUIRED
                        synonym  #CDATA  #REQUIRED>
    ]>
First part is the definition of the used Schema, the meta-meta description of the origin document. These document defines two basic kinds of elements:

To do some simple definitions a  class may also define a set of  attributes as well as  methods, consisting of an modifier, a type and a name. (Those element-attributes could also be defined as tags as well, but in this example it done is this way.)

<spec name="test.xml">
  <class name="Person">
      <attr modifyer="private" type="String" name="vorname"/>
      <attr modifyer="private" type="String" name="nachname"/>
      <attr modifyer="private" type="float" name="gewicht"/>
      <method modifyer="public" type="void" name="Vorname">
         vorname = content;
      </method>
      <method modifyer="public" type="void" name="Nachname">
         nachname = content;
      </method>
      <method modifyer="public" type="void" name="Gewicht">
         gewicht = Float.valueOf(content).floatValue();
      </method>
      <method modifyer="public" type="String" name="toString">
         return vorname + " " + nachname;
      </method>
  </class>
  <reflect name="java.util.Date" synonym="NOW"/>
</spec>
The second part defines a Java-class called  Person with three properties " Vorname", " Nachname" and "  Gewicht". A second class, " java.util.Date", which is already implemented, is referenced. (Of course this is a very poor language in this example, but I tried to keep it as simple as possible for demonstration!)

3.2  Generating DTD and Java-Classes

The second step is completely automatic, extending a parser using the DOM API to process the metamodel and produce Java-code from the model as well as a corresponding DTD:



  import com.ibm.xml.parser.*;
  import org.w3c.dom.*;
  import java.io.*;

  public class ClassParser {

    private FileWriter classWriter = null;
    private FileWriter dtdWriter = null;

    public static void main (String args[]) throws Exception {
      Parser ps = new Parser("Java Class & DTD Creator");
      ClassParser app = new ClassParser();
      app.doit(ps.readStream(new FileInputStream("ClassSpec.xml")));
    }



First is the parser - specific part, instanciating the parser itself and read the model in the file  ClassSpec.xml. Next Step is to extract all the names of the defined classes to list them in the root-element of the new language-specific DTD (which has the same name as the  DOCTYPE) and generate the head of the DTD file.



  private void doit(Document doc) throws Exception {

    // Get the name of the root-element
    NodeList root = doc.getElementsByTagName("spec");
    Element o = (Element) root.item( 0 );
    String packageName = o.getAttribute("name");

    // Get the list with all class-definitions
    NodeList classList = doc.getElementsByTagName("class");
    String classNames = "#PCDATA";
    for (int i=0;i<classList.getLength();i++) {
      Element c = (Element) classList.item( i );
      classNames += " | " + c.getAttribute("name");
    }
    NodeList reflectList = doc.getElementsByTagName("reflect");
    for (int h=0;h<reflectList.getLength();h++) {
      Element r = (Element) reflectList.item( h );
      classNames += " | " + r.getAttribute("name");
    }

    // Writing the Header of the DTD file
    dtdWriter = new FileWriter(packageName+".dtd");
    dtdWriter.write("<?xml version=\"1.0\"?>\n");
    dtdWriter.write("<!DOCTYPE "+packageName+" [\n");
    dtdWriter.write("<!ELEMENT "+packageName+" ("+classNames+")*>\n");
Next is to look at each class-definition seperately, generating one Java-file per class, adding a header and the defined attributes of the class.

    // Create all java-class-files
    for (int i=0;i<classList.getLength();i++)
      if (classList.item( i ).getNodeType() == Node.ELEMENT_NODE) {
        Element c = (Element) classList.item( i );
        String className = c.getAttribute("name");

        // produce a new java-class-file
        classWriter = new FileWriter(className+".java");
        classWriter.write("package "+packageName+";\n\n");
        classWriter.write("public class "+className+" {\n\n");

        // produce the definition-java-code
        NodeList attrList = c.getElementsByTagName("attr");
        for (int j=0;j<attrList.getLength();j++)
          if (attrList.item( j ).getNodeType() == Node.ELEMENT_NODE) {
            Element a = (Element) attrList.item( j );
            classWriter.write("  " + a.getAttribute("modifyer") + " " +
                a.getAttribute("type") + " " +
                a.getAttribute("name") + "; \n");
          }
        classWriter.write("\n");



Quasi in parallel the DTD-Definition of this class is written to the DTD file, including the List of methods offered by this class.

        // addes element-definitions for the class to the dtd-file
        NodeList methodList = c.getElementsByTagName("method");
        String methodNames = "#PCDATA";
        for (int k=0;k<methodList.getLength();k++)
          if (methodList.item( k ).getNodeType() == Node.ELEMENT_NODE) {
            Element m = (Element) methodList.item( k );
            if (m.getAttribute("modifyer").equalsIgnoreCase("PUBLIC"))
              methodNames += " | " + m.getAttribute("name");
          }
        dtdWriter.write("<!ELEMENT "+className+" ("+methodNames+")*>\n");
After the class definition each method is processed generating the method-header and the wrapped java-code which was placed in the contend of the  method-tag.

        // produces the method-java-code
        for (int k=0;k<methodList.getLength();k++)
          if (methodList.item( k ).getNodeType() == Node.ELEMENT_NODE) {
            Element m = (Element) methodList.item( k );
            classWriter.write("  " + m.getAttribute("modifyer") + " " +
                m.getAttribute("type") + " " +
                m.getAttribute("name") + "(String content) { \n");

            // produces the method-body-code ot of the tag-content
            NodeList codeList = m.getChildNodes();
            for (int l=0;l<codeList.getLength();l++)
              if (codeList.item( l ).getNodeType() == Node.TEXT_NODE) {
                classWriter.write( codeList.item( l ).getNodeValue());
              }
            classWriter.write("\n  }");
Last step to the new defined classes is the creation of the corresponding method-tags to the DTD.

            // addes element-definitions for the method to the dtd-file
            if (m.getAttribute("modifyer").equalsIgnoreCase("PUBLIC"))
              dtdWriter.write("<!ELEMENT "+m.getAttribute("name")+" (#CDATA)>\n");
          }
        classWriter.write("\n}");
        classWriter.close();
      }
It ends with the simple mapping from the existing class to the DTD-file.

    // adds element-definitions for reflected classes to the dtd-file
    for (int h=0;h<reflectList.getLength();h++) {
      Element r = (Element) reflectList.item( h );
      dtdWriter.write("<!ELEMENT "+r.getAttribute("synonym")+" EMPTY>\n");
    }
    // Close
    dtdWriter.write("]>");
    dtdWriter.close();
  }
}
Running this processor a consistent couple of DTD and classes will be produced.

3.3  Processing a Document

Using XML-namespaces different DTDs can be mixed, so different (partial) language-definitions could be merged. In our example we stay with one definition, using the output of the previous step:

    <?xml version="1.0"?>
    <!DOCTYPE xml.test [
    <!ELEMENT xml.test (#PCDATA | Person | java.util.Date)*>
    <!ELEMENT Person (#PCDATA | Vorname | Nachname | Gewicht)*>
    <!ELEMENT Vorname (#CDATA)>
    <!ELEMENT Nachname (#CDATA)>
    <!ELEMENT Gewicht (#CDATA)>
    <!ELEMENT NOW EMPTY> ]>
This is the extraction of the language. It is easy to see, these Definitions are very lightweight and also easy to use. In the next process-step the content can be connected to a form automatically as well as analyze free text.

<xml.test>
  Today morning at <NOW/> a new patient
  <Person><Nachname>Mayer</Nachname> <Vorname>Hans</Vorname> arrived.
  His Body weight was <Gewicht>82.3</Gewicht> kg.</Person>
</xml.test>
Maybe it doesn't seem to make any sense at all, because this is what we can do without all the overhead of extending parser and so on. But imagine java.util.Date is replaced by a given class which can do more complex time-annotation like we have in the Asgaard-project, connecting the arrival of a patient to documented events. (Wen would need some more definitions, but to keep the example simple they are skipped).

public class Person {
  private String vorname;
  private String nachname;
  private float gewicht;

  public void Vorname(String content) {
    vorname = content;
  }
  public void Nachname(String content) {
    nachname = content;
  }
  public void Gewicht(String content) {
    gewicht = Float.valueOf(content).floatValue();
  }
  public String toString() {
    return vorname + nachname;
  }
}
For completeness: This is the class which was generated automatically.

3.4  Generating Java Code

The processing of the Plan document is event triggered, using IBMs SAX Driver:

  import org.xml.sax.*;
  import org.xml.sax.helpers.*;
  import java.io.*;
  import java.util.Date;

public class PlanParser extends HandlerBase {

  private String tempStr = "";
  private Date d;
  private Person p = new Person();

  public static void main(String args[]) throws Exception {
    Parser p = ParserFactory.makeParser("com.ibm.xml.parsers.SAXParser");
    PlanParser demo = new PlanParser();
    p.setDocumentHandler(demo);
    FileInputStream is = new FileInputStream("Plan.xml");
    InputSource source = new InputSource(is);
    source.setSystemId("Doing it with SAX");
    p.parse(source);
  }
First the Parser ist instanciated and connected to the Input file given in  Plan.xml which contains the language-specific DTD as well as a "plan" statement.

    public void characters(char ch[], int start, int length) {
        tempStr = new String(ch, start, length);
    }

    public void endElement(String name) {
      if (name.equalsIgnoreCase("NOW")) {
        d = new Date();
      } else if (name.equalsIgnoreCase("NACHNAME")) {
        p.Nachname(tempStr);
      } else if (name.equalsIgnoreCase("VORNAME")) {
        p.Vorname(tempStr);
      } else if (name.equalsIgnoreCase("GEWICHT")) {
        p.Gewicht(tempStr);
      }
    }
By processing the document the occurrence of tags is supervised. In this place an Exception handeling could easily expand the XML-parser standard capabilities checking the validy of the document's content (= markup), what is not done by normal XML parser.

    public void endDocument() {
      System.out.println(p.toString() + " processed at " + d.toString());
    }
Using an other event some action may be triggered.

  AppAccelerator(tm) 1.1.036 for Java (JDK 1.1), x86 version.
  Copyright (c) 1998 Inprise Corp. All Rights Reserved.
  Martin Mayer processed at Mon Jun 28 16:23:19 CEST 1999
The output of this short program looks like this. Imagine, that even the Event-Handler itself could easily be generated by the Domain model (step 1) - the Document would take care about it's own processing completely (see section 2.1.2).

4  Conclusion

We have discussed the two mayor APIs DOM and SAX for manipulating XML documents in different examples. The use of the parse-tree is easy with less lines of code - but often necessary if low-level methods like CSS or some script languages lack on requested functionality.

XML itself is a structured document carrying meta-information about it's content which may be used for processing free-text like today given in HTML combined with the support of automated processing by some computer up to the integration into some database - application where the user won't see anything from the XML-code but some abstract representation.

In the example in section 3, which is the main focus in this paper, a generic domain model defines the whole process of automated document processing in four steps:

This example uses only available tools - there for is no need for some proprietary or commercial tool. All this can be done within the existing standards and code.


File translated from TEX by TTH, version 1.57.