This work shall introduce into the use of an XML document using a parser and a Java API. As there are different implementations of the APIs for Python or C as well, most of the available references concentrate on (free) Java implementations. To understand this paper, the knowledge of some basic Java-syntax is necessary.
In section a short overview about the different ways to access a XML-document using the DOM or SAX API are given, followed by a collection of different available parser which support these APIs. In the end some relevant concepts are explained.
In section a simple example is given which demonstrates the power of the combination using Java and XML generating Java code from a definition given in XML (by parsing it and using the API).
Extensible Markup Language, abbreviated XML [XML Specification: http://www.w3.org/XML], describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them. XML is an application profile or restricted form of SGML, the Standard Generalized Markup Language [ISO 8879]. By construction, XML documents are conforming SGML documents.
XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form character data, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure.
A software module called an XML processor is used to read XML documents and provide access to their content and structure. It is assumed that an XML processor is doing its work on behalf of another module, called the application. This specification describes the required behavior of an XML processor in terms of how it must read XML data and the information it must provide to the application.
Figure shows the parsing of an XML document. There are two different APIs defined: DOM (Document Object Model) which generates an hierarchically parse-tree and SAX (Simple API for XML) which processes an document event-based without generating an parse-tree. DOM is very useful to navigate an document, SAX to process a very large document with little need of memory.
Both APIs are available within free distributed pure Java XML-parser like XML for Java from IBM. A good introduction to the use of Java and XML is given in [Chang, Dan and Harkey, Dan Client-Server data access with Java and XML 1998, Wiley, New York].
The DOM Level 1 Specification [Document Object Model: http://www.w3.org/DOM] is a W3C Recommendation. The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of documents. The document can be further processed and the results of that processing can be incorporated back into the presented page. This is an overview of DOM-related materials here at W3C and around the web.
It builds up a tree of a document or document-fraction which holds all elements (objects) of the document in nodes or leaves. All objects can be accessed directly about this logical structure without the need of searching the document sequentially. On the other hand the whole document is held in memory and it's size is limited by the amount of memory available.
Example [Example using DOM:
http://developerlife.com/xmljavatutorial1/default.htm]: You first need a well formed XML
document and a validating XML parser in order to read information
into your programs. E.g. Sun and IBM both make validating XML
parsers in Java [IBM XML developing zone: http://www.ibm.com/developer/xml].
Java interfaces for DOM have been defined by the W3C and these are available in the org.w3c.dom package. The code that is required to instantiate a DOM object is different depending on which parser you use. Code for instantiating DOM objects using IBM's parser looks like:
import com.ibm.xml.parser.*; import org.w3c.dom.*; public class DOMApp { public static void main (String args[]) throws Exception { URL u = new URL("http://beanfactory.com/xml/AddressBook.xml"); InputStream i = u.openStream(); Parser ps = new Parser("just put any string here"); Document doc = ps.readStream(i); } }Now that the DOM (org.w3c.dom.Document) object has been creating using either parser, it is time to extract information from the document. Lets talk about the AddressBook.xml file. Here is the DTD for this XML file:
<?xml version="1.0"?> <!DOCTYPE ADDRESSBOOK [ <!ELEMENT ADDRESSBOOK (PERSON)*> <!ELEMENT PERSON (LASTNAME, FIRSTNAME, COMPANY, EMAIL)> <!ELEMENT LASTNAME (#PCDATA)> <!ELEMENT FIRSTNAME (#PCDATA)> <!ELEMENT COMPANY (#PCDATA)> <!ELEMENT EMAIL (#PCDATA)> ]>
DOM creates a tree based (or hierarchical) object model from the XMLdocument. The Document (created from an XML file) contains a tree of Nodes. Methods in the Node interface allow you to find out whether a Node has children or not, and also what the type of the Node is and what its value is (if any). There are many types of Nodes, but we are interested in the following types: TEXT_NODE (=3), ELEMENT_NODE (=1). These types are static int values which are defined in the org.w3c.dom.Node.java interface created by the W3C. So a Document object is a simple container of Nodes. But, in our DTD, we have Elements, not Nodes. It just so happens that there is an interface called Element (which extends Node). It also turns out that a Node which is of type ELEMENT_NODE is also an Element. Nodes of type ELEMENT_NODE (or Elements) can also have children. How do you access these children? Through the NodeList interface of course; the NodeList interface defines 2 methods to allow the iteration of a list of Nodes. These NodeList objects are generated by Node objects of type ELEMENT_NODE (or Element objects). The Document interface has a method called getElementsByTagName(String tagname) which returns a NodeList of all the Elements with that tag name.
So here is how we can extract information from our Document object. We first ask the document object for all the Element objects that have the tag name "PERSON". This should return all the Element objects that are PERSONs; all the Element objects with this tag name are returned in a NodeList object. We can use the getLength() method on this NodeList to determine how many PERSON elements are in the NodeList. Here is some code to do this:
Document doc = ... //creat DOM from AddressBook.xml NodeList listOfPersons = doc.getElementsByTagName( "PERSON" ); int numberOfPersons = listOfPersons.getLength();Now that we have the NodeList object containing all the PERSON Elements (which are also Nodes), we can iterate it to extract information from each PERSON Element (Node). The method item(int index) in NodeList returns a Node object. Remember that when the type of a Node is ELEMENT_TYPE, it is actually an Element. So here is the code to get the first person from our NodeList (assuming there is at least one person in the AddressBook.xml file):
if (numberOfPersons > 0 ){ Node firstPersonNode = listOfPersons.item( 0 ); if( firstPersonNode.getNodeType() == Node.ELEMENT_NODE ){ Element firstPersonElement = (Element)firstPersonNode; } }Now we have a reference to the firstPersonElement, which we can use to find out the FIRSTNAME, LASTNAME, COMPANY and EMAIL information of this PERSON element. Since the firstPersonElement is an Element, we can use getElementsByTagName(String) again to get the FIRSTNAME, LASTNAME, COMPANY and EMAIL elements in it. Here is the code to do get the FIRSTNAME of the firstPersonElement:
NodeList list = firstPersonElement.getElementsByTagName( "FIRSTNAME" );Now, this list does not contain other elements, because FIRSTNAME does not contain any other Elements. The FIRSTNAME element does however contain a TEXT_NODE that is the first name of this person. So the NodeList list contains at least one Node which has the name of the person in it. Along with the text (which is the first name of the person), this NodeList also contains other Nodes which also contain text, but this text is useless to us, because it consists of whitespace and carriage return and line feeds (crlf). This is NOT intuitive, because we expect only the name of the person to be in the NodeList, instead there are a bunch of nodes in this NodeList which contain a whitespace, crlfs and the String that we really want. So how do we extract the first name from this mess? We have to iterate the NodeList, and ask each Node in it for its value by using the getNodeValue() method. Then we have to trim() the String value and make sure that it is not "" or "
String firstName = null; for (int i = 0 ; i < list.getLength() ; i ++ ){ Node n = list.item( i ).getNodeValue().trim(); String value = n; if( value.equals("") || value.equals("\r") ){ continue; //keep iterating } else{ firstName = value; break; //found the firstName! } }Now, this procedure must be repeated for the LASTNAME, COMPANY and EMAIL
In the end there is not very much code necessary to operate within an XML document.
SAX 1.0: a free API for event-based XML parsing [Simple API for XML: http://www.megginson.com/SAX]. SAX is a standard interface for event-based XML parsing, developed collaboratively by the members of the XML-DEV mailing list. SAX 1.0 was released on Monday 11 May 1998, and is free for both commercial and non-commercial use.
SAX implementations are currently available in Java and Python, with more to come. SAX 1.0 support in both parsers and applications is growing fast.
SAX also allows, like DOM, to access XML-documents but does not need to build up a whole tree of the document's model because it works event triggered. The document is processed sequentially and the occurrence of searched patterns produce events which can be used for further operation. So SAX can handle very large documents, the amount of available memory has no impact on it's performance.
Example [Example using SAX: http://www.megginson.com/SAX/quickstart.html]: To create a very simple
Java-based SAX application, you first need to install at least two
Java libraries, making certain that you add all of them to your
CLASSPATH
Before you begin, make a note of the full classname of the SAX driver for the parser (for Aelfred, it's com.microstar.xml.SAXDriver). Next, you will usually want to create at least one event handler to receive information about the document. The most important type of handler is the DocumentHandler, which receives events for the start and end of elements, character data, processing instructions, and other basic XML structure.
Rather than implementing the entire interface, you can create a class that extends HandlerBase, and then fill in the methods that you need. The following example ( MyHandler.java) prints a message each time an element starts or ends:
import org.xml.sax.HandlerBase; import org.xml.sax.AttributeList; public class MyHandler extends HandlerBase { public void startElement (String name, AttributeList atts) { System.out.println("Start element: " + name); } public void endElement (String name) { System.out.println("End element: " + name); } }
Now, you can create a simple application ( SAXApp.java) to
invoke SAX and parse a document using your handler:
import org.xml.sax.Parser; import org.xml.sax.DocumentHandler; import org.xml.sax.helpers.ParserFactory; public class SAXApp { static final String parserClass = "com.microstar.xml.SAXDriver"; public static void main (String args[]) throws Exception { Parser parser = ParserFactory.makeParser(parserClass); DocumentHandler handler = new MyHandler(); parser.setDocumentHandler(handler); for (int i = 0; i < args.length; i++) { parser.parse(args[i]); } } }This example creates a Parser object by supplying a class name to the ParserFactory, instantiates your MyHandler class, registers the handler with the parser, then parses all URLs supplied on the command line (note that the URLs must be absolute).
For example, consider the following very simple XML document
( roses.xml):
<?xml version="1.0"?> <poem> <line>Roses are red,</line> <line>Violets are blue.</line> <line>Sugar is sweet,</line> <line>and I love you.</line> </poem>To parse this with your SAXApp application, you would supply the absolute URL of the document on the command line:
java SAXApp file://localhost/tmp/roses.xmlThe output should be as follows:
Start element: poem Start element: line End element: line Start element: line End element: line Start element: line End element: line Start element: line End element: line End element: poemThis is parsing XML! Now, there can be done something more interesting to do with this event handlers, e.g. manipulating the content of the elements. To handle the SAX API there is little more code to write, but the performance is with big documents as fast or faster as DOM, because the requested resources are modest.
There are many free tools available, most of the global firms investing in XML-technology. Interesting is the trend to establish free software in order to go the exclusive (and expensive) way of SGML. The most complete collection of different tools can be found with the IBM developing zone [http://www.ibm.com/developer/xml/], [IBM Alpha-Factory: http://www.alphaWorks.ibm.com/tech].
Is a free parser from the Microstar company
http://www.microstar.com/aelfred.html. Aelfred is designed for Java programmers who want to add XML support to their applets and applications without doubling their size: Aelfred consists of only two core class files, with a total size of about 26K, and requires very little memory to run. There is also a complete SAX (Simple API for XML) driver available in this distribution for interoperability.
Is a free parser from IBM for non commercial use (earlier versions where completely free) and is the mostly used parser in the community
http://www.alphaWorks.ibm.com/aw.nsf/xmltechnology/xml+parser+for+java. A validating XML parser written in pure Java that contains classes and methods for parsing, generating, manipulating, and validating XML documents. It improved native DOM performance and supports
These enhance the functionality of XML Parser for Java version 1:
Tim Bray's XML parsers. Lark is a non-validating parser; Larval is a validating parser which is just to learn about using Java and XML in non-profit and teaching environment.
http://www.textuality.com/Lark/.
A server-side parser co-created by DataChannel and Microsoft.
http://www.datachannel.com/xml/developers/parser.shtml.
This release brings the promise of XSL and XSL pattern matching capabilities to a Java-based XML parser for the first time. This parser release includes significant enhancements from the Beta 1 version of the parser including: a validating XML engine, XSL support, and transformations of data. Feature list:
We have not tested how the standards have been implemented yet.
A P3P protocol parser and constructor written in pure Java containing classes and methods which is distributed by IBM.
http://www.alphaWorks.ibm.com/aw.nsf/xmltechnology/p3p+parser.
Platform for Privacy Preferences (P3P) is a protocol that enables the private exchange of personal information on the web. "The goal of P3P is to enable Web sites to express their privacy practices and enable users to exercise preferences over those practices. P3P products will allow users to be informed of site practices (in both machine and human readable formats), to delegate decisions to their computer when appropriate, and allow users to tailor their relationship to specific sites" (W3C).
This parser is part of XSilfide a client/server based environment. XSilfide includes SIL, the Silfide Interface Language.
http://www.loria.fr/projets/XSilfide/EN//.
SILFIDE is a project of CNRS and AUPELF-UREF. The server is lodged with the LORIA. Server SILFIDE, as an interactive server, wants to offer to the whole of the French-speaking university community working starting from the language (linguists, teachers, data processing specialists...) a tool user-friendly and reasoned for the handling of electronic resources. The basic language is French.
This program compiles RDF/XML documents into the 3-tuples of the corresponding RDF data model. This tool is a reference implementation by the W3C.
http://www.w3.org/RDF/Implementations/SiRPAC/.
The documents can reside on local file system or at a URI on the Web. Also, the parser can be configured to automatically fetch corresponding RDF schemas from the declared namespaces. This version is suitable for embedded use as well as command line use. SiRPAC builds on top of the Simple API to XML documents (SAX).
A validating XML parser written in Python.
http://www.stud.ifi.uio.no/ larsga/download/python/xml/xmlproc.html.
xmlproc is an XML parser written in Python. It is a nearly complete validating parser, with only minor deviations from the specification. It supports both SGML Open Catalogs and XCatalog 0.1, as well as error messages in different languages. xmlproc also supports namespaces. Access to DTD information is provided, as is a separate DTD parser. SAX drivers are provided with the parser.
James Clark's non-validating parser written in Java. This is a famous and early parser, which was widely used in many references.
http://www.jclark.com/xml/xp/index.html.
XP is an XML 1.0 parser written in Java. It is fully conforming: it detects all non well-formed documents. It is currently not a validating XML processor. However it can parse all external entities: external DTD subsets, external parameter entities and external general entities.
This section gives an overview about the related terms found by the recherche.
There is much work in the Area of defining Schemas whith XML, there are some working drafts published by the W3C as technical Reports current status of work is published in
Part 1, Structures is part one of a two part draft of the specification for the XML Schema definition language. The document proposes facilities for describing the structure and constraining the contents of XML 1.0 documents. The schema language, which is itself represented in XML 1.0, provides a superset of the capabilities found in XML 1.0 document type definitions (DTDs.)
Part 2, Specifies a language for defining datatypes to be used in XML Schemas and, possibly, elsewhere.
Extensible Stylesheet Language (XSL) is a language for expressing stylesheets. (W3C, work in progress) It consists of two parts:
An XSL stylesheet specifies the presentation of a class of XML
documents by describing how an instance of the class is
transformed into an XML document that uses the formatting
vocabulary.
Will XSL replace CSS?
No. They are likely to co-exist since they meet different needs. XSL is intended for complex formatting where the content of the document might be displayed in multiple places; for example the text of a heading might also appear in a dynamically generated table of contents. CSS is intended for dynamic formatting of online documents for multiple media; its strictly declarative nature limits its capabilities but also makes it efficient and easy to generate and modify in the content-generation workflow.
The Resource Description Framework (RDF) [Resource Definition Framework: http://www.w3.org/RDF] will soon become a W3C recommendation and is a specification currently under development within the W3C Metadata activity. RDF is designed to provide an infrastructure to support metadata across many web-based activities. RDF is the result of a number of metadata communities bringing together their needs to provide a robust and flexible architecture for supporting metadata on the Internet and WWW. Example applications include sitemaps, content ratings, stream channel definitions, search engine data collection (web crawling), digital library collections, and distributed authoring.
RDF allows different application communities to define the metadata property set that best serves the needs of each community. RDF provides a uniform and interoperable means to exchange the metadata between programs and across the Web. Furthermore, RDF provides a means for publishing both a human-readable and a machine-understandable definition of the property set itself.
DSSSL is a more general and older Standard describing style-definition languages like XSL. XSL can be converted to extended DSSSL which be understood by Jade, James Clark's DSSSL engine, which in turn can create formatted output using any of its back-end formatters (RTF, TeX, SGML, and HTML with CSS). Both the DSSSL and the 'HTML/CSS' flow objects from the original XSL submission are supported; xslj is available with source code for the experimentally inclined.
How is XSL different from DSSSL? From DSSSL-O?
DSSSL is an International Standard style sheet language. It is particularly used for formatting of print documents. DSSSL-O is a profile of DSSSL which removes some functionality and adds capabilities to make it more suited for online documentation. XSL draws on DSSSL and the DSSSL-O work and continues the trend towards a Web-oriented style sheet language by integrating experience with CSS.
Will XSL replace DSSSL?
DSSSL has capabilities that XSL does not, and continues in use in the print publishing industry. Experience with XSL might be used in a future revision of DSSSL, but it is too early to say.
3 Example: Processing a Document
What we can do with XML is limited to the abilities of the used XML-tools. XML itself is a definition of structuring information and it carries some meta-information about this structure, but it does not carry any processing information, no semantics for a document-processor. In the end someone has to implement some code to process this document.
In a very low level CSS or XSL can be used for document processing, on the next level something like JavaScript may also allow the access to computer-semantics (means executable code). These possibilities are discussed in other places. In this example we focus connecting an XML-document specified by a language-definition to executable code.
The use of XML is demonstrated generating Java-classes from a definition embedded in XML. This is technically part of the Asgaard project [http://www.ifs.tuwien.ac.at/asgaard], [Miksch, S. and Shahar, Y. and Johnson, P. Medizinische Leitlinien und Protokolle: das Asgaard/Asbru Projekt KI-Journal, Themenheft MEDIZIN Mai 1997] which stresses the support of time oriented planning processes, e.g. in the medical domain. The target language is called Asbru which allows to define plans and guidelines.
There for we want to write plans in the XML format and generate some Java classes and objects out of this document to do some more komplex operation to support the therapy process.
The definition may be given by the Schema-specification [Schemata in XML: http://www.w3.org/XML/Activity.html] from the W3C. To do an easy example, this specification is replaced by a more simple one which only contains the necessary parts.
In figure 3 a 4-step process is outlined which may
be implemented as a client-server application or as a stand
alone-tool as well to represent the language-elements to
class-instances.
First Step is to build a Meta model which includes machine-readable semantics, e.g. in Java-code. As mentioned above it could also be some script-language or an other language-target like CLIPS interpreting knowledge-rules. This Java-close example was chosen to stay close to familiar concepts.
<?xml version="1.0"?> <!DOCTYPE spec [ <!ELEMENT spec (class | reflect)*> <!ATTLIST spec name #CDATA #REQUIRED> <!ELEMENT class (attr*, method*)> <!ATTLIST class name #CDATA #REQUIRED> <!ELEMENT method (#PCDATA)> <!ATTLIST method modifyer #CDATA #REQUIRED type #CDATA #REQUIRED name #CDATA #REQUIRED> <!ELEMENT attr EMPTY> <!ATTLIST attr modifyer #CDATA #REQUIRED type #CDATA #REQUIRED name #CDATA #REQUIRED> <!ELEMENT reflect EMPTY> <!ATTLIST reflect name #CDATA #REQUIRED synonym #CDATA #REQUIRED> ]>First part is the definition of the used Schema, the meta-meta description of the origin document. These document defines two basic kinds of elements:
To do some simple definitions a class may also define a set of attributes as well as methods, consisting of an modifier, a type and a name. (Those element-attributes could also be defined as tags as well, but in this example it done is this way.)
<spec name="test.xml"> <class name="Person"> <attr modifyer="private" type="String" name="vorname"/> <attr modifyer="private" type="String" name="nachname"/> <attr modifyer="private" type="float" name="gewicht"/> <method modifyer="public" type="void" name="Vorname"> vorname = content; </method> <method modifyer="public" type="void" name="Nachname"> nachname = content; </method> <method modifyer="public" type="void" name="Gewicht"> gewicht = Float.valueOf(content).floatValue(); </method> <method modifyer="public" type="String" name="toString"> return vorname + " " + nachname; </method> </class> <reflect name="java.util.Date" synonym="NOW"/> </spec>The second part defines a Java-class called Person with three properties " Vorname", " Nachname" and " Gewicht". A second class, " java.util.Date", which is already implemented, is referenced. (Of course this is a very poor language in this example, but I tried to keep it as simple as possible for demonstration!)
The second step is completely automatic, extending a parser using the DOM API to process the metamodel and produce Java-code from the model as well as a corresponding DTD:
import com.ibm.xml.parser.*; import org.w3c.dom.*; import java.io.*; public class ClassParser { private FileWriter classWriter = null; private FileWriter dtdWriter = null; public static void main (String args[]) throws Exception { Parser ps = new Parser("Java Class & DTD Creator"); ClassParser app = new ClassParser(); app.doit(ps.readStream(new FileInputStream("ClassSpec.xml"))); }
First is the parser - specific part, instanciating the parser
itself and read the model in the file ClassSpec.xml. Next
Step is to extract all the names of the defined classes to list
them in the root-element of the new language-specific DTD (which
has the same name as the DOCTYPE) and generate the head of
the DTD file.
private void doit(Document doc) throws Exception { // Get the name of the root-element NodeList root = doc.getElementsByTagName("spec"); Element o = (Element) root.item( 0 ); String packageName = o.getAttribute("name"); // Get the list with all class-definitions NodeList classList = doc.getElementsByTagName("class"); String classNames = "#PCDATA"; for (int i=0;i<classList.getLength();i++) { Element c = (Element) classList.item( i ); classNames += " | " + c.getAttribute("name"); } NodeList reflectList = doc.getElementsByTagName("reflect"); for (int h=0;h<reflectList.getLength();h++) { Element r = (Element) reflectList.item( h ); classNames += " | " + r.getAttribute("name"); } // Writing the Header of the DTD file dtdWriter = new FileWriter(packageName+".dtd"); dtdWriter.write("<?xml version=\"1.0\"?>\n"); dtdWriter.write("<!DOCTYPE "+packageName+" [\n"); dtdWriter.write("<!ELEMENT "+packageName+" ("+classNames+")*>\n");Next is to look at each class-definition seperately, generating one Java-file per class, adding a header and the defined attributes of the class.
// Create all java-class-files for (int i=0;i<classList.getLength();i++) if (classList.item( i ).getNodeType() == Node.ELEMENT_NODE) { Element c = (Element) classList.item( i ); String className = c.getAttribute("name"); // produce a new java-class-file classWriter = new FileWriter(className+".java"); classWriter.write("package "+packageName+";\n\n"); classWriter.write("public class "+className+" {\n\n"); // produce the definition-java-code NodeList attrList = c.getElementsByTagName("attr"); for (int j=0;j<attrList.getLength();j++) if (attrList.item( j ).getNodeType() == Node.ELEMENT_NODE) { Element a = (Element) attrList.item( j ); classWriter.write(" " + a.getAttribute("modifyer") + " " + a.getAttribute("type") + " " + a.getAttribute("name") + "; \n"); } classWriter.write("\n");
Quasi in parallel the DTD-Definition of this class is written to
the DTD file, including the List of methods offered by this class.
// addes element-definitions for the class to the dtd-file NodeList methodList = c.getElementsByTagName("method"); String methodNames = "#PCDATA"; for (int k=0;k<methodList.getLength();k++) if (methodList.item( k ).getNodeType() == Node.ELEMENT_NODE) { Element m = (Element) methodList.item( k ); if (m.getAttribute("modifyer").equalsIgnoreCase("PUBLIC")) methodNames += " | " + m.getAttribute("name"); } dtdWriter.write("<!ELEMENT "+className+" ("+methodNames+")*>\n");After the class definition each method is processed generating the method-header and the wrapped java-code which was placed in the contend of the method-tag.
// produces the method-java-code for (int k=0;k<methodList.getLength();k++) if (methodList.item( k ).getNodeType() == Node.ELEMENT_NODE) { Element m = (Element) methodList.item( k ); classWriter.write(" " + m.getAttribute("modifyer") + " " + m.getAttribute("type") + " " + m.getAttribute("name") + "(String content) { \n"); // produces the method-body-code ot of the tag-content NodeList codeList = m.getChildNodes(); for (int l=0;l<codeList.getLength();l++) if (codeList.item( l ).getNodeType() == Node.TEXT_NODE) { classWriter.write( codeList.item( l ).getNodeValue()); } classWriter.write("\n }");Last step to the new defined classes is the creation of the corresponding method-tags to the DTD.
// addes element-definitions for the method to the dtd-file if (m.getAttribute("modifyer").equalsIgnoreCase("PUBLIC")) dtdWriter.write("<!ELEMENT "+m.getAttribute("name")+" (#CDATA)>\n"); } classWriter.write("\n}"); classWriter.close(); }It ends with the simple mapping from the existing class to the DTD-file.
// adds element-definitions for reflected classes to the dtd-file for (int h=0;h<reflectList.getLength();h++) { Element r = (Element) reflectList.item( h ); dtdWriter.write("<!ELEMENT "+r.getAttribute("synonym")+" EMPTY>\n"); } // Close dtdWriter.write("]>"); dtdWriter.close(); } }Running this processor a consistent couple of DTD and classes will be produced.
Using XML-namespaces different DTDs can be mixed, so different (partial) language-definitions could be merged. In our example we stay with one definition, using the output of the previous step:
<?xml version="1.0"?> <!DOCTYPE xml.test [ <!ELEMENT xml.test (#PCDATA | Person | java.util.Date)*> <!ELEMENT Person (#PCDATA | Vorname | Nachname | Gewicht)*> <!ELEMENT Vorname (#CDATA)> <!ELEMENT Nachname (#CDATA)> <!ELEMENT Gewicht (#CDATA)> <!ELEMENT NOW EMPTY> ]>This is the extraction of the language. It is easy to see, these Definitions are very lightweight and also easy to use. In the next process-step the content can be connected to a form automatically as well as analyze free text.
<xml.test> Today morning at <NOW/> a new patient <Person><Nachname>Mayer</Nachname> <Vorname>Hans</Vorname> arrived. His Body weight was <Gewicht>82.3</Gewicht> kg.</Person> </xml.test>Maybe it doesn't seem to make any sense at all, because this is what we can do without all the overhead of extending parser and so on. But imagine java.util.Date is replaced by a given class which can do more complex time-annotation like we have in the Asgaard-project, connecting the arrival of a patient to documented events. (Wen would need some more definitions, but to keep the example simple they are skipped).
public class Person { private String vorname; private String nachname; private float gewicht; public void Vorname(String content) { vorname = content; } public void Nachname(String content) { nachname = content; } public void Gewicht(String content) { gewicht = Float.valueOf(content).floatValue(); } public String toString() { return vorname + nachname; } }For completeness: This is the class which was generated automatically.
The processing of the Plan document is event triggered, using IBMs SAX Driver:
import org.xml.sax.*; import org.xml.sax.helpers.*; import java.io.*; import java.util.Date; public class PlanParser extends HandlerBase { private String tempStr = ""; private Date d; private Person p = new Person(); public static void main(String args[]) throws Exception { Parser p = ParserFactory.makeParser("com.ibm.xml.parsers.SAXParser"); PlanParser demo = new PlanParser(); p.setDocumentHandler(demo); FileInputStream is = new FileInputStream("Plan.xml"); InputSource source = new InputSource(is); source.setSystemId("Doing it with SAX"); p.parse(source); }First the Parser ist instanciated and connected to the Input file given in Plan.xml which contains the language-specific DTD as well as a "plan" statement.
public void characters(char ch[], int start, int length) { tempStr = new String(ch, start, length); } public void endElement(String name) { if (name.equalsIgnoreCase("NOW")) { d = new Date(); } else if (name.equalsIgnoreCase("NACHNAME")) { p.Nachname(tempStr); } else if (name.equalsIgnoreCase("VORNAME")) { p.Vorname(tempStr); } else if (name.equalsIgnoreCase("GEWICHT")) { p.Gewicht(tempStr); } }By processing the document the occurrence of tags is supervised. In this place an Exception handeling could easily expand the XML-parser standard capabilities checking the validy of the document's content (= markup), what is not done by normal XML parser.
public void endDocument() { System.out.println(p.toString() + " processed at " + d.toString()); }Using an other event some action may be triggered.
AppAccelerator(tm) 1.1.036 for Java (JDK 1.1), x86 version. Copyright (c) 1998 Inprise Corp. All Rights Reserved. Martin Mayer processed at Mon Jun 28 16:23:19 CEST 1999The output of this short program looks like this. Imagine, that even the Event-Handler itself could easily be generated by the Domain model (step 1) - the Document would take care about it's own processing completely (see section 2.1.2).
We have discussed the two mayor APIs DOM and SAX for manipulating XML documents in different examples. The use of the parse-tree is easy with less lines of code - but often necessary if low-level methods like CSS or some script languages lack on requested functionality.
XML itself is a structured document carrying meta-information about it's content which may be used for processing free-text like today given in HTML combined with the support of automated processing by some computer up to the integration into some database - application where the user won't see anything from the XML-code but some abstract representation.
In the example in section 3, which is the main focus in this paper, a generic domain model defines the whole process of automated document processing in four steps:
This example uses only available tools - there for is no need for some proprietary or commercial tool. All this can be done within the existing standards and code.