[Lispweb] An XML parser in a purely functional Scheme

oleg@pobox.com oleg at pobox.com
Sun Jun 25 17:23:05 CDT 2000


A previous article described a Scheme framework for making sense of
XML documents: SSAX. This package implements a set of low- to
medium-level parsers for various productions defined in the XML
Recommendation. It _was_ not a parser per se. Rather it is a
framework, a set of "Lego blocks" one can use to build a SAX or a DOM
parser -- or a specialized lightweight parser for a particular
document type. It was surprising to realize how little one needs to
add to the framework to get an XML parser. This article describes a
new version of SSAX, which can serve as a fully-fledged parser.

The key addition to version 2 of SSAX is a function
SSAX:element->SXML. It converts an XML document or a well-formed part
of it into the corresponding SXML form. For an XML document within a
single default namespace with no external entities, the result is in
conformance with the XML Information set. SXML is formally defined in
the comments to the function SSAX:element->SXML. SXML is a "relative"
of DOM, whose data model is another instance of the XML Infoset. DOM
is also an abstract query and manipulation interface. DOM however is
not the only way to access components of an XML document. W3C defines
an XML query language XPath, which underlies both XPointer and XSLT
specifications. It is XPath that is to be used with SXML. To be more
precise, SXML query language is SXPath. Abbreviated SXPath expressions
are identical to the corresponding XPath location paths modulo
parentheses and path separators. XPath has a syntax different from
that of XML. OTH, both SXPath and SXML are represented by
s-expressions. They can be manipulated and composed as any other
Scheme lists. Depending on the point of view, SXPath and SXML are
either data structures, or Scheme code to be evaluated as it is.

SSAX is a purely functional lexing and parsing framework, with an
input port used as a linear, read-once variable. It's worth noting in
this respect a function SSAX:read-CDATA-body: it is a 'fold'
combinator, over a sequence of lines that constitute the body of a
CDATA section.

The following is a demonstration of the parser and SXML. It is an
excerpt from SSAX' built-in regression tests. The tests print an XML
document or a well-formed fragment, and the corresponding SXML
expression. The latter is the result of applying the following
expression to the input string str:

(call-with-input-string str
  (lambda (port)
     (pp (SSAX:element->SXML (SSAX:read-content-norm-ws port) port))
     (nl)))

Regression test output:

input: " <BR/>"
Result: (BR)

input: "<BR></BR>"
Result: (BR)

input: " <BR CLEAR='ALL'\nCLASS='Class1'/>"
Result: (BR (@ (CLASS "Class1") (CLEAR "ALL")))

input: "   <A HREF='URL'> link <I>itlink</I> &amp;amp;</A>"
Result: (A (@ (HREF "URL")) "link " (I "itlink") "&amp;")

input: " <P><?pi1  p1 content ?>?<?pi2 pi2? content? ??></P>"
Result: (P (*PI* pi1 "p1 content ") "?" (*PI* pi2 "pi2? content? ?"))

input: " <P><![CDATA[<]]></P>"
Result: (P "<")

input: " <P>some text <![CDATA[<]]>1\n&quot;<B>strong</B>&quot;\n</P>"
Result: (P "some text <1 \"" (B "strong") "\"")

input: " <P><![CDATA[<BR>\n<![CDATA[<BR>]]&gt;]]></P>"
Result: (P "<BR>\n<![CDATA[<BR>]]>")

input: "<T1><T2>it&apos;s\nand   that\n</T2>\r\n\r\n\n</T1>"
Result: (T1 (T2 "it's and that "))

input: "<Forecasts TStamp="958082142">
<TAF TStamp='958066200' LatLon='36.583, -121.850' BId='724915'
SName='KMRY, MONTEREY PENINSULA'>
<VALID TRange='958068000, 958154400'>111730Z 111818</VALID>
<PERIOD TRange='958068000, 958078800'>
<PREVAILING>31010KT P6SM FEW030</PREVAILING>
</PERIOD>
<PERIOD TRange='958078800, 958104000' Title='FM2100'>
<PREVAILING>29016KT P6SM FEW040</PREVAILING>
</PERIOD>
<PERIOD TRange='958104000, 958154400' Title='FM0400'>
<PREVAILING>29010KT P6SM SCT200</PREVAILING>
<VAR Title='BECMG 0708' TRange='958114800, 958118400'>VRB05KT</VAR>
</PERIOD></TAF>
</Forecasts>"

Result: (Forecasts
  (@ (TStamp "958082142"))
  (TAF (@ (SName "KMRY, MONTEREY PENINSULA")
          (BId "724915")
          (LatLon "36.583, -121.850")
          (TStamp "958066200"))
       (VALID (@ (TRange "958068000, 958154400")) "111730Z 111818")
       (PERIOD (@ (TRange "958068000, 958078800"))
               (PREVAILING "31010KT P6SM FEW030"))
       (PERIOD (@ (Title "FM2100") (TRange "958078800, 958104000"))
               (PREVAILING "29016KT P6SM FEW040"))
       (PERIOD (@ (Title "FM0400") (TRange "958104000, 958154400"))
               (PREVAILING "29010KT P6SM SCT200")
               (VAR (@ (TRange "958114800, 958118400") (Title "BECMG
0708"))
                    "VRB05KT"))))



References:

        http://pobox.com/~oleg/ftp/Scheme/SSAX.scm
The code has built-in regression tests, which aim to validate all the
major functions. I don't parse DTD yet as I the documents I handle are
valid by construction (if they are well-formed, that is, if a
generator finished successfully).

        http://pobox.com/~oleg/ftp/Scheme/SSAX.scm
Announcement of the version 1 of SSAX

        http://pobox.com/~oleg/ftp/Scheme/SXPath.scm
SXPath, SXML query language

Parsing library
        http://pobox.com/~oleg/ftp/Scheme/parsing.html




More information about the lispweb mailing list