Home page: Specification of the XStream language
Sections:
XStream scripts
An XStream source program is made of one or several text files (with .xst as the suggested extension) and contains a sequence of phrases, optionally separated by double semi-colons (;;).
P := Phrase | include "other.xst" Include statement | declare f(...) Declare statement | caml << ... >> Caml prelude | pat -> exp Rewriting rule
Include statements are plain textual inclusion.
Terms, values
Conceptually, the program manipulates terms, which encode both XML data (the input document, the output document, intermediate results), other kinds of data (such as booleans) and intentional computation. These terms are subject to the rewriting rules specified by the script.
A value is either a basic value (e.g. an integer, a string) or a term. A term is made of a constructor name and zero, one, or several arguments which are values. Each constructor name is either built-in, implicitly declared by its occurrence in a rule, or explicitly declared by a declare statement.
A declare statement has the following form:
declare f(arg1,...,argn)
where f is an identifier and each argi is one of: _, int, bool, string, <<...>> (where ... is an OCaml type expression). This declaration specifies the signature of the constructor f. When argi is _, the i-th argument of f must be a term. Otherwise, it must be an integer, a boolean, a string, or a value of the specified OCaml type.
If a constructor is used (in an expression or a pattern) before being declared, the number of arguments is taken from the application site and they are all supposed to be terms.
There are some built-in declarations for constructors which are used to encode XML fragments:
declare elt(string,<< (string*string) list >>_,_) declare str(string,_) declare nil()
The elt constructor encodes an XML fragment which begins with an XML element. It takes one string, an OCaml list of pair of strings, and two terms as argument. The string is the tag of the XML element. The list of pairs of strings is the sequence of XML attributes. The first term describes the content of the XML element. The second term describes the rest of the XML fragment, after the first XML element.
Similarly, the str constructor encodes an XML fragment which begins with some text (the first argument), followed by another XML fragment (the second argument).
The nil constructor, which does not take any argument, encodes an empty XML fragment.
As an example, the term:
elt("root", << [] >>, elt("a", << [] >>, nil(), elt("b", << [] >>, nil(), nil())), nil())
represents the XML fragment
<root><a/><b/></root>.
Some syntactic sugar is provided to help write XML fragments.
A term
elt("tag",<< [] >>,t1,t2) can be written:
tag[t1] t2.
When t1 or t2 is
nil(), it can be omitted.
A term
str("text",t) can be written: "text" t.
The term nil() can be written ().
The term above can thus be
written root[ a[] b[] ].
There is also an built-in declaration for the entry point of the program:
declare main(_)
Patterns
The left-hand sides of rewriting rules are called pattern. A pattern matches a term and bound sub-terms to capture variables. The syntax of patterns is given by the following productions:
pat := Pattern | _ Wildcard | x Capture variable | pat as x As-pattern | pat | pat Or-pattern | f(pat1,...,patn) Constructor pattern | "..." String literal | i Integer literal | <<...>> OCaml pattern | (pat)
The wildcard pattern matches anything and bind no variable. An as-pattern (pat as x) performs the same matching as pat and additionally bind x to the current term. A capture variable x is equivalent to (_ as x). An or-pattern pat1|pat2 matches values which are matched either by pat1 or by pat2; the two patterns must have the same set of capture variable. A constructor pattern f(pat1,...,patn) matches terms of the form f(t1,...,tn) where each ti is matched against pati. In a pattern <<...>>, the dots must be an OCaml pattern without capture variable.
The pattern used in a rewriting rule can only be (at toplevel) a constructor pattern or a union of such patterns. This means in particular that you cannot capture the root in a variable.
A pattern elt("tag",_,pat1,pat2) can be written
tag[pat1] pat2.
A pattern elt(x,_,pat1,pat2) can be written %x[pat1] pat2.
A pattern elt(_,_,pat1,pat2) can be written _[pat1] pat2.
A pattern elt("tag",a,pat1,pat2) can be written tag[@a pat1] pat2.
A pattern elt(x,a,pat1,pat2) can be written %x[@a pat1] pat2.
A pattern elt(_,a,pat1,pat2) can be written _[@a pat1] pat2.
When pat1 or pat2 is nil(), it can be omitted.
A pattern str("text",pat) can be written "text" pat.
A pattern str(x,pat) can be written %x pat.
A pattern str(_,pat) can be written _ pat.
The pattern nil() can be written ().
Expressions
The right-hand sides of rewriting rules are expressions.
exp := Expression | f(exp1,...,expn) Constructor expression | "..." String literal | i Integer literal | <<...>> OCaml expression | x Variable | let x = exp1 in exp2 Local binding | if exp1 then exp2 else exp3 If-then-else | fun [ x -> exp ] Local function | match exp with [ pat1 -> exp1 | ... | patn -> expn ] Pattern matching
An expression elt("tag",<< [] >>, exp1,exp2) can be written tag[exp1] exp2.
An expression elt(x,<< [] >>, exp1,exp2) can be written %x[exp1] exp2.
An expression elt("tag",a, exp1,exp2) can be written tag[@a exp1] exp2.
An expression elt(x,a, exp1,exp2) can be written %x[@a exp1] exp2.
When exp1 or exp2 is nil(), it can be omitted.
An expression str("text",exp) can be written "text" exp. An expression str(x,exp) can be written %x exp.
The expression nil() can be written ().
The scoping rules are the same as for OCaml. All the free variables of the right-hand side of a rewriting rule must be capture variables in the left-hand side of this rewriting rule.
In the if...then...else expression, the condition expression must be of type bool.
The expression fun [ x -> exp ] behaves as f(x1,...,xn) where x1,...,xn are the free variables of the expression and f is a fresh symbol; the expression also registers an implicit rewriting rule:
apply(f,x) -> exp
where apply is a built-in constructor defined by:
declare apply(_,_)
The expression match exp with [ pat1 -> exp1 | ... | patk -> expk ] behaves as f(exp,x1,...,xn) where x1,...,xn are the free variables of the expression and f is a fresh symbol; the expression also registers implicit rewriting rules:
f(pat1,x1,...,xn) -> exp1 ... f(patk,x1,...,xn) -> expk
Behavior of an XStream program
An XStream program parses the input XML document incrementally into a variable x0 and evaluates the term main(x0). Between each XML parsing event, the term is reduced and normalized using all the rewriting rules. Parts of the result which are made available are immediately sent to the output and discarded from memory.
The order in which nodes of the term are considered for rewriting is not specified. What is specified is that for a given node, all the rewriting rules are tried in the order they appear in the program, with a first match policy. Similarly, an or-pattern is evaluated from left-to-right.
Concatenation
The standard data model for XML in XStream does not support concatenation. Of course, it is possible to define it:
cat((),q) -> q cat(%s x,q) -> %s cat(x,q) cat(%t[@a x1] x2,q) -> %t[@a x1] cat(x2,q)
XStream pre-defines the following constructors:
declare concat(_,_) declare elt1(string,(string*string) list,_) declare str1(string)
and the pretty-printer knows how to deal with them. The elt1 constructor is a variant of the elt constructor used to represent a single element (without its following siblings) and similarly for str1. Of course, if you produce terms made of these constructors and you need to rewrite them, you must be prepared to deal with them in patterns.
OCaml extensions
In order to work with basic values (integers, strings, booleans), XStream allows the programmer to put OCaml expressions in right-hand sides of rewriting rules. The OCaml expression must be enclosed by << ... >> and it must evaluate to a basic value, using only variables which are in scope and whose type are basic (not terms).
For instance, here is how to extract the n first XML elements of a sequence:
declare first(int,_) first(0, _) -> () first(n, %t x) -> first(n, x) first(n, %t[x1] x2) -> %t[x1] first(<< n - 1 >>, x2)
It is also possible to use << ... >> to write patterns. E.g.
declare bool(bool) bool(<< true >>) -> true() bool(<< false >>) -> false()
The true() and false() expressions denote constructed terms, not basic values.
Here is another example:
declare eq_str(string,string) eq_str(x,y) -> bool(<< x = y >>)
It is possible to add guards to rewriting rules:
pat when << ... >> -> exp
Example:
f(%t1[x1] y1, %t2[x2] y2) when << t1 <> "a" || t2 <> "b" >> -> ...
In a phrase of the form
caml << ... >>
the dots contain toplevel OCaml phrases (structure items). These phrases are just copied at the beginning of the produced OCaml program.
Examples
The following XStream programs simply copy the input:
main(x) -> x
This can also be achieved, though somewhat less efficiently, by the following:
main(x) -> copy(x) copy(%t[@a x1] x2) -> %t[@a copy(x1)] copy(x2) copy(%t x) -> %t copy(x) copy(()) -> ()
The following program removes all the elements with tag <a> and their content:
main(a[x1] x2) -> main(x2) main(%t[@a x1] x2) -> %t[@a main(x1)] main(x2) main(%t x) -> %t main(x) main(()) -> ()
The following program removes all the elements with tag <a> but keep their content:
main(x) -> filter(x,()) filter(a[x1] x2,q) -> filter(x1,filter(x2,q)) filter(%t[@a x1] x2,q) -> %t1[@a filter(x1,())] filter(x2,q) filter(%t x,q) -> %t filter(x,q) filter((),q) -> q
Other examples are available in the examples/ subdirectory.
Limitations
Currently, XStream does not work well with OCaml expressions used within a local abstraction (fun [ x -> ...]) or pattern matching (match ... with [ .... ]). The problem is that free variables are not inferred properly.
XML namespaces are not supported.
Textual content is not properly escaped by the pretty-printer.
There is no well-formedness check on attributes (e.g. to check that the same attribute is not defined twice in the same element).
Home page: Specification of the XStream language