Fixin' your automata
- December 5, 2018
There are several ways of compiling a regular expression (RE) down to a deterministic finite-state automaton (DFA). One such way is based on Brzozowski derivatives of regular expressions. In this post, I describe a concise OCaml implementation of this transformation. This is an opportunity to illustrate the use of fix, a library that offers facilities for constructing (recursive) memoized functions and for performing least fixed point computations.
From REs to DFAs, via Brzozowski derivatives
Suppose e
denotes a set of words. Then, its
derivative delta a e
is the set of words
obtained by keeping only the words that begin with a
and by
crossing out, in each such word, the initial letter a
. For
instance, the derivative of the set { ace, amid, bar }
with
respect to a
is the set { ce, mid }
.
A regular expression is a syntactic description of a set of words. If
the set e
is described by a regular expression, then its
derivative delta a e
is also described by a regular
expression, which can be effectively computed.
Now, suppose that I am a machine and I am scanning a text, searching
for a certain pattern. At each point in time, my current
state of mind is described by a regular expression
e
: this expression represents the set of words that I am
hoping to read, and that I am willing to accept. After I read one
character, say a
, my current state changes
to delta a e
, because I have restricted my attention to the
words of e
that begin with a
, and I am now
hoping to recognize the remainder of such a word.
Thus, the idea, in a nutshell, is to build a deterministic
automaton whose states are regular expressions and whose transition
function is delta
.
The main nontrivial aspect of this apparently simple-minded approach
is the fact that only a finite number of states arise
when one starts with a regular expression e
and explores
its descendants through delta
. In other words, a regular
expression e
only has a finite number of iterated
derivatives, up to a certain equational theory. Thanks to this property,
which I won’t prove here, the construction terminates, and yields a
finite-state automaton.
For more details, please consult the paper Regular-expression derivatives re-examined by Scott Owens, John Reppy and Aaron Turon. In particular, Definition 4.1 in that paper gives a number of equations that must be exploited when deciding whether two regular expressions are equal. In the following, I refer to these equations collectively as EQTH, for equational theory. Among other things, these equations state that disjunction is associative, commutative, and idempotent. In other words, a disjunction must be viewed as a set of disjuncts. The empty regular expression can be viewed as an empty disjunction.
An alphabet
Throughout, I assume that the alphabet is given by a module
Char
whose signature is as follows:
Char : sig
type t
val equal: t -> t -> bool
val hash: t -> int
val foreach: (t -> unit) -> unit
val print: t -> string
end
The fact that this alphabet is finite is witnessed by the existence
of the function Char.foreach
, which enumerates all
characters.
As an exercise for the reader, this can be used to define an
auxiliary function exists_char
of type
(Char.t -> bool) -> bool
.
Regular expressions, hash-consed
The syntax of regular expressions (expressions, for short) is
naturally described by an algebraic data type regexp
.
It is the same syntax as in Owens
et al.’s paper, except I use n-ary disjunctions and conjunctions. As
explained above, this is dictated by the need to take EQTH into account.
The data constructors EDisj
and EConj
carry a
list of subexpressions. This list is normalized in such a way that, if
two lists are equal as sets, then they are equal as lists, too. This
list is never a singleton list. It can be empty: EDisj []
is the empty expression zero
, while EConj []
is the universal expression one
.
type regexp =
skeleton HashCons.cell
and skeleton =
| EEpsilonof Char.t
| EChar of regexp * regexp
| ECat of regexp
| EStar of regexp list
| EDisj of regexp list
| EConj of regexp | ENeg
A slight twist is that expressions are hash-consed.
That is, every expression is decorated with an integer identifier. (A
record of type skeleton HashCons.cell
is a pair of an
integer identifier and a skeleton.) These identifiers are unique: two
expressions are equal if and only if they carry the same identifier.
(This notion of equality takes EQTH into account.) This allows testing
very quickly whether two expressions are equal. This also allows
building efficient dictionaries whose keys are expressions, or in other
words, efficient memoized functions of type
regexp -> ...
. This is heavily exploited in the code
that follows: the functions nullable
, delta
,
and nonempty
are three examples, and there are more.
The module HashCons, which is part of fix, provides facilities for hash-consing. (The reader is encouraged to take a look at its implementation: it is very short.) It is used as follows:
let make : skeleton -> regexp =
let module H = HashCons.ForHashedType(struct
(* Define equality and hashing of skeletons. *)
type t = skeleton
let equal = ...
let hash = ...
end)
in H.make
let skeleton : regexp -> skeleton =
HashCons.data
The function make
is where the magic takes place:
whenever one wishes to construct a new expression, one constructs just a
skeleton and passes it to make
, which takes care of
determining whether this skeleton is already known (in which case an
existing integer identity is re-used) or is new (in which case a fresh
integer identity is allocated). The function skeleton
,
which converts between an expression and a skeleton in the reverse
direction, is just a pair projection.
Here are two basic examples of the use of make
:
let epsilon : regexp =
make EEpsilonlet zero : regexp =
make (EDisj [])
Still using make
, one can define several smart
constructors that build expressions: concatenation @@
,
disjunction, conjunction, iteration, negation. These smart constructors
reduce expressions to a normal form with respect to the equational
theory EQTH. In particular, disjunction
flattens nested
disjunctions, sorts the disjuncts, and removes duplicate disjuncts, so
that two disjunctions that are equal according to EQTH are indeed
recognized as equal.
let (@@) : regexp -> regexp -> regexp = ...
let star : regexp -> regexp = ...
let disjunction : regexp list -> regexp = ...
let conjunction : regexp list -> regexp = ...
let neg : regexp -> regexp = ...
Nullability
An expression is nullable if and only if it accepts the empty word.
Determining whether an expression is nullable is a simple matter of
writing a recursive function nullable
of type
regexp -> bool
.
I memoize this function, so the nullability of an expression is
computed at most once and can be retrieved immediately if requested
again. (This is not mandatory, but it is convenient to be able to call
nullable
without worrying about its cost. Another approach,
by the way, would be to store nullability information inside each
expression.)
The module Memoize,
which is part of fix, provides
facilities for memoization. To use it, I apply the functor
Memoize.ForHashedType
to a little module R
(not shown) which equips the type regexp
with equality,
comparison, and hashing functions. (Because expressions carry unique
integer identifiers, the definition of R
is trivial.) This
functor application yields a module M
which offers a
memoizing fixed-point combinator M.fix
.
Then, instead of defining nullable
directly as a
recursive function, I define it as an application of M.fix
,
as follows.
let nullable : regexp -> bool =
let module M = Memoize.ForHashedType(R) in
fun nullable e ->
M.fix (match skeleton e with
| EChar _ ->false
| EEpsilon
| EStar _ ->true
| ECat (e1, e2) ->
nullable e1 && nullable e2
| EDisj es ->
exists nullable es
| EConj es ->
forall nullable es
| ENeg e ->not (nullable e)
)
Derivation
It is now time to define a key operation: computing the Brzozowski
derivative of an expression. If a
is a character and
e
is an expression, then delta a e
is the
derivative of e
with respect to a
.
Implementing delta
is a textbook exercise. A key remark,
though, is that this function must be memoized in order
to ensure good complexity. A naive, non-memoizing version of it would
have exponential cost, due to the duplication that takes place in the
cases of concatenation and iteration.
In order to memoize a function of two arguments, one possible
approach would be to rewrite it as a function that takes a pair as an
argument. Here, instead, I rely on currying. For every character
a
, delta a
is a function of type
regexp -> regexp
. I memoize each such function
independently using fix
, and I memoize the function
delta
itself using memoize
, a nonrecursive
memoization combinator.
let delta : Char.t -> regexp -> regexp =
let module C = Memoize.ForHashedType(Char) in
let module M = Memoize.ForHashedType(R) in
fun a ->
C.memoize (fun delta e ->
M.fix (match skeleton e with
| EEpsilon ->
zero
| EChar b ->if Char.equal a b then epsilon else zero
| ECat (e1, e2) ->if nullable e1 then delta e2 else zero
delta e1 @@ e2 |||
| EStar e ->
delta e @@ star e
| EDisj es ->
disjunction (map delta es)
| EConj es ->
conjunction (map delta es)
| ENeg e ->
neg (delta e)
) )
(Non)Emptiness
During the construction of the DFA (which comes up next), it is convenient to be able to (efficiently) decide whether an expression is nonempty (that is, whether it accepts at least one word).
Following Owens et al., I have included all three Boolean operators (disjunction, conjunction, negation) in the syntax of expressions. The presence of conjunction makes it apparently nontrivial to determine whether an expression is nonempty.
Fortunately, by exploiting nullable
, delta
,
and with the help of fix, I am able to
define an efficient nonemptiness test in four lines of code:
let nonempty : regexp -> bool =
let module F = Fix.ForHashedType(R)(Prop.Boolean) in
fun e nonempty ->
F.lfp (fun a -> nonempty (delta a e))
nullable e || exists_char ( )
This test is based on the following fact: e
is nonempty
if and only if either e
is nullable or there exists a
character a
such that delta a e
is
nonempty.
This statement, however, cannot be used directly as a recursive
definition of nonempty
, because such a definition would not
be well-founded. An attempt to define nonempty
using
let rec
or M.fix
would construct a possibly
nonterminating function.
Instead, the above statement must be viewed as an inductive
characterization of nonemptiness. That is,
nonempty
is the least fixed point in the
Boolean lattice (where false
is less than
true
) of the equation:
fun a -> nonempty (delta a e)) nonempty e = nullable e || exists_char (
The main module of the fix library, named Fix,
is an algorithm for computing a least fixed point of type
variable -> property
, where the type
property
forms a lattice. The functor application
Fix.ForHashedType(R)(Prop.Boolean)
instantiates this
algorithm so that the type variable
is regexp
and the type property
is bool
. Then, the fixed
point combinator F.lfp
allows us to give a straightforward
definition of nonempty
.
The function nonempty
thus defined is incremental.
Nothing is computed when the function is constructed; computation takes
place only when the function is invoked. It is also memoized: the
nonemptiness of an expression, once computed, is recorded and is never
recomputed. Thus, one can invoke this function without worrying about
its cost.
The reader might note that nonemptiness could also be computed via a
graph traversal: an expression is nonempty if only there is exists a
path from this expression to a nullable expression in the graph whose
vertices are expressions and whose edges are determined by
delta
. What I have just done is exploit the fact that
co-accessibility is easily expressed as a least fixed point.
Constructing a DFA
The tools are now at hand to convert an expression to a deterministic finite-state automaton.
I must first settle on a representation of such an automaton as a
data structure in memory. I choose to represent a state as an integer in
the range of 0
to n-1
, where n
is
the number of states. An automaton can then be described as follows:
type state =
int
type dfa = {
int;
n: option;
init: state
decode: state -> regexp;Char.t -> state option;
transition: state -> }
init
is the initial state. If it is absent, then the
automaton rejects every input.
The function decode
maps every state to the expression
that this state accepts. This expression is guaranteed to be nonempty.
This state is a final state if and only if this expression is
nullable.
The function transition
maps every state and character
to an optional target state.
Now, how does one construct a DFA for an expression e
?
The answer is simple, really. Consider the infinite graph whose vertices
are the nonempty expressions and whose edges are determined by
delta
. The fragment of this graph that is reachable from
e
is guaranteed to be finite, and is exactly the desired
automaton.
There are several ways of approaching the construction of this finite
graph fragment. I choose to first perform a forward graph traversal
during which I discover the vertices that are reachable from
e
and number them from 0
to n-1
.
Once this is done, completing the construction of a data structure of
type dfa
is easy.
let dfa (e : regexp) : dfa =
let module G = struct
type t = regexp
let foreach_successor e yield =
Char.foreach (fun a ->
let e' = delta a e in
if nonempty e' then yield e'
)let foreach_root yield =
if nonempty e then yield e
end in
let module N = Number.ForHashedType(R)(G) in
let n, decode = N.n, N.decode in
let encode e = if nonempty e then Some (N.encode e) else None in
let init = encode e in
let transition q a = encode (delta a (decode q)) in
{ n; init; decode; transition }
In the above code, the module G
is a description of the
graph that I wish to traverse.
The module Number,
which is part of fix, helps explore this
graph. The functor application Number.ForHashedType(R)(G)
performs a traversal of the graph G
and constructs a
numbering N
of its vertices. The module N
contains the following data:
val n: int
val encode: regexp -> int
val decode: int -> regexp
Here, n
is the number of vertices that have been
discovered. The functions encode
and decode
record the correspondence between vertices and numbers. In other words,
decode
has type state -> regexp
and maps a
state to the nonempty expression that this state stands for. Conversely,
encode
has type regexp -> state
and maps a
nonempty expression to a state.
I extend encode
to a function of type
regexp -> state option
, which can be applied to a
possibly empty expression. Once this is done, the initial state
init
is obtained by transporting the expression
e
through the encoding, while the transition function
transition
is obtained by transporting the function
delta
through the encoding.
That’s all!
The automaton thus obtained is not necessarily minimal.
Examples
The expression dead
gives rise, quite naturally, to the
following automaton. As usual, every transition is labelled with a
character. In addition, every state is labelled with the expression that
it represents.
Naturally, this automaton is not very useful, as it merely tests
whether the input begins with dead
. To
search the input for the word dead
, one
should use the expression 1dead
. (Here, 1
denotes the universal expression one
.) This expression
gives rise to the following automaton:
Here is another example of searching for a nontrivial pattern. This
automaton corresponds to the expression 1a(a|b)*(bc)*
:
The reader may notice that a final state can have outgoing transitions. If one is interested in finding all matches or in finding a longest match, then this is useful.
However, if one is interested only in searching for a first match and
a shortest match, then there is no need for a final state to have
outgoing transitions. The above construction can easily be modified so
that final states do not carry any outgoing transitions. This leads to
slightly simpler automata. For instance, the expression
1dead
leads to the following first-match automaton:
Feeding an input text into this automaton is essentially equivalent
to searching for the word dead
using the Knuth-Morris-Pratt
algorithm.
The expression 1(add|dead)
, which is used to search a
text for one of the words add
and dead
, gives
rise to the following first-match automaton:
Feeding an input text into this automaton is essentially equivalent
to searching for one of the words add
and dead
using Aho and Corasick’s algorithm.
Conclusion
Memoization, hash-consing, and fixed points are powerful tools. The fix library makes these tools easy to use. The conversion of regular expressions to deterministic finite-state automata by Brzozowski’s method is a good illustration of their application.
For more details, please look at the full source code for this demo.