Comments on Matías Giovannini's "Merge Right"
- August 18, 2012
A relatively common story: a comment I started writing in reaction to Matías Giovannini’s Merge right blog post turned out lengthier than I initially expected. It feels a bit off-topic on Gagallium, but I know there are people interested in reading about Matías’ perfectionist code snippets (I am), so there might also be an audience for perfectionist comments about perfectionist blog posts…
If you haven’t read Matías’ blog post yet, you should have a look at it now – or skip this post altogether. He defines a rather generic merging operation on sorted collections, parametrized by functions deciding what to do when the next item on the left collection is smaller, or on the right collection, when both items are equal, or when one of the two collections has no more elements.
He then enumerates all reasonable choices for those function parameters, to get a list of 32 different "merge operators" that are instance of this general function. It’s amusing to see this classification carried out in full.
The code for the generic merge function is essentially this one:
let merge ln rn lt eq gt : 'a list -> 'a list -> 'a list = let rec go l r = match l, r with | [], ys -> ln ys | xs, [] -> rn xs | x :: xs, y :: ys -> match compare x y with | -1 -> lt x (go xs r) | 0 -> eq x (go xs ys) | 1 -> gt y (go l ys) | _ -> assert false in go
With the following operators used as parameters, in two groups, one
for ln/rn
and one for lt/eq/gt
:
let self xs = xs and null _ = [] let cons x xs = x :: xs and tail _ xs = xs
Matías shows, for example, that intersection of two collection may be
defined as merge null null tail cons tail
, while symmetric
difference is merge self self cons tail cons
.
Parametricity : a bit too good to be true
I think it is not exactly true that "provided the output sequence
must be increasing, these four functions [self, null, cons and tail]
exhaust the possibilities by parametricity": it seems that removing some
elements from the sequence xs
(but not all as
null
and tail
do) would preserve this sorting
criterion.
I’m not sure which is the best way to reduce the number of dimensions of freedom of this problem to capture the behavior you want. Here are some ideas, step by step.
One may notice that all lt/eq/gt
operators return the
xs
list unchanged. The only question is whether or not they
add their input parameter. This suggests the following change in code,
using new combinators:
let merge ln rn lt eq gt : 'a list -> 'a list -> 'a list = let rec go l r = match l, r with | [], ys -> ln ys | xs, [] -> rn xs | x :: xs, y :: ys -> match compare x y with | -1 -> lt x @ go xs r | 0 -> eq x @ go xs ys | 1 -> gt y @ go l ys | _ -> assert false in go let keep x = [x] let drop x = []
Those new combinators are still not uniquely defined by
parametricity: let duplicate x = [x; x]
would preserve
well-ordering of the result. One solution could be to request those
combinators to have an "affine" type, where "affine" is a weak notion of
linearity meaning that they can either ignore their output, or use it
only once. One may also take as input a boolean keep
, and
use filter (const keep) [x]
– a sophisticated way to say
that we use a boolean to choose between the only two combinators of
interest, keep
or drop
, which feels a bit like
cheating.
Another remark on this change: it makes it possible to define a
tail-recursive variant of merge
without changing the
interface exposed to the user: instead of writing
lt x @ go xs r
, one would write
List.rev_append (lt x) acc
, for acc
an
accumulator variable that gets reversed at the end. This is not possible
with the previous definition of lt/eq/gt
that allows them
to inspect the whole list of results of future calls.
Regarding ln/rn
, one may avoid the pitfall of removing
elements from the lists by the same trick: not giving the combinators
access to the whole list at once.
let merge ln rn lt eq gt : 'a list -> 'a list -> 'a list = let rec go l r = match l, r with | [], ys -> flatten (map ln ys) | xs, [] -> flatten (map rn xs) | x :: xs, y :: ys -> match compare x y with | -1 -> lt x @ go xs r | 0 -> eq x @ go xs ys | 1 -> gt y @ go l ys | _ -> assert false in go
With this definition (flatten : ’a list list -> ’a
list
may as well be named join
), ln/rn
use the same combinators as lt/eq/gt
, at type ’a
-> ’a list
, which helps simplifying matters. I think that
using the affine-type restriction mentioned above, one could then say
that parametricity guarantees that all the possibilities are the ones
Matías describe.
I have one last idea that is even stricter: I think the
ln
and rn
combinators may be an unnecessary
level of freedom here. From the point of view of a merging operator,
there should be not much of a difference between an empty list
[]
and a list beginning with an infinite value,
∞::_
. Indeed, if one imagines that merge
picks
the smallest element of both lists, such an infinity value will force
all the (non-infinite) elements of the other lists to be considered in
turn, a bit like the current code path for empty lists. We could
therefore handle the pattern
| [], ys -> ....
as if it actually was something like
| (∞::_), y::ys -> ....
which suggests the following right-hand-side (knowing that
compare ∞ y > 0
)
| (∞::_), y::ys -> gt y @ go l ys
This results in the following function:
let merge lt eq gt : 'a list -> 'a list -> 'a list = let rec go l r = match l, r with | [], [] -> [] | x :: xs, [] -> lt x @ go xs r | [], y :: ys -> gt y @ go l ys | x :: xs, y :: ys -> match compare x y with | -1 -> lt x @ go xs r | 0 -> eq x @ go xs ys | 1 -> gt y @ go l ys | _ -> assert false in go
This new definition is strictly less expressive as it only has 2^3
instead of 2^5 parametrizations. It is also less efficient as we handle
elements of the remaining list one by one (which can be disastrous in
particular in applications where ys
, a diff list, can be
expected to be much smaller than xs
, a data list), but we
already gave up on performances when moving to the
flatten (map ...)
style – a price we pay to look nice.
The eight remaining behaviors are the following:
keep keep keep (self self, cons cons cons) : A∪B
keep keep drop (null self, cons cons tail) : A
keep drop keep (self self, cons tail cons) : AΔB
keep drop drop (null self, cons tail tail) : A - B
drop keep keep (self null, tail cons cons) : B
drop keep drop (null null, tail cons tail) : A∩B
drop drop keep (self null, tail tail cons) : B - A
drop drop drop (null null, tail tail tail) : ∅
This captures all the operations Matías found most useful, but drops
(bad pun intended) the “remainders” B/A
and
A/B
. My intuition is that this restriction on the behavior
of ln/rn
by “prolongation to ∞” enforces a kind of
monotonicity property: all operators O(A,B)
so defined
satisfy the property that O(A₁,B₁) @ O(A₂,B₂)
is the same
as O(A₁@A₂, B₁@B₂)
if the biggest elements of
A₁
and B₁
are smaller than the smallest
elements of B₁
and B₂
(A₁ < A₂
for example is necessary for A₁@A₂
to be a valid sorted
sequence). This property does not hold for the A/B
operation.
Reasoning on the parametrization
I had one other remark on Matías’ post. In fact, it is the only remark I had after reading it, that got me motivated to write a reddit comment; I caught the parametricity "bug" while re-reading the post before commenting. It is nice to have a parametrization of set merges fully worked on, but it would be even better if this parametrization was actually helpful to reason about programs: can we express transformation on the result of those operators in term of transformations on the parameters?
This is indeed the case. One can check that, if I call
O{lt,eq,gt}
the operator parametrized by the three
lt/eq/gt
combinators, then
O{l₁,e₁,g₁}(A,B) ∪ O{l₂,e₂,g₂}(A,B)
is equal to
O{l₁∨l₂,e₁∨e₂,g₁∨g₂}(A,B)
, where ∨
is the
symmetrical operator defined by:
keep ∨ _ := keep
_ ∨ keep := keep
drop ∨ drop := drop
If one defines a ∧
operator by swapping the role of
keep
and drop
in this definition, one get
O{l₁∧l₂,e₁∧e₂,g₁∧g₂}(A,B)
equal to
O{l₁,e₁,g₁}(A,B) ∩ O{l₂,e₂,g₂}(A,B)
. There might be a
general definition of what
O{l,e,g}(O{l₁,e₁,g₁}(A,B), O{l₂,e₂,g₂}(A,B))
is, but this
looks exhausting to sort out. Of course, the five-parameter version of
Matías certainly leads to related functional reasoning, more expressive
but also more tedious.
That’s it! Thanks for your patience, and stay tuned for more content on Gagallium, after a guilty summer break.