Comments on Matías Giovannini's "Merge Right"
- August 18, 2012
A relatively common story: a comment I started writing in reaction to Matías Giovannini's Merge right blog post turned out lengthier than I initially expected. It feels a bit off-topic on Gagallium, but I know there are people interested in reading about Matías' perfectionist code snippets (I am), so there might also be an audience for perfectionist comments about perfectionist blog posts...
If you haven't read Matías' blog post yet, you should have a look at it now – or skip this post altogether. He defines a rather generic merging operation on sorted collections, parametrized by functions deciding what to do when the next item on the left collection is smaller, or on the right collection, when both items are equal, or when one of the two collections has no more elements.
He then enumerates all reasonable choices for those function parameters, to get a list of 32 different "merge operators" that are instance of this general function. It's amusing to see this classification carried out in full.
The code for the generic merge function is essentially this one:
let merge ln rn lt eq gt : 'a list -> 'a list -> 'a list = let rec go l r = match l, r with | [], ys -> ln ys | xs, [] -> rn xs | x :: xs, y :: ys -> match compare x y with | -1 -> lt x (go xs r) | 0 -> eq x (go xs ys) | 1 -> gt y (go l ys) | _ -> assert false in go
With the following operators used as parameters, in two groups, one
for ln/rn and one for lt/eq/gt:
let self xs = xs and null _ = [] let cons x xs = x :: xs and tail _ xs = xs
Matías shows, for example, that intersection of two collection may be
defined as merge null null tail cons tail, while symmetric
difference is merge self self cons tail cons.
Parametricity : a bit too good to be true
I think it is not exactly true that "provided the output sequence must
be increasing, these four functions [self, null, cons and tail]
exhaust the possibilities by parametricity": it seems that removing
some elements from the sequence xs (but not all as null and
tail do) would preserve this sorting criterion.
I'm not sure which is the best way to reduce the number of dimensions of freedom of this problem to capture the behavior you want. Here are some ideas, step by step.
One may notice that all lt/eq/gt operators return the xs list
unchanged. The only question is whether or not they add their input
parameter. This suggests the following change in code, using new
combinators:
let merge ln rn lt eq gt : 'a list -> 'a list -> 'a list = let rec go l r = match l, r with | [], ys -> ln ys | xs, [] -> rn xs | x :: xs, y :: ys -> match compare x y with | -1 -> lt x @ go xs r | 0 -> eq x @ go xs ys | 1 -> gt y @ go l ys | _ -> assert false in go let keep x = [x] let drop x = []
Those new combinators are still not uniquely defined by parametricity:
let duplicate x = [x; x] would preserve well-ordering of the
result. One solution could be to request those combinators to have an
"affine" type, where "affine" is a weak notion of linearity meaning
that they can either ignore their output, or use it only once. One may
also take as input a boolean keep, and use filter (const keep) [x]
– a sophisticated way to say that we use a boolean to choose between
the only two combinators of interest, keep or drop, which feels
a bit like cheating.
Another remark on this change: it makes it possible to define
a tail-recursive variant of merge without changing the interface
exposed to the user: instead of writing lt x @ go xs r, one would
write List.rev_append (lt x) acc, for acc an accumulator variable
that gets reversed at the end. This is not possible with the previous
definition of lt/eq/gt that allows them to inspect the whole list of
results of future calls.
Regarding ln/rn, one may avoid the pitfall of removing elements from
the lists by the same trick: not giving the combinators access to the
whole list at once.
let merge ln rn lt eq gt : 'a list -> 'a list -> 'a list = let rec go l r = match l, r with | [], ys -> flatten (map ln ys) | xs, [] -> flatten (map rn xs) | x :: xs, y :: ys -> match compare x y with | -1 -> lt x @ go xs r | 0 -> eq x @ go xs ys | 1 -> gt y @ go l ys | _ -> assert false in go
With this definition (flatten : 'a list list -> 'a list
may as well be named join), ln/rn use the same combinators as
lt/eq/gt, at type 'a -> 'a list, which helps
simplifying matters. I think that using the affine-type restriction
mentioned above, one could then say that parametricity guarantees that
all the possibilities are the ones Matías describe.
I have one last idea that is even stricter: I think the ln and rn
combinators may be an unnecessary level of freedom here. From the
point of view of a merging operator, there should be not much of
a difference between an empty list [] and a list beginning with an
infinite value, ∞::_. Indeed, if one imagines that merge picks
the smallest element of both lists, such an infinity value will force
all the (non-infinite) elements of the other lists to be considered in
turn, a bit like the current code path for empty lists. We could
therefore handle the pattern
| [], ys -> ....
as if it actually was something like
| (∞::_), y::ys -> ....
which suggests the following right-hand-side
(knowing that compare ∞ y > 0)
| (∞::_), y::ys -> gt y @ go l ys
This results in the following function:
let merge lt eq gt : 'a list -> 'a list -> 'a list = let rec go l r = match l, r with | [], [] -> [] | x :: xs, [] -> lt x @ go xs r | [], y :: ys -> gt y @ go l ys | x :: xs, y :: ys -> match compare x y with | -1 -> lt x @ go xs r | 0 -> eq x @ go xs ys | 1 -> gt y @ go l ys | _ -> assert false in go
This new definition is strictly less expressive as it only has 2^3
instead of 2^5 parametrizations. It is also less efficient as we
handle elements of the remaining list one by one (which can be
disastrous in particular in applications where ys, a diff list, can
be expected to be much smaller than xs, a data list), but we already
gave up on performances when moving to the flatten (map ...) style –
a price we pay to look nice.
The eight remaining behaviors are the following:
keep keep keep (self self, cons cons cons) : A∪B
keep keep drop (null self, cons cons tail) : A
keep drop keep (self self, cons tail cons) : AΔB
keep drop drop (null self, cons tail tail) : A - B
drop keep keep (self null, tail cons cons) : B
drop keep drop (null null, tail cons tail) : A∩B
drop drop keep (self null, tail tail cons) : B - A
drop drop drop (null null, tail tail tail) : ∅
This captures all the operations Matías found most useful, but drops
(bad pun intended) the "remainders" B/A and A/B. My intuition is
that this restriction on the behavior of ln/rn by "prolongation to
∞" enforces a kind of monotonicity property: all operators O(A,B) so
defined satisfy the property that O(A₁,B₁) @ O(A₂,B₂) is the same as
O(A₁@A₂, B₁@B₂) if the biggest elements of A₁ and B₁ are smaller
than the smallest elements of B₁ and B₂ (A₁ < A₂ for example is
necessary for A₁@A₂ to be a valid sorted sequence). This property
does not hold for the A/B operation.
Reasoning on the parametrization
I had one other remark on Matías' post. In fact, it is the only remark I had after reading it, that got me motivated to write a reddit comment; I caught the parametricity "bug" while re-reading the post before commenting. It is nice to have a parametrization of set merges fully worked on, but it would be even better if this parametrization was actually helpful to reason about programs: can we express transformation on the result of those operators in term of transformations on the parameters?
This is indeed the case. One can check that, if I call O{lt,eq,gt}
the operator parametrized by the three lt/eq/gt combinators, then
O{l₁,e₁,g₁}(A,B) ∪ O{l₂,e₂,g₂}(A,B) is equal to
O{l₁∨l₂,e₁∨e₂,g₁∨g₂}(A,B), where ∨ is the symmetrical operator
defined by:
keep ∨ _ := keep
_ ∨ keep := keep
drop ∨ drop := drop
If one defines a ∧ operator by swapping the role of keep and
drop in this definition, one get O{l₁∧l₂,e₁∧e₂,g₁∧g₂}(A,B) equal
to O{l₁,e₁,g₁}(A,B) ∩ O{l₂,e₂,g₂}(A,B). There might be a general
definition of what O{l,e,g}(O{l₁,e₁,g₁}(A,B), O{l₂,e₂,g₂}(A,B)) is,
but this looks exhausting to sort out. Of course, the five-parameter
version of Matías certainly leads to related functional reasoning,
more expressive but also more tedious.
That's it! Thanks for your patience, and stay tuned for more content on Gagallium, after a guilty summer break.