Gagallium : Comments on Matías Giovannini's "Merge Right"

A relatively common story: a comment I started writing in reaction to Matías Giovannini’s Merge right blog post turned out lengthier than I initially expected. It feels a bit off-topic on Gagallium, but I know there are people interested in reading about Matías’ perfectionist code snippets (I am), so there might also be an audience for perfectionist comments about perfectionist blog posts…

If you haven’t read Matías’ blog post yet, you should have a look at it now – or skip this post altogether. He defines a rather generic merging operation on sorted collections, parametrized by functions deciding what to do when the next item on the left collection is smaller, or on the right collection, when both items are equal, or when one of the two collections has no more elements.

He then enumerates all reasonable choices for those function parameters, to get a list of 32 different "merge operators" that are instance of this general function. It’s amusing to see this classification carried out in full.

The code for the generic merge function is essentially this one:

let merge ln rn lt eq gt : 'a list -> 'a list -> 'a list =
  let rec go l r = match l, r with
  | [], ys -> ln ys
  | xs, [] -> rn xs
  | x :: xs, y :: ys ->
    match compare x y with
    | -1 -> lt x (go xs r)
    |  0 -> eq x (go xs ys)
    |  1 -> gt y (go l  ys)
    |  _ -> assert false
  in go

With the following operators used as parameters, in two groups, one for ln/rn and one for lt/eq/gt:

let self xs =      xs
and null _  =      []

let cons x xs = x :: xs
and tail _ xs =      xs

Matías shows, for example, that intersection of two collection may be defined as merge null null tail cons tail, while symmetric difference is merge self self cons tail cons.

Parametricity : a bit too good to be true

I think it is not exactly true that "provided the output sequence must be increasing, these four functions [self, null, cons and tail] exhaust the possibilities by parametricity": it seems that removing some elements from the sequence xs (but not all as null and tail do) would preserve this sorting criterion.

I’m not sure which is the best way to reduce the number of dimensions of freedom of this problem to capture the behavior you want. Here are some ideas, step by step.

One may notice that all lt/eq/gt operators return the xs list unchanged. The only question is whether or not they add their input parameter. This suggests the following change in code, using new combinators:

let merge ln rn lt eq gt : 'a list -> 'a list -> 'a list =
  let rec go l r = match l, r with
  | [], ys -> ln ys
  | xs, [] -> rn xs
  | x :: xs, y :: ys ->
    match compare x y with
    | -1 -> lt x @ go xs r
    |  0 -> eq x @ go xs ys
    |  1 -> gt y @ go l  ys
    |  _ -> assert false
  in go

let keep x = [x]
let drop x = []

Those new combinators are still not uniquely defined by parametricity: let duplicate x = [x; x] would preserve well-ordering of the result. One solution could be to request those combinators to have an "affine" type, where "affine" is a weak notion of linearity meaning that they can either ignore their output, or use it only once. One may also take as input a boolean keep, and use filter (const keep) [x] – a sophisticated way to say that we use a boolean to choose between the only two combinators of interest, keep or drop, which feels a bit like cheating.

Another remark on this change: it makes it possible to define a tail-recursive variant of merge without changing the interface exposed to the user: instead of writing lt x @ go xs r, one would write List.rev_append (lt x) acc, for acc an accumulator variable that gets reversed at the end. This is not possible with the previous definition of lt/eq/gt that allows them to inspect the whole list of results of future calls.

Regarding ln/rn, one may avoid the pitfall of removing elements from the lists by the same trick: not giving the combinators access to the whole list at once.

let merge ln rn lt eq gt : 'a list -> 'a list -> 'a list =
  let rec go l r = match l, r with
  | [], ys -> flatten (map ln ys)
  | xs, [] -> flatten (map rn xs)
  | x :: xs, y :: ys ->
    match compare x y with
    | -1 -> lt x @ go xs r
    |  0 -> eq x @ go xs ys
    |  1 -> gt y @ go l  ys
    |  _ -> assert false
  in go

With this definition (flatten : ’a list list -> ’a list may as well be named join), ln/rn use the same combinators as lt/eq/gt, at type ’a -> ’a list, which helps simplifying matters. I think that using the affine-type restriction mentioned above, one could then say that parametricity guarantees that all the possibilities are the ones Matías describe.

I have one last idea that is even stricter: I think the ln and rn combinators may be an unnecessary level of freedom here. From the point of view of a merging operator, there should be not much of a difference between an empty list [] and a list beginning with an infinite value, ∞::_. Indeed, if one imagines that merge picks the smallest element of both lists, such an infinity value will force all the (non-infinite) elements of the other lists to be considered in turn, a bit like the current code path for empty lists. We could therefore handle the pattern

  | [], ys -> ....

as if it actually was something like

  | (∞::_), y::ys -> ....

which suggests the following right-hand-side (knowing that compare ∞ y > 0)

  | (∞::_), y::ys -> gt y @ go l ys

This results in the following function:

let merge lt eq gt : 'a list -> 'a list -> 'a list =
  let rec go l r = match l, r with
  | [],      []      -> []
  | x :: xs, []      -> lt x @ go xs r
  | [],      y :: ys -> gt y @ go l  ys
  | x :: xs, y :: ys ->
    match compare x y with
    | -1 -> lt x @ go xs r
    |  0 -> eq x @ go xs ys
    |  1 -> gt y @ go l  ys
    |  _ -> assert false
  in go

This new definition is strictly less expressive as it only has 2^3 instead of 2^5 parametrizations. It is also less efficient as we handle elements of the remaining list one by one (which can be disastrous in particular in applications where ys, a diff list, can be expected to be much smaller than xs, a data list), but we already gave up on performances when moving to the flatten (map ...) style – a price we pay to look nice.

The eight remaining behaviors are the following:

keep keep keep (self self, cons cons cons) : A∪B
keep keep drop (null self, cons cons tail) : A
keep drop keep (self self, cons tail cons) : AΔB
keep drop drop (null self, cons tail tail) : A - B
drop keep keep (self null, tail cons cons) : B
drop keep drop (null null, tail cons tail) : A∩B
drop drop keep (self null, tail tail cons) : B - A
drop drop drop (null null, tail tail tail) : ∅

This captures all the operations Matías found most useful, but drops (bad pun intended) the “remainders” B/A and A/B. My intuition is that this restriction on the behavior of ln/rn by “prolongation to ∞” enforces a kind of monotonicity property: all operators O(A,B) so defined satisfy the property that O(A₁,B₁) @ O(A₂,B₂) is the same as O(A₁@A₂, B₁@B₂) if the biggest elements of A₁ and B₁ are smaller than the smallest elements of B₁ and B₂ (A₁ < A₂ for example is necessary for A₁@A₂ to be a valid sorted sequence). This property does not hold for the A/B operation.

Reasoning on the parametrization

I had one other remark on Matías’ post. In fact, it is the only remark I had after reading it, that got me motivated to write a reddit comment; I caught the parametricity "bug" while re-reading the post before commenting. It is nice to have a parametrization of set merges fully worked on, but it would be even better if this parametrization was actually helpful to reason about programs: can we express transformation on the result of those operators in term of transformations on the parameters?

This is indeed the case. One can check that, if I call O{lt,eq,gt} the operator parametrized by the three lt/eq/gt combinators, then O{l₁,e₁,g₁}(A,B) ∪ O{l₂,e₂,g₂}(A,B) is equal to O{l₁∨l₂,e₁∨e₂,g₁∨g₂}(A,B), where ∨ is the symmetrical operator defined by:

keep ∨ _    := keep
_    ∨ keep := keep
drop ∨ drop := drop

If one defines a ∧ operator by swapping the role of keep and drop in this definition, one get O{l₁∧l₂,e₁∧e₂,g₁∧g₂}(A,B) equal to O{l₁,e₁,g₁}(A,B) ∩ O{l₂,e₂,g₂}(A,B). There might be a general definition of what O{l,e,g}(O{l₁,e₁,g₁}(A,B), O{l₂,e₂,g₂}(A,B)) is, but this looks exhausting to sort out. Of course, the five-parameter version of Matías certainly leads to related functional reasoning, more expressive but also more tedious.

That’s it! Thanks for your patience, and stay tuned for more content on Gagallium, after a guilty summer break.