This series of blog post aims to give a short weekly glimpse into my (Florian Angeletti) work on the OCaml compiler. This week subject is some data analysis of the OCaml changelog.

OCaml 5.1.0 has been released two weeks ago, last week was thus a good time to do some retrospective on this development cycle.

In particular, the OCaml compiler has accumulated a good amount of history in its changelog file. However, since this is a hand-written files with some human-enforced rules, it is not so straightforward to parse it and extract data.

I have thus spent some time mining information from this changelog, with a bunch of pretty ad-hoc rules. The result of this effort is a cleaned-up json change file, available with few analysis tools and plot scripts at https://github.com/Octachron/ocaml-changelog-analyzer.

Starting from this file, it is possible to extract some interesting titbit’s about the history of OCaml 4 and 5. For instance, in term of raw numbers, from OCaml 4.00.0 to OCaml 5.1.0, the changelog increased by 2930 entries. Filtering patch release, we have an average of 151.5 contributions by release, with a standard deviation of 48.84,

With 208 changes, the OCaml 5.1 release is the third release in term of contributions behind OCaml 4.03 (277 changes, with a one year release cycle and introduced flambda), and OCaml 4.08 (226 changes with many standard library changes after the introduction of the Stdlib namespacing in OCaml 4.07.0).

Similarly, OCaml 5.1.0 with 78 authors has the largest number of authors, followed by OCaml 4.03.0 and OCaml 4.08:

We have an average of 49 authors by release, with a standard deviation around 11.8. Maybe more importantly, we can see that we still a flux of a dozen of new contributors every releases Unfortunately, I am not tracking contributors from before OCaml 4.01., which is biasing the computation of the number of new authors. Thus it is hard to tell if the record number of new authors in 4.02 and 4.03 is due to the switch to the Github workflow, since it could be a data artifact.

Moreover, focusing only on the number of authors or contributions hide the diversity of contributions by authors. For instance, counting the number of authored change entries since OCaml 4.01, we have:

  • 9 authors with more than 100 contributions
  • 6 authors with a number of contributions between 40 and 90
  • 16 authors with a number of contributions 15 and 35
  • 119 authors with more than 1 contribution but less than 13
  • 207 authors with one contribution

If we represent each contributors by the number of authored contributions and the number of reviews on a two dimensional log plot, we have a quite wide cloud of points:

It is reassuring to see that there are contributors spending a significant amount of their time on reviews.

It is also fun to visualize the influx of new contributors by plotting the contributions of each persons release by release with some linear interpolation in-between:

It is quite a bit harder to read this movie. Trying to extract few key points, I can see:

  • a vast group of ephemeral contributors
  • a lot of fluctuation of the activity of active contributors from release to release,
  • a small group of prolific and constant contributors.