A recent breakthrough by Boris Konev and Alexei Lisitsa has been described as “too big for humans to check” due to its use of a SAT solver on a large instance. I disagree with this characterization, and will try to explain why.

A surprising, exciting paper published last month on arxiv.org has been making the scientific news. It is a breakthrough in an old conjecture of Erdős (of the infuriating “simple to state, impossible to prove” kind) that can be summed up as “it turns out we can throw a SAT solver at the problem”. The authors first prove that the theorem they want is equivalent to unsatisfiability of a large SAT formula, then run state-of-the-art SAT solver (Lingeling) on it, and get an UNSAT answer in about two hours. To improve confidence in the result, they then used a SAT solver (Glucose) which produces unsatisfiability traces (proof witnesses, instead of a simple boolean answer), and after 10 hours of chugging they got a 13Gio file that can be independently re-checked.

The whole idea is excellent, surprising, and rather amusing. SAT solvers are often described as being very good at solving very large but very boring problems. This is possibly one of the most “interesting” use of this technology – for mathematicians. However, I have been unsatisfied with the presentation of this proof technique in the news coverage of the article, focusing on the size of the UNSAT trace: “a proof that no human can check”. Already in the original article one can read the following:

The negative witness, that is, the DRUP unsatisfiability certificate, is probably one of longest proofs of a non-trivial mathematical result ever produced. Its gigantic size is comparable, for example, with the size of the whole Wikipedia, so one may have doubts about to which degree this can be accepted as a proof of a mathematical statement.

I think the insistence that this eight pages paper has “a large proof” is partly misguided. In this post I try to justify why this can (and, in my opinion, should) also be understood as a relatively short proof – maybe more than eight pages, but still something experts can check in a few weeks or months.

I should clarify what I call a “proof” here: it is a text that mathematicians (or scientists in other formal sciences) share for two purposes. One, it convinces them that the claimed result is valid (the statement is a theorem). Two, it lets them understand why the result is valid, and possibly reproduce the proof technique to prove other results. In a sense, my remark is tautological: I’m just saying that if the mathematicians that checked the arxiv article did not need to read the 13Gio file, line-by-line, to be convinced the result is correct, then this file is not part of the proof. But one could claim that, in absence of this file, the arxiv paper contains no valid proof, and does not meet the expectation of high trust we have in mathematical results. Let me try to convince you there exists such a proof, and that it’s not as large as claimed.

I think it’s unfair to count each byte of the UNSAT trace as part of “the proof”, because they correspond to incredibly atomic and dumb steps of reasoning. If we decided to translate an existing mathematical proof into “atomic reasoning steps”, we would probably also obtain huge proofs – perhaps not as large, as humans may be better than SAT solvers at optimizing proofs for size rather than speed of discovery. Mathematicians sometimes claim that their proofs could be transformed into such a sequence of atomic steps, but people don’t actually check those sequences, they check the higher-level proof as it was written. Looking at the present work from a different viewpoint, could we see it as a much shorter proof?

We could consider “running the SAT solver” a single reasoning step we trust. If we trust the SAT solver, and it returns UNSAT, then we have a rigorous proof of length one. But do we have the same trust in the SAT solver used in the article than in, say, theorems of the mathematical literature? Probably not, because it’s a computer programs, and those tend to have bugs – more so than proofs accepted by the mathematical community. But then we could build trust in the particular SAT solver used by (mathematically) proving it correct.

If the particular SAT solver used was not designed to be amenable to a correctness proof, trying to prove it correct would certainly be a Herculean task resulting in a large, messy, and very boring proof (or rather a counter-example denoting a bug in a corner case of the program). But there are SAT solvers designed for formal correctness; in fact, several SAT solvers are available that have been formally proven correct, as described for example in the following publications:

All those articles sketch a correctness proof for runnable SAT solving programs (not just a pseudocode algorithm for SAT solving), and are accompanied by very detailed proofs that filll absolutely detail one may enquire about; I’ll say more about what “formal verification” happens to mean in these work later, let’s just assume that their are detailed in a way that should satisfy any mathematician, sometimes at the cost of some readability.

If you, as a mathematician, check the correctness proof of one of these solvers, then run the SAT problem of the Erdős proof through it, and it says UNSAT, you have a perfectly correct proof that the SAT problem is indeed unsatisfiable: the arxiv paper, plus the correctness proof, plus the sentence “This correct program says UNSAT on this problem”. You need to understand the relation between the proved program and the act of running some binary blob on a physical computer, but that’s not more or less problematic than any other gap between the formal and physical world.

Now it may happen that none of those solvers says “UNSAT”: maybe the actual prover used in the proof has much better heuristics, that have not been added to those “certified solvers” as their correctness proof is too complicated, and are essential in checking UNSAT in a reasonable time – your certified provers may just time out. There is another option at your disposal: decide that running a SAT solver is not part of the proof, but that running a program that checks the UNSAT trace (the 13Gio file the article is talking about) is. It’s an achievement to prove a SAT solver correct; it’s already non-trivial to write a checker of UNSAT traces that is proved correct, but it has been done several times, and lets you use any SAT solver that can produce traces, without having to trust them. See for example A Modular Integration of SAT/SMT Solvers to Coq through Proof Witnesses (Michel Armand , Germain Faure, Benjamin Grégoire, Chantal Keller, Laurent Théry, Benjamin Werner, 2011), and the other works cited in their “Related Work” section. If you check the proof that the trace verifier is correct, and you run it on the 13Gio trace and it says “UNSAT”, you have a valid proof.

These two proof techniques (checking the proof of a SAT solver, and checking the proof of a SAT trace verifier) are completely in reach of humans – it’s been done several times already. Claiming that “the proof” of the Erdős result cannot be verified by human is therefore not accurate; there would be several ways to gain mathematical confidence in the result, some reasonable, some not.

Two closing references

  • The proof assistants community has a name for the idea that some programs are too complex to prove correct, but that we can instead prove correct a program that checks their output: validation. It was introduced by George C. Necula in his 2000 article Translation Validation for an Optimizing Compiler. If you are into complexity theory, you’ll be familiar with the related (but different) idea that some problems are very hard to solve, but that a given answer may be easy to verify.

  • If you are interested in the relation between proof of correctness of computer programs and usual mathematical proofs, the following article is a must-read: Social Processes and Proofs of Theorems and Programs, by Richad A. De Millo, Richard J. Lipton, and Alan J. Perlis (1979). This article has a view of mathematical proofs that is controversial among mathematicians – and that I personally agree with. And its claims on program verification have, in the thirty years since the publication of this article, been largely proven wrong. It still makes some very good points; in particular, not all programs have a reasonable specification; luckily it’s the case of SAT solvers or UNSAT trace checkers.

Post Scriptum

It happens to be the case that the correctness proofs in all the articles I’ve cited above have been checked by humans and by computers, using proof assistants, tools that provide both a language to express mathematical proofs and a program to check them. I have not emphasized this point above: a proof checked by a computer is still a proof, and you are is free to re-do the work of checking it, to build your own understanding of the result. But you could also decide that “checking those SAT programs with a proof assistant” is part of the proof of the Erdős result, freeing you from verifying the SAT programs as part of the proof; you would instead have to look for the formal proof of those assistants correctness.

This may not be a win. You would find out that a correctness proof for the proof assistants used in the works above is notably larger and more difficult than the direct proof of correctness of the UNSAT trace checkers, or even of the SAT solver themselves. If you’re only interested in a self-contained proof of the Erdős result, the detour through a correctness proof of a proof assistant is not interesting. But if you’re interested in checking many different proofs, that all have some parts formalized in some proof assistant, then checking this proof assistant may become worthwhile in terms of total proof size.

(You may wonder why all those proofs of SAT solvers have been written with a proof assistant, while statistically none of the mathematical proofs are – today. I think the reason is that while proof assistants are harder to use than pen-and-paper proofs, they help keep track of proof complexity and manage the “boring” aspects of one’s proof without introducing mistakes. So they happen to be a very good fit for verification of runnable programs, that are often larger and have more “boring” parts than idealized algorithms or your usual article-grained math problem.)