So now everybody knows why I was posting more and more rarely. I recently posted a proof of the ternary Goldbach conjecture: it was split into two articles – a new one on major arcs, and a new and improved version of the article on minor arcs I wrote last year.
Of course, I’m expected to write a summary of the main ideas here – not just emphasizing the new points, as I do when giving a talk to an audience of specialists, but giving an overall picture of the proof and all its parts, old and new. Let me do that.
I know that the audience here is very mixed – if you don’t have the background to follow a paragraph, just keep on reading.
Leonhard Euler – one of the greatest mathematicians of the eighteenth century, and indeed of all time – and his close friend, the amateur and polymath Christian Goldbach, kept a regular and copious correspondence. Goldbach made a guess about prime numbers, and Euler quickly reduced it to the following conjecture, which, he said, Goldbach had already stated to him: every positive integer can be written as the sum of at most three prime numbers.
We would now say “every positive integer greater than 5”, since we no longer think of 1 as a prime number. Moreover, the conjecture is nowadays split into two: the weak, or ternary, Goldbach conjecture states that every odd integer greater than 5 can be written as the sum of three primes; the strong, or binary, Goldbach conjecture states that every even integer greater than 2 can be written as the sum of two primes. As their names indicate, the strong conjecture implies the weak one (easily: subtract 3 from your odd number , then express as the sum of two primes).
See Dickson, History of the theory of numbers, Vol I., Ch. XVIII, for the early history of the subject. In summary — Waring seems to have come up with the weak conjecture again in the late eighteenth century; the nineteenth century saw some computational work (checking the conjecture for small integers – by hand!) but little real progress.
The strong conjecture remains out of reach. A few weeks ago — my preprint appeared on May 13, 2013 — I proved the weak Goldbach conjecture.
The proof builds on the progress made in the early 20th century by Hardy, Littlewood and Vinogradov. Vinogradov proved (1937) that the conjecture is true for all odd numbers larger than some constant . (Hardy and Littlewood had shown the same under the assumption of the Generalized Riemann Hypothesis; more on that later.) Since Vinogradov’s day, the constant has been specified and gradually improved, but the best (i.e., smallest) available value for was (Liu-Wang), which was much too large. Even would be too large: since is larger than the estimated number of subatomic particles in the universe times the number of seconds since the Big Bang, there wouldn’t be any hope of checking every case of by computer (even if the computer were really highly parallel, and you had really high cosmic priority)!
I brought down to (and could bring it farther down if needed). D. Platt and I had checked the conjecture for all odd numbers up to by computer (and could have gone farther), so that was the end of the story.
What goes into the proof? Let us first step back and look at the general framework of the circle method, introduced by Hardy and Littlewood.
The circle method: Fourier analysis on the integers
Fourier analysis is something we do every time we tune to a radio station: there is a signal, and we decompose it into the contributions from different frequencies. In mathematical terms – we are given a function (i.e., a function on a single real variable; in the case of a radio, the variable is time) and we define the Fourier transform by , where we write for . Then, as we learn in any Fourier analysis course, , provided that decays rapidly enough and is otherwise well-behaved. (This is the “Fourier inversion formula”.)
In other words, has been decomposed as a sum of (complex) exponential functions, with the (complex) exponential function present with “strength” . (This is equivalent to a decomposition into “sine waves” and , since .) To go back to the example of a radio: is large when is close to the frequency of some radio station, and small otherwise. (What your radio receives is a superposition of what all stations transmit; your radio receiver’s business is precisely to figure out the contribution of frequencies around a given .)
We can do the same if is a function from the integers to . In fact, things are now simpler — we get to define by a sum rather than an integral: . A funny thing here is that doesn’t change at all if we add , or any other integer , to . This is so because, for an integer,
(Thanks again, Euler.) Thus, we may restrict to the interval — or, more abstractly, we can think of as living in the quotient . Topologically, is a circle; this is just the same as saying that, since it doesn’t matter whether we add or substract to our frequency, we might as well have the little frequency marker in our radio go around a circle marked with numbers from up to , rather than have it slide back and forth (a segment of) the real line (as in the actual radio on my table). This is where the phrase circle method comes from.
The decomposition of now looks as follows: provided that decays rapidly enough.
Why do we care? The Fourier transform is immediately useful if we are dealing with additive problems, such as the Goldbach conjectures. The reason behind this is that the transform of a convolution equals a product of transforms: . Recall that the (additive) convolution of is defined by .
We can see right away from this that can be non-zero only if can be written as for some , such that and are non-zero. Similarly, can be non-zero only if can be written as for some such that , and are all non-zero. This suggests that, to study the ternary Goldbach problem, we define , , so that they take non-zero values only at the primes.
Hardy and Littlewood defined for composite (or zero or negative), and for prime (where is a parameter to be fixed later). Here the factor is there to provide “fast decay”, so that everything converges; as we will see later, Hardy and Littlewood’s choice of (rather than some other function of fast decay) is actually very clever, though not quite the best possible. The term is there for technical reasons (basically, it turns out that it makes sense to weigh a prime by because roughly one out of every integers of size about is a prime).
We see that if and only if can be written as the sum of three primes. Our task is then to show that (i.e., ) is non-zero for every larger than a constant. Since the transform of a convolution equals a product of transforms,
Our task is thus to show that the integral is non-zero.
As it happens, is particularly large when is close to a rational with small denominator; it is as if there were really radio stations transmitting at the (small-denominator) frequencies marked in the “tuning” drawing above – when the dial is close to one of them, there is a loud, clear , and when we are away from all of them, we can hear only a low hum. This suggests the following strategy: estimate for all within small arcs around the rationals with small denominators (the major arcs — so called because they make a major contribution, in spite of being small); bound for outside the major arcs (everything outside the major arcs is called minor arcs); then show that the contribution of the minor arcs to the integral is smaller in absolute value than the contribution of the major arcs, thereby forcing the integral to be non-zero.
It is this general strategy that gets called the circle method. Hardy and Littlewood introduced it to deal with a wide variety of additive problems; for example, this was also part of their approach to Waring’s problem (yes, same Waring), on integers that are sums of th powers of integers. This was taken over by Vinogradov, who was the first to give good, unconditional bounds on for in the minor arcs (something considered very remarkable at the time). The circle method is also my general strategy: what I have done is to give much better estimates on the major and minor arcs than were previously available, for , and chosen with great care.
(Incidentally: while we can start to treat the binary, or strong, Goldbach conjecture with the circle method, we soon hit a wall: the “noise” from the minor arcs overwhelms the contribution from the major arcs. This is well explained in this post by Tao.)
Dirichlet L-functions and their zeros
Before we can start working on the major arcs, we need to discuss L-functions. First, there is the zeta function , first studied for complex by Riemann, after whom it is now named. This is given by when the real part of is greater than . For , the series diverges, but the function can be defined (uniquely) by analytic continuation (and this can be done explicitly by, e.g., Euler-Maclaurin, as in Davenport, Multiplicative Number Theory, 2nd. ed, p. 32), with a pole at .
Analogously, there are Dirichlet L-functions, defined by for , and by analytic continuation for . Here is any Dirichlet character; for every
given , is a function of . A Dirichlet character (of modulus ) is just a function of period (i.e. for all ), with the additional properties that it be multiplicative ( for all , ) and that whenever and are not coprime. (The sophisticated way to put it is that it is a character of lifted to .) Dirichlet characters and Dirichlet L-functions were introduced by, um, Dirichlet, in order to study primes in arithmetic progressions.
A zero of a function is just an such that . A non-trivial zero of , or of , is a zero of , or of , such that . (The other zeros are called trivial because it is easy to tell where they are (namely, at negative integers and sometimes at ).) The Riemann hypothesis states that all non-trivial zeros of the Riemann zeta function “lie in the critical line”, meaning that they satisfy . The Generalized Riemann hypothesis for Dirichlet L-functions states that, for every Dirichlet character , every non-trivial zero of satisfies .
Since both the Riemann Hypothesis (RH) and the Generalized Riemann Hypothesis (GRH) remain unproven, any result proven using them will be conditional; we want to prove unconditional results. What can indeed be proven, and used, are partial results in the direction of GRH. Such results are of two kinds:
– Zero-free regions. Ever since the late nineteenth century (de la Vallée-Poussin) we have known that there are hour-glass shaped regions (more precisely, of the shape , where is a constant and where we write ) outside which non-trivial zeroes cannot lie;
– Finite verifications of GRH. It is possible to (ask the computer to) prove small, finite chunks of GRH, in the sense of verifying that all non-trivial zeros of a given function with imaginary part less than some constant lie on the critical line .
Most work up to date follows the first alternative. I chose the latter, and this had consequences on the precise way in which I defined the major and minor arcs: I got very precise results on the major arcs, but I had to define them to be few and very narrow,
or else the method wouldn’t work. This meant that the minor arc methods had to be particularly potent, since a particularly large part of the circle was left for them to deal with.
Let us look more closely at how one can deal with major arcs using partial results towards GRH, and, in particular, finite verifications of GRH.
Recall that we want to estimate sums of the type , where is something like (say) for equal to a prime, and otherwise. Let us modify this just a little – we will actually estimate , where is the von Mangoldt function: if is either a prime power , , and otherwise. (The use of rather than is just a bow to tradition, as is the use of the letter (for “sum”); however, the use of rather than just plain does actually simplify matters when you deal with so-called explicit formulas, which we will see in a minute.) Here is some function of fast decay; it can be , as in Hardy and Littlewood’s work, or (as in my work) something else. (It could even be just the “brutal truncation” , defined to be when and otherwise; that would be fine for the minor arcs, but, as we will see, it is a bad idea as far as the major arcs are concerned.)
Assume is on a major arc, meaning that we can write for some ( small) and some (with small). We can express as a linear combination (that is, a sum of multiples) of terms of the form , where and runs over Dirichlet characters of modulus .
Why are these sums nicer than the plainer sums ? The argument has become , whereas before it was . Here is small — smaller than a constant, in our setup. In other words, will go around the circle a bounded number of times as goes from up to a constant times (by which time has become small). This makes the sum much easier to estimate.
It is a standard fact that we can express by an explicit formula (yes, the phrase has a technical meaning, just like Jugendtraum):
Here the term between brackets appears only for . In the sum, goes over all non-trivial zeros of , and is the Mellin transform of . We win if we manage to show that the sum over is small.
The point is this – if we check GRH up to imaginary part , then we know that all with satisfy , and thus . In other words, is then very small (compared to ). However, for any whose imaginary part has absolute value greater than , we know nothing about its real part, other than . (All right, we could use a zero-free region, but known zero free regions are notoriously weak for large – meaning they tell us little in practice.) Hence, our only chance is to make sure that is small when .
This, mind you, has to be true both for tiny (including: ) and for not so tiny ( between and a constant). If you toy around with the method of stationary phase, we get that behaves like for tiny (here is the Mellin transform of ) and like for not so tiny (where ). Thus, we are in a classical dilemma, often called the uncertainty principle because it is the mathematical fact underlying the physical principle of the same name: you cannot have a function that decreases extremely rapidly and whose Fourier transform (or, in this case, its Mellin transform) also decays extremely rapidly.
What does “extremely rapidly” mean? It means “faster than any exponential “. Thus, Hardy and Littlewood’s choice seems essentially optimal.
Not so fast! What we can do is choose so that decreases exponentially (with a constant a bit worse than before), but decreases faster than exponentially. This is a rather good idea, as it is (and not so much ) that risks being fairly small.
One choice obeying this description is the Gaussian . The Mellin transform turns out to be a parabolic cylinder function, with imaginary values for one of the parameters. Parabolic cylinder functions seem to be much-loved and much-studied by applied people — but mostly for real values of the said parameter. There are some asymptotic expansions of in the literature for general parameters (notably by F. W. J. Olver), but none that were explicit enough for my purposes. Thus, I had to go and provide fully explicit estimates myself, using the saddle-point method. This took me a while, but the results should be of general applicability – hi there, engineers – and the Gaussian smoothing will hopefully become a little more popular in explicit work in number theory.
Ah, by the way, these estimates on parabolic cylinder functions allow us to take not just , but also, more generally, , here is any band-limited function, meaning, in this context, any function whose Mellin transform restricted to the axis has compact support. We will want to optimize the choice of — more on that later.
The minor arcs
How do you bound when is not close to any rational of small denominator? That this is at all possible was Vinogradov’s great achievement. Progress since then has been gradual. My own take on things is the subject of my minor arcs paper. Let me just discuss a few of the ideas behind my improvements.
Vinogradov’s proof was greatly simplified in the 70s by Vaughan, who introduced the identity that now bears his name. Basically, Vaughan’s identity is a gambit: it gives you a great deal of flexibility, but at a cost — here, a cost of two logs, rather than, say, two pawns. The problem is that, if we are to reach our goal, we cannot afford to waste logs. The only way is to recover these logs, finding cancellation in the different sums that arise from Vaughan’s identity. This I had to do, mind you, without using -functions, since I could no longer assume that was small.
Here is another theme of this part of the proof. Every has an approximation the fact that is in the minor arcs just tells us that is not small. In fact, we are looking for bounds that decrease with ; the bound I obtain is proportional to . What is the effect of ?
Something I realized early on was that, if is not tiny, it can actually be used to our advantage. One reason is that there are terms of the form , and the Fourier transforms of smooth functions decay as the argument grows. There are other issues, however. Something we can use is the following: by basic results on Diophantine approximation, every has very good approximations by rationals of non-huge denominator. If is not tiny, then the approximation is good, but not very good; hence, there must be another, better approximation with non-huge (meaning: considerably smaller than ). We can go back and forth between the approximations and , depending on which one is more useful in context. This turns out to be better than using a single approximation , however good it may be.
Another way in which large gets used is to scatter the inputs to a large sieve. The large sieve can be seen as an approximate form of the Plancherel identity, recast as an inequality: whereas the Plancherel identity tells us that the -norm of a Fourier transform of a function defined on the integers (for instance; groups other than the integers are also valid) equals the -norm of the function itself, the large sieve tells us that the total of for a well-spaced sample of points is bounded by (a multiple of) . Now, in our case, the points are multiples of our angle . If , the spacing of the points is , which is nice — but we may have to apply the large sieve many times, since we have to apply it afresh for each chunk of points. However, if and is not tiny, we can go around the circle many times, and rely on rather than on to give us the spacing. Yes, the spacing may be smaller, but the effect of this is more than compensated by the fact that we have to invoke the large sieve far fewer times (perhaps only once). What is more, this scattering can be combined with a more traditional kind of scattering (Montgomery’s lemma; see Montgomery’s “Topics in multiplicative number theory”, or else the exposition in Iwaniec-Kowalski, section 7.4) so as to take advantage of the fact that we are dealing with sums on the primes.
Putting it all together
I’ve been telling you what goes into bounding for within the minor arcs , but what we really want to do is bound the integral . One easy – and traditional – way to do this is to just use the trivial inequality . Unfortunately, this wastes a factor of .
Since our bounds for , , are given in terms of , it makes sense to combine them with estimates for integrals of the type , where is a union of arcs around rationals with denominator greater than a constant but less than . How do you estimate such integrals? It turns out that this is very closely related to a question on the large sieve: what bounds can you get for samples , , where is of moderate size?
There was an answer in the literature (based on Montgomery’s lemma; the link with the circle method was already noticed by Heath-Brown) but it was sub-optimal by a factor of at least (or in fact more). There was a newer estimate for the large sieve due to Ramaré, but it had not been made fully explicit. I had to work that out, and then adapted the new large-sieve result to estimate the integral over above. As was expected, the spurious factor of (or really a bit more) disappeared.
It remains to look at the main term vs. the error term. It turns out that we have some room to choose what the main term will be, since it depends on the smoothings we choose. The main term is proportional to , where and are the two smoothings we choose to work with, is the odd number we want to express as the sum of three primes, and is again a parameter of our choosing. For comparison, the error term is proportional to . Thus, we have an optimization problem (“maximize the size of the double integral divided by “). It is best to choose symmetric or close to symmetric (), making sure, moreover, that for . This is not too hard to achieve while keeping of the form , where is band-limited.
What about ? The solution to the optimization problem tells us that it should be of small support, or at least concentrated near the origin. Other than that – there is, so to speak, a political problem: , unlike , gets used both in the major and the minor arcs; the major arcs really want it to be of the form or , whereas the minor arcs would prefer it to be something simple, like or like (as in Tao’s five-prime paper or my own minor-arcs paper).
The solution is simple: define , where is a large constant, and . For and pretty much arbitrary, if you know how to compute (or estimate) for some , and you also know how to estimate for the other , then you know how to estimate for all : just write it out and you will see!
(Perhaps you will also see that it helps if and are very small near .)
The moral is that different problems, and different parts of the same problem, require different smoothings. At least in the context of exponential sums, there turns out to be a simple trick for combining them, as we have just seen.
Some final remarks on computing
An analytic proof usually gives a proof valid for all larger than a constant . The reason is simple: say that we want to show that a quantity is positive. Typically, at the end of a great deal of analytic work, you will have proven that the quantity is of the form , where the absolute value of the error term is at most (for example; this is obviously simplified). This certainly shows that the quantity is positive — provided that . The task, then, is to sharpen the proof to such an extent that the constant becomes small enough that all cases can be checked by hand (meaning either literally your hand or a computer). This is what my work was largely about; checking the conjecture up to (and in fact up to ) was a bit of a side task – as we are about to see, it wasn’t even the main computational effort involved.
First, let me say a few more words about analytic results. There are results of the type “the statement is true for all larger than a constant , but this proof can tell you nothing about , other than that it exists”. This is called an ineffective estimate; many proofs of Vinogradov’s result in textbooks are of this kind. (The reason behind this is the possibility of so-called Siegel zeros.) A result can also say “the statement is true for all , and you should in principle be able to determine some value of by using ideas from the proof, but the author would much rather go drink coffee”. This is an effective, non-explicit statement; Vinogradov’s definitive version of his own proof was of this sort (as are many other results in mathematics, including some of my own past results). If you do give an explicit value of , then the result is called, well, explicit. Then comes the fourth stage: making reasonable, i.e., low enough that the case can be checked by hand. It was clear from the beginning that, in the case of the ternary Goldbach conjecture, “reasonable” meant roughly , even though nobody had actually checked it that far.
I said before that D. Platt and I had checked the conjecture for all odd numbers up to . Here is how we proceeded. It was already known (thanks to a major computational effort by Oliveira e Silva, Herzog and Pardi) that the binary Goldbach conjecture is true up to — that is, every even number up to is the sum of two primes. Given that, all we had to do was to construct a “prime ladder”, that is, a list of primes from up to such that the difference between any two consecutive primes in the list is at most . Thus, if anybody gives you an odd integer up to , you know that there is a prime in the list such that is positive and at most . By assumption, we can write for some primes , , and so .
Constructing such a ladder does not take that much time. (In fact, getting a ladder up to is probably something you can do on your old personal computer in the basement, over a few weeks — though storing it is another matter.) It’s all integer arithmetic, and we use deterministic primality checking (which is fast for primes of special form), so there is really no room for concern.
The major computation consists of verifying that, for every -function of conductor up to about (or twice that for even), all zeroes of the -function with imaginary part bounded by lie on the critical line. This was entirely Platt’s work; my sole contribution was to run around asking for computer time at different places (see the acknowledgements section of the major arcs paper). In fact, he went up to conductor (or twice that for even); he had already gone up to conductor in his thesis. The verification took, in total, about core-hours (i.e., the total number of processor cores used times the number of hours they ran equals ; nowadays, a top-of-the-line processor — such as the ones in the MesoPSL machine — typically has eight cores). In the end, as I said, I used only (or twice that for even), so the number of hours actually needed was more like ; since I could have made do with about , you could say that, in retrospect, only about core-hours were needed. The computers and I were digging on opposite ends of the mountain, and we met down the middle. The fact that the computers ran for longer than needed is hardly something to be regretted: the computation is of general use, and so it’s so much the better that it not be too tightly fitted to my needs; moreover, with proofs of this length, you want to “build like a Roman”, i.e., overcalculate in case you (not the computer!) have made a small mistake somewhere. (Why did you think those walls were so thick?)
Checking zeros of -functions computationally is something as old as Riemann (who did it by hand); it is also one of the things that were tried on electronic computers already in their early days (Turing had a paper on that). One of the main issues to be careful about arises whenever one manipulates real numbers: honestly speaking, a computer cannot store ; moreover, while a computer can handle rationals, it is really most comfortable handling just those rationals whose denominators are powers of two. Thus, you cannot really say: “computer, give me the sine of that number” and expect a precise result. What you should do, if you really want to prove something (as is the case here!), is to say: “computer, I am giving you an interval ; give me an interval , preferrably very short, such that “. This is called interval arithmetic; it is really the easiest way to do floating-point computations rigorously.
Now, processors don’t do this natively, and if you do this purely on software, you can slow things down by a factor of . Fortunately, there are ways of doing this halfways on hardware, halfways on software. Platt has his own library, but there are others online (e.g. PROFIL/BIAS).
(Oh, by the way, don’t use the function on an Intel processor if you want the result to be correct up to the last bit. Just what were they thinking? Use the crlibm library instead.)
Lastly, there were several rather minor computations that I did myself; you’ll find them mentioned in my papers. A typical computation was a rigorous version of a “proof by graph” (the maximum of a function is clearly less than because gnuplot told me so). You’ll find algorithms for this in any textbook on “validated computing” – basically, it’s enough to combine the bisection method with interval arithmetic.
Finally, let me point out that there is an elementary inequality in the minor arcs paper (namely, (4.24), in the proof of Lemma 4.2) that got proven in part by a human (me) and in part by a quantifier-elimination program. In other words, there are now computer programs out there (in this case, QEPCAD) that can actually prove useful things! Now, I have no doubt that the same inequality can be proven purely by the use of human beings, but it is nice to know that our computer friends can (pretend to) do something other than munch on numbers…