The ternary Goldbach conjecture

So now everybody knows why I was posting more and more rarely. I recently posted a proof of the ternary Goldbach conjecture: it was split into two articles – a new one on major arcs, and a new and improved version of the article on minor arcs I wrote last year.

Of course, I’m expected to write a summary of the main ideas here – not just emphasizing the new points, as I do when giving a talk to an audience of specialists, but giving an overall picture of the proof and all its parts, old and new. Let me do that.

I know that the audience here is very mixed – if you don’t have the background to follow a paragraph, just keep on reading.


History

Leonhard Euler – one of the greatest mathematicians of the eighteenth century, and indeed of all time – and his close friend, the amateur and polymath Christian Goldbach, kept a regular and copious correspondence. Goldbach made a guess about prime numbers, and Euler quickly reduced it to the following conjecture, which, he said, Goldbach had already stated to him: every positive integer can be written as the sum of at most three prime numbers.

We would now say “every positive integer greater than 5″, since we no longer think of 1 as a prime number. Moreover, the conjecture is nowadays split into two: the weak, or ternary, Goldbach conjecture states that every odd integer greater than 5 can be written as the sum of three primes; the strong, or binary, Goldbach conjecture states that every even integer greater than 2 can be written as the sum of two primes. As their names indicate, the strong conjecture implies the weak one (easily: subtract 3 from your odd number n, then express n-3 as the sum of two primes).

See Dickson, History of the theory of numbers, Vol I., Ch. XVIII, for the early history of the subject. In summary — Waring seems to have come up with the weak conjecture again in the late eighteenth century; the nineteenth century saw some computational work (checking the conjecture for small integers – by hand!) but little real progress.

The strong conjecture remains out of reach. A few weeks ago — my preprint appeared on May 13, 2013 — I proved the weak Goldbach conjecture.

The proof builds on the progress made in the early 20th century by Hardy, Littlewood and Vinogradov. Vinogradov proved (1937) that the conjecture is true for all odd numbers larger than some constant C. (Hardy and Littlewood had shown the same under the assumption of the Generalized Riemann Hypothesis; more on that later.) Since Vinogradov’s day, the constant C has been specified and gradually improved, but the best (i.e., smallest) available value for C was C = e^{3100} > 10^{1346} (Liu-Wang), which was much too large. Even C = 10^{100} would be too large: since 10^{100} is larger than the estimated number of subatomic particles in the universe times the number of seconds since the Big Bang, there wouldn’t be any hope of checking every case of 10^{100} by computer (even if the computer were really highly parallel, and you had really high cosmic priority)!

I brought C down to 10^{29} (and could bring it farther down if needed). D. Platt and I had checked the conjecture for all odd numbers up to 8.8\cdot 10^{30} by computer (and could have gone farther), so that was the end of the story.

What goes into the proof? Let us first step back and look at the general framework of the circle method, introduced by Hardy and Littlewood.

The circle method: Fourier analysis on the integers

Fourier analysis is something we do every time we tune to a radio station: there is a signal, and we decompose it into the contributions from different frequencies. In mathematical terms – we are given a function f:\mathbb{R}\to \mathbb{C} (i.e., a function on a single real variable; in the case of a radio, the variable is time) and we define the Fourier transform \widehat{f}:\mathbb{R}\to \mathbb{C} by \widehat{f}(r) = \int_\mathbb{R} f(x) e(- x r) dx, where we write e(t) for e^{2\pi i t}. Then, as we learn in any Fourier analysis course, f(x) = \int_\mathbb{R} \widehat{f}(r) e(x r) dx, provided that f decays rapidly enough and is otherwise well-behaved. (This is the “Fourier inversion formula”.)

In other words, x\mapsto f(x) has been decomposed as a sum of (complex) exponential functions, with the (complex) exponential function x\mapsto e(x r) present with “strength” \widehat{f}(r). (This is equivalent to a decomposition into “sine waves” \sin(2\pi x r) and \cos(2\pi x r), since e^{i z} = \cos z + i \sin z.) To go back to the example of a radio: \widehat{f}(r) is large when r is close to the frequency of some radio station, and small otherwise. (What your radio receives is a superposition f of what all stations transmit; your radio receiver’s business is precisely to figure out the contribution of frequencies r around a given r_0.)

We can do the same if f is a function from the integers \mathbb{Z} to \mathbb{C}. In fact, things are now simpler — we get to define \widehat{f} by a sum rather than an integral: \widehat{f}(\alpha) = \sum_n f(n) e(-\alpha n). A funny thing here is that \widehat{f}(\alpha) doesn’t change at all if we add 1, or any other integer m, to \alpha. This is so because, for m an integer, e(- (m+n) \alpha) = e^{- 2\pi i \alpha n} \left(e^{- 2\pi i}\right)^{m n} = e(- \alpha n) \cdot 1^{- m n} = e(- \alpha n).
(Thanks again, Euler.) Thus, we may restrict \alpha to the interval \lbrack 0,1\rbrack — or, more abstractly, we can think of \alpha as living in the quotient \mathbb{R}/\mathbb{Z}. Topologically, \mathbb{R}/\mathbb{Z} is a circle; this is just the same as saying that, since it doesn’t matter whether we add or substract 1 to our frequency, we might as well have the little frequency marker in our radio go around a circle marked with numbers from 0 up to 1, rather than have it slide back and forth (a segment of) the real line (as in the actual radio on my table). This is where the phrase circle method comes from.

The number-theorist's radio

The number-theorist’s radio

The decomposition of f now looks as follows: f(n) = \int_0^1 \widehat{f}(\alpha) e(\alpha n) d\alpha, provided that f decays rapidly enough.

Why do we care? The Fourier transform is immediately useful if we are dealing with additive problems, such as the Goldbach conjectures. The reason behind this is that the transform of a convolution equals a product of transforms: \widehat{f\ast g} = \widehat{f}\cdot \widehat{g}. Recall that the (additive) convolution of f,g:\mathbb{Z}\to \mathbb{C} is defined by (f\ast g)(n) = \sum_{m\in \mathbb{Z}} f(m) g(n-m).

We can see right away from this that (f\ast g)(n) can be non-zero only if n can be written as n=m_1+m_2 for some m_1, m_2 such that f(m_1) and g(m_2) are non-zero. Similarly, (f\ast g\ast h)(n) can be non-zero only if n can be written as n=m_1+m_2+m_3 for some m_1, m_2, m_3 such that f(m_1), g(m_2) and h(m_3) are all non-zero. This suggests that, to study the ternary Goldbach problem, we define f, g, h so that they take non-zero values only at the primes.

Hardy and Littlewood defined f(n) = g(n) = h(n) = 0 for n composite (or n zero or n negative), and f(n) = g(n) = h(n) = (\log n) e^{-n/N} for n prime (where N is a parameter to be fixed later). Here the factor e^{-n/N} is there to provide “fast decay”, so that everything converges; as we will see later, Hardy and Littlewood’s choice of e^{-n/N} (rather than some other function of fast decay) is actually very clever, though not quite the best possible. The term \log n is there for technical reasons (basically, it turns out that it makes sense to weigh a prime p by \log p because roughly one out of every \log p integers of size about p is a prime).

We see that (f\ast g\ast h)(n) \ne 0 if and only if n can be written as the sum of three primes. Our task is then to show that (f\ast g\ast h)(n) (i.e., f\ast f\ast f) is non-zero for every n larger than a constant. Since the transform of a convolution equals a product of transforms,
(f\ast g\ast h)(n) = \int_0^1 \widehat{f\ast g\ast h}(\alpha) e(\alpha n) d\alpha =   \int_0^1 ((\widehat{f} \widehat{g} \widehat{h})(\alpha)) e(\alpha n) d\alpha. Our task is thus to show that the integral \int_0^1 ((\widehat{f} \widehat{g} \widehat{h})(\alpha)) e(\alpha n) d\alpha =    \int_0^1 (\widehat{f}(\alpha))^3 e(\alpha n) d\alpha is non-zero.

As it happens, \widehat{f}(\alpha) is particularly large when \alpha is close to a rational with small denominator; it is as if there were really radio stations transmitting at the (small-denominator) frequencies marked in the “tuning” drawing above – when the dial is close to one of them, there is a loud, clear \widehat{f}(\alpha), and when we are away from all of them, we can hear only a low hum. This suggests the following strategy: estimate \widehat{f}(\alpha) for all \alpha within small arcs around the rationals with small denominators (the major arcs — so called because they make a major contribution, in spite of being small); bound \widehat{f}(\alpha) for \alpha outside the major arcs (everything outside the major arcs is called minor arcs); then show that the contribution of the minor arcs to the integral is smaller in absolute value than the contribution of the major arcs, thereby forcing the integral \int_0^1 (\widehat{f}(\alpha))^3 e(\alpha n) d\alpha to be non-zero.

It is this general strategy that gets called the circle method. Hardy and Littlewood introduced it to deal with a wide variety of additive problems; for example, this was also part of their approach to Waring’s problem (yes, same Waring), on integers that are sums of kth powers of integers. This was taken over by Vinogradov, who was the first to give good, unconditional bounds on \widehat{f}(\alpha) for \alpha in the minor arcs (something considered very remarkable at the time). The circle method is also my general strategy: what I have done is to give much better estimates on the major and minor arcs than were previously available, for f, g and h chosen with great care.

(Incidentally: while we can start to treat the binary, or strong, Goldbach conjecture with the circle method, we soon hit a wall: the “noise” from the minor arcs overwhelms the contribution from the major arcs. This is well explained in this post by Tao.)

Dirichlet L-functions and their zeros

Before we can start working on the major arcs, we need to discuss L-functions. First, there is the zeta function \zeta(s), first studied for s complex by Riemann, after whom it is now named. This is given by \zeta(s) = \sum_{n=1}^\infty \frac{1}{n^s} when the real part \Re(s) of s is greater than 1. For \Re(s)\leq 1, the series diverges, but the function can be defined (uniquely) by analytic continuation (and this can be done explicitly by, e.g., Euler-Maclaurin, as in Davenport, Multiplicative Number Theory, 2nd. ed, p. 32), with a pole at s=1.

Analogously, there are Dirichlet L-functions, defined by L(s,\chi) = \sum_{n=1}^\infty \frac{\chi(n)}{n^s} for \Re(s)>1, and by analytic continuation for \Re(s)\leq 1. Here \chi is any Dirichlet character; for every
given \chi, L(s,\chi) is a function of s. A Dirichlet character \chi (of modulus q) is just a function \chi:\mathbb{Z}\to \mathbb{C} of period q (i.e. \chi(n) = \chi(n+q) for all n), with the additional properties that it be multiplicative (\chi(ab) = \chi(a) \chi(b) for all a, b) and that \chi(n)=0 whenever n and q are not coprime. (The sophisticated way to put it is that it is a character of (\mathbb{Z}/q\mathbb{Z})^* lifted to \mathbb{Z}.) Dirichlet characters and Dirichlet L-functions were introduced by, um, Dirichlet, in order to study primes in arithmetic progressions.

A zero of a function f is just an s\in \mathbb{C} such that f(s)=0. A non-trivial zero of \zeta(s), or of L(s,\chi), is a zero of \zeta(s), or of L(s,\chi), such that 0<\Re(s)<1. (The other zeros are called trivial because it is easy to tell where they are (namely, at negative integers and sometimes at 0).) The Riemann hypothesis states that all non-trivial zeros of the Riemann zeta function "lie in the critical line", meaning that they satisfy \Re(s)=1/2. The Generalized Riemann hypothesis for Dirichlet L-functions states that, for every Dirichlet character \chi, every non-trivial zero of L(s,\chi) satisfies \Re(s)=1/2.

Since both the Riemann Hypothesis (RH) and the Generalized Riemann Hypothesis (GRH) remain unproven, any result proven using them will be conditional; we want to prove unconditional results. What can indeed be proven, and used, are partial results in the direction of GRH. Such results are of two kinds:

- Zero-free regions. Ever since the late nineteenth century (de la Vallée-Poussin) we have known that there are hour-glass shaped regions (more precisely, of the shape c/\log t \leq \sigma \leq 1-c/\log t, where c is a constant and where we write s = \sigma + i t) outside which non-trivial zeroes cannot lie;

- Finite verifications of GRH. It is possible to (ask the computer to) prove small, finite chunks of GRH, in the sense of verifying that all non-trivial zeros of a given L function with imaginary part less than some constant H lie on the critical line \Re(s)=1/2.

Most work up to date follows the first alternative. I chose the latter, and this had consequences on the precise way in which I defined the major and minor arcs: I got very precise results on the major arcs, but I had to define them to be few and very narrow,
or else the method wouldn’t work. This meant that the minor arc methods had to be particularly potent, since a particularly large part of the circle was left for them to deal with.

Let us look more closely at how one can deal with major arcs using partial results towards GRH, and, in particular, finite verifications of GRH.

Major-arc estimates

Recall that we want to estimate sums of the type \widehat{f}(\alpha) = \sum f(n) e(-\alpha n), where f(n) is something like (\log n) e^{-n/x} (say) for n equal to a prime, and 0 otherwise. Let us modify this just a little – we will actually estimate S_\eta(\alpha,x) = \sum \Lambda(n) e(\alpha n) \eta(n/x), where \Lambda is the von Mangoldt function: \Lambda(n) = \log p if n is either a prime power p^k, k\geq 1, and \Lambda(n) = 0 otherwise. (The use of \alpha rather than -\alpha is just a bow to tradition, as is the use of the letter S (for “sum”); however, the use of \Lambda(n) rather than just plain \log p does actually simplify matters when you deal with so-called explicit formulas, which we will see in a minute.) Here \eta(t) is some function of fast decay; it can be e^{-t}, as in Hardy and Littlewood’s work, or (as in my work) something else. (It could even be just the “brutal truncation” 1_{\lbrack 0,1\rbrack}(t), defined to be 1 when t\in \lbrack 0,1\rbrack and 0 otherwise; that would be fine for the minor arcs, but, as we will see, it is a bad idea as far as the major arcs are concerned.)

Assume \alpha is on a major arc, meaning that we can write \alpha = a/q + \delta/x for some a/q (q small) and some \delta (with |\delta| small). We can express S_\eta(\alpha,x) as a linear combination (that is, a sum of multiples) of terms of the form S_{\eta,\chi}(\delta/x,x), where S_{\eta,\chi}(\delta/x,x) = \sum \Lambda(n) \chi(n) e(\delta n/x) \eta(n/x) and \chi runs over Dirichlet characters of modulus q.

Why are these sums S_{\eta,\chi} nicer than the plainer sums S_\eta? The argument has become \delta/x, whereas before it was \alpha. Here \delta is small — smaller than a constant, in our setup. In other words, e(\delta n/x) will go around the circle a bounded number of times as n goes from 1 up to a constant times x (by which time \eta(n/x) has become small). This makes the sum much easier to estimate.

It is a standard fact that we can express S_{\eta,\chi} by an explicit formula (yes, the phrase has a technical meaning, just like Jugendtraum):
S_{\eta,\chi}(\delta/x,x) = [\widehat{\eta}(-\delta)] x - \sum_\rho F_\delta(\rho) x^\rho + \text{small error}.
Here the term between brackets appears only for q=1. In the sum, \rho goes over all non-trivial zeros of L(s,\chi), and F_\delta is the Mellin transform of e(\delta t) \eta(t). We win if we manage to show that the sum over \rho is small.

The point is this – if we check GRH up to imaginary part H, then we know that all \rho with |\Im(\rho)|\leq H satisfy \Re(\rho)=1/2, and thus |x^\rho| = \sqrt{x}. In other words, x^\rho is then very small (compared to x). However, for any \rho whose imaginary part has absolute value greater than H, we know nothing about its real part, other than 0\leq \Re(\rho)\leq 1. (All right, we could use a zero-free region, but known zero free regions are notoriously weak for \Im(\rho) large – meaning they tell us little in practice.) Hence, our only chance is to make sure that F_\delta(\rho) is small when |\Im(\rho)|\geq H.

This, mind you, has to be true both for \delta tiny (including: \delta=0) and for \delta not so tiny (\delta between 1 and a constant). If you toy around with the method of stationary phase, we get that F_\delta(\rho) behaves like M\eta(\rho) for \delta tiny (here M\eta is the Mellin transform of \eta) and like \eta(t/|\delta|) for \delta not so tiny (where t = \Im(\rho)). Thus, we are in a classical dilemma, often called the uncertainty principle because it is the mathematical fact underlying the physical principle of the same name: you cannot have a function \eta that decreases extremely rapidly and whose Fourier transform (or, in this case, its Mellin transform) also decays extremely rapidly.

What does “extremely rapidly” mean? It means “faster than any exponential e^{-C t}“. Thus, Hardy and Littlewood’s choice \eta(t)=e^{-t} seems essentially optimal.

Not so fast! What we can do is choose \eta so that M\eta decreases exponentially (with a constant C a bit worse than before), but \eta decreases faster than exponentially. This is a rather good idea, as it is t/|\delta| (and not so much t) that risks being fairly small.

One choice \eta obeying this description is the Gaussian \eta(t) = e^{-t^2/2}. The Mellin transform F_\delta turns out to be a parabolic cylinder function, with imaginary values for one of the parameters. Parabolic cylinder functions seem to be much-loved and much-studied by applied people — but mostly for real values of the said parameter. There are some asymptotic expansions of F_\delta in the literature for general parameters (notably by F. W. J. Olver), but none that were explicit enough for my purposes. Thus, I had to go and provide fully explicit estimates myself, using the saddle-point method. This took me a while, but the results should be of general applicability – hi there, engineers – and the Gaussian smoothing will hopefully become a little more popular in explicit work in number theory.

Ah, by the way, these estimates on parabolic cylinder functions allow us to take not just \eta(t) = e^{-t^2/2}, but also, more generally, \eta(t) = h(t) e^{-t^2/2}, here h is any band-limited function, meaning, in this context, any function whose Mellin transform restricted to the y axis has compact support. We will want to optimize the choice of h(t) — more on that later.

The minor arcs

How do you bound |S(\alpha)| when \alpha is not close to any rational a/q of small denominator? That this is at all possible was Vinogradov’s great achievement. Progress since then has been gradual. My own take on things is the subject of my minor arcs paper. Let me just discuss a few of the ideas behind my improvements.

Vinogradov’s proof was greatly simplified in the 70s by Vaughan, who introduced the identity that now bears his name. Basically, Vaughan’s identity is a gambit: it gives you a great deal of flexibility, but at a cost — here, a cost of two logs, rather than, say, two pawns. The problem is that, if we are to reach our goal, we cannot afford to waste logs. The only way is to recover these logs, finding cancellation in the different sums that arise from Vaughan’s identity. This I had to do, mind you, without using L-functions, since I could no longer assume that q was small.

Here is another theme of this part of the proof. Every \alpha has an approximation \alpha = a/q + \delta/x; the fact that \alpha is in the minor arcs just tells us that q is not small. In fact, we are looking for bounds that decrease with q; the bound I obtain is proportional to (\log q)/\sqrt{\phi(q)}. What is the effect of \delta?

Something I realized early on was that, if \delta is not tiny, it can actually be used to our advantage. One reason is that there are terms of the form \widehat{\eta}(\delta), and the Fourier transforms of smooth functions decay as the argument grows. There are other issues, however. Something we can use is the following: by basic results on Diophantine approximation, every \alpha has very good approximations by rationals of non-huge denominator. If \delta is not tiny, then the approximation \alpha = a/q + \delta/x is good, but not very good; hence, there must be another, better approximation \alpha \sim a'/q' with q' non-huge (meaning: considerably smaller than x). We can go back and forth between the approximations a/q and a'/q', depending on which one is more useful in context. This turns out to be better than using a single approximation a/q, however good it may be.

Another way in which \delta large gets used is to scatter the inputs to a large sieve. The large sieve can be seen as an approximate form of the Plancherel identity, recast as an inequality: whereas the Plancherel identity tells us that the \ell_2-norm |\widehat{f}|_2 of a Fourier transform \widehat{f}:\mathbb{R}/\mathbb{Z}\to \mathbb{Z} of a function f defined on the integers (for instance; groups other than the integers are also valid) equals the \ell_2-norm |f|_2 of the function f itself, the large sieve tells us that the total of |\widehat{f}(\alpha_i)|^2 for a well-spaced sample of points \alpha_i\in \mathbb{R}/\mathbb{Z} is bounded by (a multiple of) |f|_2. Now, in our case, the points \alpha_i are multiples of our angle \alpha. If \alpha = a/q, the spacing of the points \alpha_i is 1/q, which is nice — but we may have to apply the large sieve many times, since we have to apply it afresh for each chunk of q points. However, if \alpha = a/q + \delta/x and \delta is not tiny, we can go around the circle many times, and rely on \delta/x rather than on 1/q to give us the spacing. Yes, the spacing may be smaller, but the effect of this is more than compensated by the fact that we have to invoke the large sieve far fewer times (perhaps only once). What is more, this scattering can be combined with a more traditional kind of scattering (Montgomery’s lemma; see Montgomery’s “Topics in multiplicative number theory”, or else the exposition in Iwaniec-Kowalski, section 7.4) so as to take advantage of the fact that we are dealing with sums on the primes.

Putting it all together

I’ve been telling you what goes into bounding S(\alpha) for \alpha within the minor arcs \mathscr{m}, but what we really want to do is bound the integral \int_{\mathfrak{m}} |S(\alpha)|^3 d\alpha. One easy – and traditional – way to do this is to just use the trivial inequality \int_{\mathfrak{m}} |S(\alpha)|^3 d\alpha \leq \max_{\alpha \in \mathfrak{m}} |S(\alpha)|\cdot \int_{\mathfrak{m}} |S(\alpha)|^2 d\alpha. Unfortunately, this wastes a factor of \log.

Since our bounds for S(\alpha), \alpha\sim a/q, are given in terms of q, it makes sense to combine them with estimates for integrals of the type \int_{\mathfrak{m_r}} |S(\alpha)|^2d \alpha, where \mathfrak{m}_r is a union of arcs around rationals with denominator greater than a constant but less than r. How do you estimate such integrals? It turns out that this is very closely related to a question on the large sieve: what bounds can you get for samples \alpha_i = a/q, q\leq r, where r is of moderate size?

There was an answer in the literature (based on Montgomery’s lemma; the link with the circle method was already noticed by Heath-Brown) but it was sub-optimal by a factor of at least e^\gamma (or in fact more). There was a newer estimate for the large sieve due to Ramaré, but it had not been made fully explicit. I had to work that out, and then adapted the new large-sieve result to estimate the integral over \mathfrak{m}_r above. As was expected, the spurious factor of e^\gamma (or really a bit more) disappeared.

It remains to look at the main term vs. the error term. It turns out that we have some room to choose what the main term will be, since it depends on the smoothings we choose. The main term is proportional to \int_0^\infty \int_0^\infty \eta_+(t_1) \eta_+(t_2) \eta_*(N/x - (t_1+t_2)) dt_1 dt_2, where \eta_+ and \eta_* are the two smoothings we choose to work with, N is the odd number we want to express as the sum of three primes, and x is again a parameter of our choosing. For comparison, the error term is proportional to |\eta_+|^2 |\eta_*|_1. Thus, we have an optimization problem (“maximize the size of the double integral divided by |\eta_+|^2 |\eta_*|_1“). It is best to choose \eta_+ symmetric or close to symmetric (\eta_+(t)\sim \eta_+(2-t)), making sure, moreover, that \eta_+(t)\sim 0 for t\geq 2. This is not too hard to achieve while keeping \eta_+ of the form \eta(t) = h(t) e^{-t^2/2}, where h is band-limited.

What about \eta_*? The solution to the optimization problem tells us that it should be of small support, or at least concentrated near the origin. Other than that – there is, so to speak, a political problem: \eta_*, unlike \eta_+, gets used both in the major and the minor arcs; the major arcs really want it to be of the form e^{-t^2/2} or t^k e^{-t^2/2}, whereas the minor arcs would prefer it to be something simple, like \eta_{\lbrack 0,1\rbrack} or like \eta_2 = (2 \eta_{\lbrack 1/2,1\rbrack}) \ast (2 \eta_{\lbrack 1/2,1\rbrack}) (as in Tao’s five-prime paper or my own minor-arcs paper).

The solution is simple: define \eta_*(t) = (f\ast g)(\kappa t), where \kappa is a large constant, f(t) = \eta_2(t) and g(t) = t^2 e^{-t^2/2}. For f and g pretty much arbitrary, if you know how to compute (or estimate) S_f(\alpha) for some \alpha, and you also know how to estimate S_g(\alpha) for the other \alpha, then you know how to estimate S_{f\ast g}(\alpha) for all \alpha: just write it out and you will see!

(Perhaps you will also see that it helps if f(t) and g(t) are very small near t=0.)

The moral is that different problems, and different parts of the same problem, require different smoothings. At least in the context of exponential sums, there turns out to be a simple trick for combining them, as we have just seen.

Some final remarks on computing

An analytic proof usually gives a proof valid for all n larger than a constant C. The reason is simple: say that we want to show that a quantity is positive. Typically, at the end of a great deal of analytic work, you will have proven that the quantity is of the form 1 + \text{error term}, where the absolute value of the error term is at most C/n (for example; this is obviously simplified). This certainly shows that the quantity is positive — provided that n\geq C. The task, then, is to sharpen the proof to such an extent that the constant C becomes small enough that all cases n\leq C can be checked by hand (meaning either literally your hand or a computer). This is what my work was largely about; checking the conjecture up to C=10^{29} (and in fact up to 8.8\cdot 10^{30}) was a bit of a side task – as we are about to see, it wasn’t even the main computational effort involved.

First, let me say a few more words about analytic results. There are results of the type “the statement is true for all n larger than a constant C, but this proof can tell you nothing about C, other than that it exists”. This is called an ineffective estimate; many proofs of Vinogradov’s result in textbooks are of this kind. (The reason behind this is the possibility of so-called Siegel zeros.) A result can also say “the statement is true for all n>C, and you should in principle be able to determine some value of C by using ideas from the proof, but the author would much rather go drink coffee”. This is an effective, non-explicit statement; Vinogradov’s definitive version of his own proof was of this sort (as are many other results in mathematics, including some of my own past results). If you do give an explicit value of C, then the result is called, well, explicit. Then comes the fourth stage: making C reasonable, i.e., low enough that the case n\leq C can be checked by hand. It was clear from the beginning that, in the case of the ternary Goldbach conjecture, “reasonable” meant roughly C\sim 10^{30}, even though nobody had actually checked it that far.

I said before that D. Platt and I had checked the conjecture for all odd numbers up to 8.8\cdot 10^{30}. Here is how we proceeded. It was already known (thanks to a major computational effort by Oliveira e Silva, Herzog and Pardi) that the binary Goldbach conjecture is true up to 4\cdot 10^{18} — that is, every even number up to 4\cdot 10^{18} is the sum of two primes. Given that, all we had to do was to construct a “prime ladder”, that is, a list of primes from 2 up to 8.8\cdot 10^{30} such that the difference between any two consecutive primes in the list is at most 4\cdot 10^{18}. Thus, if anybody gives you an odd integer n up to 8.8\cdot 10^{30}, you know that there is a prime p in the list such that n-p is positive and at most 4\cdot 10^{18}. By assumption, we can write n-p = p_1 + p_2 for some primes p_1, p_2, and so n = p + p_1 + p_2.

Constructing such a ladder does not take that much time. (In fact, getting a ladder up to 10^{29} is probably something you can do on your old personal computer in the basement, over a few weeks — though storing it is another matter.) It’s all integer arithmetic, and we use deterministic primality checking (which is fast for primes of special form), so there is really no room for concern.

The major computation consists of verifying that, for every L-function of conductor q up to about 150000 (or twice that for q even), all zeroes of the L-function with imaginary part bounded by 10^8/q lie on the critical line. This was entirely Platt’s work; my sole contribution was to run around asking for computer time at different places (see the acknowledgements section of the major arcs paper). In fact, he went up to conductor 200000 (or twice that for q even); he had already gone up to conductor 100000 in his thesis. The verification took, in total, about 400000 core-hours (i.e., the total number of processor cores used times the number of hours they ran equals 400000; nowadays, a top-of-the-line processor — such as the ones in the MesoPSL machine — typically has eight cores). In the end, as I said, I used only q\leq 150000 (or twice that for q even), so the number of hours actually needed was more like 160000; since I could have made do with about q\leq 120000, you could say that, in retrospect, only about 80000 core-hours were needed. The computers and I were digging on opposite ends of the mountain, and we met down the middle. The fact that the computers ran for longer than needed is hardly something to be regretted: the computation is of general use, and so it’s so much the better that it not be too tightly fitted to my needs; moreover, with proofs of this length, you want to “build like a Roman”, i.e., overcalculate in case you (not the computer!) have made a small mistake somewhere. (Why did you think those walls were so thick?)

Checking zeros of L-functions computationally is something as old as Riemann (who did it by hand); it is also one of the things that were tried on electronic computers already in their early days (Turing had a paper on that). One of the main issues to be careful about arises whenever one manipulates real numbers: honestly speaking, a computer cannot store \pi; moreover, while a computer can handle rationals, it is really most comfortable handling just those rationals whose denominators are powers of two. Thus, you cannot really say: “computer, give me the sine of that number” and expect a precise result. What you should do, if you really want to prove something (as is the case here!), is to say: “computer, I am giving you an interval I=[a/2^k,b/2^k]; give me an interval I'=[c/2^\ell,d/2^\ell], preferrably very short, such that \sin(I) \subset  I'“. This is called interval arithmetic; it is really the easiest way to do floating-point computations rigorously.

Now, processors don’t do this natively, and if you do this purely on software, you can slow things down by a factor of 100. Fortunately, there are ways of doing this halfways on hardware, halfways on software. Platt has his own library, but there are others online (e.g. PROFIL/BIAS).

(Oh, by the way, don’t use the \sin function on an Intel processor if you want the result to be correct up to the last bit. Just what were they thinking? Use the crlibm library instead.)

Lastly, there were several rather minor computations that I did myself; you’ll find them mentioned in my papers. A typical computation was a rigorous version of a “proof by graph” (the maximum of a function f is clearly less than 4 because gnuplot told me so). You’ll find algorithms for this in any textbook on “validated computing” – basically, it’s enough to combine the bisection method with interval arithmetic.

Finally, let me point out that there is an elementary inequality in the minor arcs paper (namely, (4.24), in the proof of Lemma 4.2) that got proven in part by a human (me) and in part by a quantifier-elimination program. In other words, there are now computer programs out there (in this case, QEPCAD) that can actually prove useful things! Now, I have no doubt that the same inequality can be proven purely by the use of human beings, but it is nice to know that our computer friends can (pretend to) do something other than munch on numbers…

About valuevar

I am a number theorist with side interests in combinatorics and group theory.
This entry was posted in Uncategorized and tagged , , , , , , , , . Bookmark the permalink.

11 Responses to The ternary Goldbach conjecture

  1. Very nice article (at least the part I’ve gotten through).

    e^z = cos z + i sin z looks like a typo. e^{iz}, right?

  2. Richard says:

    Dear Harald,

    I know this sort of elementary expository writing is hard work, and perhaps not very rewarding, but there’s at least one person out on the internet who appreciates it.

  3. Willie Wong says:

    A small correction: first line of the section on Major arcs should be \hat{f}(\alpha) and not \hat{f}(n).

  4. Ricardo Menares says:

    Small typo: the roles of $\alpha$ and $n$ should be reversed in the LHS of the equation displayed just before “(Thanks again, Euler.)”

  5. Pingback: En Passant II | Persiflage

  6. Pingback: Harald Andrés Helfgott nos habla sobre su demostración de la conjetura débil de Goldbach - Gaussianos | Gaussianos

  7. Dear Mr. Helfgott,

    My warmest congratulations.

    Yours very truly.

  8. Richard Webb says:

    Thank you for this beautifully clear explanation, but I have a question. Why does the use of the Von Mangoldt function not undermine the argument? It seems to me that this proves only that an odd number must be the sum of 3 prime powers. Please tell me what I am missing. Thanks, Richard Webb

    • valuevar says:

      Ah! Well, the proof actually gives a lower bound for the number of ways of expressing an odd integer N as a sum of three prime powers. Now, the number of ways of expressing N as a sum of three prime powers at least one of which is not a prime (or, for that, matter, as a sum n_1^k+n_2+n_3, where k>1) is pitifully small – much smaller than the lower bound I just mentioned – simply because squares (and cubes, and fourth powers, etc.) are so rare: at most \sqrt{N} numbers up to N are squares! So, you see, you can substract that pitifully small number from the lower bound and still get something positive. This is, in fact, the last step of the proof.

      • Richard Webb says:

        Thank you so much for taking the time to get back to me, and congratulations on your achievement.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s