Saturday, May 28, 2016

Rounding Error

I recently experienced the eight-bazillionth iteration of the following scenario. Someone posts a request for help on a forum. They are doing some sort of MIP model, or possibly something even more advanced (Benders decomposition, column generation, ...), and things are not going well. In some cases they think they have made a coding error; more commonly they think the solver (CPLEX in my personal experience, since that's the only one whose forums I watch) has a bug in it. After substantial back and forth, the key problem emerges: their binary variable came out 1.00000 E-9 or 0.99999999, which is "clearly" neither 0 nor 1.

At this point, my reaction is always the same (add your favorite expletives after every few words): "Don't they know about rounding error???"

In fairness, though, they may not, and perhaps the blame lies on the shoulders of faculty (possibly including me, in my former life). One of the courses I took during my first term of graduate school (in math) was numerical analysis. This is back when mainframes were just making the transition from vacuum tubes to semiconductors, and "user input" was a euphemism for "deck of punch cards". It's also a time when memory was at a premium and, thus, single precision ruled the realm.

(BEGIN TANGENT) I suspect that some readers who are (much) younger than I, and who program, will have noticed that pretty much everything of a floating point nature is automatically done double-precision these days, and may be wondering why single-precision float types still exist. They're a holdover from my youth, and I suspect exist almost exclusively for backward compatibility. (END TANGENT)

I don't recall why I picked numerical analysis as an elective -- whimsy, I think -- but it ended up being one of the most useful courses I took in grad school. The first, and most important, lesson from it was that computers do math inexactly. This should be immediately obvious if you stop to think about it. I learned in childhood that 0.3333... = 1/3, where the ... means "repeated an infinite number of times". Stop the decimal expansion after a finite number of digits, and you get something slightly less than 1/3. Now pretend for the moment that your computer encoded numbers in decimal rather than binary, and that you used the expression 1/3 in a program. The compiler would convert 1/3 to a decimal representation, but not 0.3333... -- because that would require an infinite amount of memory to store, and the computer only has a finite amount. So the computer has to truncate the decimal expansion, and as a result what the computer stores for 1/3 is not actually 1/3. (Now go back to using binary. The same logic applies.)

I'm not going to repeat an entire course in numerical analysis here (too much typing), but another lesson I learned was that, when programming, I should never do strict equality comparisons on anything that was not stored as an integer. Here's a small example that I just ran in R (not to pick on R -- it was just handy):
> sqrt(9) - 2.99999999999999999 == 0
[1] TRUE

To a mathematician, that relation is patently false, but to the program the decimal expansion is "close enough" to 3.

My prof shared one anecdote that has stuck with me for almost half a century now. He told of an engineer who coded his own routine for inverting a matrix and was not particularly careful about dealing with rounding errors (and "stiff" matrices, a topic for another time). The engineer then published a paper which included, at one point, the inverse of a particular matrix, computed by his program, with each entry in the matrix printed to some absurd number of decimal places (five, I think). All the digits to the right of the decimal points were wrong. So were the first couple of digits immediately left of the decimal points. You can ignore the limitations of computer arithmetic precision, but (queue sinister background music) those limitations will not ignore you.

Where I'm going with this is that I learned about the limitations of computer math somewhat accidentally, because I just happened to take numerical analysis. I suspect that some OR/IE programs require it, and I'm pretty sure others (most?) don't. In the interests of reducing frustration for a lot of people (of whom I'm the only one I care about), and perhaps avoiding incorrect results being published, I think we should pressure anyone teaching computational topics in OR (particularly, but not exclusively, integer programming) to provide some minimal coverage of rounding and truncation errors, the IEEE standards for single- and double-precision values, etc. If they cannot (or will not) teach it themselves, they can at least require that the student go learn it as a prerequisite.

Please! I'm begging you!