Изменить стиль страницы

Seibel: So are there language features that make programmers—folks who have mastered this unnatural act—more productive? You’re designing a language right now so you’ve obviously got some opinions about this.

Steele: I said earlier that I think you can’t afford to neglect correctness. On the other hand, I think we can design tools to make it easier to achieve that. We can’t make it trivial, but I think we can make it easier to avoid mistakes of various kinds. A good example is overflow detection on arithmetic, or providing bignums instead of just letting 32-bit integers wrap around. Now, implementing those is more expensive but I believe that providing full-blown bignums is a little less error-prone for some kinds of programming.

A trap that I find systems programmers and designers of operating-systems algorithms constantly falling into is they say, “Well, we need to synchronize some phases here so we’re going to use a take-a-number strategy. Every time we enter a new phase of the computation we’ll increment some variable and that’ll be the new number and then the different participants will make sure they’re all working on the same phase number before a certain operation happens.” And that works pretty well in practice, but if you use a 32-bit integer it doesn’t take that long to count to four billion anymore. What happens if that number wraps around? Will you still be OK or not? It turns out that a lot of such algorithms in the literature have that lurking bug. What if some thread stalls for 2 to the 32nd iterations? That’s highly unlikely in practice, but it’s a possibility. And one should either mitigate that correctness problem or else do the calculation to show that, yeah, it’s sufficiently unlikely that I don’t want to worry about it. Or maybe you’re willing to accept one glitch every day. But the point is you should do the analysis rather than simply ignoring the issue. And the fact that counters can wrap around is a lurking pitfall that doesn’t hurt most programmers but for a very few it lays traps in their algorithms.

Seibel: Speaking of glitches, what was the worst bug you’ve ever had to track down?

Steele: I’m not sure I can dig up a worst one, but I can tell you a few stories. Certainly, dealing with parallel processes has produced the most difficult-to-deal-with bugs.

When I was a teenager programming on the IBM 1130 was the one time in my life when the solution to a bug came to me in a dream. Or as I woke up. I had been puzzling over a bug for a couple days, couldn’t figure it out. And then suddenly sat bolt upright in the middle of the night and realized I knew what the problem was. And it was because I had overlooked something in an interface specification.

It had to do with concurrent processes. I was writing a decompiler so that I could study the IBM disk operating system by decompiling it. And to this end it would take binary data on the disk and print it in a variety of formats including as instructions, as character codes, as numbers, and so on. And in order to convert the characters, I was feeding the data to various character–conversion routines, one of which was designed for use after reading card codes from a card reader. And I had overlooked the tiny footnote in the specification that said, “We assume that before you call the procedure, the buffer in which the card data will be read has all over the low order bits cleared.” Or maybe had them set.

Anyway, the 12 bits from the card-code column were going into the high 12 bits of the 16-bit word and they were using the low bit for a clever trick whereby you could call the card-reader routine asynchronously and it would asynchronously load the buffer and the conversion routine would follow behind, using that low bit to determine whether the next card column had been read in or not. And if it had been read it would then convert that character. Thereby, as soon as the card was read, very shortly thereafter the conversion would finish—they were overlapping the cost of the card conversion with the time it took to read the card. And I was feeding it raw binary data that wasn’t obeying this constraint. And I had just overlooked this—I thought it was yet another card-conversion routine but it turned out this one had something special about its interface—it was relying on those low-order bits, which ordinarily you wouldn’t think about. It was interpreting what was in the buffer as saying, “Oh, the data has not yet arrived from the card reader.” In principle I knew this, but it wasn’t occurring to me. And then, as I say, it came to me while I was asleep. So that was an odd case.

The other really interesting story I can think of was while I was the maintainer of the Maclisp system and Maclisp supported bignums—integers of arbitrary precision. They’d been around for several years and were considered pretty well debugged and shaken out. They’d been used for all kinds of stuff in Macsyma, and Macsyma users were using them all the time. And yet a report came in from Bill Gosper saying, “The quotient of these two integers is wrong.” And he could tell because the quotient was supposed to be very near a decimal multiple of pi.

These were numbers about a hundred digits each and it really wasn’t feasible to trace through the entire thing by hand because the division routine was fairly complicated and these were big numbers. So I stared at the code and didn’t see anything obviously wrong. But one thing that caught my eye was a conditional step that I didn’t quite understand.

These routines were based on algorithms from Knuth, so I got Knuth off the shelf and read the specification and proceeded to map the steps of Knuth’s algorithm onto the assembly-language code. And what caught my eye in Knuth was a comment that this step happens rarely—with a probability of roughly only one in two to the size of the word. So we expected this to happen only once in every four billion times or something.

From that I reasoned to myself, “These routines are thought to be well shaken out, thus this must be a rare bug; therefore the problem is likely to be in the rarely executed code.” That was enough to focus my attention on that code and realize that a data structure wasn’t being properly copied. As a result, later down the line there was a side-effect bug where something was getting clobbered. And so I repaired that and then fed the numbers through and got the right answer and Gosper seemed to be satisfied.

A week later he came back with two somewhat larger numbers and said, “These don’t divide properly either.” This time, properly prepared, I returned to that same little stretch of ten instructions and discovered there was a second bug of the same kind in that code. So then I scoured the code thoroughly, made sure everything was copied in the right ways, and no errors were ever reported after that.

Seibel: That’s always the trick—that there’s going to be more than one.

Steele: So I guess there’s lessons there—the lesson I should have drawn is there may be more than one bug here and I should have looked harder the first time. But another lesson is that if a bug is thought to be rare, then looking at rarely executed paths may be fruitful. And a third thing is, having good documentation about what the algorithm is trying to do, namely a reference back to Knuth, was just great.

Seibel: When it’s not just a matter of waking up in the middle of the night and realizing what the problem is, what are your preferred debugging techniques? Do you use symbolic debuggers, print statements, assertions, formal proofs, all of the above?

Steele: I admit to being lazy—the first thing I will try is dropping in print statements to see if it will help me, even though that is probably the least effective for dealing with a complicated bug. But it does such a good job of grabbing the simple bugs that it’s worth a try. On the other hand, one of the great revelations in the evolution of my thinking about programming was when I was working on that one project in Haskell. Because it was a purely functional language, I couldn’t just drop in print statements.