Изменить стиль страницы

Seibel: What is the worst bug you’ve ever had to track down?

Norvig: Well, I guess the most consequential bugs I was involved with were not mine, but the ones I had to clean up after: the Mars program failures in ’98. One was foot-pounds vs. newtons. And the other was, we think, though we’re not 100 percent sure, prematurely shutting off the engines due to a software problem.

Seibel: I read one of the reports on the Mars Climate Orbiter—that was the one that was the foot-pounds vs. newtons problem—and you were the only computer scientist on that panel. Were you involved in talking to the software guys to figure out what the problem was?

Norvig: That was pretty easy, post hoc, because they knew the failure mode. From that they were able to back it out and it didn’t take long to figure that one out. Then there was this postmortem of why did it happen? And I think it was a combination of things. One was outsourcing. It was a joint effort between JPL in Pasadena and Lockheed-Martin in Colorado. There were two people on two different teams and they just weren’t sitting down and having lunch together. I’m convinced that if they had, they would have solved this problem. But instead, one guy sent an email saying, you know, “Something not quite right with these measurements, seems like we’re off by a little bit. It’s not very much, it’s probably OK, but—”

Seibel: That was all during the flight?

Norvig: Right. During the flight they had chance and chance to catch it. They knew something was wrong and they sent this email but they did not put it into the bug-tracking system. If they had, NASA has very good controls for bug tracking and at later points in the flight somebody would have had to OK it. Instead it was just an informal email that never got an answer back, and JPL said, “Oh, I guess Lockheed-Martin must have solved this problem.” And Lockheed says, “Oh, JPL’s not asking anymore—they must not be concerned.”

So it was this communications problem. It was also a software reuse problem. They have extremely good checks for the stuff that’s missioncritical, and on the previous mission the stuff that was recorded in footpounds was non–mission-critical—it was just a log that wasn’t used for navigation. So it had been classified as non–mission-critical. In the new mission they reused most of the stuff but they changed the navigation so that what was formerly a log file now became an input to the navigation.

Seibel: So the actual problem was that one side generated a data file in foot-pounds and that data file was fed into a piece of software that was calculating inputs to the actual navigation and was expecting newtons?

Norvig: Right. So essentially the other root cause was too many particles from the sun. The spacecraft is asymmetrical and it’s got these solar panels. Particles twist the spacecraft a little bit so you’ve got to fire the rockets to twist it back. So this new hire at Lockheed went to the rocket manufacturer, and they had all their specifications in foot-pounds, so he just said, “I’ll go with that,” and he recorded them that way, not knowing that NASA wanted them in metric.

Seibel: I was struck, reading that report, by the NASA attitude of, “Well, the problem was due to this software bug but we had so many other chances to notice that the ship wasn’t where we expected, and we should have. We should have fixed it anyway even though the numbers we were dealing with were totally bogus because of some stupid software glitch.” I thought that was admirable.

Norvig: Yeah, they were looking at the process.

Seibel: Is it actually common for there to be software bugs of that magnitude, which we never hear about because all the other processes keep everything online?

Norvig: Yeah, I think so. Look at all the software bugs on your computer. There are millions of them. But most of the time it doesn’t crash.

Seibel: Yet you hear about how the shuttle flight software costs $1,500 a line or something because of the care with which they write it and which is allegedly bug-free. Is that just a lie?

Norvig: No, that’s probably true. But I don’t know if it’s optimal. I think they might be better off with buggy software.

Seibel: With cheaper software and better operations?

Norvig: Yeah, because of the amount of training they have to give to these astronauts to be able to deal with the things the software just can’t do. They put these astronauts in simulators and give them all these situations and when things go bad you’ve got this screen and stuff is scrolling through it and you can’t pause the screen, you can’t go back, you can’t get a summary of what the important things are. The astronauts just have to be trained to know, “When I see this happening, here’s what’s really going on.” There are a hundred messages in a row saying, “This electrical thing has faulted,” and you train them to say, “OK, that must be because this one original one faulted and then there was a cascade downstream and all the other ones are reported.” Why can’t you do that in software rather than train the astronaut? They don’t try because they don’t want to mess with it.

Seibel: On a different topic, what are your preferred debugging techniques and tools? Print statements? Formal proofs? Symbolic debuggers?

Norvig: I think it’s a mix and it depends on where I am. Sometimes I’m using an IDE that has good tracing capability and sometimes I’m just using Emacs and don’t have all that. Certainly tracing and printing. And thinking. Writing smaller test cases and watching them go, and breaking the functionality down to see where the test case failed. And I’ve got to admit, I often end up rewriting. Sometimes I do that without ever finding the bug. I get to the point where I can just feel that it’s in this part here. I’m just not very comfortable about this part. It’s a mess. It really shouldn’t be that way. Rather than tweak it a little bit at a time, I’ll just throw away a couple hundred lines of code, rewrite it from scratch, and often then the bug is gone.

Sometimes I feel guilty about that. Is that a failure on my part? I didn’t understand what the bug was. I didn’t find the bug. I just dropped a bomb on the house and blew up all the bugs and built a new house. In some sense, the bug eluded me. But if it becomes the right solution, maybe it’s OK. You’ve done it faster than you would have by finding it.

Seibel: What about things like assertions or invariants? How formally do you think about those kinds of things when you’re coding?

Norvig: I guess I’m more on the informal side. I haven’t used languages where that’s a big part of the formal mechanism, other than just type declarations. Like loop invariants: I’ve always thought that was more trouble than it was worth. I occasionally have a problem where this loop doesn’t terminate, but mostly you don’t, and I just feel like it slows you down to do the formal part. And if you do have a problem, the debugger will tell you what loop you’re stuck inside. I guess if you’re writing high-dependability software that’s embedded in something that it’s really important that it doesn’t fail, then you really want to prove everything. But just in terms of getting the first version of the program running or debugging it, I’d rather move fast towards that than worry about the degree of formal specification you need later.

Seibel: Have you ever done anything explicit to try and learn from the bugs that you’ve created?

Norvig: Yeah, I think that’s pretty interesting and I wish I could do more with that. I’m actually in a discussion now to see if I can do an experiment, company-wide and then maybe for the world at large, to understand more some of these issues. How do you classify bugs, but also what are some factors in terms of productivity? How do you know? Is there a certain type of person? What are the factors of that person that makes them more productive? And I think it’s more interesting what the controllable factors are that make somebody do better. If giving them a bigger monitor increases productivity by such and such a percent, then you should probably do it.