Изменить стиль страницы

The ability to go study something in a systematic, and maybe even leisurely, way is attractive. The go-to-market, ride Moore’s law, and compete and deal with fast product cycles and sometimes throwaway software—seems like a shame if that’s all everybody does. So there’s a role for people who want to get PhDs, who have the skills for it. And there is interesting research to do. One of the things that we’re pushing at Mozilla is in between what’s respected in academic research circles and what’s already practice in the industry. That’s compilers and VM stuff, debuggers even—things like Valgrind—profiling tools. Underinvested-in and not sexy for researchers, maybe not novel enough, too much engineering, but there’s room for breakthroughs. We’re working with Andreas Gal and he gets these papers rejected because they’re too practical.

Of course, we need researchers who are inclined that way, but we also need programmers who do research. We need to have the programming discipline not be just this sort of blue-collar thing that’s cut off from the people in the ivory towers.

Seibel: How do you feel about proofs?

Eich: Proofs are hard. Most people are lazy. Larry Wall is right. Laziness should be a virtue. So that’s why I prefer automation. Proofs are something that academics love and most programmers hate. Writing assertions can be useful. In spite of bad assertions that should’ve been warnings, we’ve had more good assertions over time in Mozilla. From that we’ve had some illumination on what the invariants are that you’d like to express in some dream type system.

I think thinking about assertions as proof points helps. But not requiring anything that pretends to be a complete proof—there are enough proofs that are published in academic papers that are full of holes.

Seibel: On a completely different topic, what’s the worst bug you ever had to track down?

Eich: Oh, man. The worst bugs are the multithreaded ones. The work I did at Silicon Graphics involved the Unix kernel. The kernel originally started out, like all Unix kernels of the day, as a monolithic monitor that ran to completion once you entered the kernel through a system call. Except for interrupts, you could be sure you could run to completion, so no locks for your own data structure. That was cool. Pretty straightforward.

But at SGI the bright young things from HP came in. They sold symmetric multiprocessing to SGI. And they really rocked the old kernel group. They came in with some of their new guys and they did it. They stepped right up and they kept swinging until they knocked the ball pretty far out of the field. But they didn’t do it with anything better than C and semaphores and spin locks and maybe monitors, condition variables. All hand-coded. So there were tons of bugs. It was a real nightmare.

I got a free trip to Australia and New Zealand that I blogged about. We actually fixed the bug in the field but it was hellish to find and fix because it was one of these bugs where we’d taken some single-threaded kernel code and put it in this symmetric multiprocessing multithreaded kernel and we hadn’t worried about a particular race condition. So first of all we had to produce a test case to find it, and that was hard enough. Then under time pressure, because the customer wanted the fix while we were in the field, we had to actually come up with a fix.

Diagnosing it was hard because it was timing-sensitive. It had to do with these machines being abused by terminal concentrators. People were hooking up a bunch of PTYs to real terminals. Students in a lab or a bunch of people in a mining software company in Brisbane, Australia in this sort of ’70s sea of cubes with a glass wall at the end, behind which was a bunch of machines including the SGI two-processor machine. That was hard and I’m glad we found it.

These bugs generally don’t linger for years but they are really hard to find. And you have to sort of suspend your life and think about them all the time and dream about them and so on. You end up doing very basic stuff, though. It’s like a lot of other bugs. You end up bisecting—you know “wolf fence.” You try to figure out by monitoring execution and the state of memory and try to bound the extent of the bug and control flow and data that can be addressed. If it’s a wild pointer store then you’re kinda screwed and you have to really start looking at harder-to-use tools, which have only come to the fore recently, thanks to those gigahertz processors, like Valgrind and Purify.

Instrumenting and having a checked model of the entire memory hierarchy is big. Robert O’Callahan, our big brain in New Zealand, did his own debugger based on the Valgrind framework, which efficiently logs every instruction so he can re-create the entire program state at any point. It’s not just a time-traveling debugger. It’s a full database so you see a data structure and there’s a field with a scrogged value and you can say, “Who wrote to that last?” and you get the full stack. You can reason from effects back to causes. Which is the whole game in debugging. So it’s very slow. It’s like a hundred times slower than real time, but there’s hope.

Or you can use one of these faster recording VMs—they checkpoint only at system call and I/O boundaries. They can re-create corrupt program states at any boundary but to go in between those is harder. But if you use that you can probably close in quickly at near real time and then once you get to that stage you can transfer it into Rob’s Chronomancer and run it much slower and get all the program states and find the bug.

Debugging technology has been sadly underresearched. That’s another example where there’s a big gulf between industry and academia: the academics are doing proofs, sometimes by hand, more and more mechanized thanks to the POPLmark challenge and things like that. But in the real world we’re all in debuggers and they’re pieces of shit from the ’70s like GDB.

Seibel: In the real world one big split is between people who use symbolic debuggers and people who use print statements.

Eich: Yeah. So I use GDB, and I’m glad GDB, at least on the Mac, has a watch-point facility that mostly works. So I can watch an address and I can catch it changing from good bits to bad bits. That’s pretty helpful. Otherwise I’m using printfs to bisect. Once I get close enough usually I can just try things inside GDB or use some amount of command scripting. But it’s incredibly weak. The scripting language itself is weak. I think Van Jacobson added loops and I don’t even know if those made it into the real GDB, past the FSF hall monitors.

But there’s so much more debugging can do for you and these attempts, like Chronomancer and Replay, are good. They certainly changed the game for me recently. But I don’t know about multithreading. There’s Helgrind and there are other sort of dynamic race detectors that we’re using. Those are producing some false positives we have to weed through, trying to train the tools or to fix our code not to trigger them. The jury is still out on those.

The multithreaded stuff, frankly, scares me because before I was married and had kids it took a lot of my life. And not everybody was ready to think about concurrency and all the possible combinations of orders that are out there for even small scenarios. Once you combine code with other people’s code it just gets out of control. You can’t possibly model the state space in your head. Most people aren’t up to it. I could be like one of these chestthumpers on Slashdot—when I blogged about “Threads suck” someone was saying, “Oh he doesn’t know anything. He’s not a real man.” Come on, you idiot. I got a trip to New Zealand and Australia. I got some perks. But it was definitely painful and it takes too long. As Oscar Wilde said of socialism, “It takes too many evenings.”