Using the master control center, a phone engineer can test local and trunk lines for malfunctions. He (rarely she) can check various alarm displays, measure traffic on the lines, examine the records of telephone usage and the charges for those calls, and change the programming.
And, of course, anybody else who gets into the master control center by remote control can also do these things, if he (rarely she) has managed to figure them out, or, more likely, has somehow swiped the knowledge from people who already know.
In 1989 and 1990, one particular RBOC, BellSouth, which felt particularly troubled, spent a purported $1.2 million on computer security. Some think it spent as much as two million, if you count all the associated costs. Two million dollars is still very little compared to the great cost-saving utility of telephonic computer systems. Unfortunately, computers are also stupid. Unlike human beings, computers possess the truly profound stupidity of the inanimate.
In the 1960s, in the first shocks of spreading computerization, there was much easy talk about the stupidity of computers -- how they could "only follow the program" and were rigidly required to do "only what they were told." There has been rather less talk about the stupidity of computers since they began to achieve grandmaster status in chess tournaments, and to manifest many other impressive forms of apparent cleverness.
Nevertheless, computers *still* are profoundly brittle and stupid; they are simply vastly more subtle in their stupidity and brittleness. The computers of the 1990s are much more reliable in their components than earlier computer systems, but they are also called upon to do far more complex things, under far more challenging conditions.
On a basic mathematical level, every single line of a software program offers a chance for some possible screwup. Software does not sit still when it works; it "runs," it interacts with itself and with its own inputs and outputs. By analogy, it stretches like putty into millions of possible shapes and conditions, so many shapes that they can never all be successfully tested, not even in the lifespan of the universe. Sometimes the putty snaps.
The stuff we call "software" is not like anything that human society is used to thinking about. Software is something like a machine, and something like mathematics, and something like language, and something like thought, and art, and information.... but software is not in fact any of those other things. The protean quality of software is one of the great sources of its fascination. It also makes software very powerful, very subtle, very unpredictable, and very risky.
Some software is bad and buggy. Some is "robust," even "bulletproof." The best software is that which has been tested by thousands of users under thousands of different conditions, over years. It is then known as "stable." This does *not* mean that the software is now flawless, free of bugs. It generally means that there are plenty of bugs in it, but the bugs are well-identified and fairly well understood.
There is simply no way to assure that software is free of flaws. Though software is mathematical in nature, it cannot by "proven" like a mathematical theorem; software is more like language, with inherent ambiguities, with different definitions, different assumptions, different levels of meaning that can conflict.
Human beings can manage, more or less, with human language because we can catch the gist of it. Computers, despite years of effort in "artificial intelligence," have proven spectacularly bad in "catching the gist" of anything at all. The tiniest bit of semantic grit may still bring the mightiest computer tumbling down. One of the most hazardous things you can do to a computer program is try to improve it -- to try to make it safer. Software "patches" represent new, untried un- "stable" software, which is by definition riskier.
The modern telephone system has come to depend, utterly and irretrievably, upon software. And the System Crash of January 15, 1990, was caused by an *improvement* in software. Or rather, an *attempted* improvement. As it happened, the problem itself -- the problem per se -- took this form. A piece of telco software had been written in C language, a standard language of the telco field. Within the C software was a long "do... while" construct. The "do... while" construct contained a "switch" statement. The "switch" statement contained an "if" clause. The "if" clause contained a "break." The "break" was *supposed* to "break" the "if clause." Instead, the "break" broke the "switch" statement.
That was the problem, the actual reason why people picking up phones on January 15, 1990, could not talk to one another.
Or at least, that was the subtle, abstract, cyberspatial seed of the problem. This is how the problem manifested itself from the realm of programming into the realm of real life. The System 7 software for AT&T's 4ESS switching station, the "Generic 44E14 Central Office Switch Software," had been extensively tested, and was considered very stable. By the end of 1989, eighty of AT&T's switching systems nationwide had been programmed with the new software. Cautiously, thirty- four stations were left to run the slower, less-capable System 6, because AT&T suspected there might be shakedown problems with the new and unprecedently sophisticated System 7 network.
The stations with System 7 were programmed to switch over to a backup net in case of any problems. In mid-December 1989, however, a new high-velocity, high- security software patch was distributed to each of the 4ESS switches that would enable them to switch over even more quickly, making the System 7 network that much more secure. Unfortunately, every one of these 4ESS switches was now in possession of a small but deadly flaw.
In order to maintain the network, switches must monitor the condition of other switches -- whether they are up and running, whether they have temporarily shut down, whether they are overloaded and in need of assistance, and so forth. The new software helped control this bookkeeping function by monitoring the status calls from other switches. It only takes four to six seconds for a troubled 4ESS switch to rid itself of all its calls, drop everything temporarily, and re-boot its software from scratch. Starting over from scratch will generally rid the switch of any software problems that may have developed in the course of running the system. Bugs that arise will be simply wiped out by this process. It is a clever idea. This process of automatically re-booting from scratch is known as the "normal fault recovery routine." Since AT&T's software is in fact exceptionally stable, systems rarely have to go into "fault recovery" in the first place; but AT&T has always boasted of its "real world" reliability, and this tactic is a belt-and-suspenders routine.
The 4ESS switch used its new software to monitor its fellow switches as they recovered from faults. As other switches came back on line after recovery, they would send their "OK" signals to the switch. The switch would make a little note to that effect in its "status map," recognizing that the fellow switch was back and ready to go, and should be sent some calls and put back to regular work. Unfortunately, while it was busy bookkeeping with the status map, the tiny flaw in the brand-new software came into play. The flaw caused the 4ESS switch to interacted, subtly but drastically, with incoming telephone calls from human users. If -- and only if -- two incoming phone-calls happened to hit the switch within a hundredth of a second, then a small patch of data would be garbled by the flaw. But the switch had been programmed to monitor itself constantly for any possible damage to its data. When the switch perceived that its data had been somehow garbled, then it too would go down, for swift repairs to its software. It would signal its fellow switches not to send any more work. It would go into the fault- recovery mode for four to six seconds. And then the switch would be fine again, and would send out its "OK, ready for work" signal.