by Matt Parker
Having assumed the error report was navigation information, the best interpretation the onboard computer could come up with was that the rocket had suddenly swerved off to the side. So it did the logical thing in that situation and executed the rocket equivalent of steering wildly in the opposite direction. There was nothing wrong with the link between the onboard computer and the pistons which aimed its thrusters, so this command was followed, ironically making the rocket veer abruptly off to the side.
This was enough to spell doom for the Ariane 5 rocket. It would have hit the ground before too long. But, in the end, the high-speed manoeuvre partially ripped the booster rockets off the main rocket body, which is universally considered rather a bad thing. And so the onboard computer correctly decided to call it a day and deployed the self-destruct system, raining fragments of the four Cluster satellites all over the mangrove swamp below.
The final hole in the cheese is that the horizontal velocity sensor was not even needed during the launch. It was actually used to calibrate the rocket’s position pre-launch and not required at all during take-off. Except, when Ariane 4 launches were aborted before lift-off, it was a real pain to reset everything once the sensors were off. So it was decided to wait about fifty seconds into the flight before turning them off to make sure it had definitely launched. This was no longer required for the Ariane 5, but it lived on as a piece of vestigial code.
In general, reusing code without retesting can cause all sorts of problems. Remember the Therac-25 radiation therapy machine, which had a 256-roll-over problem and accidentally overdosed people? During the course of the resulting investigation it was found that its predecessor, the Therac-20, had the same issues in its software, but it had physical safety locks to stop overdoses, so no one ever noticed the programming error. The Therac-25 reused code but did not have those physical checks, so the roll-over error was able to manifest itself in disaster.
If there is any moral to this story, it’s that, when you are writing code, remember that someone may have to comb through it and check everything when it is being repurposed in the future. It could even be you, long after you have forgotten the original logic behind the code. For this reason, programmers can leave ‘comments’ in their code, which are little messages to anyone else who has to read their code. The programmer mantra should be ‘Always comment on your code.’ And make the comments helpful. I’ve reviewed dense code I wrote years before, to find the only comment is ‘Good luck, future Matt.’
Invaders of space
Programming is such a great combination of complexity and absolutely certainty. Any one line of code is completely defined: a computer will do exactly what the code says. But determining the end result of a lot of code interacting is rather difficult, and this can make debugging code an emotional experience.
At the very bottom are what I call ‘level zero’ programming mistakes. This is where the line of code itself is wrong. Something as seemingly inconsequential as a forgotten semicolon can bring a whole program grinding to a halt. Languages use things like semicolons, brackets and line breaks to indicate the beginnings and ends of statements and will freak out if they are missing. Many a programmer has spent hours yelling at their screen because their code refuses to work at all, only to later discover they were missing an invisible tab.
These mistakes are the programming equivalent of typos. In 2006 a group of molecular biologists had to retract five research papers, including publications in Science and one in Nature, because of a mistake in their code. They had written their own program to analyse data about the structure of biological molecules. However, it was accidentally flipping some positive values to be negative, and vice versa, and this meant that part of the structure they published was the mirror image of the correct arrangement.
This program, which was not part of a conventional data processing package, converted the anomalous pairs (I+ and I-) to (F- and F+), thereby introducing a sign change.
– Retraction of ‘Structure of MsbA from E. coli’
A typo in a single line of code can do enormous damage. In 2014 a programmer was doing some maintenance on their server and wanted to delete an old back-up directory called something like /docs/mybackup/, but they accidentally typed it as /docs/mybackup / with an extra space. Opposite is what the full line they typed into their computer looked like. I cannot overstress this enough: do not type anything even remotely like this into your computer, as it can delete everything you love and hold dear.
sudo rm -rf --no-preserve-root /docs/mybackup /
sudo = super user do: tells the computer you are a superuser and it should do whatever you say without question
rm = remove: synonymous with ‘delete’
-rf = recursive, force: forces the command to run recursively across a whole directory
--no-preserve-root = nothing is sacred
So now, instead of deleting one directory called /docs/mybackup/, it was going to delete two of them: /docs/mybackup and /. The funny story about / is that it represents the root directory of the computer system; the absolute base-level directory which contains all other folders: / is basically the whole computer. There are several rm -rf stories online about people who have deleted everything on their computer or, in some cases, everything on an entire company’s computers. All because of a single typo.
I also consider mistakes to be level zero which are not true typos as such but more like translation issues. A programmer has the steps in their heads they want the computer to do but they need to translate them from human thought into a programming language the computer can understand. Mistakes in translation can render a statement incomprehensible. Like the Szechuan dish which sometimes appears translated on menus as ‘saliva chicken’. No one is going to order that. The original meaning of ‘mouth-watering chicken’ has been broken.
The concept of ‘equals’ can be translated into computer language as either = or ==. In many computer languages, = is a command to make things equal, whereas == is a question about whether things are equal. Something like cat_name = Angus will name your cat Angus, but cat_name == Angus will return True or False, depending on what the cat’s name already is. Use the wrong one and the code will break.
Some computer languages try to make your life as easy as possible by meeting you halfway and putting in some effort to understand what you were trying to say. Which is why, as a hobbyist programmer, I use Python: the friendliest of all the languages. After that are the languages which don’t make any concession if the coder makes mistakes, but at least they’re not malicious about it. These are the vast majority of your coding options: C++, Java, Ruby, PHP … and so on.
Then, of course, there are the languages which hate the very concept of humans. These are born because programmers think they are hilarious and that making deliberately unwieldy programming languages is almost a sport. The classic is a language called brainf_ck, which I’ve slightly censored here. I feel its official, polite-company name of ‘BF’ does not do it justice. In brainf_ck there are only eight possible symbols: > < + – [ ], and. Which means even the simplest programs look like this:
++++[>+++++<-]>-[>++++++>+++++>++<<<-]>-----.>++.<+++++++..+++.>.<----.>>+.<++++.<-.>.
While brainf_ck is often written off as a joke language, I think it is actually worth learning because it deals directly with the way a programming language stores and manipulates data. It is like interacting directly with the hard drive. Imagine a computer program looking at one single byte in the memory at a time: < and > move the point of focus left and right; + and – increase or decrease the current value; [ and ] are used to run loops while . and , are the read and write commands. Which is all any computer program is ever doing; it’s just hidden behind other layers of translation.
If you want a language which is just obfuscating for the hell of it, then Whitespace is your best bet. It ignores any visible characters in the code and processes only the invisible ones. So to code in Whitespace you can only use combinations of spaces, t
abs and returns.fn1 And that is before we get to programming languages in which: you’re only allowed to use the word ‘chicken’; the code needs to be formatted like you’re ordering at a drive-through window; or everything is written as sheet music. I think, due to survivor bias, programmers tend to be a sadistic bunch who enjoy frustration.
Ignoring typos and languages which are deliberately out to hurt you, there is a whole class of programming errors which I consider ‘classic’ coding mistakes. They are easiest to spot in older programs, which were deliberately super-efficient to run on limited-power hardware. This caused the coders to get a bit creative, and that then led to some unexpected knock-on effects.
The people programming the Space Invaders arcade game were so worried about saving space in the limited ROM on the chip they tried to cut as many corners as they could. The efficiencies in Space Invaders led to a number of quirks, which were exploited by players, but some are so niche I don’t think any players even know about them, let alone utilize them. These lie in the grey area between outright programming error and unintended consequences.
During a game of Space Invaders the player could shoot at: the descending aliens, the occasional mystery ship that would fly across the top of the screen, and their own protective shields. The program would need to check if a shot fired hit anything important. Collision detection can be a difficult bit of code to write, and the programmers behind Space Invaders were looking for ways to simplify the process. They realized that all shots either hit something or go off the top of the screen.
So after each shot is fired the program waits to see if the bullet hits a mystery ship or goes off the screen. If neither of those happens, then it checks the y coordinate of the collision to see how high it was. If it is higher than the lowest alien, then it must have hit an alien: there are no other options. Only now does the ‘Which alien was hit?’ part of the code start up. It’s a bit like the SRI processor on the Ariane rockets: assumptions are made about what kind of data can reach it, and checks are only run when really needed.
The aliens are arranged in a grid with five rows of eleven aliens. To keep track of all fifty-five aliens, the program numbers them 0 to 54 and uses the formula of 11 × ROW + COLUMN = ALIEN to take the collision row (0 to 4) and column (0 to 10) and convert it into the number of the alien which was hit.
This all worked fine unless the player strategically shot all the aliens except the upper-left one. This is the alien in row 4, column 0, which means it is alien number 11 × 4 + 0 = 44. The player then watches alien 44 move from side to side, slowly descending, until it is about to hit the left side of the screen on its final pass, just above the player’s shields. At that moment the player shoots its own shield on the right-most side of the screen.
The game registers this as a hit within the grid of aliens and assumes an alien must have been hit. The shield is so far to the right it is where the twelfth row would have been, but the code doesn’t stop to check. It dutifully converts the collision’s horizontal coordinate into a row number and gets 11, outside the normal range of columns 0 to 10. Putting this incorrect column number into the formula gives 11 × 3 + 11 = 44 and the alien on the far side of the screen explodes.
Five-by-eleven grid overlay on the starting formation of aliens. A well-timed shot hits the shield where a twelfth column would be.
Okay, so that is not a groundbreaking mistake, but it shows you how even a system as simple as Space Invaders can end up in situations the programmers did not see coming. The original Space Invaders code was not commented, but there is an online project at computerarcheology.com to go through and comment about it all with modern notes. It’s a fun read. I enjoy any code which has comments like ‘Get alien status flag; Is the alien alive?’ I mean, any comments which are not a past version of me being a jerk are a bonus.
The 500-mile email
Being a system administrator, or sysadmin, for a large computer network is a daunting enough task without it being a computer network at a university in the late nineties. University departments can be a little touchy about their autonomy and, throw in the Wild West feel of the early web in the nineties, and it’s a recipe for complex disaster.
Thus it was with some trepidation that Trey Harris, a sysadmin for the University of North Carolina, took a phone call from the head of the statistics department sometime around 1996. They had a problem with their email. Some departments had decided to run their own email servers, including the statistics department, and Trey informally helped them out with keeping them going. Which meant that this was now, informally, his problem.
‘We’re having a problem sending email out of the department.’
‘What’s the problem?’
‘We can’t send mail more than 500 miles.’
‘Come again?’
The head of statistics explained that no one in the department could send email more than about 520 miles. Some emails sent to people within that distance still failed, but all emails going further than 520 miles definitely failed. This had apparently been going on for a few days, but they didn’t report it sooner because they were still gathering enough data to establish the exact distance. One of their geostatisticians was apparently making a very nice map of where email could and could not be sent to.
In disbelief, Trey logged in to their system and sent some test emails via their servers. Local emails and ones sent to Washington DC (240 miles), Atlanta (340 miles) and Princeton (400 miles) were all delivered fine. But emails to Providence (580 miles), Memphis (600 miles) and Boston (620 miles) all failed.
He nervously sent an email to a friend of his who he knew lived nearby, in North Carolina, but whose email server was in Seattle (2,340 miles). Thankfully, it failed. If the emails somehow knew the geographic location of their recipient, then Trey would have broken down in tears. At least the problem had something to do with the distance to the receiving server. But nothing in email protocols depended on how far the signal needed to go.
He cracked open the sendmail.cf file, a file which contains all the details and rules which govern how email is sent. Whenever an email is sent, it checks in with this file to get the instructions required to then be passed on to the actual email system responsible for the sending. It looked familiar because Trey had written it himself. Nothing was out of order; it should have worked nicely with the main Sendmail system.
So he checked the main department system (telnetted into the SMTP port, for those of you who want to follow along in excruciating detail) and was greeted by the Sun operating system. A bit of digging revealed that the statistics department had recently had their server’s copy of SunOS upgraded, and the upgrade came with a default version of Sendmail 5. Previously, Trey had set up the system to use Sendmail 8, but now the new version of SunOS had come barging in and downgraded it to Sendmail 5. Trey had written the sendmail.cf file assuming it would only ever be read by Sendmail 8.
Okay, if you glossed over during that, you can tune back in now. The short version is that the instructions for sending email had been written for a newer system and, when it was fed into an older system, it caused that classic problem yet again: a computer program trying to digest data that was not intended for it. One part of that data was the ‘timeout’ time, and in Sendmail 5’s indigestion it set it to the default value of zero.
If a computer server sends out an email and does not hear back, it needs to decide when to stop waiting and call it quits, accepting that that email is forever lost. This wait time was now set to be zero. The server would send the email and then immediately give up on it. Like parents who have converted their kid’s bedroom into a sewing room before they’ve even finished the journey to university.
Well, in practice, it would not be exactly zero. There would still be a processing delay within the program of a few milliseconds between the sending of the email and the system being able to officially abandon it. Trey grabbed some paper and did a few rough calculations. The college itself was directly connected to the i
nternet, so emails could leave the system super-quick. The first delay in the signal would be hitting the router at the far end of the journey and a response being sent back.
If the receiving server was not under heavy load and could send the response back fast enough, the only remaining limit was the speed of the signal. Trey factored in the speed of light in fibre optics for the return journey, along with router delays, and it dropped out at just over 500 miles one-way. The emails were being limited by the finite nature of the speed of light.
This also explained why some emails were failing within the 500-mile radius: the receiving servers were too slow to get a signal back before the sending system stopped listening. A simple reinstall of Sendmail 8, and the sendmail.cf config file was once again being read correctly by the mail server.
This goes to show that, even though some sysadmins see themselves as gods on Earth, they still have to obey the laws of physics.
Human interactions
In 2001 I was turning on my cobbled-together Windows machine which had almost got me through my university years, and (on the BIOS load screen) there it was, in white, chunky text on the black background:
Keyboard error or no keyboard present
Press F1 to continue, DEL to enter SETUP.
I had heard about the family of ‘No keyboard detected, press any key to continue’ error messages, but had never seen one in the wild. I ran to get my housemate so he could come and see it as well. It was the talk of the house for days to come (okay, my memory may have inflated the experience slightly). Error messages are a constant source of entertainment in the tech world.