Humble Pi
Page 6
The LAPD fixed this with a time-honoured method for taking care of a sudden glut of criminals: it transported them to a distant island. Null Island.
Null Island is a small but proud island nation off the west coast of Africa. It’s located about 600 kilometres south of Ghana, and you can find it by putting its latitude and longitude into any mapping software of your choice: 0,0. Fun fact: its coordinates look like the facial expression of anyone deported there. For, you see, outside of databases, Null Island does not exist. It really does live up to its slogan: ‘Like No Place on Earth!’
Bad data is the scourge of databases, in particular when the data has been originally written down by fallible humans in their imprecise handwriting. Add to this the ambiguity of place names (for example, I had an office on Borough Road, and there are forty-two Borough Roads in the UK alone, not to mention two Borough Road Easts), and you have a roadmap to disaster. Whenever a computer cannot decipher a location, it still has to fill something in, and so 0,0 became the default location. The island where bad data goes to die.
Except cartographers took this seriously. Cartography was a rather antiquated and fusty old discipline until it was swept up by the modern tech revolution. Now, they have an audience for their own brand of humour. For generations, cartographers have been sneaking fictitious places into real maps (often as a way to expose people plagiarizing their work), and it was inevitable that Null Island would take on a life of its own. So, they literally put it on the map.fn2 If you believe their marketing material, Null Island has a thriving population, a flag, a department of tourism and the world’s highest per capita Segway ownership.
Even when the data has made it into a database, it is not safe … which brings us, finally, to Microsoft Excel. My opinion of Excel is a matter of public record: I’m a big fan. It’s a fantastic way to do a lot of calculations at the same time and it’s normally the first thing I reach for when I need to do a quick burst of arithmetic. But there is one thing Excel is not, and that is a database system. Yet it is frequently used as one.
There is something alluring about the up-front, easy-to-see rows of a spreadsheet that draws people in to use it to store data. I’m as guilty as anyone; many of my small projects that involve a bit of data are kept in spreadsheets. It’s just so easy. Superficially, Excel makes for a great data management system. But so many things can go wrong.
For a start, just because something walks like a number and quacks like a number does not mean it is a number. Some things which look like numbers are just not. Phone numbers are a perfect example: despite being made from digits, they are not actually numbers. When have you ever added two phone numbers together? Or found the prime factors of a phone number? (Mine has eight, four distinct.) The rule of thumb should be: if you’re not going to do any maths with it, don’t store it as a number.
In some countries phone numbers start with a zero. By default, normal numbers don’t have a leading zero on the front. Open up a spreadsheet and type ‘097’ and hit Enter. Instantly, the lead zero vanishes. That is quite a personal example, because several years ago I had a credit card where the three security digits on the back were 097 (the important words in this sentence are ‘several years ago’; I’m not falling for that one). Many a website would remove the leading zero as soon as I entered it then claim my card details did not match.
It gets worse for a phone number. If you enter the phone number 0141 404 2559, not only does the zero disappear but 1,414,042,559 is a really big number, well over a billion. So if you put it in Excel, it might switch the number over to a different way of writing numbers: scientific notation. I just pasted that number into a spreadsheet and now all I can see is 1.414E+9. Widening the column in this case will reveal the hidden digits. But if you do this with a longer number, those digits could be lost for ever.
Scientific notation separates the size of the number from what its specific digits are. Normally, the size of a number is indicated by how many digits it has (before a decimal point), but when we don’t know all the digits, or the digits simply are not important, then the number ends up being mostly zeros. In normal language, we already split the digits from a number’s size. For example: the universe is currently 13.8 billion years old. The important digits are 13.8 and ‘billion’ tells you how big the number is. Much better than writing out 13,800,000,000 and relying on the zeros to indicate the size.
Scientific notation just takes that a step further. In normal language, we like round multiples of millions and billions but, in science, the decimal point is moved all the way to the front and then the number of digits is specified. So the age of the universe is 1.38E+10. That E is actually a lazy way of writing an exponential: the universe is 1.38 × 1010 years old. For a very small measurement, a negative number is used for the size. A proton has a mass of 1.67E−27 kg. That’s much neater than writing 0.00000000000000000000000000167kg.
But a phone number has important digits all the way down. I would be happier if we called them ‘phone digits’ instead of ‘phone numbers’, because, I repeat, I don’t think they are numbers. If you’re ever not sure if something is a number or not, my test is to imagine asking someone for half of it. If you asked for half the height of someone 180 centimetres tall, they would say 90 centimetres. Height is a number. Ask for half of someone’s phone number, and they will give you the first half of the digits. If the response is not to divide it but rather to split it, it’s not a number.
As well as converting non-numbers into numbers, Excel will sometimes take things which are numbers and convert them into text. This mainly applies to unusual numbers which go beyond our normal base-10 counting system. Things like the base-2 binary of computers use a reduced set of digits; binary only needs 0 and 1. A number system needs as many digits as its base, which is why base-10 has the ten digits: 0 to 9. But once you go past base-10, there are no more ‘normal’ digits so letters get called into service.
BASE-10
POSITION VALUE: 10,000s 1,000s 100s 10s 1s
DIGIT: 1 9 5 2 7
BASE-16
POSITION VALUE: 4,096s 256s 16s 1s
DIGIT: 4 C 4 7
A quick check confirms:
4 × 4,096 + C (aka ‘12’) × 256 + 4 × 16 + 7 × 1 = 19,527
If you convert the base-10 number 19,527 to base-16, you get 4C47. Here, the C is not a letter, despite still looking like one; it is a digit. Specifically, it is the digit that represents a value of twelve. Just like 7 represents a value of, well, seven. When they ran out of digits, mathematicians realized that letters were a perfect source of more symbols and already have an agreed order. So they conscripted them into number service, much to the confusion of everyone else. Including Excel. If you try to use a letter as a digit in Excel, it reasonably assumes you’re typing a word, not a number.
The problem is that higher-base numbers are not just the plaything of mathematicians. If computers are obsessed with binary numbers, their next love is base-16 numbers. It’s really easy to convert between binary and base-16 hexadecimal numbers, which is why hexadecimal is used to make computer binary a bit more human-friendly; the hexadecimal 4C47 represents the full binary number 0100110001000111 but is much easier to read. You can think of hexadecimal as binary in disguise. It was used in the SQL injection example before, to hide computer code in plain sight.
The mistake is to try to store computer data which uses hexadecimal values in Excel, a mistake I’m as guilty of as anyone. I had to store hexadecimal values in a spreadsheet of people who had crowdfunded my online videos, and Excel immediately turned them all to text. Which will teach me for not using a real database, like a grown-up.
To be fair: it is halfway there. Excel has a built-in function that converts between base-10 and base-16 called DEC2HEX (if I ever start a boy band, I’m calling it Dec2Hex). If you type in DEC2HEX(19527), it will spit out 4C47 then immediately forget it is a number. If you want to add hexadecimal numbers together in Excel, or divide them, or do anything mathematical, you
need to convert back to base-10, do your maths work, then reconvert back.
If you really want to nerd out (too late! we’re doing it!), there is one special niche case where Excel completely breaks hexadecimal by, ironically, actually treating it as a number. But it’s the wrong type of number. For example, the number 489 in base-10 becomes 1E9 in hexadecimal, but when you enter 1E9 into Excel it sees the letter E between two numbers and realizes it’s seen that before. It’s scientific notation! Suddenly, your 1E9 has been replaced by 1.00E+09. From 489 to a billion in the blink of a format.
The same problem occurs in all instances with ‘hexadecimal numbers containing no alpha digits other than a single “e”, where the “e” is not the first or last digit’. That is the official definition at the On-Line Encyclopedia of Integer Sequences, which lists the first 99,999 such cases.fn3 Spoiler: the 100,000th is 3,019,017.
This is not only limited to us nerds using hexadecimal. I spoke off the record to a database consultant who was working with a company in Italy. They had a lot of clients and their database would generate a client ID for each one by using something like the current year, the first letter of the client company name and then an index number to make sure each ID was unique. For some reason their database was losing companies which started with the letter E. It was because they were using Excel and it was converting those client IDs to be a scientific notation number which was no longer recognized as an ID.
At the time of writing, there is no way to turn off scientific notation by default in Excel. Some would argue it is a simple change that would solve some major database problems for a lot of people. Even more people would argue that, realistically, it’s not Excel’s fault, as it should not be used as a database in the first place. But, if we’re being honest: it will be. And it can cause problems with far more prosaic features than scientific notation. Spell check is bad enough.
Gene and gone
I’m no biologist, but a light bit of online research has convinced me that my body needs the enzyme ‘E3 ubiquitin-protein ligase MARCH5’. Reading biology texts reminds me what it is like for other people reading maths: a string of words and symbols that looks like normal language yet transmits no new information to my brain as it parses them. If I fight back the need to process the actual content and just focus on the overall syntax, I can kinda get the gist of what the author is trying to say.
Collectively, our data indicate that the lack of MARCH5 results in mitochondrial elongation, which promotes cellular senescence by blocking Drp1 activity and/or promoting accumulation of Mfn1 at the mitochondria.
– From the 2010 research publication Journal of Cell Science. Roughly translated: you need this stuff.
Thankfully, on the tenth human chromosome is the gene which encodes for the production of this enzyme. The gene has the catchy name of MARCH5, and if you think that looks a lot like a date, then you can already see where this is going. Over on your first chromosome, the gene SEP15 is busy making some other important protein. Type those gene names into Excel and they’ll transform into Mar-05 and Sep-15. Encoded as 01/03/2005 and 01/09/2015 (if you dig into the Excel data on a UK version), all mention of MARCH5 and SEP15 has been obliterated.
Do biologists use Excel much to process their data? Is the phosphoglycan C-terminal?! Yes! (Well, I think it is. It was that or ‘Do BPIFB1 genes secrete in the woods?!’ but I wasn’t confident about that one either. I’m way beyond my limit of biological knowledge trying to look up even obvious microbiology things.) Look: the point is yes. Cell biologists use Excel a lot.
In 2016 three intrepid researchers in Melbourne analysed eighteen journals which had published genome research between 2005 and 2015 and found a total of 35,175 publicly available Excel files associated with 3,597 different research papers. They wrote a program to auto-download the Excel files then scan them for lists of gene names, keeping an eye out for where they had been ‘autocorrected’ by Excel into something else.
After checking through the offending files manually to remove false positives, the researchers were left with 987 Excel spreadsheets (from 704 different pieces of research) which had gene name errors introduced by Excel. In their sample, they found that 19.6 per cent of gene research crunched in Excel contained errors. I’m not sure of the exact impact of having gene names disappearing from your database, but I think we can safely assume it’s not good.
These problems generally boil down to figuring out what kind of thing a piece of data is. For example, 22/12 could be a number (22 ÷ 12 = 1.833 …), a date (22 December) or a piece of text (just the characters 22/12). So a database has to store not only the data but also metadata, that is, data about the data. As well as every entry having a value, it is also defined as a type. Which is why I can say – again – that phone numbers should not be stored as a number.
In Excel, similar distinctions are possible, but they’re far from intuitive and far from easy to work with. The default settings for a new spreadsheet are not fit for purpose when it comes to scientific research. When the gene autocorrect research came out, Microsoft was asked for comment and a spokesperson said, ‘Excel is able to display data and text in many different ways. Default settings are intended to work in most day-to-day scenarios.’
Such a great quote. It comes with a heavily implied ‘gene research is not a day-to-day scenario’. (Like an exasperated nurse explaining to someone missing several fingers that opening a bottle of beer is not a day-to-day-scenario use for an axe.) I like to imagine the Microsoft spokesperson delivering their reply in a press conference while someone behind the scenes has to physically restrain the Microsoft Access team, which is an actual database system. Through the walls can be heard muffled cries of ‘Tell them to use a real database LIKE AN ADULT.’
The end of the spreadsheet
Another limitation of spreadsheets as databases is that they eventually run out. Much like computers having trouble keeping track of time in a 32-digit binary number, Excel has difficulty keeping track of how many rows are in a spreadsheet.
In 2010 WikiLeaks presented the Guardian and The New York Times with 92,000 leaked field reports from the war in Afghanistan. Julian Assange delivered them in person to the Guardian offices in London. The journalists quickly confirmed that they seemed to be real but, to their surprise, the reports ended abruptly in April 2009, when they should have gone through to the end of that year.
You guessed it: Excel counted its rows as a 16-bit number, so there was a maximum of 2^16= 65,536 rows available. So when the journalists opened the data in Excel, all the data after the first 65,536 entries vanished. New York Times journalist Bill Keller described the secret meeting where this was noticed and how ‘Assange, slipping naturally into the role of office geek, explained that they had hit the limits of Excel.’
Excel has since been expanded to a maximum 2^20 = 1,048,576 rows. But that is still a limit! Scrolling down in Excel can feel like it goes on for ever, but if you drag down for long enough you will eventually hit the end of the spreadsheet. If you’d like to go to visit the void at the end of the rows, I can confirm it takes about ten minutes of maximum-speed scrolling.
Behold the empty void at the edge of the Excel world.
When the spreadsheet hits the fan
On the whole, doing any kind of important work in a spreadsheet is not a good idea. They are the perfect environment for mistakes to spawn and grow unchecked. The European Spreadsheet Risks Interest Group (yes, that is a real organization, one dedicated to examining the moments when spreadsheets go wrong) estimates that over 90 per cent of all spreadsheets contain errors. Around 24 per cent of spreadsheets that use formulas contain a direct maths error in their computations.
They are able to arrive at such an oddly specific percentage because, occasionally, an entire company’s worth of spreadsheets escape at once. Dr Felienne Hermans is an assistant professor at Delft University of Technology, where she runs their Spreadsheet Lab. I love the idea of a spreadsheet lab: columns instead of coll
imators; if-statements instead of incubators. She was able to analyse one of the biggest corpuses of real-world spreadsheets ever captured.
In the aftermath of the Enron scandal of 2001, the Federal Energy Regulatory Commission published the results of its investigation into the corporation and the evidence behind it – which included around 0.5 million emails from within the company. There were some concerns about publishing the emails of employees who had nothing to do with the scandal, so a sanitized version taking employee concerns into account is now available online. It provides a fantastic insight into how email is used within such a large company. And, of course, they were emailing a lot of spreadsheets as attachments.
Hermans and her colleagues searched through the email archive and were able to assemble a corpus of 15,770 real-world spreadsheets as well as 68,979 emails pertaining to spreadsheets. There is some selection bias because these spreadsheets were from a company being investigated for poor financial reporting, which is a shame. But it was still an incredible snapshot of how spreadsheets are actually used in the real world, as well as the way in which emails showing those spreadsheets were discussed, passed around and updated. Here is what Hermans discovered:
The average spreadsheet was 113.4 Kilobytes.
The biggest spreadsheet was an impressive 41 Megabytes. (I bet it was a birthday invite with embedded sound files and animated gifs. Makes me shudder just thinking about it.)
On average, each spreadsheet had 5.1 worksheets within it.
One spreadsheet had 175 worksheets! Even I think that is too much and that it needs an SQL.
The spreadsheets had an average of 6,191 non-empty cells each, of which 1,286 were formulas (so 20.8 per cent of cells used formulas to do a calculation or move data around).
6,650 spreadsheets (42.2 per cent) did not contain a single formula. Come on. Why even use a spreadsheet?