Some days ago this article from BBC fell into my attention (https://www.bbc.com/news/uk-54422505), which explains how data of positive identified COVID-19 infections, in Britain, were lost into oblivion, due to the fact that the format of “.xls” was used to gather the data. Don’t get me wrong here, this article is not a roast for the “.xls” format.
Let me get into details on what happened in this case. The Public Health England (PHE) was gathering the result logs of the COVID-19 swab tests performed by commercial firms in a text format. The PHE developers had set-up an automated procedure to import those logs in a Microsoft Excel template, in order to later on post those records to a central database for the various government experts and organizations to have access and process the data. The problem is, that the PHE developers used the “.xls” format. One could say “what is wrong with that? I ‘ve been working with .xls files my entire life and never had a problem.” Well, I tend to agree, but things are more technical here, than it is just all Excel files “.xls” or “.xlsx”. That little “x” in the end makes all the difference. The “.xls” format is the default format used in Excel versions 97 up to 2003. The .xlsx format was introduced in Excel versions 2007 and is the default format until now. Apart from the low-level differences of the .xls being a binary file and the .xlsx being in its core a text file in XML format, the key difference that caused our big problem here is the actual limitations of the .xls format. The older standard has a limit of 65,536 rows, whereas the .xlsx format has a nominal limitation of 1,048,576, which in our case is a big difference. For the most people working with Excel Spreadsheets the above limits do not mean much. But when we are talking about data collected from an entire country, this is another story. So, what actually happened, is that when the files were reaching their limits, no more data could be read in. This resulted in 15,841 cases between 25 September and 2 October to be left out of the overall procedure of posting and processing. The data were not entirely lost, as they could be re-inserted after checking which were missing. The problem is -in such cases when we are dealing with the spread of a pandemic- that the data are time-critical. It is of utmost importance to have the data available at the time that they are useful (eg. same day) and thus figure out after the epidemiologists process, what are the spread rates and what measures should be taken.
The problem that we want to stress out here, is not that the .xls format is flawed and should be banned for life, but as every tool (electronic or physical), it is not suited for all the cases. People tend to use what they know better, even for cases that it is not the best selection or in many cases the second from worst. Yes, you can tighten a screw with a knife, but the torque applied, is merely a fraction of the torque of a screwdriver and if we are talking about security, that screw will fall-off on the road and someone might eventually get hurt. In many cases the selection of the right tool might demand a time-consuming research, but depending on the seriousness of the project it is worth the cost. If the PHE developers had taken the extra time and effort to craft a different method for performing the same task, such a serious problem would have been avoided. In greek, there is a saying that goes roughly like “Better secure the donkey, than go looking for it afterwards”, which applies to many things in life and perhaps in the most cases where IT projects and Software developing is involved. If you invest more in the design phase you will worry much less in the debugging phase.
So, there is no need to rush into the implementation of the first solution that comes to mind. Ask yourself some questions before deciding: “Are there any problems with this solution? Are there any limitations? Do these limitations apply in my case?”. Then you can make a so-called informed decision of which action to take.