However in science there is a vast difference between an ‘implication’ and enough evidence to convince skeptical peers, especially if you are intending to re-write a significant part of Darwin’s history. So we decided to track down all Henslow’s herbarium sheets, read the data off them and store it in a database so that we could quantify the way he built the collection.
Given the fragile nature of the specimens, we decided to collect all the possible data in one go, thus reducing the handling of the specimens to a minimum. So, in addition to the obvious (plant name, collector’s name, location etc.), we also collect apparently tangential data such as the paper maker’s name, the date in the paper’s watermark, whether the name of the herbarium was printed or hand written and so on. This decision later proved to be crucial.
Choosing the database engine
As most developers will be aware the choice of database engine is often a religious rather than a rational decision. In addition, academics have a history of using open source engines. Not only are there some great open source engines out there, they are often highly cost-effective. However, best of all, they allow academics to retain, nay, demonstrate, their autonomy from the commercial world. So we elected to use Microsoft’s SQL Server.
Yes, I know it sounds weird. And it wasn’t really for the engine itself. After all, the data set we were collecting, whilst difficult to collect and very important to us, was completely trivial in database terms. We collected 45 data items in total (10 from each plant and 35 from each sheet). There were about 12,000 plants and 4,000 sheets, so certainly less than a third of a million data items in total. So it wasn’t the raw data processing power that influenced us, it was the BI tools that come with the engine.
Put it this way. We collected the data but we had no real idea where the research was going to take us. In the team we had a world-class Darwin scholar (Prof. David Kohn) and an equally eminent botanist/geneticist (Prof. John Parker). Their entire focus was on trying to understand how and why Henslow had put this herbarium together. Which meant that they could (and did) ask for any of those 45 data items to be correlated against any of the others. Questions like “Can we see the species that Henslow personally collected in 1828” and “How did the total number of species collected vary with time?” through to “Who was collecting with Henslow in 1831?”.
Formulating such queries in SQL is perfectly possible, but takes time and, crucially, breaks the flow: in order to allow a stream-of-thought analysis, we need to get the answers back within 5-10 seconds. If we can do that, the mind (academic or commercial) is free to follow the line of reasoning. So we collected the data in SQL Server and then created an OLAP cube in Analysis Services. We added a graphical user-interface (ProClarity) and presented that to John and David. It worked like a charm. John Parker: “Actually extracting the data took an immense amount of time and we were expecting the analysis to be equally time consuming. We are used to posing a question and then working for 3 to 4 hours in order to calculate the answer. Instead, once the cubes had been set up in SQL Server 2005, we were getting the answers back in real time. That was a transformation. Essentially the mechanics of querying the data disappeared, which meant that we could concentrate on the lines of thought that we wanted to follow.”