6 December 2016

Bringing Kew's Archive Alive

How digital data produced by Kew's Directors' Correspondence team is being brought to life and can be used to visualise the British Empire's 19th Century trade networks

Letter from the Natal manuscript papers

TRADING CONSEQUENCES is a Digging into Data project that analyses how automatic text mining of large quantities of historical text can assist environmental historians in their work of researching the effects of 19th century trade in the British Empire. The text mining technology recognises mentions of commodities, locations, diseases, disasters and dates in historical text. It also enriches this information, for example, by geo-referencing the extracted locations and identifying which commodity mentions are related to which location mentions. When the mined information is visualised in different ways we are able to provide interesting views of historical collections which so far only tend to be accessible by historians through key word search. 

One of the collections we are processing in TRADING CONSEQUENCES is the Directors’ Correspondence Collection from the Archives at Kew Gardens. It contains hand-written, scientific letters and memoranda received by Kew’s Directors and senior staff from the 1840s to 1928, as well as correspondence received by Sir William Jackson Hooker prior to 1841. It provides first hand accounts and observations on botany, ethnobotany, history, natural history, science and politics around the world. In Trading Consequences, we are working with letters specifically relevant to Africa, Asia and Latin America. We are not processing the letters themselves but the meta data attached to each document: particularly a written summary of the content of each piece of correspondence.

This collection contains meta files for more than 24,000 letters and is accessible via JSTOR Global Plants. Other historical text collections, which we process in TRADING CONSEQUENCES include the House of Commons Parliamentary Papers from ProQuest, the Early Canadiana Online data archive, Adam Matthew’s Confidential Print collections, a sub-part of the Foreign and Commonwealth Office Collection from JSTOR, and a number of books relevant to trading in the 19th century. 

Text Mining

The text mining is developed by computer scientists at the School of Informatics at the University at Edinburgh. We first convert the meta information from Excel into an in-house XML format, thus creating one XML file per letter. We treat the title and description of each letter as textual information and retain all other information, including creator (i.e. the author of the letter) and date of creation (i.e. when the letter was written) as meta information. Each file is then processed by a series of steps. At first the stream of text is automatically split into its words and sentences. Then several syntactic processing steps are carried out, for example to determine the lexical category of each word (noun for cinnamon, verb for imported, preposition for through, adjective for fresh etc.) or to determine the canonical form of each word (e.g. export for exported or exports). Subsequently, we extract all commodity, location, date, disease and disaster mentions from the text. This is done in various ways, depending on the type of entity mention. In the case of commodities, we use a manually created commodity ontology and combine it with an automated bootstrapping techniques to identify other commodity mentions in the text. We also geo-reference each extracted location mention with an adapted version of Edinburgh Geoparser by linking them with a latitude and longitude. Finally, we extract commodity-location relations whenever a commodity is associated in some way with a location. All this information is stored in the Trading Consequencesdatabase.

Visualising the data

The database allows us to query for all commodities that were associated with different locations as mentioned in the historical collections analysed. We can also search for a particular commodity with respect to dates or locations, or for all commodities mentioned in relation to a specific location. For the following analysis, we extracted all commodities mentioned in the Directors’ Correspondence Collection and identified a subset of frequently mentioned ones (rubber, palm, coffee, cotton, bamboo, Liberian coffee). For each commodity in this subset, we extracted all commodity-location relations along with the year of publication date of the letter they occur in and the latitude and longitude for each location. The result is a list of “year,commodity,location[lat,long]” triples which can be visualised on a timeline or map. We identified 360 triples for rubber, 276 for coffee, 176 for palm, 164 for cotton, 63 for Liberian coffee and 51 for bamboo. A further step counts the identical triples, allowing us to display the more frequent occurrences with larger symbols.

The following video shows all locations each of the six commodities is associated with in the Directors’ Correspondence Collection over time. The yellow dots represent all locations mentioned in this collection over time, irrespective of whether they are related to any commodity. These yellow dots provide an interesting mapping of the British Empire during the 19th century and show how the reach of Kew Gardens expanded well beyond the formal empire. Look at the particular interest in South America during the first few decades as an example. We know economic botanists helped identify and transfer numerous South American plants, such as cinchona and rubber, so they could be grown on British plantations in places like Sri Lanka (Ceylon). Visualising locations from 24,000 letters, however, provides new insights into the scale of this project. (It will look best if you expand the video) 

Liberian Coffee

The second video focuses in on coffee and Liberian coffee. When coffee rust disease started to spread between coffee growing regions in the world during the second half of the 19th century, economic botanists worked to find alternative crops. In this video we see the letters mentioning Liberian coffee appear frequently after 1873, after the identification of this alternative type of coffee. While this example only confirms the history of coffee production we already know, it does demonstrate the potential of using text mining to explore large collections of documents. 

Future developments

In the near future historians and interested members of the public will be able to explore the TRADING CONSEQUENCES database through a dynamic visualisation website. The following screenshot is a sneak preview for this website, which is currently being developed by visualisation experts at the University of St. Andrews. In TRADING CONSEQUENCES, we process a number of different historical collections. The visualisation shown in the image below is limited to the Kew Gardens’ Directors’ Correspondence Collection. The image shows a map with bubbles in locations associated with the commodity Liberian coffee. The Seychelles and Sri Lanka are the most significant locations for this commodity. A timeline with the distribution of relevant documents per decade is shown underneath the map.

Similarly to the information shown in the video, the commodity Liberian coffee appears around 1870. Any commodities related to Liberian Coffee, i.e. ones that appear in the same summary of the original letter, are listed on the righthand side of the page. The title of the the top 50 most relevant documents containing mentions of the commodity Liberian coffee are listed in order of relevance at the bottom of the screen. Each document title links back to the original images on JSTOR Global Plants.

- The Trading Consequences team: Bea Alex, Jim Clifford and Uta Hinrichs -   

Read & watch