The Biodiversity Heritage Library (BHL), a consortium of natural history and botanical libraries from around the world, has a collection of over 40 million pages of text describing and organizing life on earth. It’s currently the only freely available, authoritative information source for the majority of species we know about on our planet.
The problem: most of it isn’t all that accessible digitally.
“BHL search capabilities available on the site are currently limited, and are only useful for finding information about a small number of known species,” says Evangelos Milios of Dal’s Faculty of Computer Science. “These manuscripts are important for biologists because they capture the state of life on earth in history and are the only sources we have about species from 200 or 300 years ago.”
Together with Anatoliy Gruzd, director of Dal’s Social Media Lab, Dr. Milios is co-lead of a ϳԹteam that’s helping re-build this history of biodiversity for the digital age.
The project is called “Mining Biodiversity” (MiBio), and Dal’s team is one of 14 from around the world who have been named the winners of the third Digging into Data Challenge, a competition to develop new insights and tools on how big data is changing research in the areas of humanities and social science. Each team represents collaborations among scholars, scientists, and information professionals from leading universities and libraries.
Extracting the next generation of information
In collaboration with the National Centre for Text Mining at the University of Manchester and the Biodiversity Heritage Library (BHL), MiBio will aim to transform the BHL into a next-generation social digital library resource. It’ll also provide the online library with a semantic search system to help researchers and the public study scientific documents on biodiversity.
Dr. Milios says the team will work together to enrich a large-scale digital library, improve access to biodiversity-related written artifacts via an enhanced search engine, and stimulate increased collaboration and exchange of information amongst BHL users via a social media environment. It will integrate text mining, visualization, crowdsourcing and social media into the BHL.
“ϳԹhas capacities in supporting technologies of this project,” says Dr. Milios. “Our expertise in text mining and visualization is complementary to that of our partner, the National Centre of Text Mining (NaCTeM) in the UK.”
Other members of the Dal team include faculty members Stan Matwin, Vlado Keselj and Stephen Brooks, all from the Faculty of Computer Science.
A key challenge of the project will be the dynamic and ever-changing evolution of biodiversity as recorded in old natural history books. This brings terminology issues to the fore, as scientists over time have changed the meaning of taxonomic names. For example, the word “reptile” originally included both what we think of today as reptiles (e.g., turtles, lizards, snakes) and also what we now call amphibians (e.g., frogs, toads, salamanders).
“In a digital archive such as BHL, tracking terminology evolution over time is crucial for search,” says Dr. Milios. “Another aspect of the project will include the correction of the text we obtain from the digitization and optical character recognition of the old books and how to link the terminology extracted from the text with standard taxonomic resources in the field.”
Social sharing
A key component of the project is the development of a social media environment, to allow BHL users to discuss, link and share digital artifacts posted to social media sites linked to the BHL search portal. The outcome would transform the BHL from a traditional digital library into a social digital library.
“The exciting part for my students and I at the Social Media Lab is to try and figure our how to turn legacy science documents into ‘social’ digital objects that can be easily shared among researchers and the public via social media,” says Dr. Gruzd. “Our expectation is that this will help make these earlier biodiversity documents and artifacts more accessible and will help raise the public awareness of how our planet’s biodiversity has changed over time.”
Before the social media environment is launched, Drs. Gruzd and Milios and the rest of their team will spend the next eighteen months analyzing thousands and thousand of pieces of digitized literature.
“This project really demonstrates the strength and advanced research at ϳԹin the areas of social media, text mining and visualization,” says Dr. Gruzd. “It also demonstrates that our research is being recognized internationally.”
The 14 teams who prevailed in the Digging into Data competition have received a total of $5.1 million in grants from a group of 10 international research-funding agencies from Canada, the Netherlands, the United Kingdom and the United States. The Mining Biodiversity project will receive approximately $125,000 in funding from the Social Science and Humanities Research Council of Canada, and $125,000 from the Natural Sciences and Engineering Research Council of Canada.