Taxa Project

Extracting species occurrences from a corpus of student papers

Motivations for Project

Climate change is driving rapid changes in our biosphere on local and global scales. Our capacity to understand these shifts relies entirely upon two critical things: long-term biological and environmental observations, and an ability to discover and access them. Species Occurrence records are foundational to understanding biodiversity in ecosystems and help researchers track adaptation and the effects of climate change.

Knowledge bases that gather observations from around the world including species, location and time of the event provide a more complete picture of historical and geographic changes in biodiversity. Not surprisingly, observations from the past recorded on paper are often missing from these knowledge bases because they are hard to come by. Libraries up and down the Pacific coast hold collections of undergraduate student papers with observations of marine plants and animals “hidden” in the text. Reading and extracting those observations by hand is an effort libraries cannot afford to undertake.

Goals for Project

The goal of this project is to employ natural language processing, machine learning, and data visualization to amplify the work of librarians in identifying and verifying these observations.

To develop a dataset of species occurrences (a species at a given place at a specific time), we need to identify and extract scientifically relevant entities from the corpus. These entities include:

Taxonomic names
Location(s) (named; “Hopkins Beach”, “Fisherman’s Cove”) and/or latitude/longitude
A date when the species was observed

Optionally, as supporting information, we can extract habitat type(s) (e.g., “subtidal”) from the text, too.

Data Inputs

Student Papers spanning the West Coast of North America

The Stanford Taxa Project is part of a larger effort (called Data Over Decades) with librarian partners up and down the western coast of North America. We each hold large collections of student research reports that contain observations of environmental conditions, species, and populations recorded over a span of at least nine decades. As we seek funding to support the digitization and preservation of collections from partner institutions, the Taxa Project explores the potential of computational methods to amplify the work of librarians in identifying and verifying species observations in student papers from Hopkins Marine Station of Stanford University (HMS). We hold 746 HMS student papers that were created from 1963 - 2011. Of these, 672 are open access, and 74 are restricted to use by Stanford-affiliated folks only.

Potential corpus

The undergraduate student research papers available for the Data Over Decades project number into the thousands once more papers in these collections have been fully digitized (see Table 1). These papers were created annually, over several decades, as the capstone project for students taking an applied research course at their respective research station. For the Taxa Project, we are using the set of open access papers from Hopkins Marine Station as our initial corpus.

Institution	Approx. papers	Date Range	Digital	OCR’d	Open Access
Bamfield Marine Sciences Centre (consortium)	5,426 papers	1973 - present	No	No	No
Friday Harbor Labs, University of Washington	5,000 papers; 2011-2020, 480 papers in hand	1946 - present	Pre-2011 not scanned; 2011+ born-digital	2011-2020, 480 papers	Pre-2011, No; 2011+ Yes
Hatfield Marine Science Center, Oregon State University	140 + 170 papers	1966 - present	Yes	No	Yes
Bodega Marine Laboratory, UC Davis	3,800 papers	1928 - present	No	No	No
Moss Landing Marine Labs, San Jose State University	690 theses	1968 - present	Yes	Yes	Yes
UC Berkeley (research conducted at Hopkins)	239 papers	1947 - 1952	No	No	No
Hopkins Marine Station, Stanford University	778 papers	1963 - 2011	Yes	Yes	672 Yes; 106 No
Wrigley Marine Science Center, USC, UCLA, Carleton College (corpus held at HMS Library)	300 papers	1970s - 2000	Yes	Yes	No
UC Santa Cruz (corpus held at HMS Library)	59 papers	1973 - 2000	Yes	Yes	No

Table 1. A list of project partners in the Data Over Decades project, with an approximate number of papers at each site.

Google Map of library partners in the Data Over Decades project.

Student Papers - From Paper to Digital

We partnered with the Stanford Libraries Digital Production Group (DPG) to scan all of the Hopkins Marine Station student papers. The DPG is staffed by highly skilled professional imaging specialists and student assistants, and they have well-established workflows for capturing and processing digital content, and accessioning it into the Stanford Digital Repository (SDR). Their scanning process generated, for each page, a high-resolution JPEG 2000 for preservation, a medium resolution jpg for the catalog media viewer interface, an ALTO XML file, and a text-searchable PDF of the full paper. These files are all stored in Stanford Digital Repository (SDR) for preservation and access purposes.

For this phase of the project we will be working with the PDFs, but we plan to incorporate an open software image OCR step into our workflow so it will be more directly portable to our community of practice.