Taxa Project
Extracting species occurrences from a corpus of student papers
Motivations for Project
Climate change is driving rapid changes in our biosphere on local and global scales. Our capacity to understand these shifts relies entirely upon two critical things: long-term biological and environmental observations, and an ability to discover and access them. Species Occurrence records are foundational to understanding biodiversity in ecosystems and help researchers track adaptation and the effects of climate change.
Knowledge bases that gather observations from around the world including species, location and time of the event provide a more complete picture of historical and geographic changes in biodiversity. Not surprisingly, observations from the past recorded on paper are often missing from these knowledge bases because they are hard to come by. Libraries up and down the Pacific coast hold collections of undergraduate student papers with observations of marine plants and animals “hidden” in the text. Reading and extracting those observations by hand is an effort libraries cannot afford to undertake.
Goals for Project
The goal of this project is to employ natural language processing, machine learning, and data visualization to amplify the work of librarians in identifying and verifying these observations.
To develop a dataset of species occurrences (a species at a given place at a specific time), we need to identify and extract scientifically relevant entities from the corpus. These entities include:
- Taxonomic names
- Location(s) (named; “Hopkins Beach”, “Fisherman’s Cove”) and/or latitude/longitude
- A date when the species was observed
Optionally, as supporting information, we can extract habitat type(s) (e.g., “subtidal”) from the text, too.
Data Inputs
Student Papers spanning the West Coast of North America
The Stanford Taxa Project is part of a larger effort (called Data Over Decades) with librarian partners up and down the western coast of North America. We each hold large collections of student research reports that contain observations of environmental conditions, species, and populations recorded over a span of at least nine decades. As we seek funding to support the digitization and preservation of collections from partner institutions, the Taxa Project explores the potential of computational methods to amplify the work of librarians in identifying and verifying species observations in student papers from Hopkins Marine Station of Stanford University (HMS). We hold 746 HMS student papers that were created from 1963 - 2011. Of these, 672 are open access, and 74 are restricted to use by Stanford-affiliated folks only.
Potential corpus
The undergraduate student research papers available for the Data Over Decades project number into the thousands once more papers in these collections have been fully digitized (see Table 1). These papers were created annually, over several decades, as the capstone project for students taking an applied research course at their respective research station. For the Taxa Project, we are using the set of open access papers from Hopkins Marine Station as our initial corpus.
Institution | Approx. papers | Date Range | Digital | OCR’d | Open Access |
---|---|---|---|---|---|
Bamfield Marine Sciences Centre (consortium) | 5,426 papers | 1973 - present | No | No | No |
Friday Harbor Labs, University of Washington | 5,000 papers; 2011-2020, 480 papers in hand | 1946 - present | Pre-2011 not scanned; 2011+ born-digital | 2011-2020, 480 papers | Pre-2011, No; 2011+ Yes |
Hatfield Marine Science Center, Oregon State University | 140 + 170 papers | 1966 - present | Yes | No | Yes |
Bodega Marine Laboratory, UC Davis | 3,800 papers | 1928 - present | No | No | No |
Moss Landing Marine Labs, San Jose State University | 690 theses | 1968 - present | Yes | Yes | Yes |
UC Berkeley (research conducted at Hopkins) | 239 papers | 1947 - 1952 | No | No | No |
Hopkins Marine Station, Stanford University | 778 papers | 1963 - 2011 | Yes | Yes | 672 Yes; 106 No |
Wrigley Marine Science Center, USC, UCLA, Carleton College (corpus held at HMS Library) | 300 papers | 1970s - 2000 | Yes | Yes | No |
UC Santa Cruz (corpus held at HMS Library) | 59 papers | 1973 - 2000 | Yes | Yes | No |
Table 1. A list of project partners in the Data Over Decades project, with an approximate number of papers at each site.
Google Map of library partners in the Data Over Decades project.
Student Papers - From Paper to Digital
We partnered with the Stanford Libraries Digital Production Group (DPG) to scan all of the Hopkins Marine Station student papers. The DPG is staffed by highly skilled professional imaging specialists and student assistants, and they have well-established workflows for capturing and processing digital content, and accessioning it into the Stanford Digital Repository (SDR). Their scanning process generated, for each page, a high-resolution JPEG 2000 for preservation, a medium resolution jpg for the catalog media viewer interface, an ALTO XML file, and a text-searchable PDF of the full paper. These files are all stored in Stanford Digital Repository (SDR) for preservation and access purposes.
For this phase of the project we will be working with the PDFs, but we plan to incorporate an open software image OCR step into our workflow so it will be more directly portable to our community of practice.