experiments

Code
pip install pandas

Getting to know LLMs and LangChain

A foundational element of using LangChain as a wrapper for large language models (LLMs) like GPT4 is the prompt. A prompt in the LLM context is just like a prompt in the usual context: it’s a thoughtfully designed question meant to elicit a response. If you want to use an LLM to explore text, it is critically important to design an effective prompt that will help the model generate accurate and helpful responses. We are exploring the use of LLMs to help us β€œread” undergraduate student research papers in marine science and figure out if the paper contains a species occurrence. That is, did the student observe or collect a given species at a given place during the course of their research? If they did, that kind of information is a species occurrence (species + place + date).

To explore the potential for using LLMs in this work, we selected a few online tools that are designed to help the user ask questions about text provided to the application. We picked a few student papers at random (all open access) and iterated on a series of questions to learn how to engineer prompts that might give us the information we need to determine if a paper includes a species occurrence. This was our process and what we found.

Tools we tried

Chat My Data πŸ“ ChatPDF πŸ“ Ask My PDF

The first question we gave to each chat tool was, β€œWhat is this paper about?”

Results

Code

import pandas as pd
Code
df = pd.read_csv('data.csv', nrows=3, usecols=[0,1,2,3])
df.style.set_table_styles([dict(selector="th",props=[('max-width', '100px')])])
left_aligned_df = df.style.set_properties(**{'text-align': 'left'})
display(left_aligned_df)
  Model Paper What is this paper about? Summarize the paper concisely with reference to materials and methods.
0 PDF GPT fhl_2014_Charifson_34622.pdf; Snail Predation by Hemigrapsus nudus and Hemigrapsus oregonensis: Sex and Species Differences in Chela Size This paper is about the potential for character displacement in the chela size of two sympatric shore crab species, Hemigrapsus nudus and Hemigrapsus oregonensis. It was found that the two-way ANOVA with factors sex and species did not show a significant species effect, suggesting that the claws of both shore crabs produce roughly the same force and are functionally equivalent (Behrens Yamada and Boulding 1998). Additionally, the consumption rates of H. nudus for Littorina sitkana and Lacuna vincta were found to be 1.8 and 13.03/individual crab/day respectively, and there was no potential for character displacement of chela size in the sympatric species. There were also no differences in consumption rates between male and female H. nudus. [Page no. 8], [Page no. 10], [Page no. 14]. The paper discussed the morphometrics and consumption rates of Hemigrapsus oregonensis and Hemigrapsus nudus, two closely related crabs that can be distinguished by external features. Male and female crabs were measured with digital calipers for carapace width, propal height and propal width [Page no. 5]. Consumption rates of 3 female (Fe1 to Fe3) and 3 male (Ma1 to Ma3) crabs were also calculated [Page no. 21]. SMA regressions of carapace width and propus measures were also presented in a Table [Page no. 11].
1 Chat Your Data fhl_2014_Charifson_34622.pdf; Snail Predation by Hemigrapsus nudus and Hemigrapsus oregonensis: Sex and Species Differences in Chela Size The paper is about an experimental study that investigates the role of highly mobile crab predators in the intertidal zonation of their gastropod prey. It also examines claw morphology, prey size selection, and foraging efficiency in generalist and specialist shell-breaking crabs. The study received financial support from various organizations and was conducted at Friday Harbor Laboratories with permission to collect organisms The paper acknowledges the support and facilities provided by Friday Harbor Laboratories, The Libbie Hyman Scholarship, Society for Integrative and Comparative Biology, and the Stony Brook Department of Ecology and Evolution. However, the specific materials and methods used in the study are not mentioned in the provided context
2 Ask My PDF fhl_2014_Charifson_34622.pdf; Snail Predation by Hemigrapsus nudus and Hemigrapsus oregonensis: Sex and Species Differences in Chela Size The paper is about the relationship between propus size and the use of snails as prey in two species of crabs, Hemigrapsus nudus and Hemigrapsus oregonensis. It also investigates the potential for character displacement in chela size between the two species. (Page 2) The paper investigates the relationship between propus size and snail predation in Hemigrapsus nudus and Hemigrapsus oregonensis. The study uses standardized major axis regression to analyze the morphometric parameters of the crabs. The researchers also analyze prey selection by H. nudus using G-tests and one-way ANOVA. The results show that carapace width is a good predictor of propal height and width in both male and female crabs of both species. There is sexual dimorphism in chela size, but no detectable difference in chela size between the two species. The study suggests that there is no potential for character displacement in chela size between H. nudus and H. oregonensis.

Comparing results for initial questions across the three tools.

Testing Architectures

Having experimented with prompts across some pre-made conversational tools, the next step is to explore different combinations of tools/methods for 1. Load, 2. Transform (Text splitting), 3. Embed, 4. Store, 5. Retrieve (Vector store query). We came up with four main options (below) with some possible variations (see the yellow arrows).

Code
neato`
digraph {
    nodesep=0.5;
    labelloc = "b"
    fontname = Arial
    node [
        shape = rectangle
        width = 1.5
        color= lightgray
        style = filled
        fontname="Helvetica,Arial,sans-serif"
    ]
    edge [
    len = 2 
    penwidth = 1.5 
    arrowhead=open
    color= darkgray

  ]
    start = regular
    normalize = 0

    subgraph cluster_0 {
        style=filled;
        color= deepskyblue;
        node [style=filled,color=white];
        SentenceTransformers -> SentenceTransformerEmbeddings -> Annoy -> MultiQueryRetriever;
        label = "Option #1";
    }

    subgraph cluster_1 {
        style=filled;
        color= yellowgreen;
        node [style=filled,color=white];
        TikToken -> OpenAIEmbeddings -> FAISS -> ContextualCompression;
        label = "Option #2";
    }

subgraph cluster_2 {
        style=filled;
        color= orange;
        node [style=filled,color=white];
        NLTK -> LlamaCCP -> Chroma -> ChromaSelfQuerying;
        label = "Option #3";
    }

subgraph cluster_3 {
        style=filled;
        color= deeppink;
        node [style=filled,color=white];
        SpaCY -> SpaCYEmbeddings -> Chroma2 -> ChromaSelfQuerying2;
        label = "Option #4";
    }

subgraph cluster_4 {
        margin=40
        style=filled;
        color= gray;
        node [style=filled,color=white];
        rankdir=LR;
        Stuffing, Refine, MapReduce, MapReRank;
    }


    source -> PyMuPDF;
  PyMuPDF -> SentenceTransformers;
  PyMuPDF -> TikToken;
  PyMuPDF -> NLTK;
  PyMuPDF -> SpaCY;
  MultiQueryRetriever -> Stuffing [color = lightblue]
  MultiQueryRetriever -> Refine [color = lightblue]
  MultiQueryRetriever -> MapReduce [color = lightblue]
  MultiQueryRetriever -> MapReRank [color = lightblue]
  ContextualCompression -> Stuffing [color = lightblue]
  ContextualCompression -> Refine [color = lightblue]
  ContextualCompression -> MapReduce [color = lightblue]
  ContextualCompression -> MapReRank [color = lightblue]
  ChromaSelfQuerying -> Stuffing [color = lightblue]
  ChromaSelfQuerying -> Refine [color = lightblue]
  ChromaSelfQuerying -> MapReduce [color = lightblue]
  ChromaSelfQuerying -> MapReRank [color = lightblue]
  ChromaSelfQuerying2 -> Stuffing [color = lightblue]
  ChromaSelfQuerying2 -> Refine [color = lightblue]
  ChromaSelfQuerying2 -> MapReduce [color = lightblue]
  ChromaSelfQuerying2 -> MapReRank [color = lightblue]
    FAISS -> MultiQueryRetriever [color = yellow]
  Annoy -> ContextualCompression [color = yellow]
  LlamaCCP -> FAISS [color = yellow]
  OpenAIEmbeddings -> Annoy [color = yellow] 

    source [shape=Msquare];
}
`
Code
neato = require("@observablehq/graphviz@0.2")