During the news gathering process, reporters can often struggle with understanding complex, jargon-heavy documents, particularly in fields like science and technology. For instance, a tech reporter trying to write a story about a new study about AI may encounter difficulty making sense of specialized terms and concepts from the field. This can create friction during the process of reading the documents to determine what may be a newsworthy angle to cover about the study.
To address this challenge, we propose using large language models (LLMs) to identify and define jargon terms within scientific abstracts. These models can be leveraged to wrangle and transform textual input into different formats (e.g., news headlines or summaries based on news article text), and may lend themselves well to the use-case of simplifying complex terms, especially if a definition is available but needs re-writing into more easily understood words. Such systems could also support the contextualization of news articles for readers, as demonstrated by BBC News Labs a few years ago.
This blog post walks through how we built and conducted a preliminary evaluation of a prototype to test out this idea, and describes what we learned in the process.1 We hope this can help others engaged in similar projects, and extend to systems that support journalists’ sense-making of documents in other domains, such as complex legal or medical documents.
Retrieval-augmented generation (RAG) is an approach that can enable this kind of simplification of complex terms — it allows a user to input a “query” (i.e., a prompt, a question, or even simply a jargon term), and then matches it against text in a reference document that could be used to help derive a simplified definition. For instance, a jargon term like “precision metric” from a scientific abstract about a novel AI model will likely be found in different sentences within the text of the whole scientific article (Note: In this post we use article to refer to a scientific article rather than a news article.) RAG relies on finding these matching sentences or text snippets, and supplying them to an LLM, along with a prompt instructing how the snippets need to be used — e.g., to create a summary, or to generate a readable definition.
Two assumptions we made when designing this prototype with a RAG approach were that: (1) the retrieved snippets will actually be informative for creating a definition of the jargon term, and (2) even if the RAG output has minor errors, a human will actually verify the supplied definition if it elicits their interest.
By employing such a RAG approach with GPT-4, we designed a prototype system to provide reporters with clear, concise, and accurate definitions of complex terms. We also designed the prototype to personalize the identification of jargon terms based on a reader’s knowledge level, making it easier for journalists with differing levels of scientific knowledge to parse these articles (e.g. a general interest reporter might need something different than a seasoned science reporter working her specific beat). This prototype was constructed and evaluated in the lab using a sample of 64 peer-reviewed articles published on arXiv Computer Science in March 2024.
The prototype is built as a web app with a list of scientific articles, displaying each article’s metadata and abstract. Jargon terms are highlighted within the abstracts, allowing users to hover over them for instant definitions. Additionally, users can click to access a comprehensive list of all jargon terms from the abstract, along with their definitions. A search bar allows users to find articles of interest, and filters allow users to sift through specific categories of the articles as well (we focus on topics in AI, Human Computer Interaction, Computing and Society for now).
Building this prototype entailed work on two main problems:
We tackled the challenge of identifying jargon terms in scientific articles using GPT-4, with a prompt template that allowed users to specify their level of scientific expertise. We used this prompt to generate a tailored list of jargon terms for that user.
This was evaluated with two annotators who manually identified jargon terms in the articles in our sample dataset, and who also provided their own levels of expertise in the prompt templates by describing their knowledge in natural language. The overlap between the resulting sets of jargon terms — for each individual article, for each individual annotator — was used to understand how well GPT-4 performs at this task. Worth noting here is that both annotators had differing levels of scientific expertise, and this was visible with how one annotator consistently identified more jargon terms per abstract than the other.
We find that GPT-4 shows promise in identifying jargon terms, but does tend to identify more terms than the human annotators. It successfully captured most human-identified terms but also incorrectly labeled many words as jargon which the annotators didn’t think were jargon. In information retrieval jargon this equates to a high recall but low precision.
Interestingly, GPT-4 does maintain the relative differences in expertise between the two annotators, identifying more terms as jargon for the less expert annotator. This suggests potential for personalization based on readers’ knowledge levels, and aligns to similar recent findings as well.
Once the jargon terms were identified, the next challenge was to provide clear, concise, and accurate definitions. To support our RAG approach we sourced relevant snippets of text for a given jargon term from the complete text of the source article. These are obtained by calculating the similarity of the jargon term to individual snippets of text from the article, and returning snippets that exceed a certain similarity threshold. We use cosine similarity to capture semantic relatedness and chose a threshold of 0.3 (low to medium) to capture a wide range of relevant snippets while excluding irrelevant ones.
We then used GPT-42 with a query prompt to generate simple and understandable definitions from the retrieved snippets. We also generated another set of definitions for comparison, solely based on providing the abstract to GPT-4, and asking it to infer a definition of a given jargon term based on the text of the abstract. Annotators rated both definitions for a given term on their accuracy, and then recorded a preference for one or the other (or a tie), based on their clarity and informativeness. Pairwise preferences such as these are often used in LLM evaluations.
The accuracy is calculated as a percentage over all definitions generated by a model. The preferences are measured by calculating a win percentage, based on the number of times one approach (Abstract vs. RAG) is preferred over another.
Surprisingly, GPT-4 with the abstracts performs a little bit better than GPT-4 with RAG over the article text, both in terms of accuracy (96.6% vs. 93.5%) and win percentage (29.2% vs. 27.8% — the rest were ties). This suggests that more context from the scientific article did not necessarily lead to higher accuracy or better understandability (i.e. our first assumption about using RAG here as described above was not met). A deeper investigation of the retrieved snippets and their actual relevance to the jargon term may help understand if this is an issue with the quality of the context, or if there may be other causes such as the similarity threshold we used. To apply RAG successfully, it’s essential to explore and test different parameters. Without this kind of careful experimentation, RAG on its own might not provide the desired results.
It may also be the case that the large size of GPT-4’s pre-training dataset enables it to draw from other sources to generate definitions. This can be as much a concern as a benefit though — it can make it harder to override irrelevant information from the pre-training data, or eschew its limitations based on cutoff dates for model training.
We also found that the effectiveness of these approaches varied based on the reader’s expertise. For instance, the less experienced annotator found similar value in both methods (higher tie percentage), while the more expert reader noticed more differences. Further evaluation with a larger set of annotators may help to replicate and understand these differences.
This blogpost demonstrates a small and scoped experiment in using GPT-4 with RAG to generate definitions of jargon terms, in service of improving the experience of reading complex documents during the news gathering process. We find that GPT-4 performs fairly well at identifying jargon based on a reader’s expertise, although it does tend to over-predict a bit.
Contrary to expectations, we also find that GPT-4 with RAG over an article’s text performs a little worse in terms of accuracy and clarity/informativeness of generated definitions, when compared to GPT-4 with just the context of the article’s abstract. This finding leads us to draw out more questions that are worth considering in the context of using LLMs to support text generation and transformation in the newsroom, including in terms of how to evaluate RAG oriented systems. In addition, our exploration of including expertise in the prompt to identify jargon suggests a potentially valuable pattern for journalists seeking to emulate this experience in their own workflow: providing a natural language description of expertise and knowledge in the domain can help steer the model to be more helpful in identifying jargon.
Ultimately, the utility of such a prototype is contingent on how users actually incorporate it into their workflows, and if it actually saves time and improves readers’ comprehension in practice. We would love to know if you have tested out such RAG-based tools in your own newsroom, and how they have been perceived, received, and used!
Sachita Nishal is a Ph.D. student in human-computer interaction and AI at Northwestern University. Eric Lee, an undergraduate computer science student at Northwestern, also contributed to this article. This piece originally ran on Generative AI in the Newsroom.