In the New England Journal of Medicine, members of the team from Microsoft Research and OpenAI give a high-level overview of performance on question banks, but also explore additional avenues for LLM applications.3 In one example provided, a GPT-4-based chatbot provides answers to a simulated use case with questions regarding medications for diabetes. Unsurprisingly, considering the large body of content available regarding the diabetes medication metformin, GPT-4 is able to provide a coherent and accurate informational response to a simple initial query. However, the example also demonstrates a common problem with LLMs termed “hallucinations,” in which the output is grossly incorrect. In subsequent follow-up prompts in the example provided, the GPT-4-based chatbot claims to have a master’s degree in public health, as well as personal family experience with diabetes. These spurious generative examples demonstrate some of the limitations of LLMs; specifically that they do not rely upon underlying reasoning and insight.
Explore This IssueACEP Now: Vol 42 – No 06 – June 2023
The effects of “hallucinations” become more apparent when these models are prompted to add rigor and explanation to their answers. A project based on a previous version of GPT, GPT-3.5, prompted the LLM to provide answers to a set of medical questions and provided them to experts to rate their content.4 Out of 20 questions and responses, the experts identified five major and seven minor factual errors. More interestingly, the authors also asked GPT-3.5 to supply references for their medical explanations. The LLM generated 59 supporting references, 70 percent of which the authors determined were outright fabrications. These fabrications were usually composed in the proper journal citation format, sometimes used author names from published articles relevant to the question prompt, and otherwise outwardly appeared legitimate, but simple verification steps revealed them to be fiction. While this experiment was performed using GPT-3.5, these errors persist in GPT-4.
Another proposed application for LLMs involves transcription and summarization services to reduce administrative overhead. The aforementioned New England Journal of Medicine article was also authored in part by representatives from Nuance Communications, in Burlington, Mass., highlighting this future product offering. In their example, the software passively listens to, and creates a transcript of, a conversation between a clinician and a patient. Following the patient encounter, the software then summarizes the encounter in the form of a doctor’s note. Generally speaking, the LLM creates a narrative matching the input, but also hallucinates details not present in the original encounter. The example provided illustrates both the potential and current limitations of the technology, presently requiring careful review to ensure spurious extrapolations have not been inserted into the medical documentation.