Logo

Log In Sign Up |  An official publication of: American College of Emergency Physicians
Navigation
  • Home
  • Multimedia
    • Podcasts
    • Videos
  • Clinical
    • Airway Managment
    • Case Reports
    • Critical Care
    • Guidelines
    • Imaging & Ultrasound
    • Pain & Palliative Care
    • Pediatrics
    • Resuscitation
    • Trauma & Injury
  • Resource Centers
    • mTBI Resource Center
  • Career
    • Practice Management
      • Benchmarking
      • Reimbursement & Coding
      • Care Team
      • Legal
      • Operations
      • Quality & Safety
    • Awards
    • Certification
    • Compensation
    • Early Career
    • Education
    • Leadership
    • Profiles
    • Retirement
    • Work-Life Balance
  • Columns
    • ACEP4U
    • Airway
    • Benchmarking
    • Brief19
    • By the Numbers
    • Coding Wizard
    • EM Cases
    • End of the Rainbow
    • Equity Equation
    • FACEPs in the Crowd
    • Forensic Facts
    • From the College
    • Images in EM
    • Kids Korner
    • Medicolegal Mind
    • Opinion
      • Break Room
      • New Spin
      • Pro-Con
    • Pearls From EM Literature
    • Policy Rx
    • Practice Changers
    • Problem Solvers
    • Residency Spotlight
    • Resident Voice
    • Skeptics’ Guide to Emergency Medicine
    • Sound Advice
    • Special OPs
    • Toxicology Q&A
    • WorldTravelERs
  • Resources
    • ACEP.org
    • ACEP Knowledge Quiz
    • Issue Archives
    • CME Now
    • Annual Scientific Assembly
      • ACEP14
      • ACEP15
      • ACEP16
      • ACEP17
      • ACEP18
      • ACEP19
    • Annals of Emergency Medicine
    • JACEP Open
    • Emergency Medicine Foundation
  • About
    • Our Mission
    • Medical Editor in Chief
    • Editorial Advisory Board
    • Awards
    • Authors
    • Article Submission
    • Contact Us
    • Advertise
    • Subscribe
    • Privacy Policy
    • Copyright Information

Can AI Critically Appraise Medical Research?

By Ken Milne, MD | on December 31, 2024 | 0 Comment
Skeptics' Guide to EM
  • Tweet
  • Click to email a link to a friend (Opens in new window) Email
Print-Friendly Version

The collaborative human–AI approach yielded superior performance compared to individual LLMs, with accuracies reaching 96 percent for PRISMA and 95 percent for AMSTAR when humans and LLMs worked together.

You Might Also Like
  • Leveraging Large Language Models (like ChatGPT) in Emergency Medicine: Opportunities, Risks, and Cautions
  • Tread Cautiously with Artificial Intelligence in Clinical Scenarios
  • Dr. Chatbot Will See You Now
Explore This Issue
ACEP Now: Jan 01

Key Results

Click to enlarge.

LLMs alone performed worse than humans; however, a collaborative approach between humans and LLMs showed potential for reducing the workload for human raters by identifying high-certainty ratings, especially for PRISMA and AMSTAR (see table).

Talk Nerdy to Me

  1. Bias in training data and prompts: LLMs rely on the data they were trained on, which may introduce unseen biases. In addition, the behavior of the model is impacted by the information it was fed—i.e. prompts. For example, when the LLMs were required to pull relevant quotes, they did not follow the instructions, often pulling too many or pulling quotes from analyses they had already performed, rather than the source material.
  2. Limited contextual understanding: LLMs lack the nuanced judgment required to assess the methodological quality of complex trials. This was illustrated by LLMs reporting low accuracy, while adding a human rater reported high accuracy. LLMs still don’t process in the same way as humans, but when we are the gold standard, is that something we want to match, or is there a benefit to re-evaluating our responses after the LLM goes through to see if we are making errors?
  3. Lack of transparency in LLM decision processes: Transparency in decision-making LLMs presents significant challenges. A key issue is the “black box” nature of these systems. This often makes it difficult to explain how they reach their decisions, even to experts. LLMs can generate sophisticated outputs from simple prompts, but the underlying reasons are opaque. AI often misunderstands or simplifies tasks, creating outputs that can be unpredictable and difficult to interpret and further complicating transparency. This raises concerns about trust in the LLMs’ results.

SGEM Bottom Line

LLMs alone are not yet accurate enough to replace human raters for complex critical appraisals. However, a human–AI collaboration strategy shows promise for reducing the workload for human raters without sacrificing accuracy.

Case Resolution

You start playing around with LLMs to evaluate the SRMA adherence to the PRISMA guidelines while verifying the accuracy of the AI-generated answers.

Remember to be skeptical of anything you learn, even if you heard it on the Skeptics’ Guide to Emergency Medicine.

Thank you to Dr. Laura Walker who is an associate professor of emergency medicine and the vice chair for digital emergency medicine at the Mayo Clinic, for her help with this review.

Pages: 1 2 3 | Single Page

Topics: Artificial IntelligenceResearch

Related

  • Search with GRACE: Artificial Intelligence Prompts for Clinically Related Queries

    October 9, 2025 - 3 Comments
  • AI Scribes Enter the Emergency Department

    August 11, 2025 - 2 Comments
  • Reflecting on Four Decades at ACEP’s Council

    June 28, 2025 - 0 Comment

Current Issue

ACEP Now: November 2025

Download PDF

Read More

No Responses to “Can AI Critically Appraise Medical Research?”

Leave a Reply Cancel Reply

Your email address will not be published. Required fields are marked *


*
*


Wiley
  • Home
  • About Us
  • Contact Us
  • Privacy
  • Terms of Use
  • Advertise
  • Cookie Preferences
Copyright © 2025 by John Wiley & Sons, Inc. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies. ISSN 2333-2603