Logo

Log In Sign Up |  An official publication of: American College of Emergency Physicians
Navigation
  • Home
  • Multimedia
    • Podcasts
    • Videos
  • Clinical
    • Airway Managment
    • Case Reports
    • Critical Care
    • Guidelines
    • Imaging & Ultrasound
    • Pain & Palliative Care
    • Pediatrics
    • Resuscitation
    • Trauma & Injury
  • Resource Centers
    • mTBI Resource Center
  • Career
    • Practice Management
      • Benchmarking
      • Reimbursement & Coding
      • Care Team
      • Legal
      • Operations
      • Quality & Safety
    • Awards
    • Certification
    • Compensation
    • Early Career
    • Education
    • Leadership
    • Profiles
    • Retirement
    • Work-Life Balance
  • Columns
    • ACEP4U
    • Airway
    • Benchmarking
    • Brief19
    • By the Numbers
    • Coding Wizard
    • EM Cases
    • End of the Rainbow
    • Equity Equation
    • FACEPs in the Crowd
    • Forensic Facts
    • From the College
    • Images in EM
    • Kids Korner
    • Medicolegal Mind
    • Opinion
      • Break Room
      • New Spin
      • Pro-Con
    • Pearls From EM Literature
    • Policy Rx
    • Practice Changers
    • Problem Solvers
    • Residency Spotlight
    • Resident Voice
    • Skeptics’ Guide to Emergency Medicine
    • Sound Advice
    • Special OPs
    • Toxicology Q&A
    • WorldTravelERs
  • Resources
    • ACEP.org
    • ACEP Knowledge Quiz
    • Issue Archives
    • CME Now
    • Annual Scientific Assembly
      • ACEP14
      • ACEP15
      • ACEP16
      • ACEP17
      • ACEP18
      • ACEP19
    • Annals of Emergency Medicine
    • JACEP Open
    • Emergency Medicine Foundation
  • About
    • Our Mission
    • Medical Editor in Chief
    • Editorial Advisory Board
    • Awards
    • Authors
    • Article Submission
    • Contact Us
    • Advertise
    • Subscribe
    • Privacy Policy
    • Copyright Information

Can AI Critically Appraise Medical Research?

By Ken Milne, MD | on December 31, 2024 | 0 Comment
Skeptics' Guide to EM
  • Tweet
  • Click to email a link to a friend (Opens in new window) Email
Print-Friendly Version

I am preparing to review another systematic review and meta-analysis (SRMA). Although it is enjoyable to critically appraise these publications, I need to verify whether the authors followed the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines. There has been a lot of enthusiasm for open-source large language models (LLMs), like ChatGPT 3.5 from Open AI, and I wonder if artificial intelligence (AI) could do this task quickly and accurately.

You Might Also Like
  • Leveraging Large Language Models (like ChatGPT) in Emergency Medicine: Opportunities, Risks, and Cautions
  • Tread Cautiously with Artificial Intelligence in Clinical Scenarios
  • Dr. Chatbot Will See You Now
Explore This Issue
ACEP Now: Jan 01

Background

LLMs such as ChatGPT and Claude have shown remarkable potential in automating and improving various aspects of medical research. One interesting area is their ability to assist in critical appraisal, one of the core aspects of evidence-based medicine. Critical appraisal involves evaluating the quality, validity, and applicability of studies using structured tools like PRISMA, AMSTAR (A MeaSurement Tool to Assess systematic Reviews), and PRECIS (PRagmatic Explanatory Continuum Indicator Summary)-2.

Confirming adherence to quality checklist guidelines often requires significant expertise and time. However, LLMs have evolved to interpret and analyze complex textual data and therefore represent a unique opportunity to enhance the efficiency of these appraisals. Research into the accuracy of LLMs for these tasks is still in its early stages.

Clinical Question

Can LLMs accurately assess critical appraisal tools when evaluating systematic reviews and randomized controlled trials (RCTs)?

Reference

Woelfle T, Hirt J, Janiaud P, et al. Benchmarking human–AI collaboration for common evidence appraisal tools. J Clin Epidemiol. 2024;175:111533.

Population: Systematic reviews and RCTs that were evaluated by critical appraisal tools (PRISMA, AMSTAR, and PRECIS-2).

Intervention: Five different LLMs—Claude 3 Opus, Claude 2, ChatGPT 4, ChatGPT 3.5, Mixtral-8x22B—assessing these studies.

Comparison: Comparisons were made against individual human raters’ human consensus ratings and human–AI collaboration.

Outcome: Accuracy and identification of potential areas for improving efficiency via human–AI collaboration.

Authors’ Conclusions

“Current LLMs alone appraised evidence worse than humans. Human–AI collaboration may reduce workload for the second human rater for the assessment of reporting (PRISMA) and methodological rigor (AMSTAR), but not for complex tasks such as PRECIS-2.”

Results

The authors assessed 113 SRMAs and 56 RCTs. Humans had the highest accuracy for all three assessment tools. Of the LLMs, Claude 3 Opus consistently performed the best across PRISMA and AMSTAR, indicating that it may be the most reliable LLM for these tasks.

The older, smaller model, ChatGPT 3.5, performed better than newer LLMs like ChatGPT 4 and Claude 3 Opus on the more complex PRECIS-2 tasks.

Pages: 1 2 3 | Single Page

Topics: Artificial IntelligenceResearch

Related

  • ACEP4U: Reinventing Research Education

    June 11, 2025 - 0 Comment
  • June 2025 News from the College

    June 5, 2025 - 0 Comment
  • AI May Allow Physicians To Regain Their Humanity

    February 19, 2025 - 1 Comment

Current Issue

ACEP Now: June 2025 (Digital)

Read More

No Responses to “Can AI Critically Appraise Medical Research?”

Leave a Reply Cancel Reply

Your email address will not be published. Required fields are marked *


*
*

Wiley
  • Home
  • About Us
  • Contact Us
  • Privacy
  • Terms of Use
  • Advertise
  • Cookie Preferences
Copyright © 2025 by John Wiley & Sons, Inc. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies. ISSN 2333-2603