Comparative Performance Analysis of Large Language Models: ChatGPT, Google Gemini, and DeepSeek in Approach to Resistant Hypertension

سال انتشار: 1404
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 26

متن کامل این مقاله منتشر نشده است و فقط به صورت چکیده یا چکیده مبسوط در پایگاه موجود می باشد.
توضیح: معمولا کلیه مقالاتی که کمتر از ۵ صفحه باشند در پایگاه سیویلیکا اصل مقاله (فول تکست) محسوب نمی شوند و فقط کاربران عضو بدون کسر اعتبار می توانند فایل آنها را دریافت نمایند.

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

AIMS02_626

تاریخ نمایه سازی: 29 تیر 1404

چکیده مقاله:

Background: Large Language Models (LLMs) Show Promise for Enhancing Medical Decision-Making, but Their Performance in Specialized Domains Like Hypertension Management Remains Understudied. We Evaluated Three LLMs - ChatGPT-۴, Google Gemini, and DeepSeek - Using Guideline-Based Clinical Questions to Assess Their Accuracy and Reliability. Methods: Thirty Clinical Questions (۱۵ Simple, ۱۰ Moderate, ۵ Hard Difficulty) Were Extracted From the ۲۰۲۴ ESC Guidelines for the Management of Elevated Blood Pressure and Hypertension, With Emphasis on Resistant Hypertension. A Senior Cardiologist Scored Responses Using a ۴-Point Scale: (۱) Completely Incorrect Answer, (۲) Correct Answer Containing Incorrect Information, (۳) Correct but Inadequate Answer, and (۴) Correct and Adequate Answer. We Summarized Non-Normally Distributed Data As Median With Interquartile Range (IQR) and Performed Nonparametric Analyses Using Stata ۱۸ (StataCorp LLC), Including Friedman Tests With Post Hoc Wilcoxon Signed-Rank Tests (Bonferroni-Corrected) to Compare Model Performance Across Difficulty Levels. Results: ChatGPT-۴ Achieved the Highest Median Scores for All Question Types (Hard: ۴ [IQR ۴-۴]; Moderate: ۴ [۳-۴]; Simple: ۴ [۳.۵-۴]), With Statistically Superior Performance Over Gemini and DeepSeek for Moderate Questions (p=۰.۰۳). Differences Were Nonsignificant for Hard (p=۰.۳) and Simple Questions (p=۰.۶۲), Though ChatGPT Showed Marginally Better Overall Performance (p=۰.۰۵۹). Gemini and DeepSeek Performed Comparably but Scored Lower Than ChatGPT (Median Range: ۳-۳.۵). Response Consistency (IQRs) Was Similar Across Models. Conclusion: While ChatGPT-۴ Outperformed Peers in Moderate-Difficulty Questions, the Absence of Significant Differences in Hard Questions - Coupled With the Small Sample of Hard Queries - Highlights the Need for Larger, More Balanced Datasets to Robustly Evaluate LLMs in Complex Clinical Scenarios. These Findings Underscore the Importance of Context-Specific Validation When Deploying LLMs in Medicine.

کلیدواژه ها:

نویسندگان

Fatemeh Zahra Seyed-Kolbadi

Cardiovascular Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran

Sara Ghazizadeh

Cardiovascular Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran

Ali Khorram

Faculty of Medicine, Hormozgan University of Medical Sciences, Bandar Abbas, Iran

Ghazal Rezaee

Faculty of Medicine, Hormozgan University of Medical Sciences, Bandar Abbas, Iran

Fatemeh Ardali

Cardiovascular Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran

Shahin Abbaszadeh

Cardiovascular Research Center, Hormozgan University of Medical Sciences, Bandar Abbas, Iran