Article Data

  • Views 237
  • Dowloads 40

Original Research

Open Access

Benchmark evaluation of large language models for clinical decision support in headache management

  • Shi Chen1,2,†
  • Dong Liang1,2,†
  • Xu Qiu1,2,†
  • Chengqi Dong1,2
  • Jiayi Deng3
  • Li Xu1,2
  • Xiaoxue Dong4,5
  • Yonglei Zhao6
  • Xuemei Fan7
  • Xiaoyu Liu8
  • Yali Wu1,2
  • Jianliang Sun1,2
  • Feifang He9
  • Ke Ma10
  • Liang Yu1,2,*,
  • Hanbin Wang1,2,*,

1Department of Pain, Affiliated Hangzhou First People’s Hospital, School of Medicine, Westlake University, 310006 Hangzhou, Zhejiang, China

2The Fourth Clinical School of Medicine, Hangzhou First People’s Hospital, Zhejiang Chinese Medical University, 310006 Hangzhou, Zhejiang, China

3Department of Pain, Wuxi Xishan People’s Hospital, 214000 Wuxi, Jiangsu, China

4National Neuroscience Institute of Singapore, 308433 Singapore, Singapore

5Department of Neurology, Shanghai General Hospital, School of Medicine, Shanghai Jiao Tong University, 200080 Shanghai, China

6Department of Radiology, Sir Run Run Shaw Hospital, School of Medicine, Zhejiang University, 310016 Hangzhou, Zhejiang, China

7Department of Neurology, Affiliated Hangzhou First People’s Hospital, School of Medicine, Westlake University, 310006 Hangzhou, Zhejiang, China

8Affiliated Mental Health Center & Hangzhou Seventh People’s Hospital, School of Medicine, Zhejiang University, 310016 Hangzhou, Zhejiang, China

9Department of Pain Management, Center for Intracranial Hypotension Management, Sir Run Run Shaw Hospital, School of Medicine, Zhejiang University, 310016 Hangzhou, Zhejiang, China

10Department of Pain, Shanghai Ninth People’s Hospital, School of Medicine, Shanghai Jiao Tong University, 200011 Shanghai, China

DOI: 10.22514/jofph.2026.029 Vol.40,Issue 2,March 2026 pp.140-150

Submitted: 27 October 2025 Accepted: 15 December 2025

Published: 12 March 2026

*Corresponding Author(s): Liang Yu E-mail: yuliang@hospital.westlake.edu.cn
*Corresponding Author(s): Hanbin Wang E-mail: wanghanbin@hospital.westlake.edu.cn

† These authors contributed equally.

Abstract

Background: Headache disorders are a major cause of disability worldwide. In routine practice, diagnosis and guideline-based management are difficult because symptoms can overlap between primary and secondary headaches, and clinicians must combine clinical, imaging, and pathological information. Large language models (LLMs) are being proposed to assist clinical reasoning, but their performance on headache cases and their sensitivity to prompting have not been systematically assessed. Methods: We evaluated seven leading LLMs using 13 headache cases from the New England Journal of Medicine (NEJM). We compared two prompting strategies: ask-in-sequence (AS) and ask-at-once (AO). Using a 5-point Likert rubric, three headache specialists independently scored six dimensions: rationality of diagnostic thinking, comprehensiveness of differential diagnosis, diagnostic accuracy, completeness of pathological diagnosis, clinical management, and supplementary value. Readability was measured with Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL). We analyzed differences across models, prompting strategies, and cases. Results: Diagnostic accuracy differed by model: in the AS strategy, ChatGPT-4o outperformed Grok-3. Supplementary value also varied: in AS, Grok-3 outperformed ChatGPT-5 and Hunyuan-T1; in AO, DeepSeek-R1 outperformed ChatGPT-5. Overall, supplementary value was generally higher with AS, while strategy-related differences in diagnostic accuracy were observed only for Grok-3. Performance also depended on the case; C8 and C11 consistently received very low scores, suggesting difficulty integrating psychiatric or warning signs with pathological findings. Readability differed significantly: Gemini 2.5 Pro had the highest FRE (best readability) across strategies, and AS outputs generally had higher FRE. Within AS, ChatGPT-4o had the highest FKGL (worst readability). No significant model differences were found for the other four clinical dimensions. Conclusions: This study provides a structured, reproducible evaluation of LLMs on headache case analysis. While some models improved supplementary value, diagnostic accuracy, or readability, overall clinical accuracy remains below expert performance and is not sufficient for unsupervised clinical use.


Keywords

Headache disorders; Large language models; Clinical reasoning; Artificial intelligence


Cite and Share

Shi Chen,Dong Liang,Xu Qiu,Chengqi Dong,Jiayi Deng,Li Xu,Xiaoxue Dong,Yonglei Zhao,Xuemei Fan,Xiaoyu Liu,Yali Wu,Jianliang Sun,Feifang He,Ke Ma,Liang Yu,Hanbin Wang. Benchmark evaluation of large language models for clinical decision support in headache management. Journal of Oral & Facial Pain and Headache. 2026. 40(2);140-150.

References

[1] Chen L, Zhang Z, Chen W, Whelton PK, Appel LJ. Lower sodium intake and risk of headaches: results from the trial of nonpharmacologic interventions in the elderly. American Journal of Public Health. 2016; 106: 1270–1275.

[2] Zhai X, Zhang S, Li C, Liu F, Huo Q. Complementary and alternative therapies for tension-type headache: a protocol for systematic review and network meta-analysis. Medicine. 2021; 100: e25544.

[3] Hu XH, Markson LE, Lipton RB, Stewart WF, Berger ML. Burden of migraine in the United States: disability and economic costs. Archives of Internal Medicine. 1999; 159: 813–818.

[4] Linde M, Gustavsson A, Stovner LJ, Steiner TJ, Barré J, Katsarava Z, et al. The cost of headache disorders in Europe: the Eurolight project. European Journal of Neurology. 2012; 19: 703–711.

[5] Voedisch AJ, Hindiyeh N. Combined hormonal contraception and migraine: are we being too strict. Current Opinion in Obstetrics and Gynecology. 2019; 31: 452–458.

[6] Getsoian SL, Gulati SM, Okpareke I, Nee RJ, Jull GA. Validation of a clinical examination to differentiate a cervicogenic source of headache: a diagnostic prediction model using controlled diagnostic blocks. BMJ Open. 2020; 10: e035245.

[7] From Einstein to AI: how 100 years have shaped science. Nature. 2023; 624: 474.

[8] Caucheteux C, Gramfort A, King JR. Evidence of a predictive coding hierarchy in the human brain listening to speech. Nature Human Behaviour. 2023; 7: 430–441.

[9] Xiang Y, Du J, Fujimoto K, Li F, Schneider J, Tao C. Application of artificial intelligence and machine learning for HIV prevention interventions. The Lancet HIV. 2022; 9: e54–e62.

[10] Gu D, Su K, Zhao H. A case-based ensemble learning system for explainable breast cancer recurrence prediction. Artificial Intelligence in Medicine. 2020; 107: 101858.

[11] Deng J, Qiu X, Dong C, Xu L, Dong X, Yang S, et al. Evaluating ChatGPT and DeepSeek in postdural puncture headache management: a comparative study with international consensus guidelines. BMC Neurology. 2025; 25: 264.

[12] García-Azorín D, Farid-Zahran M, Gutiérrez-Sánchez M, González-García MN, Guerrero AL, Porta-Etessam J. Tension-type headache in the Emergency Department Diagnosis and misdiagnosis: the TEDDi study. Scientific Reports. 2020; 10: 2446.

[13] Wornow M, Xu Y, Thapa R, Patel B, Steinberg E, Fleming S, et al. The shaky foundations of large language models and foundation models for electronic health records. npj Digital Medicine. 2023; 6: 135.

[14] Sullivan GM, Artino AR III. Analyzing and interpreting data from likert-type scales. Journal of Graduate Medical Education. 2013; 5: 541–542.

[15] Sandmann S, Hegselmann S, Fujarski M, Bickmann L, Wild B, Eils R, et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nature Medicine. 2025; 31: 2546–2549.

[16] Tordjman M, Liu Z, Yuce M, Fauveau V, Mei Y, Hadjadj J, et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nature Medicine. 2025; 31: 2550–2555.

[17] Jindal P, MacDermid JC. Assessing reading levels of health information: uses and limitations of flesch formula. Education for Health. 2017; 30: 84–88.

[18] Diniz-Freitas M, López-Pintor RM, Santos-Silva AR, Warnakulasuriya S, Diz-Dios P. Assessing the accuracy and readability of ChatGPT-4 and Gemini in answering oral cancer queries—an exploratory study. Exploration of Digital Health Technologies. 2024; 2: 334–345.

[19] Griot M, Hemptinne C, Vanderdonckt J, Yuksel D. Impact of high-quality, mixed-domain data on the performance of medical language models. Journal of the American Medical Informatics Association. 2024; 31: 1875–1883.

[20] Kraljevic Z, Bean D, Shek A, Bendayan R, Hemingway H, Yeung JA, et al. Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. The Lancet Digital Health. 2024; 6: e281–e290.

[21] Cabral S, Restrepo D, Kanjee Z, Wilson P, Crowe B, Abdulnour RE, et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Internal Medicine. 2024; 184: 581–583.

[22] Ferdush J, Begum M, Hossain ST. ChatGPT and clinical decision support: scope, application, and limitations. Annals of Biomedical Engineering. 2024; 52: 1119–1124.

[23] Zakka C, Shad R, Chaurasia A, Dalal AR, Kim JL, Moor M, et al. Almanac—retrieval-augmented language models for clinical medicine. NEJM AI. 2024; 1: 10.1056/aioa2300068.

[24] Yang Y, Liu X, Jin Q, Huang F, Lu Z. Unmasking and quantifying racial bias of large language models in medical report generation. Communications Medicine. 2024; 4: 176.

[25] García-Azorín D, Abelaira-Freire J, González-García N, Rodriguez-Adrada E, Schytz HW, Barloese M, et al. Sensitivity of the SNNOOP10 list in the high-risk secondary headache detection. Cephalalgia. 2022; 42: 1521–1531.

[26] Hang H, Yang L, Wang Z, Lin Z, Li P, Zhu J, et al. Comparative analysis of accuracy and completeness in standardized database generation for complex multilingual lung cancer pathological reports: large language model-based assisted diagnosis system vs. DeepSeek, GPT-3.5, and healthcare professionals with varied professional titles, with task load variation assessment among medical staff. Frontiers in Medicine. 2025; 12: 1618858.

[27] Gosalia H, Moreno-Ajona D, Goadsby PJ. Medication-overuse headache: a narrative review. The Journal of Headache and Pain. 2024; 25: 89.

[28] Chalmer MA, Kogelman L, Ullum H, Sørensen E, Didriksen M, Mikkelsen S, et al. Population-based characterization of menstrual migraine and proposed diagnostic criteria. JAMA Network Open. 2023; 6: e2313235.

[29] Fast D, Adams LC, Busch F, Fallon C, Huppertz M, Siepmann R, et al. Autonomous medical evaluation for guideline adherence of large language models. npj Digital Medicine. 2024; 7: 358.

[30] Schubert MC, Wick W, Venkataramani V. Performance of large language models on a neurology board-style examination. JAMA Network Open. 2023; 6: e2346721.

[31] Afshar M, Resnik F, Baumann MR, Hintzke J, Lemmon K, Sullivan AG, et al. A novel playbook for pragmatic trial operations to monitor and evaluate ambient artificial intelligence in clinical practice. NEJM AI. 2025; 2: 10.1056/aidbp2401267.

[32] Vrdoljak J, Boban Z, Vilović M, Kumrić M, Božić J. A review of large language models in medical education, clinical decision support, and healthcare administration. Healthcare. 2025; 13: 603.

[33] Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. npj Digital Medicine. 2023; 6: 120.


Submission Turnaround Time

Top