ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.14766
68
5

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

20 February 2025
Joshua Harris
Timothy Laurence
Leo Loman
Fan Grayson
Toby Nonnenmacher
Harry Long
Loes WalsGriffith
Amy Douglas
Holly Fountain
Stelios Georgiou
Jo Hardstaff
Kathryn Hopkins
Y-Ling Chi
G. Kuyumdzhieva
Lesley Larkin
Samuel Collins
Hamish Mohammed
Thomas Finnie
Luke Hounsome
Michael Borowitz
Steven Riley
    LM&MA
    AI4MH
ArXivPDFHTML
Abstract

Advances in Large Language Models (LLMs) have led to significant interest in their potential to support human experts across a range of domains, including public health. In this work we present automated evaluations of LLMs for public health tasks involving the classification and extraction of free text. We combine six externally annotated datasets with seven new internally annotated datasets to evaluate LLMs for processing text related to: health burden, epidemiological risk factors, and public health interventions. We evaluate eleven open-weight LLMs (7-123 billion parameters) across all tasks using zero-shot in-context learning. We find that Llama-3.3-70B-Instruct is the highest performing model, achieving the best results on 8/16 tasks (using micro-F1 scores). We see significant variation across tasks with all open-weight LLMs scoring below 60% micro-F1 on some challenging tasks, such as Contact Classification, while all LLMs achieve greater than 80% micro-F1 on others, such as GI Illness Classification. For a subset of 11 tasks, we also evaluate three GPT-4 and GPT-4o series models and find comparable results to Llama-3.3-70B-Instruct. Overall, based on these initial results we find promising signs that LLMs may be useful tools for public health experts to extract information from a wide variety of free text sources, and support public health surveillance, research, and interventions.

View on arXiv
@article{harris2025_2405.14766,
  title={ Evaluating Large Language Models for Public Health Classification and Extraction Tasks },
  author={ Joshua Harris and Timothy Laurence and Leo Loman and Fan Grayson and Toby Nonnenmacher and Harry Long and Loes WalsGriffith and Amy Douglas and Holly Fountain and Stelios Georgiou and Jo Hardstaff and Kathryn Hopkins and Y-Ling Chi and Galena Kuyumdzhieva and Lesley Larkin and Samuel Collins and Hamish Mohammed and Thomas Finnie and Luke Hounsome and Michael Borowitz and Steven Riley },
  journal={arXiv preprint arXiv:2405.14766},
  year={ 2025 }
}
Comments on this paper