RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain

21 March 2024

Papers citing "RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain"

6 / 6 papers shown

Title
Trusting CHATGPT: how minor tweaks in the prompts lead to major differences in sentiment classification Jaime E. Cuellar Oscar Moreno-Martinez Paula Sofia Torres-Rodriguez Jaime Andres Pavlich-Mariscal Andres Felipe Mican-Castiblanco Juan Guillermo Torres-Hurtado 23 0 0 16 Apr 2025
The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem Joe B Hakim Jeffery L. Painter D. Ramcharran V. Kara Greg Powell Paulina Sobczak Chiho Sato Andrew Bate Andrew Beam 20 2 0 01 Jul 2024
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation Zexue He Yu-Xiang Wang An Yan Yao Liu Eric Y. Chang Amilcare Gentili Julian McAuley Chun-Nan Hsu ELM 54 14 0 21 Oct 2023
Can Large Language Models Be an Alternative to Human Evaluations? Cheng-Han Chiang Hung-yi Lee ALM LM&MA 209 559 0 03 May 2023
Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark Nouha Dziri Hannah Rashkin Tal Linzen David Reitter ALM 185 79 0 30 Apr 2021
PubMedQA: A Dataset for Biomedical Research Question Answering Qiao Jin Bhuwan Dhingra Zhengping Liu William W. Cohen Xinghua Lu 202 791 0 13 Sep 2019