ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.14578
  4. Cited By
RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants
  in the Biomedical Domain

RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain

21 March 2024
William James Bolton
Rafael Poyiadzi
Edward R. Morrell
Gabriela van Bergen Gonzalez Bueno
Lea Goetz
ArXivPDFHTML

Papers citing "RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain"

6 / 6 papers shown
Title
Trusting CHATGPT: how minor tweaks in the prompts lead to major differences in sentiment classification
Trusting CHATGPT: how minor tweaks in the prompts lead to major differences in sentiment classification
Jaime E. Cuellar
Oscar Moreno-Martinez
Paula Sofia Torres-Rodriguez
Jaime Andres Pavlich-Mariscal
Andres Felipe Mican-Castiblanco
Juan Guillermo Torres-Hurtado
23
0
0
16 Apr 2025
The Need for Guardrails with Large Language Models in Medical
  Safety-Critical Settings: An Artificial Intelligence Application in the
  Pharmacovigilance Ecosystem
The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem
Joe B Hakim
Jeffery L. Painter
D. Ramcharran
V. Kara
Greg Powell
Paulina Sobczak
Chiho Sato
Andrew Bate
Andrew Beam
20
2
0
01 Jul 2024
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark
  for Language Model Evaluation
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation
Zexue He
Yu-Xiang Wang
An Yan
Yao Liu
Eric Y. Chang
Amilcare Gentili
Julian McAuley
Chun-Nan Hsu
ELM
54
14
0
21 Oct 2023
Can Large Language Models Be an Alternative to Human Evaluations?
Can Large Language Models Be an Alternative to Human Evaluations?
Cheng-Han Chiang
Hung-yi Lee
ALM
LM&MA
209
559
0
03 May 2023
Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
Nouha Dziri
Hannah Rashkin
Tal Linzen
David Reitter
ALM
185
79
0
30 Apr 2021
PubMedQA: A Dataset for Biomedical Research Question Answering
PubMedQA: A Dataset for Biomedical Research Question Answering
Qiao Jin
Bhuwan Dhingra
Zhengping Liu
William W. Cohen
Xinghua Lu
202
791
0
13 Sep 2019
1