v1v2 (latest)

The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches

5 June 2024

Bhashithe Abeysinghe

Ruhan Circi

ELM

ArXiv (abs)PDF HTML Github (789★)

Papers citing "The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches"

16 / 16 papers shown

Advancing Academic Chatbots: Evaluation of Non Traditional Outputs

Nicole Favero

Francesca Salute

Daniel Hardt

30 Nov 2025

Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs

Md Abdullah Al Mamun

Ihsen Alouani

Nael B. Abu-Ghazaleh

118

28 Aug 2025

CASE: An Agentic AI Framework for Enhancing Scam Intelligence in Digital Payments

116

27 Aug 2025

Metric assessment protocol in the context of answer fluctuation on MCQ tasks

210

21 Jul 2025

Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to ArtifactsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Hongyu Chen

Seraphina Goldfarb-Tarrant

755

12 Mar 2025

A review of faithfulness metrics for hallucination assessment in Large Language ModelsIEEE Journal on Selected Topics in Signal Processing (JSTSP), 2024

450

03 Jan 2025

Generating a Low-code Complete Workflow via Task Decomposition and RAG

Orlando Marquez Ayala

Patrice Béchard

305

29 Nov 2024

From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap

Gopi Krishnan Rajbahadur

410

28 Oct 2024

Is artificial intelligence still intelligence? LLMs generalize to novel adjective-noun pairs, but don't mimic the full human distribution

Hayley Ross

Kathryn Davidson

Najoung Kim

304

23 Oct 2024

A Cross-Lingual Statutory Article Retrieval Dataset for Taiwan Legal Studies

165

15 Oct 2024

Conversate: Supporting Reflective Learning in Interview Practice Through Interactive Simulation and Dialogic Feedback

405

08 Oct 2024

Comparing Criteria Development Across Domain Experts, Lay Users, and Models in Large Language Model Evaluation

Annalisa Szymanski

Simret Araya Gebreegziabher

256

02 Oct 2024

Retrospective Comparative Analysis of Prostate Cancer In-Basket Messages: Responses from Closed-Domain LLM vs. Clinical Teams

Yuexing Hao

...

Mark R. Waddle

Wei Liu

LM&MA

194

26 Sep 2024

Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue

Jonathan Ivey

...

Anders Giovanni Møller

Lechen Zhang

David Jurgens

356

12 Sep 2024

Language agents achieve superhuman synthesis of scientific knowledge

Michael D. Skarlinski

506

107

10 Sep 2024

"Mango Mango, How to Let The Lettuce Dry Without A Spinner?'': Exploring User Perceptions of Using An LLM-Based Conversational Assistant Toward Cooking PartnerProceedings of the ACM on Human-Computer Interaction (PACMHCI), 2023

Bingsheng Yao

257

09 Oct 2023