ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.03339
  4. Cited By
The Challenges of Evaluating LLM Applications: An Analysis of Automated,
  Human, and LLM-Based Approaches
v1v2 (latest)

The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches

5 June 2024
Bhashithe Abeysinghe
Ruhan Circi
    ELM
ArXiv (abs)PDFHTMLGithub (789★)

Papers citing "The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches"

16 / 16 papers shown
Advancing Academic Chatbots: Evaluation of Non Traditional Outputs
Advancing Academic Chatbots: Evaluation of Non Traditional Outputs
Nicole Favero
Francesca Salute
Daniel Hardt
89
0
0
30 Nov 2025
Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs
Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs
Md Abdullah Al Mamun
Ihsen Alouani
Nael B. Abu-Ghazaleh
118
1
0
28 Aug 2025
CASE: An Agentic AI Framework for Enhancing Scam Intelligence in Digital Payments
CASE: An Agentic AI Framework for Enhancing Scam Intelligence in Digital Payments
Nitish Jaipuria
Lorenzo Gatto
Zijun Kan
Shankey Poddar
Bill Cheung
Diksha Bansal
Ramanan Balakrishnan
Aviral Suri
Jose Estevez
116
0
0
27 Aug 2025
Metric assessment protocol in the context of answer fluctuation on MCQ tasks
Metric assessment protocol in the context of answer fluctuation on MCQ tasks
Ekaterina Goliakova
X. Renard
Marie-Jeanne Lesot
Thibault Laugel
Christophe Marsala
Marcin Detyniecki
210
0
0
21 Jul 2025
Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts
Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to ArtifactsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Hongyu Chen
Seraphina Goldfarb-Tarrant
755
14
0
12 Mar 2025
A review of faithfulness metrics for hallucination assessment in Large Language Models
A review of faithfulness metrics for hallucination assessment in Large Language ModelsIEEE Journal on Selected Topics in Signal Processing (JSTSP), 2024
Ben Malin
Tatiana Kalganova
Nikoloas Boulgouris
HILM
450
17
0
03 Jan 2025
Generating a Low-code Complete Workflow via Task Decomposition and RAG
Orlando Marquez Ayala
Patrice Béchard
305
2
0
29 Nov 2024
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap
Gopi Krishnan Rajbahadur
G. Oliva
Dayi Lin
Ahmed E. Hassan
Ahmed E. Hassan
ALM
410
3
0
28 Oct 2024
Is artificial intelligence still intelligence? LLMs generalize to novel
  adjective-noun pairs, but don't mimic the full human distribution
Is artificial intelligence still intelligence? LLMs generalize to novel adjective-noun pairs, but don't mimic the full human distribution
Hayley Ross
Kathryn Davidson
Najoung Kim
304
6
0
23 Oct 2024
A Cross-Lingual Statutory Article Retrieval Dataset for Taiwan Legal
  Studies
A Cross-Lingual Statutory Article Retrieval Dataset for Taiwan Legal Studies
Yen-Hsiang Wang
Feng-Dian Su
Tzu-Yu Yeh
Yao-Chung Fan
RALMAILaw
165
0
0
15 Oct 2024
Conversate: Supporting Reflective Learning in Interview Practice Through
  Interactive Simulation and Dialogic Feedback
Conversate: Supporting Reflective Learning in Interview Practice Through Interactive Simulation and Dialogic Feedback
Taufiq Daryanto
Xiaohan Ding
Lance T Wilhelm
Sophia Stil
Kirk McInnis Knutsen
Eugenia H Rho
405
22
0
08 Oct 2024
Comparing Criteria Development Across Domain Experts, Lay Users, and
  Models in Large Language Model Evaluation
Comparing Criteria Development Across Domain Experts, Lay Users, and Models in Large Language Model Evaluation
Annalisa Szymanski
Simret Araya Gebreegziabher
Oghenemaro Anuyah
Ronald A Metoyer
Tao Li
ALMELM
256
15
0
02 Oct 2024
Retrospective Comparative Analysis of Prostate Cancer In-Basket
  Messages: Responses from Closed-Domain LLM vs. Clinical Teams
Retrospective Comparative Analysis of Prostate Cancer In-Basket Messages: Responses from Closed-Domain LLM vs. Clinical Teams
Yuexing Hao
J. Holmes
Jared Hobson
Alexandra Bennett
Daniel K. Ebner
...
N. Yu
Chris L. Hallemeier
Brooke E. Ball
Mark R. Waddle
Wei Liu
LM&MA
194
4
0
26 Sep 2024
Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of
  Human Responses in Dialogue
Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue
Jonathan Ivey
Shivani Kumar
Jiayu Liu
Hua Shen
Sushrita Rakshit
...
Dustin Wright
Abraham Israeli
Anders Giovanni Møller
Lechen Zhang
David Jurgens
356
13
0
12 Sep 2024
Language agents achieve superhuman synthesis of scientific knowledge
Language agents achieve superhuman synthesis of scientific knowledge
Michael D. Skarlinski
Sam Cox
Jon M. Laurent
James D. Braza
Michaela M. Hinks
M. Hammerling
Manvitha Ponnapati
Samuel G. Rodriques
Andrew D. White
ELMHILMALM
506
107
0
10 Sep 2024
"Mango Mango, How to Let The Lettuce Dry Without A Spinner?'': Exploring
  User Perceptions of Using An LLM-Based Conversational Assistant Toward
  Cooking Partner
"Mango Mango, How to Let The Lettuce Dry Without A Spinner?'': Exploring User Perceptions of Using An LLM-Based Conversational Assistant Toward Cooking PartnerProceedings of the ACM on Human-Computer Interaction (PACMHCI), 2023
Szeyi Chan
Jiachen Li
Bingsheng Yao
Amama Mahmood
Chien-Ming Huang
Holly Jimison
Elizabeth D. Mynatt
Dakuo Wang
257
15
0
09 Oct 2023
1
Page 1 of 1