ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.00741
  4. Cited By
FELM: Benchmarking Factuality Evaluation of Large Language Models

FELM: Benchmarking Factuality Evaluation of Large Language Models

1 October 2023
Shiqi Chen
Yiran Zhao
Jinghan Zhang
Ethan Chern
Siyang Gao
Pengfei Liu
Junxian He
    HILM
ArXivPDFHTML

Papers citing "FELM: Benchmarking Factuality Evaluation of Large Language Models"

32 / 32 papers shown
Title
Hallucination Detection in LLMs via Topological Divergence on Attention Graphs
Hallucination Detection in LLMs via Topological Divergence on Attention Graphs
Alexandra Bazarova
Aleksandr Yugay
Andrey Shulga
A. Ermilova
Andrei Volodichev
...
Dmitry Simakov
M. Savchenko
Andrey Savchenko
Serguei Barannikov
Alexey Zaytsev
HILM
28
0
0
14 Apr 2025
Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking
Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking
Zihan Gu
Ruoyu Chen
Hua Zhang
Yue Hu
Xiaochun Cao
32
0
0
04 Apr 2025
FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research
FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research
Gabriel Recchia
Chatrik Singh Mangat
Issac Li
Gayatri Krishnakumar
ALM
77
0
0
29 Mar 2025
ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph
ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph
Langming Liu
Haibin Chen
Yuhao Wang
Yujin Yuan
Shilei Liu
Wenbo Su
Xiangyu Zhao
Bo Zheng
RALM
58
0
0
20 Mar 2025
SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models
SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models
Chuan Qin
X. Chen
Chengrui Wang
Pengmin Wu
Xi Chen
...
Han Wu
C. Li
Yuanchun Zhou
H. Xiong
Hengshu Zhu
ELM
57
1
0
12 Mar 2025
HuDEx: Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM responses
HuDEx: Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM responses
Sujeong Lee
Hayoung Lee
Seongsoo Heo
Wonik Choi
HILM
90
0
0
12 Feb 2025
Iterative Tree Analysis for Medical Critics
Iterative Tree Analysis for Medical Critics
Zenan Huang
Mingwei Li
Zheng Zhou
Youxin Jiang
83
0
0
18 Jan 2025
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input
Alon Jacovi
Andrew Wang
Chris Alberti
Connie Tao
Jon Lipovetz
...
Rachana Fellinger
Rui Wang
Zizhao Zhang
Sasha Goldshtein
Dipanjan Das
HILM
ALM
82
12
0
06 Jan 2025
Measuring short-form factuality in large language models
Measuring short-form factuality in large language models
Jason W. Wei
Nguyen Karina
Hyung Won Chung
Yunxin Joy Jiao
Spencer Papay
Amelia Glaese
John Schulman
W. Fedus
ELM
KELM
HILM
38
38
0
07 Nov 2024
Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large
  Language Models
Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models
Qitan Lv
Jie Wang
Hanzhu Chen
Bin Li
Yongdong Zhang
Feng Wu
HILM
17
3
0
19 Oct 2024
Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in
  Code
Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code
Nan Jiang
Qi Li
Lin Tan
Tianyi Zhang
HILM
29
1
0
13 Oct 2024
PersoBench: Benchmarking Personalized Response Generation in Large
  Language Models
PersoBench: Benchmarking Personalized Response Generation in Large Language Models
Saleh Afzoon
Usman Naseem
Amin Beheshti
Zahra Jamali
31
1
0
04 Oct 2024
Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost
Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost
Sania Nayab
Giulio Rossolini
Giorgio Buttazzo
Nicolamaria Manes
F. Giacomelli
Nicolamaria Manes
Fabrizio Giacomelli
LRM
49
23
0
29 Jul 2024
AgentPeerTalk: Empowering Students through Agentic-AI-Driven Discernment
  of Bullying and Joking in Peer Interactions in Schools
AgentPeerTalk: Empowering Students through Agentic-AI-Driven Discernment of Bullying and Joking in Peer Interactions in Schools
Aditya Paul
Chi Lok Yu
Eva Adelina Susanto
Nicholas Wai Long Lau
Gwenyth Isobel Meadows
LLMAG
35
3
0
27 Jul 2024
WildHallucinations: Evaluating Long-form Factuality in LLMs with
  Real-World Entity Queries
WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries
Wenting Zhao
Tanya Goyal
Yu Ying Chiu
Liwei Jiang
Benjamin Newman
...
Khyathi Raghavi Chandu
Ronan Le Bras
Claire Cardie
Yuntian Deng
Yejin Choi
HILM
38
7
0
24 Jul 2024
Composable Interventions for Language Models
Composable Interventions for Language Models
Arinbjorn Kolbeinsson
Kyle O'Brien
Tianjin Huang
Shanghua Gao
Shiwei Liu
...
Anurag J. Vaidya
Faisal Mahmood
Marinka Zitnik
Tianlong Chen
Thomas Hartvigsen
KELM
MU
82
5
0
09 Jul 2024
NutriBench: A Dataset for Evaluating Large Language Models on Nutrition Estimation from Meal Descriptions
NutriBench: A Dataset for Evaluating Large Language Models on Nutrition Estimation from Meal Descriptions
Andong Hua
Mehak Preet Dhaliwal
Ryan Burke
Yao Qin
Yao Qin
23
1
0
04 Jul 2024
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge
  Conflicts from Wikipedia
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia
Yufang Hou
Alessandra Pascale
Javier Carnerero-Cano
T. Tchrakian
Radu Marinescu
Elizabeth M. Daly
Inkit Padhi
P. Sattigeri
41
6
0
19 Jun 2024
DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation
DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation
A. B. M. A. Rahman
Saeed Anwar
Muhammad Usman
Ajmal Mian
HILM
39
2
0
13 Jun 2024
A Probabilistic Framework for LLM Hallucination Detection via Belief Tree Propagation
A Probabilistic Framework for LLM Hallucination Detection via Belief Tree Propagation
Bairu Hou
Yang Zhang
Jacob Andreas
Shiyu Chang
69
5
0
11 Jun 2024
Evaluating the Factuality of Large Language Models using Large-Scale
  Knowledge Graphs
Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs
Xiaoze Liu
Feijie Wu
Tianyang Xu
Zhuo Chen
Yichi Zhang
Xiaoqian Wang
Jing Gao
HILM
33
8
0
01 Apr 2024
ERBench: An Entity-Relationship based Automatically Verifiable
  Hallucination Benchmark for Large Language Models
ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models
Jio Oh
Soyeon Kim
Junseok Seo
Jindong Wang
Ruochen Xu
Xing Xie
Steven Euijong Whang
36
1
0
08 Mar 2024
In-Context Sharpness as Alerts: An Inner Representation Perspective for
  Hallucination Mitigation
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation
Shiqi Chen
Miao Xiong
Junteng Liu
Zhengxuan Wu
Teng Xiao
Siyang Gao
Junxian He
HILM
51
21
0
03 Mar 2024
Factuality of Large Language Models in the Year 2024
Factuality of Large Language Models in the Year 2024
Yuxia Wang
Minghan Wang
Muhammad Arslan Manzoor
Fei Liu
Georgi Georgiev
Rocktim Jyoti Das
Preslav Nakov
LRM
HILM
30
7
0
04 Feb 2024
Benchmarking LLMs via Uncertainty Quantification
Benchmarking LLMs via Uncertainty Quantification
Fanghua Ye
Mingming Yang
Jianhui Pang
Longyue Wang
Derek F. Wong
Emine Yilmaz
Shuming Shi
Zhaopeng Tu
ELM
15
47
0
23 Jan 2024
KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large
  Language Models
KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models
Yuyang Bai
Shangbin Feng
Vidhisha Balachandran
Zhaoxuan Tan
Shiqi Lou
Tianxing He
Yulia Tsvetkov
ELM
40
2
0
15 Oct 2023
KCTS: Knowledge-Constrained Tree Search Decoding with Token-Level
  Hallucination Detection
KCTS: Knowledge-Constrained Tree Search Decoding with Token-Level Hallucination Detection
Sehyun Choi
Tianqing Fang
Zhaowei Wang
Yangqiu Song
30
32
0
13 Oct 2023
Maieutic Prompting: Logically Consistent Reasoning with Recursive
  Explanations
Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations
Jaehun Jung
Lianhui Qin
Sean Welleck
Faeze Brahman
Chandra Bhagavatula
Ronan Le Bras
Yejin Choi
ReLM
LRM
218
189
0
24 May 2022
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang
Jason W. Wei
Dale Schuurmans
Quoc Le
Ed H. Chi
Sharan Narang
Aakanksha Chowdhery
Denny Zhou
ReLM
BDL
LRM
AI4CE
297
3,236
0
21 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
315
8,448
0
28 Jan 2022
Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
Nouha Dziri
Hannah Rashkin
Tal Linzen
David Reitter
ALM
185
79
0
30 Apr 2021
Understanding Factuality in Abstractive Summarization with FRANK: A
  Benchmark for Factuality Metrics
Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics
Artidoro Pagnoni
Vidhisha Balachandran
Yulia Tsvetkov
HILM
215
305
0
27 Apr 2021
1