ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.18201
  4. Cited By
A Critical Evaluation of Evaluations for Long-form Question Answering

A Critical Evaluation of Evaluations for Long-form Question Answering

29 May 2023
Fangyuan Xu
Yixiao Song
Mohit Iyyer
Eunsol Choi
    ELM
ArXivPDFHTML

Papers citing "A Critical Evaluation of Evaluations for Long-form Question Answering"

50 / 77 papers shown
Title
How well do LLMs reason over tabular data, really?
How well do LLMs reason over tabular data, really?
Cornelius Wolff
Madelon Hulsebos
LMTD
ELM
LRM
37
0
0
12 May 2025
An Empirical Study of Evaluating Long-form Question Answering
An Empirical Study of Evaluating Long-form Question Answering
Ning Xian
Yixing Fan
Ruqing Zhang
Maarten de Rijke
Jiafeng Guo
ELM
27
0
0
25 Apr 2025
The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models
The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models
Ronak Pradeep
Nandan Thakur
Shivani Upadhyay
Daniel Fernando Campos
Nick Craswell
Jimmy Lin
22
0
0
21 Apr 2025
Extract, Match, and Score: An Evaluation Paradigm for Long Question-context-answer Triplets in Financial Analysis
Extract, Match, and Score: An Evaluation Paradigm for Long Question-context-answer Triplets in Financial Analysis
Bo Hu
Han Yuan
Vlad Pandelea
Wuqiong Luo
Yingzhu Zhao
Zheng Ma
53
0
0
20 Mar 2025
MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration
MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration
David Wan
Justin Chih-Yao Chen
Elias Stengel-Eskin
Mohit Bansal
LLMAG
LRM
60
1
0
19 Mar 2025
Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings
Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings
Austin Xu
Srijan Bansal
Yifei Ming
Semih Yavuz
Shafiq R. Joty
ELM
89
2
0
19 Mar 2025
Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators
Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators
Feng Gu
Zongxia Li
Carlos Rafael Colon
Benjamin Evans
Ishani Mondal
Jordan Boyd-Graber
39
1
0
09 Mar 2025
Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution
K. Li
Tianhua Zhang
Yunxiang Li
Hongyin Luo
Abdalla Moustafa
Xixin Wu
James Glass
H. Meng
59
0
0
03 Mar 2025
DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking
DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking
Zhuoqun Li
Haiyang Yu
Xuanang Chen
Hongyu Lin
Y. Lu
Fei Huang
Xianpei Han
Y. Li
Le Sun
40
3
0
28 Feb 2025
FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models
FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models
Radu Marinescu
D. Bhattacharjya
Junkyu Lee
T. Tchrakian
Javier Carnerero-Cano
Yufang Hou
Elizabeth M. Daly
Alessandra Pascale
HILM
LRM
56
0
0
25 Feb 2025
How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
Saad Obaid ul Islam
Anne Lauscher
Goran Glavas
HILM
LRM
112
1
0
21 Feb 2025
Prompt-based Depth Pruning of Large Language Models
Prompt-based Depth Pruning of Large Language Models
Juyun Wee
Minjae Park
Jaeho Lee
VLM
84
0
0
17 Feb 2025
EvidenceMap: Learning Evidence Analysis to Unleash the Power of Small Language Models for Biomedical Question Answering
EvidenceMap: Learning Evidence Analysis to Unleash the Power of Small Language Models for Biomedical Question Answering
Chang Zong
Jian Wan
Siliang Tang
Lei Zhang
73
0
0
17 Feb 2025
Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies
Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies
Sunnie S. Y. Kim
J. Vaughan
Q. V. Liao
Tania Lombrozo
Olga Russakovsky
89
5
0
12 Feb 2025
Context-Aware Hierarchical Merging for Long Document Summarization
Context-Aware Hierarchical Merging for Long Document Summarization
Litu Ou
Mirella Lapata
MoMe
101
1
0
03 Feb 2025
LLM-powered Multi-agent Framework for Goal-oriented Learning in Intelligent Tutoring System
Tianfu Wang
Yi Zhan
Jianxun Lian
Zhengyu Hu
N. Yuan
Qi Zhang
Xing Xie
Hui Xiong
29
1
0
28 Jan 2025
Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation
Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation
Takyoung Kim
Kyungjae Lee
Y. Jang
Ji Yong Cho
Gangwoo Kim
Minseok Cho
Moontae Lee
83
0
0
28 Jan 2025
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
Ruosen Li
Teerth Patel
Xinya Du
LLMAG
ALM
52
94
0
03 Jan 2025
Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM
Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM
Dong Yuan
Eti Rastogi
Fen Zhao
Sagar Goyal
Gautam Naik
Sree Prasanna Rajagopal
29
0
0
31 Dec 2024
Not All Heads Matter: A Head-Level KV Cache Compression Method with
  Integrated Retrieval and Reasoning
Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning
Yu Fu
Zefan Cai
Abedelkadir Asi
Wayne Xiong
Yue Dong
Wen Xiao
36
14
0
25 Oct 2024
Quebec Automobile Insurance Question-Answering With Retrieval-Augmented
  Generation
Quebec Automobile Insurance Question-Answering With Retrieval-Augmented Generation
David Beauchemin
Zachary Gagnon
Ricahrd Khoury
AILaw
18
1
0
12 Oct 2024
Integrating Planning into Single-Turn Long-Form Text Generation
Integrating Planning into Single-Turn Long-Form Text Generation
Yi Liang
You Wu
Honglei Zhuang
Li Chen
Jiaming Shen
...
Zhen Qin
Sumit Sanghai
Xuanhui Wang
Carl Yang
Michael Bendersky
48
3
0
08 Oct 2024
CALF: Benchmarking Evaluation of LFQA Using Chinese Examinations
CALF: Benchmarking Evaluation of LFQA Using Chinese Examinations
Yuchen Fan
Xin Zhong
Heng Zhou
Yuchen Zhang
Mingyu Liang
Chengxing Xie
Ermo Hua
Ning Ding
Bowen Zhou
ALM
ELM
16
0
0
02 Oct 2024
Direct Judgement Preference Optimization
Direct Judgement Preference Optimization
Peifeng Wang
Austin Xu
Yilun Zhou
Caiming Xiong
Shafiq Joty
ELM
37
11
0
23 Sep 2024
Schrodinger's Memory: Large Language Models
Schrodinger's Memory: Large Language Models
Wei Wang
Qing Li
24
1
0
16 Sep 2024
Into the Unknown Unknowns: Engaged Human Learning through Participation
  in Language Model Agent Conversations
Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations
Yucheng Jiang
Yijia Shao
Dekun Ma
Sina J. Semnani
Monica S. Lam
LLMAG
32
14
0
27 Aug 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALM
ELM
59
22
0
23 Aug 2024
Ancient Wisdom, Modern Tools: Exploring Retrieval-Augmented LLMs for
  Ancient Indian Philosophy
Ancient Wisdom, Modern Tools: Exploring Retrieval-Augmented LLMs for Ancient Indian Philosophy
Priyanka Mandikal
RALM
VLM
33
0
0
21 Aug 2024
How Susceptible are LLMs to Influence in Prompts?
How Susceptible are LLMs to Influence in Prompts?
Sotiris Anagnostidis
Jannis Bulian
LRM
25
16
0
17 Aug 2024
DebateQA: Evaluating Question Answering on Debatable Knowledge
DebateQA: Evaluating Question Answering on Debatable Knowledge
Rongwu Xu
Xuan Qi
Zehan Qi
Wei Xu
Zhijiang Guo
ELM
36
5
0
02 Aug 2024
Localizing and Mitigating Errors in Long-form Question Answering
Localizing and Mitigating Errors in Long-form Question Answering
Rachneet Sachdeva
Yixiao Song
Mohit Iyyer
Iryna Gurevych
HILM
36
0
0
16 Jul 2024
Suri: Multi-constraint Instruction Following for Long-form Text
  Generation
Suri: Multi-constraint Instruction Following for Long-form Text Generation
Chau Minh Pham
Simeng Sun
Mohit Iyyer
ALM
LRM
34
15
0
27 Jun 2024
VERISCORE: Evaluating the factuality of verifiable claims in long-form
  text generation
VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation
Yixiao Song
Yekyung Kim
Mohit Iyyer
HILM
19
23
0
27 Jun 2024
Assessing "Implicit" Retrieval Robustness of Large Language Models
Assessing "Implicit" Retrieval Robustness of Large Language Models
Xiaoyu Shen
Rexhina Blloshmi
Dawei Zhu
Jiahuan Pei
Wei Zhang
RALM
KELM
33
0
0
26 Jun 2024
CaLMQA: Exploring culturally specific long-form question answering
  across 23 languages
CaLMQA: Exploring culturally specific long-form question answering across 23 languages
Shane Arora
Marzena Karpinska
Hung-Ting Chen
Ipsita Bhattacharjee
Mohit Iyyer
Eunsol Choi
HILM
40
11
0
25 Jun 2024
Depth $F_1$: Improving Evaluation of Cross-Domain Text Classification by
  Measuring Semantic Generalizability
Depth F1F_1F1​: Improving Evaluation of Cross-Domain Text Classification by Measuring Semantic Generalizability
Parker Seegmiller
Joseph Gatto
S. Preum
VLM
14
0
0
20 Jun 2024
Model Internals-based Answer Attribution for Trustworthy
  Retrieval-Augmented Generation
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation
Jirui Qi
Gabriele Sarti
Raquel Fernández
Arianna Bisazza
RALM
37
5
0
19 Jun 2024
Learning to Generate Answers with Citations via Factual Consistency
  Models
Learning to Generate Answers with Citations via Factual Consistency Models
Rami Aly
Zhiqiang Tang
Samson Tan
George Karypis
HILM
21
4
0
19 Jun 2024
CRAG -- Comprehensive RAG Benchmark
CRAG -- Comprehensive RAG Benchmark
Xiao Yang
Kai Sun
Hao Xin
Yushi Sun
Nikita Bhalla
...
Nirav Shah
Rakesh Wanga
Anuj Kumar
Wen-tau Yih
Xin Luna Dong
18
22
0
07 Jun 2024
Inverse Constitutional AI: Compressing Preferences into Principles
Inverse Constitutional AI: Compressing Preferences into Principles
Arduin Findeis
Timo Kaufmann
Eyke Hüllermeier
Samuel Albanie
Robert Mullins
SyDa
41
8
0
02 Jun 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
59
7
0
26 May 2024
Lessons from the Trenches on Reproducible Evaluation of Language Models
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman
Hailey Schoelkopf
Lintang Sutawika
Leo Gao
J. Tow
...
Xiangru Tang
Kevin A. Wang
Genta Indra Winata
Franccois Yvon
Andy Zou
ELM
ALM
120
16
3
23 May 2024
OLAPH: Improving Factuality in Biomedical Long-form Question Answering
OLAPH: Improving Factuality in Biomedical Long-form Question Answering
Minbyul Jeong
Hyeon Hwang
Chanwoong Yoon
Taewhoo Lee
Jaewoo Kang
MedIm
HILM
LM&MA
25
11
0
21 May 2024
LLMs can learn self-restraint through iterative self-reflection
LLMs can learn self-restraint through iterative self-reflection
Alexandre Piché
Aristides Milios
Dzmitry Bahdanau
Chris Pal
31
5
0
15 May 2024
On the Evaluation of Machine-Generated Reports
On the Evaluation of Machine-Generated Reports
James Mayfield
Eugene Yang
Dawn J Lawrie
Sean MacAvaney
Paul McNamee
...
Orion Weller
Efsun Kayi
Kate Sanders
Marc Mason
Noah Hibbler
ALM
60
12
0
02 May 2024
ir_explain: a Python Library of Explainable IR Methods
ir_explain: a Python Library of Explainable IR Methods
S.
Harsh Agarwal
Venktesh V
Avishek Anand
Swastik Mohanty
Debapriyo Majumdar
Mandar Mitra
XAI
60
1
0
29 Apr 2024
Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual
  Alignment
Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment
Zhaofeng Wu
Ananth Balashankar
Yoon Kim
Jacob Eisenstein
Ahmad Beirami
43
8
0
18 Apr 2024
AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence
AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence
Minbeom Kim
Hwanhee Lee
Joonsuk Park
Hwaran Lee
Kyomin Jung
24
1
0
18 Apr 2024
Groundedness in Retrieval-augmented Long-form Generation: An Empirical
  Study
Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study
Alessandro Stolfo
RALM
HILM
19
0
0
10 Apr 2024
WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search
  Results with Citations
WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations
Haolin Deng
Chang Wang
Xin Li
Dezhang Yuan
Junlang Zhan
Tianhua Zhou
Jin Ma
Jun Gao
Ruifeng Xu
HILM
45
2
0
04 Mar 2024
12
Next