Large Language Models Still Face Challenges in Multi-Hop Reasoning with External Knowledge

11 December 2024

Haotong Zhang

LRM

ELM

ArXiv (abs)PDF HTML Github

Main:11 Pages

3 Figures

Bibliography:4 Pages

8 Tables

Appendix:9 Pages

Abstract

We carry out a series of experiments to test large language models' multi-hop reasoning ability from three aspects: selecting and combining external knowledge, dealing with non-sequential reasoning tasks and generalising to data samples with larger numbers of hops. We test the GPT-3.5 model on four reasoning benchmarks with Chain-of-Thought prompting (and its variations). Our results reveal that despite the amazing performance achieved by large language models on various reasoning tasks, models still suffer from severe drawbacks which shows a large gap with humans.

View on arXiv

Comments on this paper