NLCTables: A Dataset for Marrying Natural Language Conditions with Table Discovery

Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025

22 April 2025

ArXiv (abs)PDF HTML Github (15★)

Main:9 Pages

7 Figures

Bibliography:1 Pages

14 Tables

Abstract

With the growing abundance of repositories containing tabular data, discovering relevant tables for in-depth analysis remains a challenging task. Existing table discovery methods primarily retrieve desired tables based on a query table or several vague keywords, leaving users to manually filter large result sets. To address this limitation, we propose a new task: NL-conditional table discovery (nlcTD), where users combine a query table with natural language (NL) requirements to refine search results. To advance research in this area, we present nlcTables, a comprehensive benchmark dataset comprising 627 diverse queries spanning NL-only, union, join, and fuzzy conditions, 22,080 candidate tables, and 21,200 relevance annotations. Our evaluation of six state-of-the-art table discovery methods on nlcTables reveals substantial performance gaps, highlighting the need for advanced techniques to tackle this challenging nlcTD scenario. The dataset, construction framework, and baseline implementations are publicly available atthis https URLto foster future research.

View on arXiv

Comments on this paper