ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.02804
24
0

ACCORD: Closing the Commonsense Measurability Gap

4 June 2024
François Roewer-Després
Jinyue Feng
Zining Zhu
Frank Rudzicz
    LRM
ArXivPDFHTML
Abstract

We present ACCORD, a framework and benchmark suite for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. ACCORD introduces formal elements to commonsense reasoning to explicitly control and quantify reasoning complexity beyond the typical 1 or 2 hops. Uniquely, ACCORD can automatically generate benchmarks of arbitrary reasoning complexity, and so it scales with future LLM improvements. Benchmarking state-of-the-art LLMs -- including GPT-4o (2024-05-13), Llama-3-70B-Instruct, and Mixtral-8x22B-Instruct-v0.1 -- shows performance degrading to random chance with only moderate scaling, leaving substantial headroom for improvement. We release a leaderboard of the benchmark suite tested in this work, as well as code for automatically generating more complex benchmarks.

View on arXiv
@article{roewer-després2025_2406.02804,
  title={ ACCORD: Closing the Commonsense Measurability Gap },
  author={ François Roewer-Després and Jinyue Feng and Zining Zhu and Frank Rudzicz },
  journal={arXiv preprint arXiv:2406.02804},
  year={ 2025 }
}
Comments on this paper