ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.13068
22
0

Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models

17 April 2025
S. Bhagat
Ibne Farabi Shihab
Anuj Sharma
ArXivPDFHTML
Abstract

This study investigates the relationship between deep learning (DL) model accuracy and expert agreement in classifying crash narratives. We evaluate five DL models -- including BERT variants, USE, and a zero-shot classifier -- against expert labels and narratives, and extend the analysis to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our findings reveal an inverse relationship: models with higher technical accuracy often show lower agreement with human experts, while LLMs demonstrate stronger expert alignment despite lower accuracy. We use Cohen's Kappa and Principal Component Analysis (PCA) to quantify and visualize model-expert agreement, and employ SHAP analysis to explain misclassifications. Results show that expert-aligned models rely more on contextual and temporal cues than location-specific keywords. These findings suggest that accuracy alone is insufficient for safety-critical NLP tasks. We argue for incorporating expert agreement into model evaluation frameworks and highlight the potential of LLMs as interpretable tools in crash analysis pipelines.

View on arXiv
@article{bhagat2025_2504.13068,
  title={ Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models },
  author={ Sudesh Ramesh Bhagat and Ibne Farabi Shihab and Anuj Sharma },
  journal={arXiv preprint arXiv:2504.13068},
  year={ 2025 }
}
Comments on this paper