This study investigates the relationship between deep learning (DL) model accuracy and expert agreement in classifying crash narratives. We evaluate five DL models -- including BERT variants, USE, and a zero-shot classifier -- against expert labels and narratives, and extend the analysis to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our findings reveal an inverse relationship: models with higher technical accuracy often show lower agreement with human experts, while LLMs demonstrate stronger expert alignment despite lower accuracy. We use Cohen's Kappa and Principal Component Analysis (PCA) to quantify and visualize model-expert agreement, and employ SHAP analysis to explain misclassifications. Results show that expert-aligned models rely more on contextual and temporal cues than location-specific keywords. These findings suggest that accuracy alone is insufficient for safety-critical NLP tasks. We argue for incorporating expert agreement into model evaluation frameworks and highlight the potential of LLMs as interpretable tools in crash analysis pipelines.
View on arXiv@article{bhagat2025_2504.13068, title={ Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models }, author={ Sudesh Ramesh Bhagat and Ibne Farabi Shihab and Anuj Sharma }, journal={arXiv preprint arXiv:2504.13068}, year={ 2025 } }