A MISMATCHED Benchmark for Scientific Natural Language Inference
Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are completely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MISMATCHED. The new MISMATCHED benchmark covers three non-CS domains-PSYCHOLOGY, ENGINEERING, and PUBLIC HEALTH, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MISMATCHED using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best performing baseline shows a Macro F1 of only 78.17% illustrating the substantial headroom for future improvements. In addition to introducing the MISMATCHED benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI. We make our dataset and code publicly available on GitHub.
View on arXiv@article{shaik2025_2506.04603, title={ A MISMATCHED Benchmark for Scientific Natural Language Inference }, author={ Firoz Shaik and Mobashir Sadat and Nikita Gautam and Doina Caragea and Cornelia Caragea }, journal={arXiv preprint arXiv:2506.04603}, year={ 2025 } }