Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2406.10421
Cited By

SciEx: Benchmarking Large Language Models on Scientific Exams with Human
Expert Grading and Automatic Grading

v1v2 (latest)

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

14 June 2024

Leonard Barmann

Jianfeng Gao

Tobias Röddiger

Alexander Waibel

Rainer Stiefelhagen

Carsten Dachsbacher

Jan Niehues

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github (1★)

Papers citing "SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading"

8 / 8 papers shown

CLINB: A Climate Intelligence Benchmark for Foundational Models

CLINB: A Climate Intelligence Benchmark for Foundational Models

Michelle Chen Huebscher

Aleksandar Stanić

Markus Leippold

...

Massimiliano Ciaramita

Lierni Sestorain Saralegui

369

0

0

29 Oct 2025

Mechanisms of Matter: Language Inferential Benchmark on Physicochemical Hypothesis in Materials Synthesis

Mechanisms of Matter: Language Inferential Benchmark on Physicochemical Hypothesis in Materials Synthesis

191

0

0

29 Sep 2025

Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons

Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons

Isik Baran Sandan

382

3

0

04 Jun 2025

Testing Low-Resource Language Support in LLMs Using Language Proficiency Exams: the Case of Luxembourgish

Testing Low-Resource Language Support in LLMs Using Language Proficiency Exams: the Case of Luxembourgish

Cedric Lothritz

464

2

0

02 Apr 2025

AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

652

5

0

03 Feb 2025

ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

Yoshiaki Uchida

339

10

0

04 Oct 2024

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

404

12

0

13 Jun 2024

The Invalsi Benchmarks: measuring Linguistic and Mathematical
understanding of Large Language Models in Italian

The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian

Giovanni Puccetti

333

7

0

27 Mar 2024

Page 1 of 1