229
v1v2 (latest)

Market-Bench: Evaluating Large Language Models on Introductory Quantitative Trading and Market Dynamics

Abhay Srivastava
Sam Jung
Spencer Mateega
Main:8 Pages
Bibliography:2 Pages
4 Tables
Abstract

We introduce MARKET-BENCH, a benchmark that evaluates large language models (LLMs) on introductory quantitative trading tasks by asking them to construct executable backtesters from natural language strategy descriptions and market assumptions. Each instance specifies one of three canonical strategies: scheduled trading on Microsoft (NASDAQ: MSFT), pairs trading on Coca-Cola (NASDAQ: KO) and Pepsi (NASDAQ: PEP), or delta hedging on MSFT. Models must produce code whose profit and loss (P and L), drawdown, and position paths match a verifiable reference implementation. We assess thirteen state-of-the-art models using a multi-round evaluation that separates structural reliability (whether the backtest runs) from numerical accuracy (mean absolute error of the backtest metrics), assigning failed outputs a duplicated-metrics baseline MAE. While most models reliably execute the simplest strategy (average executable passes of 4.08 out of 5 rounds), errors vary by orders of magnitude across models and tasks. Gemini 3 Pro and Claude 4.5 Sonnet combine strong reliability with low error on simpler strategies. GPT-5.2 achieves strong overall performance with perfect executability. GPT-5.1 Codex-Max achieves the lowest best-run error on the easiest task. Qwen3 Max attains perfect executability yet sometimes produces inaccurate profit and loss paths. These results show that current LLMs can scaffold basic trading infrastructure but still struggle to reason robustly about prices, inventory, and risk. We release MARKET-BENCH and a public leaderboard atthis https URL.

View on arXiv
Comments on this paper