v1v2 (latest)

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

5 October 2025

ArXiv (abs)PDF HTML Github

Main:6 Pages

Bibliography:2 Pages

4 Tables

Appendix:4 Pages

Abstract

We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium. MacroBench instantiates seven self-hosted sites covering 681 tasks across interaction complexity and targeting difficulty. Our end-to-end protocol validates generated code via static checks, sandboxed execution, and outcome verification (DOM assertions, database snapshots), and includes a safety suite for scraping, spam/abuse, and credential/privacy prompts. Across 2,636 model-task runs, we observe stratified success: GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), DeepSeek (83.4%). Models handle simple tasks reliably (91.7%) but fail on complex workflows (0.0%), and none meet production-quality coding practices despite functional completion. We release our complete benchmark pipeline, evaluation framework, and experimental results atthis https URLto enable reproducible assessment of macro synthesis for web automation.

View on arXiv

Comments on this paper