GraphPilot: Grounded Scene Graph Conditioning for Language-Based Autonomous Driving
Vision-language models have recently emerged as promising planners for autonomous driving, where success hinges on topology-aware reasoning over spatial structure and dynamic interactions from multimodal input. However, existing models are typically trained without supervision that explicitly encodes these relational dependencies, limiting their ability to infer how agents and other traffic entities influence one another from raw sensor data. In this work, we bridge this gap with a novel model-agnostic method that conditions language-based driving models on structured relational context in the form of traffic scene graphs. We serialize scene graphs at various abstraction levels and formats, and incorporate them into models via structured prompt templates, enabling systematic analysis of when and how relational supervision is most beneficial and computationally efficient. Extensive evaluations on the LangAuto and Bench2Drive benchmarks show that scene graph conditioning yields large and persistent improvements. We observe a substantial performance increase in the Driving Score of our proposed approach versus competitive LMDrive, BEVDriver, and SimLingo baselines. These results indicate that diverse architectures can effectively internalize and ground relational priors through scene graph-conditioned training, even without requiring scene graph input at test-time. Code, fine-tuned models, and our scene graph dataset are publicly available atthis https URL.
View on arXiv