Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a 42trainingcost,comparedtothousandsofdollarsforbaselinemodels.However,challengessuchasoptimizationinstabilityandlengthconstraintsemergedwithprolongedtraining.ThesefindingshighlighttheefficacyofRL−basedfine−tuningforsmallLLMs,offeringacost−effectivealternativetolarge−scaleapproaches.Wereleaseourcodeanddatasetsasopen−sourceresources,providinginsightsintotrade−offsandlayingafoundationforscalable,reasoning−capableLLMsinresource−limitedenvironments.AllareavailableatthishttpsURL.
@article{dang2025_2503.16219,
title={ Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't },
author={ Quy-Anh Dang and Chris Ngo },
journal={arXiv preprint arXiv:2503.16219},
year={ 2025 }
}