77
5

ScriptoriumWS: A Code Generation Assistant for Weak Supervision

Abstract

Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts -- and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.

View on arXiv
@article{huang2025_2502.12366,
  title={ ScriptoriumWS: A Code Generation Assistant for Weak Supervision },
  author={ Tzu-Heng Huang and Catherine Cao and Spencer Schoenberg and Harit Vishwakarma and Nicholas Roberts and Frederic Sala },
  journal={arXiv preprint arXiv:2502.12366},
  year={ 2025 }
}
Comments on this paper