Domain Pre-training Impact on Representations

Main:4 Pages
8 Figures
Bibliography:5 Pages
7 Tables
Appendix:5 Pages
Abstract
This empirical study analyzes the effects of the pre-training corpus on the quality of learned transformer representations. We focus on the representation quality induced solely through pre-training. Our experiments show that pre-training on a small, specialized corpus can yield effective representations, and that the success of combining a generic and a specialized corpus depends on the distributional similarity between the target task and the specialized corpus.
View on arXiv@article{gonzalez-gutierrez2025_2505.24455, title={ Domain Pre-training Impact on Representations }, author={ Cesar Gonzalez-Gutierrez and Ariadna Quattoni }, journal={arXiv preprint arXiv:2505.24455}, year={ 2025 } }
Comments on this paper