21
8

PART: Pre-trained Authorship Representation Transformer

Abstract

Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. Using stylometric representations is more suitable, but this by itself is an open research challenge. In this paper, we propose PART, a contrastively trained model fit to learn \textbf{authorship embeddings} instead of semantics. We train our model on ~1.5M texts belonging to 1162 literature authors, 17287 blog posters and 135 corporate email accounts; a heterogeneous set with identifiable writing styles. We evaluate the model on current challenges, achieving competitive performance. We also evaluate our model on test splits of the datasets achieving zero-shot 72.39\% accuracy when bounded to 250 authors, a 54\% and 56\% higher than RoBERTa embeddings. We qualitatively assess the representations with different data visualizations on the available datasets, observing features such as gender, age, or occupation of the author.

View on arXiv
@article{huertas-tato2025_2209.15373,
  title={ PART: Pre-trained Authorship Representation Transformer },
  author={ Javier Huertas-Tato and Alejandro Martin and David Camacho },
  journal={arXiv preprint arXiv:2209.15373},
  year={ 2025 }
}
Comments on this paper