v1v2v3v4v5 (latest)

Native Language Identification in Turkish: L1 Influence of Arabic, Persian, and Albanian

27 July 2023

Ahmet Uluslu

Gerold Schneider

ArXiv (abs)PDF HTML Github

Main:2 Pages

4 Figures

3 Tables

Appendix:5 Pages

Abstract

This paper presents the first application of Native Language Identification (NLI) for the Turkish language. NLI is the task of automatically identifying an individual's native language (L1) based on their writing or speech in a non-native language (L2). While most NLI research has focused on L2 English, our study extends this scope to L2 Turkish by analyzing a corpus of texts written by native speakers of Albanian, Arabic and Persian. We leverage a cleaned version of the Turkish Learner Corpus and demonstrate the effectiveness of syntactic features, comparing a structural Part-of-Speech n-gram model to a hybrid model that retains function words. Our models achieve promising results, and we analyze the most predictive features to reveal L1-specific transfer effects. We make our data and code publicly available for further study.

View on arXiv

Comments on this paper