v1v2 (latest)

Contextualising Levels of Language Resourcedness that affect NLP tasks

29 September 2023

Ramin Barati

Reza Safabakhsh

ArXiv (abs)PDF HTML

Main:27 Pages

2 Tables

Abstract

Several widely used software applications involve some form of processing of natural language, with tasks ranging from digitising hardcopies and text processing to speech generation. Varied language resources are used to develop software systems to accomplish a wide range of natural language processing (NLP) tasks, such as the ubiquitous spellcheckers and chatbots. Languages are typically characterised as either low (LRL) or high resourced languages (HRL) with African languages having been characterised as resource-scarce languages and English by far the most well-resourced language. But what lies in-between? We argue that the dichotomous typology of LRL and HRL for all languages is problematic. Through a clear understanding of language resources situated in a society, a matrix is developed that characterises languages as Very LRL, LRL, RL, HRL and Very HRL. The characterisation is based on the typology of contextual features for each category, rather than counting tools. The motivation is provided for each feature and each characterisation. The contextualisation of resourcedness, with a focus on African languages in this paper, and an increased understanding of where on the scale the language used in a project is, may assist in, among others, better planning of research and implementation projects. We thus argue in this paper that the characterisation of language resources within a given scale in a project is an indispensable component, particularly for those in the lower half of the scale.

View on arXiv

Comments on this paper