ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.00151
31
1

Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

28 February 2025
Fakhraddin Alwajih
Abdellah El Mekki
Samar Magdy
AbdelRahim Elmadany
Omer Nacar
El Moatez Billah Nagoudi
Reem Abdel-Salam
Hanin Atwany
Youssef Nafea
Abdulfattah Mohammed Yahya
Rahaf Alhamouri
Hamzah A. Alsayadi
Hiba Zayed
Shuifa Sun
Serry Sibaee
Yasir Ech-Chammakhy
Walid Al-Dhabyani
Marwa Mohamed Ali
Imen Jarraya
Ahmed Oumar El-Shangiti
Aisha Alraeesi
Mohammed Anwar Al-Ghrawi
Abdulrahman S. Al-Batati
Elgizouli Mohamed
Noha Taha Elgindi
Muhammed Saeed
Houdaifa Atou
Issam Ait Yahia
Abdelhak Bouayad
Mohammed Machrouh
Amal Makouar
Dania Alkawi
Mukhtar Mohamed
Safaa Taher Abdelfadil
Amine Ziad Ounnoughene
Rouabhia Anfel
Rwaa Assi
Ahmed Sorkatti
Mohamedou cheikh tourad
Anis Koubaa
Ismail Berrada
Mustafa Jarrar
Shady Shehata
Muhammad Abdul-Mageed
ArXivPDFHTML
Abstract

As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.

View on arXiv
@article{alwajih2025_2503.00151,
  title={ Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs },
  author={ Fakhraddin Alwajih and Abdellah El Mekki and Samar Mohamed Magdy and Abdelrahim A. Elmadany and Omer Nacar and El Moatez Billah Nagoudi and Reem Abdel-Salam and Hanin Atwany and Youssef Nafea and Abdulfattah Mohammed Yahya and Rahaf Alhamouri and Hamzah A. Alsayadi and Hiba Zayed and Sara Shatnawi and Serry Sibaee and Yasir Ech-Chammakhy and Walid Al-Dhabyani and Marwa Mohamed Ali and Imen Jarraya and Ahmed Oumar El-Shangiti and Aisha Alraeesi and Mohammed Anwar Al-Ghrawi and Abdulrahman S. Al-Batati and Elgizouli Mohamed and Noha Taha Elgindi and Muhammed Saeed and Houdaifa Atou and Issam Ait Yahia and Abdelhak Bouayad and Mohammed Machrouh and Amal Makouar and Dania Alkawi and Mukhtar Mohamed and Safaa Taher Abdelfadil and Amine Ziad Ounnoughene and Rouabhia Anfel and Rwaa Assi and Ahmed Sorkatti and Mohamedou Cheikh Tourad and Anis Koubaa and Ismail Berrada and Mustafa Jarrar and Shady Shehata and Muhammad Abdul-Mageed },
  journal={arXiv preprint arXiv:2503.00151},
  year={ 2025 }
}
Comments on this paper