As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.
View on arXiv@article{alwajih2025_2503.00151, title={ Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs }, author={ Fakhraddin Alwajih and Abdellah El Mekki and Samar Mohamed Magdy and Abdelrahim A. Elmadany and Omer Nacar and El Moatez Billah Nagoudi and Reem Abdel-Salam and Hanin Atwany and Youssef Nafea and Abdulfattah Mohammed Yahya and Rahaf Alhamouri and Hamzah A. Alsayadi and Hiba Zayed and Sara Shatnawi and Serry Sibaee and Yasir Ech-Chammakhy and Walid Al-Dhabyani and Marwa Mohamed Ali and Imen Jarraya and Ahmed Oumar El-Shangiti and Aisha Alraeesi and Mohammed Anwar Al-Ghrawi and Abdulrahman S. Al-Batati and Elgizouli Mohamed and Noha Taha Elgindi and Muhammed Saeed and Houdaifa Atou and Issam Ait Yahia and Abdelhak Bouayad and Mohammed Machrouh and Amal Makouar and Dania Alkawi and Mukhtar Mohamed and Safaa Taher Abdelfadil and Amine Ziad Ounnoughene and Rouabhia Anfel and Rwaa Assi and Ahmed Sorkatti and Mohamedou Cheikh Tourad and Anis Koubaa and Ismail Berrada and Mustafa Jarrar and Shady Shehata and Muhammad Abdul-Mageed }, journal={arXiv preprint arXiv:2503.00151}, year={ 2025 } }