Spanish is not just one: a Spanish dialect dataset for LLMs by IPTC researchers

Researchers from the Information Processing and Telecommunications Center (IPTC) at the Universidad Politécnica de Madrid, in collaboration with linguists from other institutions, present a novel dataset designed to evaluate how Large Language Models (LLMs) handle the diversity of Spanish dialects. While multilingualism is a major focus in Natural Language Processing (NLP), dialectal variation has often been overlooked. Given that Spanish is spoken by over 600 million people across regions with rich lexical, morphological, and syntactic diversity, ensuring fair representation of its dialects is essential.
The proposed dataset comprises 30 expert-curated multiple-choice questions that test LLMs’ ability to identify and reproduce dialectal features across seven major Spanish varieties: Andean, Antillean, Chilean, Continental Caribbean, Mexican and Central American, European Peninsular, and Rioplatense. Each question was meticulously developed and reviewed by linguistic specialists to guarantee accuracy, cultural appropriateness, and clarity.

This resource allows researchers to assess whether LLMs exhibit biases toward specific Spanish varieties or default to one dominant form. Beyond AI benchmarking, the dataset can be applied to sociolinguistic studies, language education, and dialect recognition in human evaluations, promoting awareness of linguistic diversity.
By addressing dialectal bias, this work contributes to the development of more inclusive and equitable AI systems that better reflect the full spectrum of Spanish linguistic and cultural variation, marking an important step toward fairer multilingual NLP evaluation frameworks.
Related information:
Our colleagues at the UPM participated in the 41st International Congress of the Spanish Society for Natural Language Processing, held in Zaragoza from September 23-26, 2025. This meeting brings together the scientific community and industry to share knowledge and discuss language technologies. The latest research and developments in the field of NLP are presented.

Here you can see Marina Mayor Rocher and Gonzalo Martínez presenting the article entitled: It’s the same but not the same: Do LLMs distinguish Spanish varieties? Produced in collaboration with https://somosnlp.org/, the international community of Spanish speakers passionate about NLP which evaluates the knowledge that of a set of LLMs have on the Spanish varieties using the dataset.
Bibliographic reference:
Martínez, G., Mayor-Rocher, M., Pozo-Huertas, C., Melero, N., Grandury, M. & Reviriego, P. Spanish is not just one: A dataset of Spanish dialect recognition for LLMs. Data in Brief ,63, pp. 112088. https://doi.org/10.1016/j.dib.2025.112088
Cristina Pozo Huertas: LinkedIn
María Grandury: GS / ORCID / LinkedIn
Pedro Reviriego: GS / ORCID / LinkedIn
For more information: www.iptc.upm.es
LinkedIn: https://www.linkedin.com/company/iptc-upm/
Source of the maps: Moreno Fernández, F., & Otero Roth, J. (2007). Atlas de la lengua española en el mundo. Ariel / Fundación Telefónica.
Share this:
Latest news



Categories

