ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2511.04914
256
0
v1v2v3 (latest)

MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages

7 November 2025
Hardik B. Sailor
Aw Ai Ti
Chen Fang Yih Nancy
Chiu Ying Lay
Ding Yang
He Yingxu
Jiang Ridong
Li Jingtao
Liao Jingyi
Liu Zhuohan
Lu Yanfeng
Ma Yi
Manas Gupta
Muhammad Huzaifah Bin Md Shahrin
Nabilah Binte Md Johan
Nattadaporn Lertcheva
Pan Chunlei
Pham Minh Duc
Siti Maryam Binte Ahmad Subaidi
Siti Umairah Binte Mohammad Salleh
Sun Shuo
Tarun K. Vangani
Wang Qiongqiong
Won Cheng Yi Lewis
Wong Heng Meng Jeremy
Wu Jinyang
Zhang Huayun
Zhang Longyin
Zou Xunlong
    AuLLM
ArXiv (abs)PDFHTML
Main:7 Pages
3 Figures
Bibliography:3 Pages
1 Tables
Appendix:1 Pages
Abstract

We present MERaLiON-SER, a robust speech emotion recognition model de- signed for English and Southeast Asian languages. The model is trained using a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient (CCC) losses for joint discrete and dimensional emotion modelling. This dual approach enables the model to capture both the distinct categories of emotion (like happy or angry) and the fine-grained, such as arousal (intensity), valence (positivity/negativity), and dominance (sense of control), lead- ing to a more comprehensive and robust representation of human affect. Extensive evaluations across multilingual Singaporean languages (English, Chinese, Malay, and Tamil ) and other public benchmarks show that MERaLiON-SER consistently surpasses both open-source speech encoders and large Audio-LLMs. These results underscore the importance of specialised speech-only models for accurate paralin- guistic understanding and cross-lingual generalisation. Furthermore, the proposed framework provides a foundation for integrating emotion-aware perception into future agentic audio systems, enabling more empathetic and contextually adaptive multimodal reasoning.

View on arXiv
Comments on this paper