HERTS: Human Emotion Recognition Through Speech

Intro Machine Learning

Aug 2025 - Dec 2025

This project introduces HERTS: Human Emotion Recognition Through Speech, a hybrid representation framework that combines interpretable prosodic features with rich multilingual embeddings from the wav2vec XLSR-53 model. Using strict actor-based splits of the CREMA-D dataset, audio was transformed into a unified hybrid feature vector and evaluated across multiple classifiers including Logistic Regression, Random Forest, and Multilayer Perceptron to classify a set of 6 emotions. This hybrid framework demonstrates strong speaker-independent performance on CREMA-D, with the hybrid MLP achieving the best results while requiring only a frozen XLSR encoder and a lightweight classifier to achieve a 68% accuracy. View the journal paper here.

[digital project]

Goals

  • Investigate hybrid speech emotion representations by combining interpretable prosodic features with powerful self-supervised XLSR embeddings to better capture emotional cues in speech.
  • Evaluate robustness and generalization through strict actor-disjoint training, ablation study of the feature families, and zero-shot testing across datasets, languages, and recording conditions.
  • Design a lightweight, deployable SER system suitable for real-time human–robot interaction, balancing performance, interpretability, and practical usability.

Features

  • Full-stack speech emotion recognition pipeline built in Python using PyTorch, combining handcrafted acoustic features with self-supervised wav2vec 2.0 XLSR-53 embeddings.
  • Robust training and evaluation framework with actor-disjoint data splits, feature normalization, and zero-shot cross-dataset testing to ensure generalization across unseen speakers, languages, and recording conditions.
  • Model development and optimization using scikit-learn and fined-tuned MLPs, with early stopping, macro-F1–based validation, and detailed performance analysis with confusion matrices to support deployment in real-time interactive systems.
No items found.
[digital project][digital project]

Specific Contributions

  • Independently designed and executed the entire project, from dataset selection and experimental design to implementation, evaluation, and final analysis, with no external collaborators.
  • Implemented all data handling and modeling components, including audio preprocessing, feature extraction, embedding integration, model training, and evaluation logic, using Librosa, PyTorch, scikit-learn, and Hugging Face.
  • Conducted ablation studies to evaluate contributions of feature families and performed zero-shot generalization of the CREMA-D trained model to non-English datasets.
  • Deployed framework to real-life robotic system with emotion-informed actuation of LX-16A servos.