This project introduces HERTS: Human Emotion Recognition Through Speech, a hybrid representation framework that combines interpretable prosodic features with rich multilingual embeddings from the wav2vec XLSR-53 model. Using strict actor-based splits of the CREMA-D dataset, audio was transformed into a unified hybrid feature vector and evaluated across multiple classifiers including Logistic Regression, Random Forest, and Multilayer Perceptron to classify a set of 6 emotions. This hybrid framework demonstrates strong speaker-independent performance on CREMA-D, with the hybrid MLP achieving the best results while requiring only a frozen XLSR encoder and a lightweight classifier to achieve a 68% accuracy. View the journal paper here.

Goals

Investigate hybrid speech emotion representations by combining interpretable prosodic features with powerful self-supervised XLSR embeddings to better capture emotional cues in speech.
Evaluate robustness and generalization through strict actor-disjoint training, ablation study of the feature families, and zero-shot testing across datasets, languages, and recording conditions.
Design a lightweight, deployable SER system suitable for real-time human–robot interaction, balancing performance, interpretability, and practical usability.

Features

Full-stack speech emotion recognition pipeline built in Python using PyTorch, combining handcrafted acoustic features with self-supervised wav2vec 2.0 XLSR-53 embeddings.
Robust training and evaluation framework with actor-disjoint data splits, feature normalization, and zero-shot cross-dataset testing to ensure generalization across unseen speakers, languages, and recording conditions.
Model development and optimization using scikit-learn and fined-tuned MLPs, with early stopping, macro-F1–based validation, and detailed performance analysis with confusion matrices to support deployment in real-time interactive systems.

No items found.

Specific Contributions

Independently designed and executed the entire project, from dataset selection and experimental design to implementation, evaluation, and final analysis, with no external collaborators.
Implemented all data handling and modeling components, including audio preprocessing, feature extraction, embedding integration, model training, and evaluation logic, using Librosa, PyTorch, scikit-learn, and Hugging Face.
Conducted ablation studies to evaluate contributions of feature families and performed zero-shot generalization of the CREMA-D trained model to non-English datasets.
Deployed framework to real-life robotic system with emotion-informed actuation of LX-16A servos.