- Title
- Speech emotion recognition using deep neural networks
- Creator
- Bakhshi, Ali
- Relation
- University of Newcastle Research Higher Degree Thesis
- Resource Type
- thesis
- Date
- 2021
- Description
- Research Doctorate - Doctor of Philosophy (PhD)
- Description
- Emotion recognition is an interdisciplinary research area in psychology, social science, signal processing, and image processing. From a machine learning point of view, emotion recognition is a challenging task due to the different modalities used to express emotions. In this Ph.D. thesis, various speech emotion recognition frameworks have been proposed, most of which have been designed based on deep neural networks using end-to-end learning. A combination of speech and physiological signals has been used in a multimodal model to recognise real emotions through these modalities. As the first step, given the importance of deep neural networks in different applications, evolutionary algorithms have been used to find the best architecture and hyperparameters for DNNs designed for image classification tasks. In this thesis, speech signals have mainly been used for emotion recognition, as speech signals are the simplest means of communicating between humans and are a rich source of emotional information. Hence, the first speech emotion recognition architecture was designed based on a hierarchical classifier that used Cepstral coefficients based on evolutionary filterbanks as the emotional features. The optimised classifiers outperformed conventional Mel Frequency Cepstral Coefficients in terms of overall emotion classification accuracy. Next, an end-to-end speech emotion recognition model is proposed that uses a relatively small training set for training a nearly deep model from scratch. Using almost one-third of the RECOLA dataset, the proposed deep model showed a comparable prediction of the arousal and valence states compared with the prediction of models that used the whole RECOLA dataset. A combination of audio and physiological signals available in the RECOLA dataset were then used in an end-to-end deep multimodal system to predict valid labels for different emotional dimensions. The results achieved using the multimodal model show improved prediction results compared to the unimodal models, especially in terms of valence state. As an application of emotion recognition in real life, we utilised the speech signals extracted from surveillance cameras for detecting violence in real situations. Two different DNN frameworks were proposed based on raw speech signals and Mel-spectrograms of speech signals for violence detection. Considering the lack of sufficient pre-trained deep models for speech signals, I proposed two different speech-to-image transforms, CyTex, and PhaSion transforms, which are the main contributions of my thesis. The images generated by the CyTex and PhaSion transforms can be used as the inputs to the pre-trained image-based DNN models that have shown promising performances in various applications. These two speech-to-image transforms are reversible, computationally efficient, and lossless, which ensures no emotion-related features of the speech signals are neglected during the speech-to-image transformation. Using the CyTex and PhaSion images and pre-trained DNN models, we achieved promising results for emotion classification on two popular emotion datasets, the EmoDB and IEMOCAP datasets.
- Subject
- speech emotion recognition; multimodal emotion recognition; deep neural networks; evolutionary algorithms
- Identifier
- http://hdl.handle.net/1959.13/1430839
- Identifier
- uon:38885
- Rights
- Copyright 2021 Ali Bakhshi
- Language
- eng
- Full Text
- Hits: 863
- Visitors: 1807
- Downloads: 1148
Thumbnail | File | Description | Size | Format | |||
---|---|---|---|---|---|---|---|
View Details Download | ATTACHMENT01 | Thesis | 3 MB | Adobe Acrobat PDF | View Details Download | ||
View Details Download | ATTACHMENT02 | Abstract | 206 KB | Adobe Acrobat PDF | View Details Download |