This project evaluates multiple audio classification architectures on the BanglaSER dataset to develop an effective Bangla Speech Emotion Recognition (SER) model. Several deep learning models—including CNN, Bi-LSTM, CNN–BiLSTM, and pretrained YAMNet-based models—were implemented using extracted acoustic features such as MFCCs and Mel-spectrograms. In addition, a Support Vector Machine (SVM) classifier was trained on handcrafted features as a classical baseline. Hyperparameter optimization techniques (Grid Search, Random Search, and Bayesian Optimization), data augmentation, and different loss functions were explored to improve performance. Experimental results show that CNN-based and Bi-LSTM architectures benefit from careful tuning, while the optimized SVM achieved the highest accuracy, highlighting the importance of feature engineering and model selection for Bangla SER.