Official implementation of TriadNet, a multimodal deep learning framework for Big Five personality trait assessment from video, audio, and text.
TriadNet: A Multimodal Architecture for Big Five Personality Traits Assessment
Soham Pahari*, Sandeep Chand Kumain*, Lalit K Awasthi†
*School of Computer Science, UPES, Dehradun, India
†SPU Mandi, Himachal Pradesh, India
- State-of-the-art Performance: Achieves 97.8% cluster accuracy and 96.9% mean trait accuracy on ChaLearn LAP 2017
- Multimodal Fusion: Jointly models visual, auditory, and textual behavioral cues
- Novel Architecture Components:
- Dynamic Face Graph Network (DFGN): Captures spatial-temporal facial dynamics via landmark graphs
- Prosody-Aware Capsule Network (PACN): Models prosodic and acoustic-semantic patterns
- Contextual Sentiment Flow (CSF): Tracks emotional and linguistic shifts in text
- Iterative Modality Dialogue (IMD): Cross-attention-based fusion mechanism
- Trait-Specific Attention Gate (TSAG): Individualized reasoning for each personality trait
Clone the code from GitHub with this command.
git clone https://github.com/sohampahari/TriadNet.git
cd TriadNet
Python 3.10 is suggested. Create a virtual env first. Then install packages from requirements.txt.
Example:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
If you run on GPU install CUDA drivers and torch with the matching build.
Create a data folder before running the scripts. Use this bash command.
# create data folder
mkdir -p data
Place all data under the data/ folder. Use the First Impressions V2 dataset from ChaLearn. Download link:
https://chalearnlap.cvc.uab.cat/
We used the First Impressions V2 set from CVPR 2017.
Dataset Statistics:
- ~10,000 video clips total (6,000 train / 2,000 validation / 2,000 test)
- Average duration: 15 seconds
- Big Five trait annotations: [0, 1] fractional scores
For fault tolerance we save embeddings after each step. This lets you resume work if a run stops.
The pipeline writes these CSV files. For each split you run, the file names follow this pattern.
Video embeddings
video_embeddings_train.csvvideo_embeddings_vali.csvvideo_embeddings_test.csv
Audio embeddings
audio_embeddings_train.csvaudio_embeddings_vali.csvaudio_embeddings_test.csv
Text embeddings
text_embeddings_train.csvtext_embeddings_vali.csvtext_embeddings_test.csv
Transcriptions
transcribed_videos.csv
Fused features
fused_av.csv(video + audio, 1024 dims)fused_avt.csv(video + audio + text, 1536 dims)
You can change the file names in the code if you prefer another naming scheme.
- Prepare the dataset files under
data/. - Run the video embedding script. This creates
video_embeddings_*.csv. - Run the audio embedding script. This creates
audio_embeddings_*.csv. - Run the text transcriber. This creates
transcribed_videos.csvandtext_embeddings_*.csv. - Run the fusion script. This creates
fused_av.csvorfused_avt.csv(use fused_avt.csv). - Run
traingate.pyto train models using the fused features.
We recommend computing all embeddings for all splits first. Then run training.
video_emb.py: video feature extractor and GCN.audio_emb.py: audio extractor and PACN.text_emb.py: whisper transcriber and BERT + CSF.fusion.py: IMD and iterative fusion. Use modeavoravt.traingate.py: training script.requirements.txt: Python packages.
Run for each split (train, validation, test):
python video_emb.py --split train
python video_emb.py --split validation
python video_emb.py --split test
What it does:
- Detects faces using MTCNN
- Extracts 68 facial landmarks per frame
- Applies Dynamic Face Graph Network (DFGN)
- Outputs 512-D video embeddings
Run for each split:
python audio_emb.py --split train
python audio_emb.py --split validation
python audio_emb.py --split test
What it does:
- Segments audio into windows
- Extracts MFCCs + Wav2Vec2 embeddings
- Applies Prosody-Aware Capsule Network (PACN)
- Outputs 512-D audio embeddings
Run for each split:
python csf_emb.py --split train
python csf_emb.py --split validation
python csf_emb.py --split test
What it does:
- Transcribes videos using Whisper
- Generates BERT contextual embeddings
- Applies Contextual Sentiment Flow (CSF)
- Outputs 512-D text embeddings
Choose fusion mode:
# Audio-Visual fusion (1024-D)
python fusion.py --mode av
# Audio-Visual-Text fusion (1536-D)
python fusion.py --mode avt
What it does:
- Applies Iterative Modality Dialogue (IMD)
- Uses cross-attention for multimodal fusion
- Creates fused representations
python traitgate.py
Training details:
- Uses fused features from Step 4
- Applies Trait-Specific Attention Gate (TSAG)
- Hierarchical prediction: clusters → traits
- Loss:
L = 0.3 * L_cluster + 0.7 * L_traits
- DFGN: Graph convolution + LSTM for facial dynamics + EmotionNet for appearance
- PACN: Capsule routing for prosodic hierarchies
- CSF: BERT + LSTM + attention for sentiment flow
- IMD: Multi-head cross-attention fusion (3 iterations)
- TSAG: Per-trait attention gates
- Hierarchical Predictor: Cluster classification + trait regression
| Metric | Value |
|---|---|
| Cluster Accuracy | 97.80% |
| Mean Trait Accuracy | 96.92% |
| Mean Squared Error | 0.0116 |
- The MTCNN face detector needs a working CPU or GPU.
- Whisper models are heavy. Use GPU for speed.
- If memory is tight reduce
max_framesorsegment_length. - If you want one node per landmark change the feature design in the video code.
If you want changes, tell me which part to update.
Lead Developer: Soham Pahari
📧 Email: paharisoham@gmail.com
This project is licensed under the MIT License.
Last Updated: November 2025