Skip to content

sohampahari/TriadNet

Repository files navigation

TriadNet: A Multimodal Architecture for Big Five Personality Traits Assessment

Official implementation of TriadNet, a multimodal deep learning framework for Big Five personality trait assessment from video, audio, and text.

TriadNet: A Multimodal Architecture for Big Five Personality Traits Assessment
Soham Pahari*, Sandeep Chand Kumain*, Lalit K Awasthi†
*School of Computer Science, UPES, Dehradun, India
†SPU Mandi, Himachal Pradesh, India


🔥 Highlights

  • State-of-the-art Performance: Achieves 97.8% cluster accuracy and 96.9% mean trait accuracy on ChaLearn LAP 2017
  • Multimodal Fusion: Jointly models visual, auditory, and textual behavioral cues
  • Novel Architecture Components:
    • Dynamic Face Graph Network (DFGN): Captures spatial-temporal facial dynamics via landmark graphs
    • Prosody-Aware Capsule Network (PACN): Models prosodic and acoustic-semantic patterns
    • Contextual Sentiment Flow (CSF): Tracks emotional and linguistic shifts in text
    • Iterative Modality Dialogue (IMD): Cross-attention-based fusion mechanism
    • Trait-Specific Attention Gate (TSAG): Individualized reasoning for each personality trait

📋 Table of Contents


🚀 Installation

Clone the repo

Clone the code from GitHub with this command.

git clone https://github.com/sohampahari/TriadNet.git
cd TriadNet

Setup

Python 3.10 is suggested. Create a virtual env first. Then install packages from requirements.txt.

Example:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

If you run on GPU install CUDA drivers and torch with the matching build.


📦 Dataset Preparation

Prepare data folder

Create a data folder before running the scripts. Use this bash command.

# create data folder
mkdir -p data

Data

Place all data under the data/ folder. Use the First Impressions V2 dataset from ChaLearn. Download link:

https://chalearnlap.cvc.uab.cat/

We used the First Impressions V2 set from CVPR 2017.

Dataset Statistics:

  • ~10,000 video clips total (6,000 train / 2,000 validation / 2,000 test)
  • Average duration: 15 seconds
  • Big Five trait annotations: [0, 1] fractional scores

🔄 Training Pipeline

Fault tolerance

For fault tolerance we save embeddings after each step. This lets you resume work if a run stops.

CSV files produced

The pipeline writes these CSV files. For each split you run, the file names follow this pattern.

Video embeddings

  • video_embeddings_train.csv video_embeddings_vali.csv video_embeddings_test.csv

Audio embeddings

  • audio_embeddings_train.csv audio_embeddings_vali.csv audio_embeddings_test.csv

Text embeddings

  • text_embeddings_train.csv text_embeddings_vali.csv text_embeddings_test.csv

Transcriptions

  • transcribed_videos.csv

Fused features

  • fused_av.csv (video + audio, 1024 dims)
  • fused_avt.csv (video + audio + text, 1536 dims)

You can change the file names in the code if you prefer another naming scheme.

Workflow

  1. Prepare the dataset files under data/.
  2. Run the video embedding script. This creates video_embeddings_*.csv.
  3. Run the audio embedding script. This creates audio_embeddings_*.csv.
  4. Run the text transcriber. This creates transcribed_videos.csv and text_embeddings_*.csv.
  5. Run the fusion script. This creates fused_av.csv or fused_avt.csv (use fused_avt.csv).
  6. Run traingate.py to train models using the fused features.

We recommend computing all embeddings for all splits first. Then run training.

Files of interest

  • video_emb.py : video feature extractor and GCN.
  • audio_emb.py : audio extractor and PACN.
  • text_emb.py : whisper transcriber and BERT + CSF.
  • fusion.py : IMD and iterative fusion. Use mode av or avt.
  • traingate.py : training script.
  • requirements.txt : Python packages.

💻 Usage

Step 1: Extract Video Features

Run for each split (train, validation, test):

python video_emb.py --split train
python video_emb.py --split validation
python video_emb.py --split test

What it does:

  • Detects faces using MTCNN
  • Extracts 68 facial landmarks per frame
  • Applies Dynamic Face Graph Network (DFGN)
  • Outputs 512-D video embeddings

Step 2: Extract Audio Features

Run for each split:

python audio_emb.py --split train
python audio_emb.py --split validation
python audio_emb.py --split test

What it does:

  • Segments audio into windows
  • Extracts MFCCs + Wav2Vec2 embeddings
  • Applies Prosody-Aware Capsule Network (PACN)
  • Outputs 512-D audio embeddings

Step 3: Process Text

Run for each split:

python csf_emb.py --split train
python csf_emb.py --split validation
python csf_emb.py --split test

What it does:

  • Transcribes videos using Whisper
  • Generates BERT contextual embeddings
  • Applies Contextual Sentiment Flow (CSF)
  • Outputs 512-D text embeddings

Step 4: Fuse Modalities

Choose fusion mode:

# Audio-Visual fusion (1024-D)
python fusion.py --mode av

# Audio-Visual-Text fusion (1536-D)
python fusion.py --mode avt

What it does:

  • Applies Iterative Modality Dialogue (IMD)
  • Uses cross-attention for multimodal fusion
  • Creates fused representations

Step 5: Train Model

python traitgate.py

Training details:

  • Uses fused features from Step 4
  • Applies Trait-Specific Attention Gate (TSAG)
  • Hierarchical prediction: clusters → traits
  • Loss: L = 0.3 * L_cluster + 0.7 * L_traits

Component Details

  • DFGN: Graph convolution + LSTM for facial dynamics + EmotionNet for appearance
  • PACN: Capsule routing for prosodic hierarchies
  • CSF: BERT + LSTM + attention for sentiment flow
  • IMD: Multi-head cross-attention fusion (3 iterations)
  • TSAG: Per-trait attention gates
  • Hierarchical Predictor: Cluster classification + trait regression

Results

Performance on ChaLearn LAP 2017 Test Set

Metric Value
Cluster Accuracy 97.80%
Mean Trait Accuracy 96.92%
Mean Squared Error 0.0116

📝 Notes and Tips

  • The MTCNN face detector needs a working CPU or GPU.
  • Whisper models are heavy. Use GPU for speed.
  • If memory is tight reduce max_frames or segment_length.
  • If you want one node per landmark change the feature design in the video code.

📧 Contact

If you want changes, tell me which part to update.

Lead Developer: Soham Pahari
📧 Email: paharisoham@gmail.com


📄 License

This project is licensed under the MIT License.


Last Updated: November 2025

About

Official code directory of TriadNet.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages