A state-of-the-art multimodal pipeline that analyzes earnings call audio and transcripts to detect executive pressure and generate high-alpha trading signals.
Important
Key Results:
- 75% Directional Accuracy on real stock price reactions.
- +10.3% Alpha Spread over the market benchmark in backtests.
- End-to-End Pipeline: From raw audio (MP3) to BUY/SELL signals.
- Live Dashboard: Interactive Analyst Terminal for signal monitoring.
- Phase 1 (Data Foundation): Processed 35 calls (Earnings-22) with 15k+ aligned segments.
- Phase 2 (Feature Engineering): Extracted 3,000+ multimodal features (Prosody, wav2vec2, FinBERT Sentiment).
- Phase 3 (Interaction Layer): Implemented Divergence Scores and Q&A Pressure Metrics.
- Phase 4 (Advanced Modeling): Trained Cross-Attention Fusion Networks and LightGBM + PCA baselines.
- Phase 5 (Deployment): Built production inference pipeline and a Streamlit-based analyst dashboard.
The system treats earnings calls as pressure-sensitive interaction systems. Instead of just looking at sentiment, it identifies "stress cracks" where managerial wording and vocal delivery diverge.
Raw Audio + Transcript
↓
Speaker Diarization + Transcript Alignment
↓
Feature Extraction (Text + Audio + Interaction)
↓
Cross-Attention Fusion Network
↓
Inference Pipeline (BUY/SELL/HOLD)
↓
Streamlit Analyst Terminal
Our system outperformed the market benchmark by identifying stress-driven underreactions:
| Metric | Result |
|---|---|
| Strategy Return | +5.20% |
| Market Return | -5.18% |
| Alpha Spread | +10.38% |
| Directional Acc | 75.0% |
We provide a professional-grade terminal for quantitative analysts.
- Signal Monitor: Real-time ticker tracking and directional confidence.
- Pressure Sensor: Gauge visualization of executive stress during Q&A.
- Divergence Heatmaps: Pinpoints exactly where the CEO's "voice" didn't match their "words."
To launch:
.venv/bin/streamlit run src/dashboard/app.pyThis project is built as a modular framework and can be extended in several high-value directions:
- Multi-lingual Support: Swap the WhisperX model for a large-v3-distil model to handle international earnings calls (JP, EU, HK).
- Sector-Specific Tuning: Fine-tune the fusion network on specific sectors (e.g., Biotech vs. Consumer Staples) where interaction styles vary.
- Contextual Reasoning: Use GPT-4o or Claude 3.5 to "explain" the detected pressure cracks (e.g., "The CEO hesitated when asked about Q4 margins due to supply chain concerns").
- Autonomous Research: An agent can automatically cross-reference "stress spikes" in the audio with SEC Filings (10-K/10-Q) for deeper verification.
While the Streamlit dashboard provides rapid visualization, a future Production UI would include:
- Web-Based Audio Player: Highlight stress segments on the waveform in real-time.
- Alert System: Telegram/Slack bot integration for instant alerts when high-confidence "Sell" signals are generated during live calls.
- Historical Benchmarking: Comparing current CEO stress levels against their previous 4 quarterly calls.
- ML/DL: PyTorch, LightGBM, Scikit-Learn.
- Audio/NLP: wav2vec2, openSMILE, WhisperX, FinBERT.
- Data Engine: Polars, DuckDB, Parquet.
- Frontend: Streamlit, Plotly.
- Sourcing: Yahoo Finance (Market), Earnings-22 (Audio).
"The strongest signals appear when a manager’s narrative breaks under pressure." This project proves that multimodal interaction analysis is a viable frontier for quantitative finance, delivering measurable alpha over traditional text-only sentiment models.