Skip to content

mhirschberg/smle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SMLE - Unified Social Media Listening Engine

In the world of social media analytics, "fragmentation" is the enemy. Data lives in silos. If you want to track a brand’s reputation, you’re usually toggling between a Twitter dashboard, a LinkedIn search, and a Reddit scraper, trying to mentally merge three different data formats into one coherent picture.

We decided to solve this engineering challenge by building SMLE (Social Media Listening Engine).

The goal was ambitious but clear: Create a single, unified pipeline that can listen, aggregate, and analyze conversations across Instagram, TikTok, Twitter/X, Reddit, Facebook, YouTube, and LinkedIn simultaneously.

Here’s a look at how we architected the solution and the tech stack that powers it.

The Architecture: A Unified Pipeline

graph TD
    User((User/Dashboard)) <--> API[API Server Express]
    
    subgraph Orchestration [Orchestrator]
        API --> Pipeline[Run Pipeline]
        Pipeline --> S1[Search Discovery]
        Pipeline --> S2[Scraping Engine]
        Pipeline --> S3[Analysis Engine]
        Pipeline --> S4[Analytics Engine]
    end

    subgraph External [External Services]
        S1 -- SERP/Direct --> BD[Bright Data]
        S2 -- Fetch Results --> BD
        BD -- raw data --> S2
    end

    subgraph AIPipeline [AI & Semantic Layer]
        S3 --> LLM[LLM Provider: Ollama/Gemini]
        LLM -- Sentiment/Topics --> S3
        S3 --> Vector[Embedding Provider]
        Vector -- nomic-embed-text --> S3
    end

    subgraph Storage [Database Layer]
        S1 --> DB
        S2 --> DB[(Unified Storage)]
        S3 --> DB
        S4 --> DB
        
        DBAdapter[Database Adapter] --> CBC[Couchbase Capella]
        DBAdapter --> CDB[CrateDB Cloud]
        DBAdapter --> PG[PostgreSQL + pgvector]
    end

    User -- Semantic Query --> API
    API --> Vector
    Vector -- query vector --> DB
    DB -- Vector Similarity Search --> User
Loading

The core philosophy behind SMLE is "One Campaign, Any Platform."

Instead of building seven distinct tools, we built a modular pipeline. When you initiate a search for "Generative AI," the engine spins up parallel processes. Whether the data comes from a TikTok viral video or a LinkedIn thought leadership article, it flows through the same normalization and analysis funnel.

1. Hybrid Data Collection Strategy

One of the biggest hurdles in social scraping is that every platform behaves differently. A "one size fits all" approach doesn't work. We implemented a hybrid strategy using Bright Data’s infrastructure:

  • SERP-Based Discovery: For platforms that are notoriously hard to search directly (like Instagram, Facebook, and LinkedIn), we leverage advanced Google SERP scraping. We construct complex search operators (e.g., site:linkedin.com "keyword") to find relevant post URLs first, and then target those specific URLs for extraction.
  • Direct Keyword Discovery: For platforms with more open discovery mechanics (like TikTok, Reddit, and YouTube), we hit the discovery APIs directly. This is faster and yields richer initial metadata.

2. The Brain: Local & Cloud LLMs

Raw social data is messy. Hashtags are spammy, descriptions are full of emojis, and sentiment is hard to parse with traditional regex.

We integrated Ollama (Local) and Google Gemini (Cloud) directly into the ingestion pipeline. Every single post passes through an LLM analysis layer that:

  • Scores Sentiment (1-10): Not just "positive/negative," but a nuanced score based on the narrative.
  • Extracts Topics: It reads comments and captions to generate semantic tags (e.g., categorizing a post about "broken screens" under "hardware quality" automatically).
  • Sanitizes Data: It cleans up the noise, leaving us with structured, queryable JSON.

3. Smart Deduplication & Engagement Tracking

Social media isn't static. A post scraped today might have 10 likes; tomorrow it might have 10,000.

We built a smart deduplication system: instead of ignoring duplicate URLs, the system recognizes them. If a campaign runs and finds a post we’ve already seen:

  1. It skips the heavy re-analysis (saving compute costs).
  2. It updates the engagement metrics (likes, shares, comments).
  3. It logs a history of that post’s growth.

This allows users to track velocity—not just seeing what’s popular, but what’s becoming popular right now.

The "Killer Feature": Semantic Search

This is where the tech stack really shines. Because we generate vector embeddings for every post during the analysis phase, we aren't limited to keyword searching.

We built a Natural Language Search interface.

Users don't have to search for "customer support" AND "fail" AND "angry." They can simply type: "Find posts where people are complaining about shipping delays."

The engine performs a vector similarity search against the stored embeddings across all 7 platforms. It returns posts that match the intent of the query, even if they don't share a single keyword.

SMLE Vision: Deep Video Intelligence

Beyond text analysis, SMLE Vision provides AI-powered video content analysis for TikTok, Instagram Reels, and YouTube videos.

How It Works

  1. Video Download: Automatically downloads videos using platform-specific downloaders with session-based proxying via Bright Data's Scraping Browser and Web Unlocker
  2. Frame Extraction: Extracts key frames at 1fps using FFmpeg
  3. Visual Analysis: Each frame is analyzed using a vision-capable LLM (llava:latest via Ollama)
  4. Strategic Summary: Aggregates frame analyses into executive summaries with:
    • Overall sentiment (positive/neutral/negative)
    • Visual themes and topics
    • Product insights and brand appearance
    • Strategic recommendations

Key Features

  • Real-time Progress: Terminal-style log viewer shows download and analysis progress
  • Robust Downloads:
    • TikTok & Instagram: Uses Scraping Browser with human-like interactions to evade bot detection
    • YouTube: Enforces single-threaded, non-chunked downloads with rate limiting
  • Smart JSON Parsing: Automatically repairs malformed LLM responses
  • Session Persistence: Maintains browser sessions between scraping and downloading for reliability

Requirements

  • FFmpeg: For video frame extraction
  • yt-dlp: For YouTube downloads (included in project)
  • Ollama with llava: Vision-capable model for frame analysis
  • Bright Data Credentials:
    • Scraping Browser (for TikTok/Instagram)
    • Web Unlocker (for YouTube)
  • UNLOCKER_USERNAME and UNLOCKER_PASSWORD - Web Unlocker credentials

These are required for downloading videos from TikTok, Instagram, and YouTube.

Interactive Network Graph

Go beyond simple lists with our new Force-Directed Graph visualization. This tool allows you to see the "shape" of the conversation.

Three Distinct Views

  • Influencer Network (Blue): Visualizes who is talking to whom. Node size represents influence score, derived from post volume and topic diversity.
  • Topic Clusters (Indigo): Shows the semantic relationships between themes. See how concepts like "AI" and "Ethics" naturally group together.
  • Community Tribes (Emerald): Automatically detects and groups authors into sub-communities based on shared interests and interaction patterns.

Narrative Pathfinding (Magic Wand)

Discover how two seemingly unconnected people are linked.

  • Magic Wand: Click the Sparkles icon () to instantly find a guaranteed connection in the current network. The system calculates the shortest "Narrative Bridge" between two agents.
  • Interactive Mode: Manually select any Start node and Target node to query the engine for a path.
  • Visual Feedback: The path is highlighted in gold with animated particles flowing between the nodes, proving the chain of influence.

Why This Matters

Most tools force you to choose between depth (deep analytics on one platform) or breadth (shallow metrics on many). SMLE proves that with the right architecture—combining SERP discovery, targeted scraping, and LLM processing—you can have both.

We can now spin up a campaign in seconds, walk away for coffee, and return to a comprehensive, AI-analyzed report on exactly what the world is saying, everywhere at once.

Features

  • Multi-Platform Tracking: Monitor campaigns on Instagram, TikTok, Reddit, YouTube, and more.
  • Sentiment Analysis: Automated sentiment scoring for posts.
  • Interactive Network Graph: Visual exploration of influencer nodes and narrative paths.
  • SMLE Vision: AI-powered video content analysis with frame-by-frame insights.
  • Semantic Search: Natural language queries across all platforms using vector embeddings.
  • Real-time Dashboard: Visualize campaign performance and trends.
  • Self-Healing: Automatic cleanup of stuck jobs.
  • Secure Authentication: JWT-based auth with protected routes.

Prerequisites

  • Node.js: v18+
  • Database Server: Right now we support Couchbase, CrateDB, PostgreSQL (with pgvector), and Neo4j.
  • Docker: For running PostgreSQL/Neo4j locally.
  • BrightData Account: For SERP and scraping capabilities.

Installation

  1. Clone the repository

    git clone https://github.com/mhirschberg/smle
    cd smle
  2. Install Backend Dependencies

    npm install
  3. Install Frontend Dependencies

    cd frontend
    npm install
    cd ..

Infrastructure Start

Before running the application, you must start your chosen database.
We recommend using a cloud instance of Couchbase Capella or CrateDB Cloud for the easiest setup. Neo4j Aura is also a good option for the network graphinstead of a local instance.

If you prefer running PostgreSQL locally via Docker:

docker-compose up -d postgres

If you prefer running Neo4j locally via Docker:

docker-compose up -d neo4j

Configuration

Environment Variables

Copy the example file:

cp .env.example .env

Update .env with your:

  • Database connection string and credentials
  • BrightData API Key
  • JWT Secret
  • ADMIN_USERNAME and ADMIN_PASSWORD (Your initial login credentials)

Tip

You can point to a specific environment file (e.g., for switching between local and cloud DBs) by using:
DOTENV_CONFIG_PATH=.env.cb npm run setup:auth

Database Initialization

For Couchbase:

npm run setup:couchbase

For CrateDB:

npm run setup:cratedb

For Postgres:

npm run setup:postgres

This will:

  • Create the necessary database structure and indexes.
  • Create a default application user.

Optional local LLM setup

Install ollama and run it locally:

ollama serve

Now pull the required models:

ollama pull llama3.2:1b
ollama pull nomic-embed-text

SMLE Vision Setup (Optional)

For video analysis capabilities, install additional dependencies:

1. Install FFmpeg

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get install ffmpeg

# Windows
# Download from https://ffmpeg.org/download.html

2. Pull Vision Model

ollama pull llava:latest

3. Configure Bright Data Proxies

Update your .env with:

  • SBR_USERNAME and SBR_PASSWORD - Scraping Browser credentials
  • UNLOCKER_USERNAME and UNLOCKER_PASSWORD - Web Unlocker credentials

These are required for downloading videos from TikTok, Instagram, and YouTube.

1. Start the Backend API

In the root directory:

npm run dev

Server will start on http://localhost:3001.

2. Start the Frontend Dashboard

In a new terminal, navigate to frontend:

cd frontend
npm run dev

Access the dashboard at http://localhost:5173

Usage

  1. Login using the credentials created during setup (or register a new user).
  2. Create a Campaign: Enter keywords and select platforms.
  3. View Results: The dashboard will update as data is fetched and analyzed.

Tech Stack

  • Backend: Node.js/Express with a Repository Pattern.
  • Database: Couchbase, CrateDB or Postgres.
  • LLM: Ollama or Google Gemini.
  • Frontend: React + Vite + TailwindCSS.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages