analytics-sandbox/sql-interview-problems/02-sessionization.sql at main · wavde/analytics-sandbox · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
-- Problem 02: Session Reconstruction
--
-- Scenario
-- --------
-- A streaming service's playback platform emits a stream of heartbeat events while a
-- title is being watched. The consumer-insights team needs to reconstruct
-- viewing sessions from that stream: a session is a run of heartbeats from
-- the same profile with no gap longer than 30 minutes. Session boundaries
-- feed downstream metrics like average session length, sessions per active
-- day, and bounce-before-first-minute rates.
--
-- Prompt
-- ------
-- Given `page_views (user_id, viewed_at)`, assign a session index to each
-- row such that a new session starts whenever two consecutive events from
-- the same user are more than 30 minutes apart.
--
-- Why this problem matters
-- ------------------------
-- Business relevance: Sessionisation is the foundation of engagement
--                     reporting for any playback, browsing, or messaging
--                     product. Most top-line retention metrics are defined
--                     on top of it.
-- Skill demonstrated:  Fluency with the gaps-and-islands pattern —
--                      identifying group boundaries with LAG and converting
--                      them into group ids with a running sum.
-- Business impact:     Miscalibrated session timeouts silently shift every
--                      engagement KPI. A 30-minute boundary defined
--                      inconsistently across pipelines is a common cause
--                      of DAU/session-count discrepancies between teams.

-- Schema
-- CREATE TABLE page_views (
--     user_id     BIGINT,
--     viewed_at   TIMESTAMP
-- );

-- ============================================================================
-- Approach
-- ============================================================================
-- Step 1: For each user, look at the previous event's timestamp with LAG.
-- Step 2: Flag a row as "new session" (1) when there is no previous event,
--         or when the gap exceeds 30 minutes; otherwise 0.
-- Step 3: Running-sum that flag within the user — the cumulative total is
--         the session index.
WITH flagged AS (
    SELECT
        user_id,
        viewed_at,
        -- Flag = 1 when the gap from previous view exceeds 30 minutes
        -- (or when there is no previous view at all).
        CASE
            WHEN LAG(viewed_at) OVER w IS NULL
              OR viewed_at - LAG(viewed_at) OVER w > INTERVAL '30 minutes'
            THEN 1
            ELSE 0
        END AS is_new_session
    FROM page_views
    WINDOW w AS (PARTITION BY user_id ORDER BY viewed_at)
)
SELECT
    user_id,
    viewed_at,
    -- Cumulative sum of the boundary flag = session index within the user
    SUM(is_new_session) OVER (
        PARTITION BY user_id ORDER BY viewed_at
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS session_num
FROM flagged
ORDER BY user_id, viewed_at;

-- ============================================================================
-- Why this pattern
-- ============================================================================
-- Gaps-and-islands generalises: anywhere you need to group consecutive rows
-- by a condition (session timeout, consecutive wins, continuous
-- subscription), the two-step pattern is:
--   1. Mark the *starts* of groups with a 0/1 flag.
--   2. Running-sum the flag to get a group id.
--
-- Alternatives using SESSION / MATCH_RECOGNIZE exist in Snowflake and
-- Oracle but are not portable. The LAG + running sum version works
-- everywhere, including Spark SQL with identical syntax. In production
-- pipelines with billions of events, partition pruning on user_id and a
-- sort-merge plan on viewed_at keep this cheap; watch for skew on power
-- users with outsized event counts.