|
| 1 | +# Finding a Gap in Timeseries data and or Gaps and Islands using SQL |
| 2 | + |
| 3 | +A common problem in analyzing a Log data from applications is to create sessions from logged user activity. A user may use the app for couple of hours in day and then come back to it the next day. The log will register the user activity, but the logs will not tell us when the session end for the day, and a new session started the next day. Typically a sessions are based on _m_ minutes of activity and _n_ minutes of no activity. _n_ could be minutes, for example. This is also known as [Gaps-and-Islands](sessionization.md) problem in Computer Science. |
| 4 | + |
| 5 | +## Sample log data |
| 6 | + |
| 7 | +|username|log_timestmap| |
| 8 | +|--------|-------------------------| |
| 9 | +| Angela | 2020-08-07 20:10:00.000 | |
| 10 | +| Scott | 2020-08-07 20:10:00.000 | |
| 11 | +| Bob | 2020-08-07 20:10:00.000 | |
| 12 | +| Bob | 2020-08-07 20:20:00.000 | |
| 13 | +| Angela | 2020-08-07 20:20:00.000 | |
| 14 | +| Scott | 2020-08-07 20:20:00.000 | |
| 15 | +| Bob | 2020-08-07 20:30:00.000 | |
| 16 | +| Angela | 2020-08-07 20:30:00.000 | |
| 17 | +| Scott | 2020-08-07 20:30:00.000 | |
| 18 | +| Angela | 2020-08-07 20:40:00.000 | |
| 19 | +| Scott | 2020-08-07 20:40:00.000 | |
| 20 | +| Bob | 2020-08-07 20:50:00.000 | |
| 21 | +| Angela | 2020-08-07 20:50:00.000 | |
| 22 | +| Scott | 2020-08-07 20:50:00.000 | |
| 23 | +| Bob | 2020-08-07 21:00:00.000 | |
| 24 | +| Bob | 2020-08-07 21:10:00.000 | |
| 25 | +| Scott | 2020-08-07 22:00:00.000 | |
| 26 | +| Scott | 2020-08-07 22:20:00.000 | |
| 27 | +| Scott | 2020-08-07 22:30:00.000 | |
| 28 | + |
| 29 | +### Desired output |
| 30 | + |
| 31 | +From the above App Log, let's say we need to display |
| 32 | + |
| 33 | +1. user |
| 34 | +2. begin_timestamp (begin of the session) |
| 35 | +3. end_timestamp (end of the session) |
| 36 | + |
| 37 | +We will define the _session_ to be 10 mins i.e. If there is no activity for 10 minutes, the session should be considered as ended. |
| 38 | + |
| 39 | +### MATCH_RECOGNIZE Query to create sessions |
| 40 | + |
| 41 | +Below we will use SQL's [MATCH_RECOGNIZE](applied-overview-of-MATCH_RECOGNIZE-clause.md) to _sessionize_ this data |
| 42 | + |
| 43 | +```sql |
| 44 | +SELECT username, |
| 45 | + session_start |
| 46 | + , session_end |
| 47 | +FROM app_log |
| 48 | + MATCH_RECOGNIZE( |
| 49 | + PARTITION BY username |
| 50 | + ORDER BY log_timestamp |
| 51 | + MEASURES |
| 52 | + first_value(log_timestamp) AS session_start, |
| 53 | + last_value(log_timestamp) AS session_end |
| 54 | + PATTERN (session_start continuous_activity * ) |
| 55 | + DEFINE |
| 56 | + continuous_activity AS log_timestamp <= dateadd('minute', 10, lag(log_timestamp)) |
| 57 | + ) |
| 58 | +``` |
| 59 | + |
| 60 | + |
| 61 | +### Query output |
| 62 | + |
| 63 | +|Username|session_start|session_end| |
| 64 | +|--------|-------------------------|-------------------------| |
| 65 | +| Scott | 2020-08-07 20:10:00.000 | 2020-08-07 20:50:00.000 | |
| 66 | +| Scott | 2020-08-07 22:00:00.000 | 2020-08-07 22:07:00.000 | |
| 67 | +| Scott | 2020-08-07 22:20:00.000 | 2020-08-07 22:30:00.000 | |
| 68 | +| Bob | 2020-08-07 20:10:00.000 | 2020-08-07 20:30:00.000 | |
| 69 | +| Bob | 2020-08-07 20:50:00.000 | 2020-08-07 21:10:00.000 | |
| 70 | +| Angela | 2020-08-07 20:10:00.000 | 2020-08-07 20:50:00.000 | |
| 71 | + |
| 72 | +### CONDITIONAL_TRUE_EVENT Query to create sessions |
| 73 | + |
| 74 | +Another way to _sessionize_ this App Log is to use [CONDITIONAL_TRUE_EVENT](conditional_true_event.md) in SQL |
| 75 | + |
| 76 | +```sql |
| 77 | +select |
| 78 | + username |
| 79 | + , log_timestamp |
| 80 | + , datediff( |
| 81 | + minute |
| 82 | + , lag(log_timestamp) over (partition by username order by log_timestamp asc) |
| 83 | + , log_timestamp |
| 84 | + ) as minutes_since_last_action |
| 85 | + , conditional_true_event(minutes_since_last_action > 10) |
| 86 | + over (partition by username order by log_timestamp asc) |
| 87 | + as session_count |
| 88 | +from app_log; |
| 89 | +``` |
| 90 | + |
| 91 | +### Query output |
| 92 | + |
| 93 | +|usernane|log_timestamp|minutes_since_last_action|session_count| |
| 94 | +|--------|-------------------------|----|---| |
| 95 | +| Scott | 2020-08-07 20:10:00.000 | | 0 | |
| 96 | +| Scott | 2020-08-07 20:20:00.000 | 10 | 0 | |
| 97 | +| Scott | 2020-08-07 20:30:00.000 | 10 | 0 | |
| 98 | +| Scott | 2020-08-07 20:40:00.000 | 10 | 0 | |
| 99 | +| Scott | 2020-08-07 20:50:00.000 | 10 | 0 | |
| 100 | +| Scott | 2020-08-07 22:00:00.000 | 70 | 1 | |
| 101 | +| Scott | 2020-08-07 22:07:00.000 | 7 | 1 | |
| 102 | +| Scott | 2020-08-07 22:20:00.000 | 13 | 2 | |
| 103 | +| Scott | 2020-08-07 22:30:00.000 | 10 | 2 | |
| 104 | +| Bob | 2020-08-07 20:10:00.000 | | 0 | |
| 105 | +| Bob | 2020-08-07 20:20:00.000 | 10 | 0 | |
| 106 | +| Bob | 2020-08-07 20:30:00.000 | 10 | 0 | |
| 107 | +| Bob | 2020-08-07 20:50:00.000 | 20 | 1 | |
| 108 | +| Bob | 2020-08-07 21:00:00.000 | 10 | 1 | |
| 109 | +| Bob | 2020-08-07 21:10:00.000 | 10 | 1 | |
| 110 | +| Angela | 2020-08-07 20:10:00.000 | | 0 | |
| 111 | +| Angela | 2020-08-07 20:20:00.000 | 10 | 0 | |
| 112 | +| Angela | 2020-08-07 20:30:00.000 | 10 | 0 | |
| 113 | +| Angela | 2020-08-07 20:40:00.000 | 10 | 0 | |
| 114 | +| Angela | 2020-08-07 20:50:00.000 | 10 | 0 | |
0 commit comments