Simplified access to raw HARs

We discussed with @nrllh the cost efficient way to get all the website resources in historical crawls.
We don't have a page partitioning on the `crawl.requests` table, so querying even one month would cost thousands.

@pmeenan confirmed the approach to construct HAR file URLs:
```sql
WITH hars AS (
SELECT
  REPLACE(FORMAT_DATE('%b_%e_%Y', date), ' ', '') AS date,
  wptid,
  page,
  IF(client = 'mobile', 'android', 'chrome') AS platform
FROM crawl.pages
WHERE date >= '2016-01-01' AND
  is_root_page AND
  page = 'https://www.google.com/'
)

SELECT
  page,
  CONCAT('https://storage.googleapis.com/httparchive/crawls/', platform, '-', date, '/', wptid, '.har.gz') AS url
FROM hars
```

This could be the way to retrospectively analyse the metrics not available in the crawls.

In order to remove complexity of downloading and parsing HARs - we could offer BigQuery custom routine, to make them available for analysis directly within a query.


P.S. There is also a related idea for a scalable access of old HARs - to restore retrospective BigQuery data. https://github.com/HTTPArchive/httparchive.org/issues/942

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Simplified access to raw HARs #1011

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Simplified access to raw HARs #1011

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions