Skip to content

Simplified access to raw HARs #1011

Open
@max-ostapenko

Description

@max-ostapenko

We discussed with @nrllh the cost efficient way to get all the website resources in historical crawls.
We don't have a page partitioning on the crawl.requests table, so querying even one month would cost thousands.

@pmeenan confirmed the approach to construct HAR file URLs:

WITH hars AS (
SELECT
  REPLACE(FORMAT_DATE('%b_%e_%Y', date), ' ', '') AS date,
  wptid,
  page,
  IF(client = 'mobile', 'android', 'chrome') AS platform
FROM crawl.pages
WHERE date >= '2016-01-01' AND
  is_root_page AND
  page = 'https://www.google.com/'
)

SELECT
  page,
  CONCAT('https://storage.googleapis.com/httparchive/crawls/', platform, '-', date, '/', wptid, '.har.gz') AS url
FROM hars

This could be the way to retrospectively analyse the metrics not available in the crawls.

In order to remove complexity of downloading and parsing HARs - we could offer BigQuery custom routine, to make them available for analysis directly within a query.

P.S. There is also a related idea for a scalable access of old HARs - to restore retrospective BigQuery data. #942

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions