Open
Description
We discussed with @nrllh the cost efficient way to get all the website resources in historical crawls.
We don't have a page partitioning on the crawl.requests
table, so querying even one month would cost thousands.
@pmeenan confirmed the approach to construct HAR file URLs:
WITH hars AS (
SELECT
REPLACE(FORMAT_DATE('%b_%e_%Y', date), ' ', '') AS date,
wptid,
page,
IF(client = 'mobile', 'android', 'chrome') AS platform
FROM crawl.pages
WHERE date >= '2016-01-01' AND
is_root_page AND
page = 'https://www.google.com/'
)
SELECT
page,
CONCAT('https://storage.googleapis.com/httparchive/crawls/', platform, '-', date, '/', wptid, '.har.gz') AS url
FROM hars
This could be the way to retrospectively analyse the metrics not available in the crawls.
In order to remove complexity of downloading and parsing HARs - we could offer BigQuery custom routine, to make them available for analysis directly within a query.
P.S. There is also a related idea for a scalable access of old HARs - to restore retrospective BigQuery data. #942
Metadata
Metadata
Assignees
Labels
No labels