Skip to content

Investigate persistent crawler timeouts on some domains and clean up associated records #413

@bolinocroustibat

Description

@bolinocroustibat

There are specific domains where the crawler and harvesters encounter near-systematic timeouts:

Image

SQL query to list those:

SELECT
  domain,
  COUNT(*) FILTER (WHERE timeout) AS timeouts,
  COUNT(*) AS total_checks,
  ROUND(100.0 * COUNT(*) FILTER (WHERE timeout) / COUNT(*), 2) AS pct_timeout
FROM checks
WHERE created_at >= now() - interval '30 days'
  AND domain IS NOT NULL
  AND domain <> ''
GROUP BY domain
HAVING COUNT(*) >= 100
ORDER BY pct_timeout DESC, timeouts DESC
LIMIT 30;

To do

[ ] Clean up or archive the records/entries associated with these unreachable domains?

[ ] Contact those domains administrators?

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

📝 Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions