Skip to content

Spike: Add Guidance on Safe Use of External Monitoring Tools #690

@wisyhambolu

Description

@wisyhambolu

Description

We’ve recently seen customers implement external monitoring tools (e.g. Route53 health checks, Google Stackdriver Monitoring) in ways that generate expensive, uncached requests to /, /feed, or other dynamic pages. During traffic spikes, these requests can place significant and unnecessary load on the backend and negatively impact site performance.

To avoid this, clients should be encouraged to expose a lightweight, static endpoint (e.g. /health) that reliably indicates site availability in a cheap and safe way.

We should create documentation that provides clear guidance and best practices for implementing external monitoring without causing performance or stability issues on the sites being monitored.

This issue is to investigate what kind of guidance and best practices we can give.

Acceptance Criteria

  • Identify Clear risks and constraints of external monitoring
  • Define recommended best-practice guidance
  • Document do’s and don’ts for client use
    Depending on the above results and if any approval/discussion is needed the following criteria might be moved to another ticket:
  • Update the documentation: https://docs.altis-dxp.com/cloud/healthchecks/
  • Write the FAQ
  • Write a blog post reminding users about our healthcheck API along with documentation and newly created FAQ

Ready for Work Checklist

Is this ticket ready to be worked on? See
the Play Book Definition of Ready

  • Is the title clear?
  • Is the description clear and detailed enough?
  • Are acceptance criteria listed?
  • Have any dependencies been identified? (Optional)
  • Have any documentation/playbook changes been identified? (Optional)
  • Is an estimate or time box assigned?
  • Is a priority label assigned?
  • Is this ticket added to a milestone?
  • Is this ticket added to an epic? (Optional)

Completion Checklist

Is this ticket done? See
the Play Book Definition of Done

  • Has the acceptance criteria been met?
  • Is the documentation updated (including README)?
  • Do any code/documentation changes meet project standards?
  • Are automatic tests in place to verify the fix or new functionality?
    • Or are manual tests documented (at least on this ticket)?
  • Are any Playbook/Handbook pages updated?
  • Has a new module release (patch/minor) been created/scheduled?
  • Have the appropriate backport labels been added to the PR?
  • Is there a roll-out (and roll-back) plan if required?

Metadata

Metadata

Assignees

No one assigned

    Labels

    should haveShould be done, medium priority for nowspikeAn investigation needed in order to refine & estimate a story

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions