Ponyhof LLC
The site monitoring currently goes to disabled status after 10 missed checks, so the user must regularly monitor the monitor to make sure it is monitoring. The message ("Seems unreachable from this location") is not correct after the system has continued with many successful checks over a long period.
Proposed -- at least have an auto-reset so user does not have to intervene as often... and then other improvements if possible.
(1) AUTO-RESET: Rule such as "reset count after checks clean XX (count or time)"
(2) RATE LIMITS: 1 message in 5 minutes (per site && notification channel) should be sufficient for most cases. If people need more granular then it is not the right tool for the job.
(3) DISABLE NOTIFICATIONS, NOT CHECKS: We would prefer that checks continue with reduced frequency hourly or daily notifications (except after many days of downtime) and then to get a "service restored" notification
(4) NOISE FILTER: Is there a recheck/validation before sending a notification?
(5) COMBINE NOTIFICATIONS: For our needs, it would be best to combine all locations into the same checking, and require more than 1 location to fail or show instability before issuing notifications (or have an "info" level message for only a single location failing). If a string with # of locations failing is included, then I can regex and have OpsGenie/PagerDuty wake me from sleep when appropriate.... "Unreachable from 1/3 location(s)" would not wake me but "2/3" or "3/3" can for a confirmed outage.
Site Monitor Notifications: Auto-Reset, Other Limits
-
Dennis moved item to board Closed
2 years ago -
Dennis moved item to board Under review
2 years ago -
Ponyhof LLC moved item to project Site Level Requests
2 years ago -
Ponyhof LLC created the item
2 years ago