Page MenuHomePhabricator

Implement alerting when spikes occur in Growth team dashboard in Logstash
Open, Needs TriagePublic

Description

We have a chores process that involves engineers reviewing logs in Logstash on a daily basis. Sometimes we miss a day, though. Or, maybe we look at the dashboard at 22:00 UTC but there are a flood of errors beginning at 01:00 UTC. As a team, we would like to know about these errors before someone else tells us about them.

We should implement email alerts when there is a spike of messages, for example more than 10 (to be determined) per hour. We should first complete T328128: Reduce noise in Growth team's Logstash dashboard, so we have a better handle on expected log volume for our dashboard.

It seems like we'd need the OpenSearch Alerting plugin to implement this. I've asked about that in T293694: Alert RelEng when mw-client-error editing dashboard shows errors at a rate of over 1000 errors in a 12 hr period .

Event Timeline

Is it possible to use the OpenSearch Alerting plugin for this? For context, I have arrived at this task from T328129: Implement alerting when spikes occur in Growth team dashboard in Logstash which proposes to do something similar, but for the Growth team's dashboard.

The alerting plugin is not recommended, but we have a way to feed query results into our alerting infrastructure.

Alerts on logs need a few things:

  1. A query that selects and filters down to what you want to monitor
  2. A defined threshold that when out of bounds triggers the alert to fire
  3. Infrastructure that knows when to generate the alert and forward it to the right recipients

For (1), a dashboard gets us oriented in the right direction for developing the needed queries. For (2), we need threshold(s) and a look at the pros and cons of codifying the thresholds in VCS or the need to change it often. For (3), we use a Prometheus exporter that turns Dashboards queries into metrics, Prometheus or Grafana evaluates those metrics against the threshold(s), and AlertManager routes alerts to recipients.

For (1), a dashboard gets us oriented in the right direction for developing the needed queries.

Can you use an actual dashboard, or is it more like the snapshot of a dashboard? Being able to update the dashboard's filters and have the same change automatically apply to the alerts as well would be nice. (We use the filters to ignore irrelevant known bugs. Those are usually not particularly high-volume so the filters might not be that important, but it would be a nice capability.)

For (2), we need threshold(s) and a look at the pros and cons of codifying the thresholds in VCS or the need to change it often. For (3), we use a Prometheus exporter that turns Dashboards queries into metrics, Prometheus or Grafana evaluates those metrics against the threshold(s), and AlertManager routes alerts to recipients.

If the process uses Grafana anyway, I wonder if we could just use Grafana's built-in email alerts? Thes goal of this task is more of giving us a heads-up about potential issues than emergency response, so IMO the self-serve configurability via the Grafana UI might be more useful than the robustness of having the alert rules be managed in git.

Can you use an actual dashboard, or is it more like the snapshot of a dashboard? Being able to update the dashboard's filters and have the same change automatically apply to the alerts as well would be nice. (We use the filters to ignore irrelevant known bugs. Those are usually not particularly high-volume so the filters might not be that important, but it would be a nice capability.)

Currently there is no way to synchronize a set of queries from a dashboard without human intervention. The query that generates the metrics would live in Puppet and require a patch to update.

RelEng has also expressed the desire to tag certain logs as "known issues" and change views based on that property through the logstash interface: T302041: Provide way to enhance Logstash logs with information from the Train Log Triage process.

If the process uses Grafana anyway, I wonder if we could just use Grafana's built-in email alerts? Thes goal of this task is more of giving us a heads-up about potential issues than emergency response, so IMO the self-serve configurability via the Grafana UI might be more useful than the robustness of having the alert rules be managed in git.

Yes, the ability to adjust a threshold without a Gerrit patch is one advantage of this approach. SMTP is not configured for Grafana, but alerts routed through AlertManager are supported.

Sgs triaged this task as Low priority.Nov 20 2023, 4:17 PM
Michael raised the priority of this task from Low to Needs Triage.Jun 3 2024, 8:47 AM

Can you use an actual dashboard, or is it more like the snapshot of a dashboard? Being able to update the dashboard's filters and have the same change automatically apply to the alerts as well would be nice. (We use the filters to ignore irrelevant known bugs. Those are usually not particularly high-volume so the filters might not be that important, but it would be a nice capability.)

Currently there is no way to synchronize a set of queries from a dashboard without human intervention. The query that generates the metrics would live in Puppet and require a patch to update.

Can we somehow get to a state where we can make progress on this?

As far as I can see it, our minimal requirements would be:

  • something that runs daily
  • gets the number of events in logstash for a given query for the last 24 hours
  • pings an url (statsd) with that number
  • is self-service by growth-team engineers

Our current best workaround would seem to be to create two dashboards:

  • One with all known issues filtered out, so that we are able to see new issues.
  • And one with all known issues showing as well and a human engineer eye-balling whether they are getting worse.

I think that's not exactly ideal and it would be great if we could do better than that.

(I guess one alternative would be to highly streamline the process to make changes to puppet for that use-case. But that feels very much high-friction and permanent compared to the somewhat more ephemeral altering we actually need.)

@Tgr Is what Michael is suggesting a feasible way to go about the implementation of alerts?