Implement alerting when spikes occur in Growth team dashboard in Logstash
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	kostajh
	Jan 27 2023, 1:34 PM

Description

We have a chores process that involves engineers reviewing logs in Logstash on a daily basis. Sometimes we miss a day, though. Or, maybe we look at the dashboard at 22:00 UTC but there are a flood of errors beginning at 01:00 UTC. As a team, we would like to know about these errors before someone else tells us about them.

We should implement email alerts when there is a spike of messages, for example more than 10 (to be determined) per hour. We should first complete T328128: Reduce noise in Growth team's Logstash dashboard, so we have a better handle on expected log volume for our dashboard.

It seems like we'd need the OpenSearch Alerting plugin to implement this. I've asked about that in T293694: Alert RelEng when mw-client-error editing dashboard shows errors at a rate of over 1000 errors in a 12 hr period .

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T340453 [Epic] FY 2023-24 Growth Maintenance Work
Open	None	T325082 Eslint: configure browser compatibility
Open	None	T367429 [Epic] FY 2024-25 Growth Maintenance Work
Open	None	T345202 Implement alerting for Growth-consumed or Growth-managed services/pipelines
Open	None	T328129 Implement alerting when spikes occur in Growth team dashboard in Logstash

Event Timeline

kostajh created this task.Jan 27 2023, 1:34 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 27 2023, 1:34 PM

kostajh mentioned this in T293694: Alert RelEng when mw-client-error editing dashboard shows errors at a rate of over 1000 errors in a 12 hr period .Jan 27 2023, 1:37 PM

kostajh updated the task description. (Show Details)

kostajh moved this task from Inbox to Triaged on the Growth-Team board.Jan 27 2023, 2:23 PM

kostajh added a parent task: T323132: [Epic] Q3 FY 2022-23 Growth Maintenance Work.

In T293694#8564483, @kostajh wrote:

Is it possible to use the OpenSearch Alerting plugin for this? For context, I have arrived at this task from T328129: Implement alerting when spikes occur in Growth team dashboard in Logstash which proposes to do something similar, but for the Growth team's dashboard.

The alerting plugin is not recommended, but we have a way to feed query results into our alerting infrastructure.

Alerts on logs need a few things:

A query that selects and filters down to what you want to monitor
A defined threshold that when out of bounds triggers the alert to fire
Infrastructure that knows when to generate the alert and forward it to the right recipients

For (1), a dashboard gets us oriented in the right direction for developing the needed queries. For (2), we need threshold(s) and a look at the pros and cons of codifying the thresholds in VCS or the need to change it often. For (3), we use a Prometheus exporter that turns Dashboards queries into metrics, Prometheus or Grafana evaluates those metrics against the threshold(s), and AlertManager routes alerts to recipients.

In T328129#8565247, @colewhite wrote:

For (1), a dashboard gets us oriented in the right direction for developing the needed queries.

Can you use an actual dashboard, or is it more like the snapshot of a dashboard? Being able to update the dashboard's filters and have the same change automatically apply to the alerts as well would be nice. (We use the filters to ignore irrelevant known bugs. Those are usually not particularly high-volume so the filters might not be that important, but it would be a nice capability.)

For (2), we need threshold(s) and a look at the pros and cons of codifying the thresholds in VCS or the need to change it often. For (3), we use a Prometheus exporter that turns Dashboards queries into metrics, Prometheus or Grafana evaluates those metrics against the threshold(s), and AlertManager routes alerts to recipients.

If the process uses Grafana anyway, I wonder if we could just use Grafana's built-in email alerts? Thes goal of this task is more of giving us a heads-up about potential issues than emergency response, so IMO the self-serve configurability via the Grafana UI might be more useful than the robustness of having the alert rules be managed in git.

In T328129#8571676, @Tgr wrote:

Can you use an actual dashboard, or is it more like the snapshot of a dashboard? Being able to update the dashboard's filters and have the same change automatically apply to the alerts as well would be nice. (We use the filters to ignore irrelevant known bugs. Those are usually not particularly high-volume so the filters might not be that important, but it would be a nice capability.)

Currently there is no way to synchronize a set of queries from a dashboard without human intervention. The query that generates the metrics would live in Puppet and require a patch to update.

RelEng has also expressed the desire to tag certain logs as "known issues" and change views based on that property through the logstash interface: T302041: Provide way to enhance Logstash logs with information from the Train Log Triage process.

If the process uses Grafana anyway, I wonder if we could just use Grafana's built-in email alerts? Thes goal of this task is more of giving us a heads-up about potential issues than emergency response, so IMO the self-serve configurability via the Grafana UI might be more useful than the robustness of having the alert rules be managed in git.

Yes, the ability to adjust a threshold without a Gerrit patch is one advantage of this approach. SMTP is not configured for Grafana, but alerts routed through AlertManager are supported.

DMburugu moved this task from Triaged to Current Maintenance Focus on the Growth-Team board.Jan 31 2023, 3:23 PM

DMburugu edited parent tasks, added: T333335: [Epic] Q4 FY 2022-23 Growth Maintenance Work; removed: T323132: [Epic] Q3 FY 2022-23 Growth Maintenance Work.Mar 28 2023, 2:25 PM

DMburugu edited parent tasks, added: T340455: [Epic] Q1 FY 2023-24 Growth Maintenance Work; removed: T333335: [Epic] Q4 FY 2022-23 Growth Maintenance Work.Jun 26 2023, 5:06 PM

Urbanecm_WMF added a parent task: T345202: Implement alerting for Growth-consumed or Growth-managed services/pipelines.Aug 29 2023, 7:23 PM

Urbanecm_WMF removed a parent task: T340455: [Epic] Q1 FY 2023-24 Growth Maintenance Work.

Sgs triaged this task as Low priority.Nov 20 2023, 4:17 PM

Michael subscribed.May 28 2024, 8:12 AM

In T328129#8571808, @colewhite wrote:

In T328129#8571676, @Tgr wrote:

Can you use an actual dashboard, or is it more like the snapshot of a dashboard? Being able to update the dashboard's filters and have the same change automatically apply to the alerts as well would be nice. (We use the filters to ignore irrelevant known bugs. Those are usually not particularly high-volume so the filters might not be that important, but it would be a nice capability.)

Currently there is no way to synchronize a set of queries from a dashboard without human intervention. The query that generates the metrics would live in Puppet and require a patch to update.

Can we somehow get to a state where we can make progress on this?

As far as I can see it, our minimal requirements would be:

something that runs daily
gets the number of events in logstash for a given query for the last 24 hours
pings an url (statsd) with that number
is self-service by growth-team engineers

Our current best workaround would seem to be to create two dashboards:

One with all known issues filtered out, so that we are able to see new issues.
And one with all known issues showing as well and a human engineer eye-balling whether they are getting worse.

I think that's not exactly ideal and it would be great if we could do better than that.

(I guess one alternative would be to highly streamline the process to make changes to puppet for that use-case. But that feels very much high-friction and permanent compared to the somewhat more ephemeral altering we actually need.)

Michael mentioned this in T367211: Log unactionable errors to statslib/prometheus and set alert instead of using logstash.Jun 11 2024, 5:30 PM

@Tgr Is what Michael is suggesting a feasible way to go about the implementation of alerts?

Implement alerting when spikes occur in Growth team dashboard in LogstashOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Implement alerting when spikes occur in Growth team dashboard in Logstash
Open, Needs TriagePublic
Actions

Related Objects
Search...