There are discussions about occasions where EventBus fails producing events to EventGate.
- T120242: Eventually Consistent MediaWiki State Change Events
- T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable"
- T362977: WDQS updater missed some updates
etc.
We have some metrics from envoy mesh local proxies and from eventgate when things fail. (See Errors section of the EventGate grafana dashboard).
However, we do not have these metrics from EventBus itself. We do have failure logs in logstash.
Especially on 5xx errors, the client will know best when they happen.
Now that MediaWiki has prometheus support (T350592: EPIC: migrate in use metrics and dashboards to statslib), we should instrument EventBus and add metrics around event production and whatever else might be nice to have.
https://www.mediawiki.org/wiki/Manual:Stats has instructions for how to use the MW Stats library to do this.
Doing this will help us quantify when we fail to produce events, which will help us with defining SLOs and documentation for T120242: Eventually Consistent MediaWiki State Change Events.
Done is
- EventBus emits metrics about event produce and failure counters, with informative labels. Labels should probably include
- stream name
- event service name (eventgate name)
- HTTP status code (?)
-
timing metrics: send() function timing & http request timing- will do this in different task -
maybe $schema if it isn't hard to get?(not really that useful)
- etc.
- Any other easy/useful/relavent EventBus metrics are emitted.
- EventBus metrics are shown in a Grafana dashboard, either in the existent EventGate one, or a new one for EventBus.
- Temporary feature flag logic removed from EventBus