This has contributed to the outages we have had in the past couple of weeks (see the parent ticket). The concurrency should go down to avoid overwhelming the primary database with too many writes.
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Security | • Ladsgroup | T370304 Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis | ||
Open | None | T370624 Reduce concurrency of RecordLintJobs or shard it per section |
Event Timeline
Comment Actions
Should this be done in the job queue? Or is there something we can do inside RecordLintJob? Is there an example of other jobs that are sharded by section?
Comment Actions
OK, I found https://gerrit.wikimedia.org/g/mediawiki/services/change-propagation/jobqueue-deploy/+/05420ad000caa34a9351de4774d0196a860ca869/scap/vars.yaml#88 and I think this is probably a bit past what I feel comfortable doing so I'll leave it for someone else.
I'll note that T330036#9791309 will also address it by moving the updates into refreshLinks rather than having a separate job.
This seems harder to do given the joins we're doing in queries already, I don't want to make it more difficult for editors to get access to the linter data :/