Page MenuHomePhabricator

Wikitech issues for datacentre switchover (March 2023)
Closed, ResolvedPublic

Description

Wikitech databases have been migrated from m5 to s6 T167973#7359504, but Wikitech itself is only hosted still on labweb cloudweb* servers in eqiad only.

12:52 <taavi> claime: I just realized that we moved wikitech databases from m5 to s6 since the last switchover, but it's still hosted on labweb* servers in eqiad only. although the multi-dc work means that it should Just Work, but you might want to double-check that beforehand

Event Timeline

Clement_Goubert created this task.
Clement_Goubert moved this task from Incoming 🐫 to this.quarter 🍕 on the serviceops board.
jcrespo subscribed.

CC dbas & cloud- This worries me- while labswiki won't have a lot of queries- there is no way to migrate the user to the other datacenter, like in the other sections. This could have performance and security implications or, in the worse case scenario, wikitech will become read only?

Unfortunately we can't do anything with the DB. So we might have cross DC queries unless cloud-services-team can set up a labweb host in codfw before the DC switch.

Marostegui removed a project: DBA.

Leaving Data-Persistence tag instead of DBA. We can support cloud-services-team as much as needed, but we can't really do anything DB wise.

Leaving Data-Persistence tag instead of DBA. We can support cloud-services-team as much as needed, but we can't really do anything DB wise.

👍 Note I added both teams for awareness, not expecting any actionable in concrete

jijiki renamed this task from Check wikitech switchover from labweb eqiad to Wikitech issues for datacentre switchover (March 2023).Feb 7 2023, 10:25 AM
jijiki updated the task description. (Show Details)

We 've discussed this internally within the team. We realize that it's not possible to exclude wikitech from the switchover. The reason for that is that the master database (labswiki, in s6) for wikitech will be readonly in eqiad and read-write in codfw. That's how the switchover process works and given that read-only is a system level variable of mysql/mariadb we wouldn't want to alter it

Possible paths forward:

  1. We postpone the Switchover
  2. We could push forward with putting wikitech in mw-on-k8s.
  3. Move labswiki back to m5
  4. cloudweb/labsweb hosts get set up for Wikitech in codfw prior to the switchover
  5. Wikitech runs in eqiad, using codfw database(s) for the duration of the switchover (8 weeks)

If anyone has any other ideas, please share.

Some considerations we already had as a team regarding the paths outlined above.

  • Paths 1 and 2 are a no go for our team. We 've already substantially delayed a switchover due to team capacity issues and delaying it even more risks the robustness of a process that is important to exist. Moving wikitech to kubernetes is also a no go, simply cause we just don't have the capacity to do it in the next 3 weeks given all the other work for the Switchover + annual planning.
  • Path 3 is a regression from T167973 where a lot of work has gone into. That task also was 1 more step in the overall direction we want to go regarding Wikitech, such a rollback would set us back on that goal. Furthermore, it will require substantial work from the teams involved in that task and in short time frame, needing extensive re-prioritization
  • Path 4 requires hardware, as well as work from the team responsible to cloudweb/labweb, again requiring reprioritization and in a short time-frame.
  • Path 5 introduces a performance regression, albeit, in our understanding, a limited one. The cross DC latency is 40ms currently and it is bound to cause performance issues for the wikitech mediawiki application talking to codfw. However, while database write requests (SQL updates/inserts/deletes) will flow to codfw where the master database is, queries to the replicas (SQL selects) will flow to the eqiad DC. This greatly mitigates the performance regression issue for almost all use cases, with the exception of edits.

Proposed plan

We propose going with path 5. Given the limited time we have before the Switchover (3 weeks) and the fact that MultiDC is greatly mitigating that effects of the performance regression, we feel it's the least bad of all the options.

@nskaggs, @bd808 (feel free to add others), let me know what you think.

@nskaggs, @bd808 (feel free to add others), let me know what you think.

Option 5 sounds like a solution that requires limited time from everyone to accomplish which makes me a fan of it. I also doubt that the edit rate on wikitech will rise to a level where the performance of writes over the cross-DC link is likely to be noticed by many.

Of course my ideal solution is getting T237773: Move Wikitech onto the production MW cluster done so that nobody has to worry about the snowflake wikitech deployment again, but that is an unreasonable blocker to place on T327920: March 2023 Datacenter Switchover.

Please apologies if I am wrong, which I probably am, but...

Wikitech read requests will flow to eqiad, and write requests (the results of HTTP POSTs) will flow to codfw

How is this possible, if there are no codfw app servers serving wikitech? As I understand (and hopefully I am the one missunderstanding it, please correct me in this case), it will be at query level where that can/will happen, leading to...

Path 5 introduces a performance regression, albeit, in our understanding, a limited one.

Not too worried about the latency, but what about having plain text communication (including passwords) open in the internet? Again, please correct me if I am misunderstanding the issue (labweb1 cloudweb1* app server will connect to db2* codfw databases?). Plain text queries are usually a no-go for cross dc traffic.

How is this possible, if there are no codfw app servers serving wikitech? As I understand (and hopefully I am the one missunderstanding it, please correct me in this case), it will be at query level where that can/will happen, leading to...

The server will continue being cloudweb*. I meant mysql update/insert/delete/etc traffic in that comment. Now that I read it again, I could have been clearer. I 'll amend.

How is this possible, if there are no codfw app servers serving wikitech? As I understand (and hopefully I am the one missunderstanding it, please correct me in this case), it will be at query level where that can/will happen, leading to...

The server will continue being cloudweb*. I meant mysql update/insert/delete/etc traffic in that comment. Now that I read it again, I could have been clearer. I 'll amend.

May I ask you to please check with platform if primary traffic is encrypted automagically- I know remote traffic was TLS-encrypted through some patches when multi-dc was setup, but it may require (even if trivial) some change on a not explicitly remote server? That is the only blocker I see, and could be solved by imposing the same restriction on cloudweb1* (I don't see performance as worrying).

How is this possible, if there are no codfw app servers serving wikitech? As I understand (and hopefully I am the one missunderstanding it, please correct me in this case), it will be at query level where that can/will happen, leading to...

The server will continue being cloudweb*. I meant mysql update/insert/delete/etc traffic in that comment. Now that I read it again, I could have been clearer. I 'll amend.

May I ask you to please check with platform if primary traffic is encrypted automagically- I know remote traffic was TLS-encrypted through some patches when multi-dc was setup, but it may require (even if trivial) some change on a not explicitly remote server? That is the only blocker I see, and could be solved by imposing the same restriction on cloudweb1* (I don't see performance as worrying).

We already synced up on IRC, but for posterity's sake, P43759#178003. The TL;DR is that cross-DC traffic will be encrypted with TLS 1.2

No blocker on my side, then. Supporting path 5 (security worried me more than performance).

Option #5 sounds good. We'd need to do a switchover though for that master whenever we reach the row A eqiad switch maintenance though (I will check and coordinate with the team)

Sounds like there's a plan in place here. Thank you! I did also want to add my support for T237773: Move Wikitech onto the production MW cluster to avoid this type of pain moving forward.

Option 5 sounds good too, I think we can also reuse this solution in toolhub too (T329319)