User Details
- User Since
- Jan 5 2016, 9:54 PM (454 w, 2 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- LToscano (WMF) [ Global Accounts ]
Yesterday
Last step remaining is to decommission the old VMs!
This time we have an issue with sign, since a certificate is already there. I verified with manual commands and gencert works fine.
The move was done and everything seems to work as expected!
puppetserver1001 is also working with the new settings, it was rebooted today after trashing.
Wed, Sep 18
All poolcounter IPs for MediaWIki/Thumbor are now on Bookworm!
puppetserver1002 is now running with 35 JRuby workers instead of 48, let's see how it goes at steady state. If everything looks good, we can rollout the change to the rest of the cluster.
I tried to generate a heap dump with jmap but it is very large and I'd need to copy it to my local laptop to inspect it via VisualVM. There is also an option in jmap to generate a live breakdown in plaintext, but it is full of jruby objects (as expected).
I had a chat with Filippo, the keyholder-proxy is not the daemon that needs re-arming when restarted, so it can be done anytime without extra manual steps.
Tue, Sep 17
registry2005 is now running Bookworm, up and running:
VM is up and running :)
On the infrastructure side we now have:
sudo cookbook sre.ganeti.makevm --os bookworm --network private -p 7 --cluster codfw --group B --memory 6 --vcpus 2 --disk 20 registry2005
Mon, Sep 16
Probably not needed anymore :)
Cross-posting from T365167#10148384, where I am testing a reimage for sretest2001.
I checked for RSC in the dump that I made from Redfish, and I see the following:
Thumbor has been migrated to the new poolcounter VMs, and the MW network policies support the new VM's IPs.
The reimage of 2001 went fine, I just repooled it. Let's wait for a day before moving to 1001 so if anything weird comes up, we'll have a quick way to fix (depool 2001).
@Jhancock.wm you are totally right, thanks a lot! I was able to force PXE on a 10G port setting the the first RSC-W-66G4 option to Legacy. I hope to find an option in Redfish for enable it from the provision cookbook..
Due to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1035854, the VM's RAM was bumped to 2G.
Fri, Sep 13
All new VMs created!
+-------+-------+-----------+----------+-----------+---------+-----------+ | Group | Nodes | Instances | MFree | MFree avg | DFree | DFree avg | +-------+-------+-----------+----------+-----------+---------+-----------+ | A | 8 | 30 | 331.2GiB | 41.4GiB | 16.9TiB | 2.1TiB | | B | 7 | 33 | 242.3GiB | 34.6GiB | 12.1TiB | 1.7TiB | | C | 8 | 30 | 327.5GiB | 40.9GiB | 15.9TiB | 2.0TiB | | D | 6 | 32 | 207.3GiB | 34.6GiB | 10.9TiB | 1.8TiB | +-------+-------+-----------+----------+-----------+---------+-----------+
Thu, Sep 12
Nasty issue found for sretest2001: T365167#10140713
Something not really great: on sretest2001 one of the 10G interfaces has a link up, that I can confirm via BIOS, but not via Redfish.
Using this task to create another VM, poolcounter2006.
Moved thumbor codfw to poolcounter2005, everything worked nicely.
Updated https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1071553 and tested, it seems working. I kicked off a reimage of sretest2001, and I ended up with:
It happened again, this time to puppetserver1001. Amir was in the middle of a puppet-merge and it got stuck. OOM killer acting on the puppetserver's JVM :(
Thanks! I created a diff from the settings dumped before your fix(es) and after, from the Redfish point of view.
Wed, Sep 11
@klausman I was reviewing with Janis the status of the migration, I think that some steps are missing, please check https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/PSP_replacement
Filed a PR for the upstream jaeger chart: https://github.com/jaegertracing/helm-charts/pull/600
AUX migrated to PSS!
The poolcounter2005 host is up with Bookworm, as far as I can see it seems working fine.
@MoritzMuehlenhoff I'd proceed with the creation of poolcounter2005 in row A if you are ok, using sre.ganeti.makevm.
+-------+-------+-----------+----------+-----------+---------+-----------+ | Group | Nodes | Instances | MFree | MFree avg | DFree | DFree avg | +-------+-------+-----------+----------+-----------+---------+-----------+ | A | 6 | 21 | 265.9GiB | 44.3GiB | 13.4TiB | 2.2TiB | | B | 6 | 22 | 250.8GiB | 41.8GiB | 13.3TiB | 2.2TiB | | C | 6 | 23 | 247.8GiB | 41.3GiB | 10.7TiB | 1.8TiB | | D | 6 | 24 | 256.7GiB | 42.8GiB | 11.7TiB | 2.0TiB | +-------+-------+-----------+----------+-----------+---------+-----------+
Found a violation:
Both nodes on Bookworm!
Tue, Sep 10
elukey@config-master1001:~$ curl https://puppetserver1001.eqiad.wmnet/puppet-sha1.txt 68278f7164f8b827af56282c0ac8664010886b8d
@Jhancock.wm @Papaul Hi! If you have time I have another strange thing to figure out.
Mon, Sep 9
I'd also add the knative images:
Better procedure after chatting with Moritz:
Hey folks, as far as I can get both poolcounter (debian upstream) and poolcounter-prometheus-exporter (bookworm-wikimedia) are already good to go, so we could attempt a reimage of one of the nodes (namely I can try)?
First node reimaged! Everything looks good afaics.
Tried to file a patch but I realized that we don't have the helm package for Bookworm/Bullseye, so the build fails. I am wondering if the current version of chartmuseum requires Helm 2 or if we could use Helm 3, but maybe we need to upgrade.
Fri, Sep 6
Great news, the first version of the Supermicro support in provision is live on cumin nodes (namely the cookbook now supports it).
My bad, it was because my factory reset for some reason didn't restore the ADMIN password to its original state. Thanks for the follow up!
Next and last step - wait for the new conftool release, and then close!
Tried to update Wikitech and https://wikitech.wikimedia.org/wiki/Puppet#Private_puppet, the documentation should be relatively good now.
I've released spicerack 8.13.0 that collects the latest changes for the redfish module, and installed on cumin2002. The cookbook seems ready to go (https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/10378060) but I'd like to test it on sretest2001 first. I have factory-reset it, but now I think it is missing the Redfish license, so I need to wait DCops to redeploy it.
Next steps:
@Jhancock.wm Hi! I tried to factory reset the sretest2001's BMC, and now I am getting some errors when using the Redfish API (unauthorized etc..). I am wondering if the factory reset deleted the license to use redfish too.. if so could you please re-add it? Thanks in advance!
Deployed :)
Next steps:
- reimage codfw outside the deployment window
- let it bake for some days
- do the same for eqiad
As FYI I have been taking care of deployments of new versions of Proton, a new announce went out yesterday and I filed https://gerrit.wikimedia.org/r/c/mediawiki/services/chromium-render/+/1071133.
root@deploy1003:~# kube-env admin aux-k8s-eqiad
Thu, Sep 5
Important bit after a discussion with Riccardo - the debmonitor DB is already replicated (eqiad -> codfw at the moment) since it is hosted on M2-Master, and the replication/backup is handled by Data Persistence. The important bit is that they also handle what DC is "active", and it is transparent to us since we resolve the m2-master DNS record (that points to what M2 master is currently active).
Wed, Sep 4
We have now https://gitlab.wikimedia.org/repos/sre/python-release that basically documents how to release Spicerack and other similar projects. Please always ping the SRE Infra Foundations team before doing anything :)