Page MenuHomePhabricator

VMs on cloudvirt1015 crashing - bad mainboard/memory
Closed, ResolvedPublic

Description

I put cloudvirt1015 into service on Monday the 8th. Yesterday (the 11th) tools-prometheus-01 crashed with a kernel panic. On Friday 12th, tools-worker-1023 also crashed.
On Saturday 13th, @Zppix's puppet-lta.lta-tracker.eqiad.wmflabs crashed

We've replaced lots of parts in this box, to no avail:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Cmjohnson - are those errors for DIMM A3 enough info to get Dell to RMA a part to us? If not, let me know, and I'll bring it up during my next sync up meeting with them.

Thanks,
Willy

@Cmjohnson - are those errors for DIMM A3 enough info to get Dell to RMA a part to us? If not, let me know, and I'll bring it up during my next sync up meeting with them.

Thanks,
Willy

This should be enough. I'm going to run the memtest, and also put in a self dispatch for new memory to (hopefully) arrive this week.

Mentioned in SAL (#wikimedia-operations) [2019-07-24T13:25:39Z] <robh> rebooting cloudvirt1015 into memtest for dell support repair via T220853

Ok, this failed with another memory error in the SEL for dimm A3 (the one in question this entire time). I've entered self dispatch SR995043467 with Dell to get a new dimm dispatched.

It should arrive on Thursday or Friday and I can swap it out.

Mentioned in SAL (#wikimedia-operations) [2019-07-24T13:49:08Z] <robh> rebooting cloudvirt1015 into OS, memory error confirmed. new memory replacement dispatch entered via T220853

Dear Rob Halsell,

Your dispatch shipped on 7/24/2019 7:50 PM

What's Next?

If you need to make any changes to the dispatch contact information, please visit our Support Center or Click Here to chat with a live support representative.
For expedited service to our premium tech agents please use Express Service Code when calling Dell. The Express Service Code is located under your Portables or on the back of desktop.
You may also check for updates via our Online Status page.

Please see below for important information.

Dispatch Number: 713921885
Work Order Number: SR995043467
Waybill Number: 109793257685
Service Tag: 31R9KH2
PO/Reference: T220853

parts arrival for thursday has EQ inbound shipment ticket - 1-191287024247

Mentioned in SAL (#wikimedia-operations) [2019-07-25T13:35:24Z] <robh> cloudvirt1015 offline for ram swap via T220853

copying the SEL to this task before I erase it

Record:      1
Date/Time:   07/24/2019 13:23:07
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   07/24/2019 13:32:47
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   07/24/2019 13:33:15
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   07/25/2019 13:47:03
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   07/25/2019 13:47:08
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------

@Andrew:

We've swapped out the failed memory dimm on this system and the new one hasn't reported any errors (as of yet.)

Can you return this to service (perhaps with those test vms you mention) and see if any other issues crop up?

Please note: We will either resolve this task soon, or remove the ops-eqiad tag. Either way, we want to clear it off our workboard for onsite tasks. If you want to keep this task open for your own reference, please just remove ops-eqiad.

Thanks @RobH. I'll spin up some stress testing VMs on that host and let them run until Andrew gets back from vacation next week.

Mentioned in SAL (#wikimedia-cloud) [2019-07-25T14:06:58Z] <jeh> create 4 testing VMs on cloudvirt1015 T220853

Created these VMs

openstack server list --project testlabs --long -c ID -c Name -c Host| grep cv1015
| 30f17a94-252e-46d2-aa28-e6f24c9c457e | cv1015-testing03                  | cloudvirt1015   |
| d1b13075-ace4-44ba-8f26-c9c12a360184 | cv1015-testing02                  | cloudvirt1015   |
| b99a2376-1bb1-48f9-9889-00d3aedb9a43 | cv1015-testing01                  | cloudvirt1015   |
| e65ff310-f0ef-451c-956c-8d21b21cc12a | cv1015-testing04                  | cloudvirt1015   |

Each VM has stress-ng running with the command

sudo screen -d -m /usr/bin/stress-ng --timeout 600 --fork 4 --cpu 4 --vm 30 --vm-bytes 1G --vm-method all --verify

Mentioned in SAL (#wikimedia-cloud) [2019-07-25T14:49:50Z] <jeh> running cpu and ram stress tests on cloudvirt1015 T220853

@Andrew Resolving this task (again) if the same issue returns please reopen. If it's a different issue please create a new task.

Andrew reassigned this task from Andrew to wiki_willy.

I put this system under a realistic load today (running ~80 VMs) and it crashed after not all that long. I had to reboot in order to get access. I don't see anything in the syslog that presaged a crash...

Jul 27 07:57:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:57:36.364 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Final resource view: name=cloudvirt1015.eqiad.wmnet phys_ram=515916MB used_ram=184832MB phys_disk=5864GB used_disk=1800GB total_vcpus=72 used_vcpus=90 pci_stats=[]
Jul 27 07:57:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:57:36.376 2075 WARNING nova.rpc [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] compute.metrics.update is not a versioned notification and not whitelisted. See ./doc/source/notification.rst
Jul 27 07:57:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:57:36.434 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Compute_service record updated for cloudvirt1015:cloudvirt1015.eqiad.wmnet
Jul 27 07:58:01 cloudvirt1015 CRON[41276]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jul 27 07:58:20 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:20.504 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Auditing locally available compute resources for node cloudvirt1015.eqiad.wmnet
Jul 27 07:58:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:36.264 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Total usable vcpus: 72, total allocated vcpus: 90
Jul 27 07:58:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:36.265 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Final resource view: name=cloudvirt1015.eqiad.wmnet phys_ram=515916MB used_ram=184832MB phys_disk=5864GB used_disk=1800GB total_vcpus=72 used_vcpus=90 pci_stats=[]
Jul 27 07:58:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:36.277 2075 WARNING nova.rpc [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] compute.metrics.update is not a versioned notification and not whitelisted. See ./doc/source/notification.rst
Jul 27 07:58:36 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:58:36.343 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Compute_service record updated for cloudvirt1015:cloudvirt1015.eqiad.wmnet
Jul 27 07:59:01 cloudvirt1015 CRON[42583]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jul 27 07:59:20 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:20.501 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Auditing locally available compute resources for node cloudvirt1015.eqiad.wmnet
Jul 27 07:59:35 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:35.670 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Total usable vcpus: 72, total allocated vcpus: 90
Jul 27 07:59:35 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:35.671 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Final resource view: name=cloudvirt1015.eqiad.wmnet phys_ram=515916MB used_ram=184832MB phys_disk=5864GB used_disk=1800GB total_vcpus=72 used_vcpus=90 pci_stats=[]
Jul 27 07:59:35 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:35.681 2075 WARNING nova.rpc [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] compute.metrics.update is not a versioned notification and not whitelisted. See ./doc/source/notification.rst
Jul 27 07:59:35 cloudvirt1015 nova-compute[2075]: 2019-07-27 07:59:35.742 2075 INFO nova.compute.resource_tracker [req-0246ab3b-ff8d-4a3d-b939-bdf71349a05f - - - - -] Compute_service record updated for cloudvirt1015:cloudvirt1015.eqiad.wmnet
Jul 27 08:00:01 cloudvirt1015 CRON[43916]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Jul 27 22:21:31 cloudvirt1015 systemd-modules-load[1412]: Inserted module 'br_netfilter'
Jul 27 22:21:31 cloudvirt1015 systemd-modules-load[1412]: Inserted module 'ipmi_devintf'
Jul 27 22:21:31 cloudvirt1015 systemd-modules-load[1412]: Inserted module 'nbd'
Jul 27 22:21:31 cloudvirt1015 systemd-modules-load[1412]: Inserted module 'iscsi_tcp'

I haven't dug up much else on account of it being Saturday :)

I don't see any errors in the Service Event Log:

/admin1-> racadm getsel
Record:      1
Date/Time:   07/25/2019 13:49:05
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------

It just has the entry of me clearing it of the last error after the replacement memory. I don't really see any kind of errors in the above comment either, so I'm going to reboot it (it appears locked up at this time) into the dell hardware test suite.

ePSA Pre-boot System Assessment is now running, will update task with results

Assigning to @RobH for results from ePSA pre-boot system assessment, before determining the next steps.

I neglected to update this, but it passed all dell epsa tests without crash.

If all we have is the log from T220853#5371114, then it really isn't much to go on. I suppose we can insist to our Dell team they send us a new mainboard since we've tried everything else.

Moving back to @Cmjohnson - can you try getting Dell to RMA you a motherboard? If they give you push back, let me know and I can try escalating with our account manager.

Thanks,
Willy

Submitted the ticket with Dell. We will see what happens

You have successfully submitted request SR996138617.

Dell approved my ticket. I talked to the technician today and he will be
out Monday morning to replace the motherboard.

Thanks Chris, hopefully this will solve things.

Did the technician replace the mainboard?

Board arrived DOA...need another one

Board arrived DOA...need another one

The haunting extends to replacement parts too. Maybe we need to consult an exorcist. ;)

motherboard replaced set idrac and password

Finished the idrac setup. on-site work is complete

I'll see if I can make it crash again!

Mentioned in SAL (#wikimedia-operations) [2019-09-04T14:51:57Z] <andrewbogott> reimaging cloudvirt1015 for T220853

btw, @Cmjohnson, did you restore BIOS settings after replacing the board?

(I just now enabled virtualization in the bios)

Andrew reassigned this task from Andrew to wiki_willy.

I can still make this crash -- my process is scheduling 80 VMs on the host, and then getting them all busy, like this:

andrew@labpuppetmaster1001:~$ sudo cumin --force --timeout 500 -o json  "name:stresstest1015" "/usr/bin/stress-ng --fork 4 2-cpu 1 --vm 30 --vm-bytes 1G --vm-method all --verify"

I was tailing the syslog during the last crash; it looks like this:

https://phabricator.wikimedia.org/P9042

Meanwhile, the console is very busy (even after the system became unreachable):

https://phabricator.wikimedia.org/P9043

Hi @Andrew - I mentioned the ongoing issues with this machine to our Dell account rep last week, since we've basically replaced every CPU/DIMM/MB on this box. They mentioned we could install Live Optics to evaluate load, but I'm not sure this is something we want to run on our hardware. Do you have another cloudvirt machine up and running right now on the same hardware specs? Essentially running at the same CPU usage...mainly so we can compare and try to isolate any other type of config differences between them.

Thanks,
Willy

@wiki_willy, the parent task of this task is the procurement for four identical systems: cloudvirt1015, 1016. 1017, 1018. 1018 has had some problems as well, but I don't see a lot of issues for 1016 or 1017 in phab history.

Thanks @Andrew - I'll reach out to our Account Rep, to see if something else can be done.

Emailed our Dell account rep, who responded that they will look into what our options are and get back to us. Thanks, Willy

Here's the response I got from Dell (pasted below). @Cmjohnson or @Jclark-ctr : can one of you guys call Dell at 1-800-456-3355, explain to them the numerous parts we've already replaced (and that it continues to crash on load) and get them to analyze the logs for the system? Let me know how it goes.

Thanks,
Willy

Here are the case that were created on behalf of ST 31R9KH2:

SR 996138617 Created 8/14/19
SR 995043467 Created 7/24/19
SR 986941687 Created 2/25/19
SR 955632952 Created 10/23/17
SR 953656459 Created 9/11/17

None of these cases had case owners because they were parts dispatches through our Tech Direct system.

I had a person in our Tech Support team analyze these cases and there is not much to go on because at Dell we didn’t receive logs. Tech Direct system has its advantages and disadvantages. Getting parts such as drives, Psu, Dimms and such quickly are the advantage. The disadvantage is proper troubleshooting doesn’t always occur and some issues get parts thrown at them.

Tech support suggestion is to open a case with an actual person in tech support and have them analyze the logs for the system. This system does have a Basic warranty so your techs would need call 1-800-456-3355, Monday through Friday 7am to 7pm CST (5am to 5pm PST).

What's the status, was there a reply from Dell?

@wiki_willy, is there any update on this issue? We're still a bit short on capacity due to missing this host and cloudvirt1024.

Hi @Andrew - apologies for the delay. Chris has been out, but @Jclark-ctr is going to follow up on this. Thanks, Willy

Dell EMC SR # 1000122167 || Service Tag: 31R9KH2 || Server Crashes under Load

opened SR with Dell forward TSR report for further diagnostics. Rep Advised that only basic warranty on host we do not have pro-support. Might require further diagnostic on Dells part

From Dell Support they have not been able to find any hardware errors from tsr report

Hi John,

I did check with a Senior Engineer who checked the linux logs .

He said the BIOS needs to be updated to latest.
And the System Profile needs to be set at " Performance" to do that Press F2 while server is posting, then click System BIOS and then " System Profile Settings " and set it to "Performance"

And then please run a CPU stress test using this Support live image. Please make a Bootable DVD and run the CPU stress test after booting in the live image .

Here is the link for the live image https://downloads.dell.com/FOLDER04967352M/1/SLI_3.0.0_A00.iso?uid=b74c0825-8823-43da-ce23-bed894301dd6&fn=SLI_3.0.0_A00.iso

Do let me know the outcome.

Regards,

Kapeel Pawaskar

Pointed this task out to our Dell account rep today. @Jclark-ctr - let me know if the steps they provided don't work, and then I'll forward our case number over to them...to see if we can just get a new server.

Thanks,
Willy

@Andrew want to confirm this box is not in use right now. Need to perform additional test for dell

It's still out of service awaiting a fix.

Verified performance mode in bios . loaded stress test multiple errors on start up sent errors to dell Requested Support Engineer for replacement

followed up with dell regarding results spoke with Madhusudan.Rao@dell.com on phone. he will follow up with Kapeel Pawaskar

resent tsr report again waiting on dell

sent TSR report after running onboard diagnostics that had faults for memory and psu1 & psu2 . TSR report showed no errors running more test..

Mentioned in SAL (#wikimedia-cloud) [2019-12-12T21:24:38Z] <jeh> schedule downtime until Jan 6th 2020 on cloudvirt1015 (bad hardware) T220853

firmware updated and bios. @JHedden can your team test to see if it will fail still

Mentioned in SAL (#wikimedia-cloud) [2020-02-06T14:28:18Z] <jeh> run hardware tests on cloudvirt1015 T220853

Mentioned in SAL (#wikimedia-cloud) [2020-02-06T14:44:20Z] <jeh> update apt packages on cloudvirt1015 T220853

Closing this for now, I'll open it back up if it fails again.

I just stress-tested this and it crashed again. Stress test was:

sudo cumin --force --timeout 500 -o json  "name:stresstest1015" "/usr/bin/stress-ng --fork 1 2-cpu 1 --vm 30 --vm-bytes 512M --vm-method all --verify"

on 79 VMs.

cloudvirt1015 has crashed again using @Andrew's stress test.

Paste with all the kernel oops and panics prior to the crash at P10788

@wiki_willy what should we do about this server? At this point going back to Dell feels like throwing good money after bad; they'll just put @Jclark-ctr on hold for half a day and then tell him to upgrade the firmware. Should we just unplug it and throw it in the garbage?

This host has been broken since the first week we put it into service more than a year ago. It has never provided us any value, and has cost us countless hours in SRE and dc-ops time.

Hi @Andrew - I'll sync up with @Jclark-ctr tomorrow to get a summary of the interactions that have taken place with Dell, along with a list of components that have been replaced up to this point....then make one last attempt with our Dell account rep to see if we can get the entire server swapped. If unsuccessful, then yeah...if decommissioning the server is a doable option for your team, that's probably last case scenario. Will update you in a few days, with the outcome talking to Dell. Thanks, Willy

Thanks @wiki_willy! I wouldn't love to decom that host, but if thinking about this is stealing DC-Ops's time away from racking our new hardware I definitely vote for the new stuff. There's no workload on 1015 now, so decom wouldn't make things any worse than they are now.

@wiki_willy Dell will not do anything further. unless we renew/upgrade warranty to pro-support. because this box did not have pro support. Warranty only covered reviewing TSR report that had no hardware errors.

@Andrew and @Jclark-ctr - I met with our Dell account rep today, to try and push for a new replacement server...or at minimum, allow us to ship the server back to Dell for stress testing and fixing themselves, without having to rely solely on TSR reports. There's a couple hoops that we'll still have to get by, but he's going to dig around and see what he can do internally to get around it. @Andrew - are you ok if I forward them the kernel dump from P10788? Thanks, Willy

@Andrew - are you ok if I forward them the kernel dump from P10788? Thanks, Willy

That's definitely fine!

@Andrew - just wanted to keep you posted with the latest update on this from my bi-weekly meeting with Dell today. They're going to try and replace cloudvirt1015 with a seed server. They no longer carry the same R630 servers, so they're looking at getting us either a seed server with equivalent specs. The two options they're looking at is a) an Intel seed server (which would have an Intel card) or b) an AMD seed server, which would have a Broadcom card. I'll get you the exact specs once they're provided to us, which should be around 3-4 business days. Thanks, Willy

@Andrew - just wanted to keep you posted with the latest update on this from my bi-weekly meeting with Dell today. They're going to try and replace cloudvirt1015 with a seed server. They no longer carry the same R630 servers, so they're looking at getting us either a seed server with equivalent specs. The two options they're looking at is a) an Intel seed server (which would have an Intel card) or b) an AMD seed server, which would have a Broadcom card. I'll get you the exact specs once they're provided to us, which should be around 3-4 business days. Thanks, Willy

We currently only have Intel based CPUs across the fleet. I'm not opposed to also evaluate/procure servers with AMD CPUs in the future, but please let's avoid a unicorn server with an AMD CPU as a replacement for cloudvirt1015. This server has caused already quite a bit of trouble and we might run into random issues with virtualisation down the road (e.g. we pass down CPU flags into the virtual instances which could differ on AMD).

@wiki_willy honestly at this point the best outcome is probably getting 'store credit' towards future purchases. Having a replacement server (exactly like the old one) would be great but 1) I agree with @MoritzMuehlenhoff that having one new, oddball server in the middle of our cluster sounds bad, and 2) Any new servers that we're buying for this workload are 'thinvirts' which have radically different specs.

@Andrew and @MoritzMuehlenhoff - based on your feedback, I talked to our Dell rep again today, to figure out a different option. There's no alternative of getting credit back, but one thing they can do is to postpone that "seed server" for later on. So for example, if we're procuring some new hardware for WMCS next fiscal, we can take 1 or 2 of the servers in that order (depending on the cost), and Dell will get them to us as "seed severs" at no cost. Does that work for you? Thanks, Willy

Cmjohnson claimed this task.

Resolving this task, the server has been sent for decommissioning and @wiki_willy is working with Dell.

@Andrew - just a heads up, I talked to our Dell rep about using that credit/seed server for you guys in Q3, since that's the next time I see you have hw budgeted for the following:

eqiad: cloudvirt10[12-14] refresh
eqiad: cloudvirt expansion (3+1 nodes)
eqiad: ceph expansion (6+3 nodes)

Thanks,
Willy

wiki_willy mentioned this in Unknown Object (Task).Jan 8 2021, 6:03 PM