New Question
0

Ceph for Windows driver crashes.

asked 2024-01-03 19:19:17 +0300

mstidham gravatar image

Reaching out to see if anyone has experienced this or if anyone has the ability to reproduce.

We’ve been troubleshooting what appears to be random WNBD driver crashes. Each time a crash occurs; the first entry in the Event log is WNBD (129) and the output of “rbd device list” shows -1 for the disk numbers of all the volumes. We have roughly 50 volumes connected via the driver.

We’ve found that if we schedule a reboot of the “Ceph / Windows” service once a week; we do not experience the crash.

We’ve found that if we lower the amount of connected volumes in half (25); the issue has not occurred in over a month so far. Before they were occurring roughly every two weeks.

We setup a test server with 50 volumes connected with constant fio tests against the volumes; we’ve been able to get the driver to crash in less than 5 days.

When the driver crashes; the only way we’ve found to restore connectivity is to restart the “Ceph / Windows” service. If we try to “unmap / remap” the volume without restarting the service it fails with:

asok(0000018B874B7F00) AdminSocketConfigObs::init: failed: AdminSocket::bindandlisten: failed to bind the UNIX domain socket to 'c:/programdata/ceph/client.admin.9000.asok': (17) File exists

We have a DebugView output from our latest crash in the test environment.

01021899 12786.95214844 DrainDeviceQueue:95 Stale connection detected. Time since last IO reply (ms): 1019125.Time since the aborted request was issued (ms): 64546. 01021900 12786.95214844 DrainDeviceQueue:104 Removing stale connection. 01021901 12786.95214844 WnbdDeviceMonitorThread:102 Cleaning up device connection: rbd/my-cephtest-01-c. 01021902 12786.95214844 WnbdDeviceMonitorThread:111 Waiting for pending device requests: rbd/my-cephtest-01-c. 01021903 12786.95214844 WnbdDeviceMonitorThread:114 Finished waiting for pending device requests: rbd/my-cephtest-01-c. 01021904 12786.95312500 WnbdParseUserIOCTL:578 Disconnecting disk: rbd/my-cephtest-01-c. 01021905 12786.95312500 WnbdDeleteConnection:345 Could not find connection to delete 01022002 12840.17187500 Possible deadlock. Use !locks FFFF800F0B12E450 to determine the resource owner 01022003 12840.17187500 WERKERNELHOST: WerpCheckPolicy: Requested Policy is 2 01022004 12840.17187500 WERKERNELHOST: ZwOpenKey failed with scode 0xc0000034 01022005 12840.17187500 WERKERNELHOST: System memory threshold is not met. Memory Threshold 274877906944 bytes, SystemMemory 549354266624 bytes 01022006 12840.17187500 WERKERNELHOST: WerpCheckPolicy: Requested Policy 2 is higher than granted 0 01022007 12840.17187500 WERKERNELHOST: CheckPolicy throttled dump creation for Component 01022008 12840.17187500 DBGK: DbgkWerCaptureLiveKernelDump: WerLiveKernelCreateReport failed, status 0xc0000022. 01022009 12840.17187500
01022010 12840.17187500 DBGK: DbgkpWerCleanupContext: Context 0xFFFFDD0297647DF0

......

edit retag flag offensive close merge delete

2 answers

Sort by » oldest newest most voted
0

answered 2024-03-04 17:53:20 +0300

mstidham gravatar image

This was resolved as a part of https://github.com/ceph/ceph/pull/55413. Thanks

edit flag offensive delete link more
0

answered 2024-01-05 10:36:40 +0300

lpetrut gravatar image

updated 2024-02-02 12:43:10 +0300

Hi,

Is it really a driver crash, as in a blue screen, or is it just that the driver disconnects the disks after detecting stale connections? We need to make this important distinction in order to properly understand the issue.

For what is worth, I've tried to reproduce this but it didn't occur in 10 days. I'll increase the disc count and try again.

Thanks, Lucian

Later edit: the transmission id counter was flipping after about 2B requests due to the wrong type being used on the Ceph side: https://github.com/ceph/ceph/pull/55413. At that point, OSD replies were getting dropped because of transmission id mismatches, so basically all IO requests were hanging.

The fix has been merged in Ceph and backported to the downstream Reef branch: https://github.com/cloudbase/ceph/rel.... The latest Reef MSI includes the fix.

A big thank you to @mstidham for helping identify this bug.

edit flag offensive delete link more

Comments

What we see as a catalyst is the WNBD (129) error in eventlog. Then 60 seconds later (our timeout settings for the WNBD driver) then logs that it's seeing stale connections and disconnecting volumes. The stale connections seem to be caused by whatever triggers the WNBD error in event log.

mstidham gravatar imagemstidham ( 2024-01-05 19:47:41 +0300 )edit

Whatever the WNBD (129) error is; it's not being caught by debugview or any other method we've tried to figure it out. That is why I describe this as an driver crash.

mstidham gravatar imagemstidham ( 2024-01-05 19:48:49 +0300 )edit

Coupled with the fact that restarting the service is the only thing that gets the volumes back online. Restarting the service preemptively prevents the problem from occurring and lowering the amount of connected volumes prevents the problem from occurring. May not technically be a driver crash.

mstidham gravatar imagemstidham ( 2024-01-05 19:51:44 +0300 )edit

I have found that its not the number of volumes. It's seems to be related to the total io requests. I have a script that's been tracking the wnbd-client stats (per volume) every five minutes until it crashes.

mstidham gravatar imagemstidham ( 2024-01-12 15:34:25 +0300 )edit

What I noticed is that if I total the TotalReceivedIORequests from all the volumes; the last recorded values are always around 2,145,xxx,xxx. It doesn't matter if I use 25 drives or 50 drives.

mstidham gravatar imagemstidham ( 2024-01-12 15:34:36 +0300 )edit

That is an interesting number because its real close to INT size 2147483648. Since it's every five minutes; I don't know what the exact number is when it crashes. Can you test this theory?

mstidham gravatar imagemstidham ( 2024-01-12 15:36:15 +0300 )edit

That’s a very useful observation, it might imply a counter overflow. I’ll look into it, thanks a lot!

lpetrut gravatar imagelpetrut ( 2024-01-14 14:05:15 +0300 )edit

Thanks; I'm anxious to hear if you are able to reproduce it.

mstidham gravatar imagemstidham ( 2024-01-18 18:23:25 +0300 )edit

I need one more day to reach the 2 billion request mark :) https://paste.opendev.org/raw/bZK5NTBpxg441n8N3USi/. Once it enters that state, it should be relatively easy to pinpoint the problem.

lpetrut gravatar imagelpetrut ( 2024-01-19 09:36:08 +0300 )edit

Good news, the issue reproduced after 5 days and 3.9B requests: https://paste.openstack.org/raw/b5xYFsPbZLXvftyyFlDf/. I'm looking at a memory dump and I think there's an issue with the ceph connection. I see a bunch of hanging requests, including the image close call.

lpetrut gravatar imagelpetrut ( 2024-01-22 13:38:10 +0300 )edit

Got a few leads from the stack trace but it will need further investigation. It could be ceph messenger related or maybe an issue with the socket poll handler or the ceph timers.

lpetrut gravatar imagelpetrut ( 2024-01-23 11:45:54 +0300 )edit

Something that you mind find useful, I'm using WSL to interact with the running daemons: https://paste.openstack.org/raw/bsDBGA5lsLYvilLHlaH5/. We couldn't port the "ceph" tool yet because of the Cython dependency but WSL can be used as an alternative.

lpetrut gravatar imagelpetrut ( 2024-01-23 11:46:01 +0300 )edit

I've prepared a new test build and I'm attempting to reproduce the problem. I'm using memstore so hopefully it will show up sooner. The only downside with memstore is that it's rather unstable.

lpetrut gravatar imagelpetrut ( 2024-01-23 11:46:10 +0300 )edit

It's real good to hear that we're finally are to a point where it's reproducible. I wasn't expecting that you would be able to reach 3.9b before the crash. I did not know that about WSL but I'll check it out.

mstidham gravatar imagemstidham ( 2024-01-23 15:27:52 +0300 )edit

A small update: hit the issue again after 3.5B requests, however we gathered more data this time. The good news is that the ceph connection is still alive, which allows us to narrow the search.

lpetrut gravatar imagelpetrut ( 2024-01-26 13:08:28 +0300 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2024-01-03 19:19:17 +0300

Seen: 519 times

Last updated: Mar 04