New Question
0

Ceph for Windows driver crashes.

asked 2024-01-03 19:19:17 +0200

mstidham gravatar image

Reaching out to see if anyone has experienced this or if anyone has the ability to reproduce.

We’ve been troubleshooting what appears to be random WNBD driver crashes. Each time a crash occurs; the first entry in the Event log is WNBD (129) and the output of “rbd device list” shows -1 for the disk numbers of all the volumes. We have roughly 50 volumes connected via the driver.

We’ve found that if we schedule a reboot of the “Ceph / Windows” service once a week; we do not experience the crash.

We’ve found that if we lower the amount of connected volumes in half (25); the issue has not occurred in over a month so far. Before they were occurring roughly every two weeks.

We setup a test server with 50 volumes connected with constant fio tests against the volumes; we’ve been able to get the driver to crash in less than 5 days.

When the driver crashes; the only way we’ve found to restore connectivity is to restart the “Ceph / Windows” service. If we try to “unmap / remap” the volume without restarting the service it fails with:

asok(0000018B874B7F00) AdminSocketConfigObs::init: failed: AdminSocket::bindandlisten: failed to bind the UNIX domain socket to 'c:/programdata/ceph/client.admin.9000.asok': (17) File exists

We have a DebugView output from our latest crash in the test environment.

01021899 12786.95214844 DrainDeviceQueue:95 Stale connection detected. Time since last IO reply (ms): 1019125.Time since the aborted request was issued (ms): 64546. 01021900 12786.95214844 DrainDeviceQueue:104 Removing stale connection. 01021901 12786.95214844 WnbdDeviceMonitorThread:102 Cleaning up device connection: rbd/my-cephtest-01-c. 01021902 12786.95214844 WnbdDeviceMonitorThread:111 Waiting for pending device requests: rbd/my-cephtest-01-c. 01021903 12786.95214844 WnbdDeviceMonitorThread:114 Finished waiting for pending device requests: rbd/my-cephtest-01-c. 01021904 12786.95312500 WnbdParseUserIOCTL:578 Disconnecting disk: rbd/my-cephtest-01-c. 01021905 12786.95312500 WnbdDeleteConnection:345 Could not find connection to delete 01022002 12840.17187500 Possible deadlock. Use !locks FFFF800F0B12E450 to determine the resource owner 01022003 12840.17187500 WERKERNELHOST: WerpCheckPolicy: Requested Policy is 2 01022004 12840.17187500 WERKERNELHOST: ZwOpenKey failed with scode 0xc0000034 01022005 12840.17187500 WERKERNELHOST: System memory threshold is not met. Memory Threshold 274877906944 bytes, SystemMemory 549354266624 bytes 01022006 12840.17187500 WERKERNELHOST: WerpCheckPolicy: Requested Policy 2 is higher than granted 0 01022007 12840.17187500 WERKERNELHOST: CheckPolicy throttled dump creation for Component 01022008 12840.17187500 DBGK: DbgkWerCaptureLiveKernelDump: WerLiveKernelCreateReport failed, status 0xc0000022. 01022009 12840.17187500
01022010 12840.17187500 DBGK: DbgkpWerCleanupContext: Context 0xFFFFDD0297647DF0

......

edit retag flag offensive close merge delete

3 answers

Sort by » oldest newest most voted
0

answered 2024-10-13 14:05:01 +0200

this post is marked as community wiki

This post is a wiki. Anyone with karma >75 is welcome to improve it.

I am having a similar error. In the Event Viewer I can see

DrainDeviceQueue:95 Stale connection detected. Time since last IO reply (ms): 20616181.Time since the aborted request was issued (ms): 58687.

Followed by DrainDeviceQueue:104 Removing stale connection.

and I only have 5 servers with 5 volumes each.

Currently using Squid Driver on Windows 2019 server

We had the same issue before on a test cluster we were building, and we thought it may be related to sensitivity with latency. But recently we acquired a brand new set of 5 x Dell R7615 servers with 12xOSD each and 4x10G cards and deployed ceph with it. no latencies are detected. but the disconnections keep happening.

How can we further research on this issue?

Thanks Luis

edit flag offensive delete link more

Comments

Hi, the other issue was caused by a counter that was flipping after 2 billion requests. Could you please paste the exact ceph version as returned by "rbd-wnbd --version"? Also, how reproducible is it? Is the Ceph cluster still accessible when the connections are dropped (e.g. rbd ls)?

lpetrut gravatar imagelpetrut ( 2024-10-24 10:42:18 +0200 )edit

Thanks for the response, ceph version 19.1.0 (9025b9024baf597d63005552b5ee004013630404) squid (rc) In this particular case I went through the documentation, and executed the following command and it is working now, no disconnect.: wnbd-client.exe set-opt RemoveStaleConnections false --persistent

laragones gravatar imagelaragones ( 2024-10-25 08:18:35 +0200 )edit
0

answered 2024-03-04 17:53:20 +0200

mstidham gravatar image

This was resolved as a part of https://github.com/ceph/ceph/pull/55413. Thanks

edit flag offensive delete link more
0

answered 2024-01-05 10:36:40 +0200

lpetrut gravatar image

updated 2024-02-02 12:43:10 +0200

Hi,

Is it really a driver crash, as in a blue screen, or is it just that the driver disconnects the disks after detecting stale connections? We need to make this important distinction in order to properly understand the issue.

For what is worth, I've tried to reproduce this but it didn't occur in 10 days. I'll increase the disc count and try again.

Thanks, Lucian

Later edit: the transmission id counter was flipping after about 2B requests due to the wrong type being used on the Ceph side: https://github.com/ceph/ceph/pull/55413. At that point, OSD replies were getting dropped because of transmission id mismatches, so basically all IO requests were hanging.

The fix has been merged in Ceph and backported to the downstream Reef branch: https://github.com/cloudbase/ceph/rel.... The latest Reef MSI includes the fix.

A big thank you to @mstidham for helping identify this bug.

edit flag offensive delete link more

Comments

What we see as a catalyst is the WNBD (129) error in eventlog. Then 60 seconds later (our timeout settings for the WNBD driver) then logs that it's seeing stale connections and disconnecting volumes. The stale connections seem to be caused by whatever triggers the WNBD error in event log.

mstidham gravatar imagemstidham ( 2024-01-05 19:47:41 +0200 )edit

Whatever the WNBD (129) error is; it's not being caught by debugview or any other method we've tried to figure it out. That is why I describe this as an driver crash.

mstidham gravatar imagemstidham ( 2024-01-05 19:48:49 +0200 )edit

Coupled with the fact that restarting the service is the only thing that gets the volumes back online. Restarting the service preemptively prevents the problem from occurring and lowering the amount of connected volumes prevents the problem from occurring. May not technically be a driver crash.

mstidham gravatar imagemstidham ( 2024-01-05 19:51:44 +0200 )edit

I have found that its not the number of volumes. It's seems to be related to the total io requests. I have a script that's been tracking the wnbd-client stats (per volume) every five minutes until it crashes.

mstidham gravatar imagemstidham ( 2024-01-12 15:34:25 +0200 )edit

What I noticed is that if I total the TotalReceivedIORequests from all the volumes; the last recorded values are always around 2,145,xxx,xxx. It doesn't matter if I use 25 drives or 50 drives.

mstidham gravatar imagemstidham ( 2024-01-12 15:34:36 +0200 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2024-01-03 19:19:17 +0200

Seen: 723 times

Last updated: Oct 13 '24