Reaching out to see if anyone has experienced this or if anyone has the ability to reproduce.
We’ve been troubleshooting what appears to be random WNBD driver crashes. Each time a crash occurs; the first entry in the Event log is WNBD (129) and the output of “rbd device list” shows -1 for the disk numbers of all the volumes. We have roughly 50 volumes connected via the driver.
We’ve found that if we schedule a reboot of the “Ceph / Windows” service once a week; we do not experience the crash.
We’ve found that if we lower the amount of connected volumes in half (25); the issue has not occurred in over a month so far. Before they were occurring roughly every two weeks.
We setup a test server with 50 volumes connected with constant fio tests against the volumes; we’ve been able to get the driver to crash in less than 5 days.
When the driver crashes; the only way we’ve found to restore connectivity is to restart the “Ceph / Windows” service. If we try to “unmap / remap” the volume without restarting the service it fails with:
asok(0000018B874B7F00) AdminSocketConfigObs::init: failed: AdminSocket::bindandlisten: failed to bind the UNIX domain socket to 'c:/programdata/ceph/client.admin.9000.asok': (17) File exists
We have a DebugView output from our latest crash in the test environment.
01021899 12786.95214844 DrainDeviceQueue:95 Stale connection detected. Time since last IO reply (ms): 1019125.Time since the aborted request was issued (ms): 64546.
01021900 12786.95214844 DrainDeviceQueue:104 Removing stale connection.
01021901 12786.95214844 WnbdDeviceMonitorThread:102 Cleaning up device connection: rbd/my-cephtest-01-c.
01021902 12786.95214844 WnbdDeviceMonitorThread:111 Waiting for pending device requests: rbd/my-cephtest-01-c.
01021903 12786.95214844 WnbdDeviceMonitorThread:114 Finished waiting for pending device requests: rbd/my-cephtest-01-c.
01021904 12786.95312500 WnbdParseUserIOCTL:578 Disconnecting disk: rbd/my-cephtest-01-c.
01021905 12786.95312500 WnbdDeleteConnection:345 Could not find connection to delete
01022002 12840.17187500 Possible deadlock. Use !locks FFFF800F0B12E450 to determine the resource owner
01022003 12840.17187500 WERKERNELHOST: WerpCheckPolicy: Requested Policy is 2
01022004 12840.17187500 WERKERNELHOST: ZwOpenKey failed with scode 0xc0000034
01022005 12840.17187500 WERKERNELHOST: System memory threshold is not met. Memory Threshold 274877906944 bytes, SystemMemory 549354266624 bytes
01022006 12840.17187500 WERKERNELHOST: WerpCheckPolicy: Requested Policy 2 is higher than granted 0
01022007 12840.17187500 WERKERNELHOST: CheckPolicy throttled dump creation for Component
01022008 12840.17187500 DBGK: DbgkWerCaptureLiveKernelDump: WerLiveKernelCreateReport failed, status 0xc0000022.
01022009 12840.17187500
01022010 12840.17187500 DBGK: DbgkpWerCleanupContext: Context 0xFFFFDD0297647DF0
......