Ceph for Windows driver crashes.
Reaching out to see if anyone has experienced this or if anyone has the ability to reproduce.
We’ve been troubleshooting what appears to be random WNBD driver crashes. Each time a crash occurs; the first entry in the Event log is WNBD (129) and the output of “rbd device list” shows -1 for the disk numbers of all the volumes. We have roughly 50 volumes connected via the driver.
We’ve found that if we schedule a reboot of the “Ceph / Windows” service once a week; we do not experience the crash.
We’ve found that if we lower the amount of connected volumes in half (25); the issue has not occurred in over a month so far. Before they were occurring roughly every two weeks.
We setup a test server with 50 volumes connected with constant fio tests against the volumes; we’ve been able to get the driver to crash in less than 5 days.
When the driver crashes; the only way we’ve found to restore connectivity is to restart the “Ceph / Windows” service. If we try to “unmap / remap” the volume without restarting the service it fails with:
asok(0000018B874B7F00) AdminSocketConfigObs::init: failed: AdminSocket::bindandlisten: failed to bind the UNIX domain socket to 'c:/programdata/ceph/client.admin.9000.asok': (17) File exists
We have a DebugView output from our latest crash in the test environment.
01021899 12786.95214844 DrainDeviceQueue:95 Stale connection detected. Time since last IO reply (ms): 1019125.Time since the aborted request was issued (ms): 64546.
01021900 12786.95214844 DrainDeviceQueue:104 Removing stale connection.
01021901 12786.95214844 WnbdDeviceMonitorThread:102 Cleaning up device connection: rbd/my-cephtest-01-c.
01021902 12786.95214844 WnbdDeviceMonitorThread:111 Waiting for pending device requests: rbd/my-cephtest-01-c.
01021903 12786.95214844 WnbdDeviceMonitorThread:114 Finished waiting for pending device requests: rbd/my-cephtest-01-c.
01021904 12786.95312500 WnbdParseUserIOCTL:578 Disconnecting disk: rbd/my-cephtest-01-c.
01021905 12786.95312500 WnbdDeleteConnection:345 Could not find connection to delete
01022002 12840.17187500 Possible deadlock. Use !locks FFFF800F0B12E450 to determine the resource owner
01022003 12840.17187500 WERKERNELHOST: WerpCheckPolicy: Requested Policy is 2
01022004 12840.17187500 WERKERNELHOST: ZwOpenKey failed with scode 0xc0000034
01022005 12840.17187500 WERKERNELHOST: System memory threshold is not met. Memory Threshold 274877906944 bytes, SystemMemory 549354266624 bytes
01022006 12840.17187500 WERKERNELHOST: WerpCheckPolicy: Requested Policy 2 is higher than granted 0
01022007 12840.17187500 WERKERNELHOST: CheckPolicy throttled dump creation for Component
01022008 12840.17187500 DBGK: DbgkWerCaptureLiveKernelDump: WerLiveKernelCreateReport failed, status 0xc0000022.
01022009 12840.17187500
01022010 12840.17187500 DBGK: DbgkpWerCleanupContext: Context 0xFFFFDD0297647DF0
......