New Question

Revision history [back]

click to hide/show revision 1
initial version

Hi,

Is it really a driver crash, as in a blue screen, or is it just that the driver disconnects the disks after detecting stale connections? We need to make this important distinction in order to properly understand the issue.

For what is worth, I've tried to reproduce this but it didn't occur in 10 days. I'll increase the disc count and try again.

Thanks, Lucian

Hi,

Is it really a driver crash, as in a blue screen, or is it just that the driver disconnects the disks after detecting stale connections? We need to make this important distinction in order to properly understand the issue.

For what is worth, I've tried to reproduce this but it didn't occur in 10 days. I'll increase the disc count and try again.

Thanks, Lucian

Later edit: the transmission ID counter was flipping after about 2B requests due to the wrong type being used on the Ceph side: https://github.com/ceph/ceph/pull/55413. At that point, OSD replies were getting dropped because of transmission ID mismatches, so basically all IO requests were hanging.

The fixed has been merged in Ceph and backported to the downstream Reef branch: https://github.com/cloudbase/ceph/releases/tag/v18.5.4. The latest Reef MSI includes the fix.

A big thank you to @mstidham for helping identify this bug.

Hi,

Is it really a driver crash, as in a blue screen, or is it just that the driver disconnects the disks after detecting stale connections? We need to make this important distinction in order to properly understand the issue.

For what is worth, I've tried to reproduce this but it didn't occur in 10 days. I'll increase the disc count and try again.

Thanks, Lucian

Later edit: the transmission ID id counter was flipping after about 2B requests due to the wrong type being used on the Ceph side: https://github.com/ceph/ceph/pull/55413. At that point, OSD replies were getting dropped because of transmission ID id mismatches, so basically all IO requests were hanging.

The fixed fix has been merged in Ceph and backported to the downstream Reef branch: https://github.com/cloudbase/ceph/releases/tag/v18.5.4. The latest Reef MSI includes the fix.

A big thank you to @mstidham for helping identify this bug.