Ceph for Windows - TCP Session count

asked 2023-06-08 19:09:30 +0300

mstidham
1 ●7 ●9 ●9

We've been testing the Ceph for Windows driver (WNBD driver) for a while now and have ran into an interesting situation. It appears that every rbd-wnbd connected volume opens an abnormal amount of tcp sessions to the ceph hosts. This could be a coincidence but using some quick math it appears that every rbd connected volume has a tcp session for every osd on every ceph node.

If I have a Hyper-V host with lets say 60 connected rbd volumes and I run the following netstat commands; we find that the number of sessions is roughly the same as the number of rbd volumes * the number of servers * the number of osd's per server. For example:

netstat -nao | find /i "ip of a single ceph node" /c will show around 1440 tcp sessions to that node. 1440 / 60 = 24.

So when you have 60 connected rbd volumes and 13 ceph nodes with 24 osds; the number of tcp sessions is around 18720.

By default; Windows caps the number of source ports to 16384. It is around this mark that we started having trouble with rbd volumes randomly being disconnected or having issues. That combined with the tcp/ip port exhaustion errors we were seeing in event log is how we discovered this issue. To get around the issue we expanded the port range in windows to 32768. However; we see this as a potential issue as the number of rbd volumes we'll want to connect to a single Hyper-V host to increase AND we expect to continue adding nodes to the Ceph cluster. It won't take long until we're butting up against this limitation again.

As an added measure; I decided to spin up a proxmox cluster to see if it opens the same amount of connections per LUN and it does not appear to. The connection count does not seem to corelate to the amount of nodes or the amount of osds per node at all. The connection count was surprisingly really low.

I'm hoping with this post that maybe someone can share some insight into this. Is this something you're already aware of? Is this expected behavior and why? Have we misconfigured something? Just looking for some guidance. Thank you

edit retag flag offensive close merge delete

add a comment

Comments

Just wanted to check back on this to see what the outcome of your experiments are. I haven't seen any updates to the wnbd driver page. Thanks

mstidham ( 2023-06-30 15:48:59 +0300 )edit

The good news is that the OSD connections are shared when using a single process, so the number of connections is significantly reduced. The bad news is that some RBD operations (e.g. mapping images using the same RADOS context) don't seem to be thread safe and often crash.

lpetrut ( 2023-07-05 10:17:44 +0300 )edit

I'll need to discuss it with the RBD team and see if it's a known issue.

lpetrut ( 2023-07-05 10:18:46 +0300 )edit

Thanks for the update.

mstidham ( 2023-07-05 17:52:45 +0300 )edit

I've fixed the crashes, will have a PR soon. We'll most probably include this in the Reef msi.

lpetrut ( 2023-07-12 11:05:40 +0300 )edit

see more comments

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Ceph for Windows - TCP Session count

1 answer

Comments

Your Answer

Question Tools

Stats

Related questions

Ceph for Windows - TCP Session count edit

1 answer

Comments

Your Answer

Question Tools

Stats

Related questions

Ceph for Windows - TCP Session count