DragonFlyBSD中SO_REUSEPORT和socket affinity

DragonFly BSD论坛
dragonflybsd
帖子: 20
注册时间: 10 5月 2016, 20:20

DragonFlyBSD中SO_REUSEPORT和socket affinity

帖子dragonflybsd » 30 7月 2016, 23:48

DragonFly hashes input packets and sockets using Toeplitz. The Toeplitz key
is created from 2 bytes, so the hash result is TCP-4-tuple and UDP-2-tuple
commutative. The TCP-4-tuple and UDP-2-tuple commutativity of hash result
is important.

In DragonFly, there is one NETISR kernel thread for each CPU and is bound to
that CPU. The TCP syncache, connected socket inpcb, inpcb connection hash
table and inpcb wildcard hash table is per-cpu. However, the TCP listen
socket's completion queue and incompletion queue are shared across NETISRs,
i.e. CPUs. The inpcb wildcard hash and TCP listen socket:

CPU0 CPU1

inpwild_hash inpwild_hash
: : : :
: : +-----------+ : :
+----+ | so_comp | +----+
| |------>|-----------|<------| |
+----+ | so_incomp | +----+
: : +-----------+ : :
: : A A : :
A | | A
| | | |
| | | |
SYNCACHE0------+ +------SYNCACHE1
A A
| |
| |
NETISR0 NETISR1

so_comp and so_incomp accessing is protected by pooled token. In nginx
case, this is what happens w/ the traditional listen socket inheritance.

The inpcb connection hash and TCP connected socket:

CPU0 CPU1

inpconn_hash inpconn_hash
: : : :
: : : :
+----+ +---------+ +---------+ +----+
| |-->| TCP inp | | TCP inp |<--| |
+----+ +---------+ +---------+ +----+
: : A A : :
: : | | : :
A | | A
| | | |
| | | |
NETISR0-------+ +--------NETISR1

The operations on TCP connected socket do not need protection and is CPU
localized. The connected socket's TCP-4-tuple Toeplitz hash is masked w/
ncpus2_mask to pick up the CPU, on which the TCP connected socket should
be processed. In nginx case, this is the picture when the sockets are
accepted.

The SO_REUSEPORT introduces per-cpu inpcb group hash, here it shows
the number of TCP listen sockets is same as number of CPUs:

accept(2) accept(2)
UTHREAD0 ----------------+ UTHREAD1 ----------+
(TCP listen sock0) | (TCP listen sock1) |
| |
CPU0 | CPU1 |
V |
inpgroup_hash entry +-------------+ inpgroup_hash entry |
+->| comp/incomp |<-+ |
+-----+ | +-------------+ | +-----+ |
| 0 |---------+ A +---------| 0 | |
+-----+ +--------------+ +-----+ |
| 1 |----|----+ +---------| 1 | |
+-----+ | | +-------------+ | +-----+ |
A | +->| comp/incomp |<-+ A |
| | +-------------+ | |
| | A A | |
| | | +------------------------------------+
| | | |
SYNCACHE0---+ +--------------------SYNCACHE1
A A
| |
| |
NETISR0 NETISR1

The incoming TCP SYN's Toeplitz hash is moduloed w/ the number of SO_REUSEPORT
TCP listen sockets on the same port/addr to generate the index into the
inpgroup_hash entry, so the so_comp and so_incomp accessing, i.e. accept(2),
is CPU localized.

The data output path:

CPU0 UTHREAD0 (TCP listen sock0)
|
| mbuf to-be-send on inp0
NETISR0<------+
|
| mbuf
V
inp0 so_snd

As long as UTHREAD0 contains only sockets from SO_REUSEPORT TCP listen sock0
and it stays on CPU0 (I have not provided CPU binding hint yet, however,
DragonFly scheduler helps a lot here), the send(2) is CPU localized.

The data input path:

CPU0 UTHREAD0 (TCP listen sock0)
|
inpconn_hash |
: : | extract the rcvd mbuf
: : |
+----+ V
| |--->inp0 so_rcv
+----+ A
: : |
: : | rcvd mbuf
|
NETISR0----------+

As long as UTHREAD0 contains only sockets from SO_REUSEPORT TCP listen sock0
and it stays on CPU0, the recv(2) is CPU localized.

The socket close path:

CPU0 UTHREAD0 (TCP listen sock0)
|
| close inp0 msg
NETISR0<------+
:
soclose(inp0)

As long as UTHREAD0 contains only sockets from SO_REUSEPORT TCP listen sock0
and it stays on CPU0, the close(2) is CPU localized.



THE REST OF THE RELATED PARTS:

If the network hardware does not support RSS, using MSI or line interrupt:

CPU0 CPU1

NETISR0 (ETHERNET/IP/TCP proc) NETISR1 (ETHERNET/IP/TCP proc)
A A
| |
| pkt(hash==0) | pkt(hash==1)
| |
hash=(SW Toeplitz hash & ncpus2_mask)-----------+
|
NIC_ITHREAD---->RX ring

If the network hardware does not support RSS, using polling(4):

CPU0 CPU1

NETISR0 (ETHERNET/IP/TCP proc) NETISR1 (ETHERNET/IP/TCP proc)
| A A
| | |
| | pkt(hash==0) | pkt(hash==1)
| | |
hash=(SW Toeplitz hash & ncpus2_mask)-----------+
|
V
RX ring

If the network hardware supports RSS, unlike other OSes, we set the same
key used by software to the hardware, and configure the redirect table in
following fashion: (hash & ring_cnt_mask) == rdr_table[(hash & rdr_table_mask)]
In DragonFly, MSI-X ithread is bound to the specific CPU set by the driver,
which has the knowledge of the CPU that should processing the input packets on
the specific RX ring.

If the network hardware supports RSS, using MSI-X:

CPU0 CPU1

NETISR0 (ETHERNET/IP/TCP proc) NETISR1 (ETHERNET/IP/TCP proc)
A A
| |
| pkt(hash==0) | pkt(hash==1)
| |
hash=(HW Toeplitz hash & ncpus2_mask) hash=(HW Toeplitz hash & ncpus2_mask)
| |
MSIX_ITHREAD0 MSIX_ITHREAD1
| |
V V
RX ring0 RX ring1

If the network hardware supports RSS, using polling(4):

CPU0 CPU1

NETISR0 (ETHERNET/IP/TCP proc) NETISR1 (ETHERNET/IP/TCP proc)
| A | A
| | | |
| | pkt(hash==0) | | pkt(hash==1)
| | | |
hash=(HW Toeplitz hash & ncpus2_mask) hash=(HW Toeplitz hash & ncpus2_mask)
| |
V V
RX ring0 RX ring1

The tranmission path, when number of hardware transmissions is less than the
number of CPUs or ALTQ packet scheduler is enabled (assume CPU0 processing
hardware TXEOF):

CPU0 CPU1

NETISR0 NETISR1
| |
| mbuf | mbuf
| |
V V
+----------------------------+
| if_subq0 |
+----------------------------+
| |
| |
+........................\
| TX ring contended :
| : TX ring not contended
| :
V V
+----------------------------+
| TX ring |
+----------------------------+

The tranmission path, when number of hardware transmissions is equal to or
more than the number of CPUs:

CPU0 CPU1

NETISR0 NETISR1
| |
| mbuf | mbuf
| |
V V
if_subq0 if_subq1
| |
| |
V V
TX ring0 TX ring1



MY ORIGINAL PLAN, BEFORE SO_REUSEPORT:

CPU0 CPU1

UTHREAD0 UTHREAD1
| : : |
| :................................. : |
| : : |
| .....................................: |
| : : |
| : : |
V V V V
+-------------+ +-----------------+ +-------------+
| so_comphdr0 |---| TCP listen sock |---| so_comphdr1 |
+-------------+ +-----------------+ +-------------+
A A
| |
| |
SYNCACHE0 SYNCACHE1
A A
| |
| |
NETISR0 NETISR1

UTHREAD0 will try dequeuing from so_comphdr0, if it is empty, then it goes
to so_comphdr1; same applies to UTHREAD1.

However, this could has trouble when waking up the waiters on TCP listen
socket, since they are actually wait on the same socket. Additionally,
it could introduce so too many changes to the kernel, which may affect
other kinds of sockets (AF_LOCAL, SOCK_STREAM/SOCK_SEQPACKET). Comparatively
speaking, the implementation of SO_REUSEPORT is much simpler, much less
invasive, straightforward and the ending result is good.






dragonfly bsd 开源组织提供专业的bsd咨询服务,
大家今后有遇到BSD系统层面的问题,可以发邮件到BSD的专业邮箱:
sephe@freebsd.org 或登陆dragonflyBSD官网:www.dragonflybsd.org. 进行讨论

回到 “DragonFly BSD论坛”

在线用户

用户浏览此论坛: 没有注册用户 和 1 访客