git.itanic.dy.fi Git - linux-stable/commit

author	Sagi Grimberg <sagi@grimberg.me>
	Mon, 5 Sep 2022 15:07:06 +0000 (18:07 +0300)
committer	Christoph Hellwig <hch@lst.de>
	Tue, 6 Sep 2022 04:40:44 +0000 (06:40 +0200)
commit	3770a42bb8ceb856877699257a43c0585a5d2996
tree	c40968338fe0f850c45e3c4b792d5d3dafaa2182	tree \| snapshot
parent	160f3549a907a50e51a8518678ba2dcf2541abea	commit \| diff

nvme-tcp: fix regression that causes sporadic requests to time out

When we queue requests, we strive to batch as much as possible and also
signal the network stack that more data is about to be sent over a socket
with MSG_SENDPAGE_NOTLAST. This flag looks at the pending requests queued
as well as queue->more_requests that is derived from the block layer
last-in-batch indication.

We set more_request=true when we flush the request directly from
.queue_rq submission context (in nvme_tcp_send_all), however this is
wrongly assuming that no other requests may be queued during the
execution of nvme_tcp_send_all.

Due to this, a race condition may happen where:

1. request X is queued as !last-in-batch
2. request X submission context calls nvme_tcp_send_all directly
3. nvme_tcp_send_all is preempted and schedules to a different cpu
4. request Y is queued as last-in-batch
5. nvme_tcp_send_all context sends request X+Y, however signals for
both MSG_SENDPAGE_NOTLAST because queue->more_requests=true.

==> none of the requests is pushed down to the wire as the network
stack is waiting for more data, both requests timeout.

To fix this, we eliminate queue->more_requests and only rely on
the queue req_list and send_list to be not-empty.

Fixes: 122e5b9f3d37 ("nvme-tcp: optimize network stack with setting msg flags according to batch size")
Reported-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>