OSDN Git Service

uclinux-h8/linux.git
2 years agonvme: remove nvme_alloc_request and nvme_alloc_request_qid
Christoph Hellwig [Tue, 15 Mar 2022 14:53:59 +0000 (15:53 +0100)]
nvme: remove nvme_alloc_request and nvme_alloc_request_qid

Just open code the allocation + initialization in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2 years agonvme: cleanup how disk->disk_name is assigned
Christoph Hellwig [Tue, 15 Mar 2022 11:58:06 +0000 (12:58 +0100)]
nvme: cleanup how disk->disk_name is assigned

They way how assigning the disk name and commenting on why it is done
is split over core.c and multipath.c seems to be rather confusing.

Now that ns_head->disk always exists we can do all the work in core.c
and have a single big comment explaining the issues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2 years agonvmet: move the call to nvmet_ns_changed out of nvmet_ns_revalidate
Christoph Hellwig [Tue, 15 Mar 2022 07:13:04 +0000 (08:13 +0100)]
nvmet: move the call to nvmet_ns_changed out of nvmet_ns_revalidate

nvmet_ns_changed states via lockdep that the ns->subsys->lock must be
held. The only caller of nvmet_ns_changed which does not acquire that
lock is nvmet_ns_revalidate. nvmet_ns_revalidate has 3 callers,
of which 2 do not acquire that lock: nvmet_execute_identify_cns_cs_ns
and nvmet_execute_identify_ns. The other caller
nvmet_ns_revalidate_size_store does acquire the lock.

Move the call to nvmet_ns_changed from nvmet_ns_revalidate to the callers
so that they can perform the correct locking as needed.

This issue was found using a static type-based analyser and manually
verified.

Reported-by: Niels Dossche <dossche.niels@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2 years agonvmet: use snprintf() with PAGE_SIZE in configfs
Chaitanya Kulkarni [Tue, 8 Mar 2022 22:56:58 +0000 (14:56 -0800)]
nvmet: use snprintf() with PAGE_SIZE in configfs

Instead of using sprintf, use snprintf with buffer size limited to
PAGE_SIZE just like what we have for the rest of the file.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: don't fold lines
Chaitanya Kulkarni [Tue, 8 Mar 2022 22:56:57 +0000 (14:56 -0800)]
nvmet: don't fold lines

Don't fold line that can fit into 80 char limit. No functional change
in this patch.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet-rdma: fix kernel-doc warning for nvmet_rdma_device_removal
Chaitanya Kulkarni [Wed, 9 Mar 2022 05:59:27 +0000 (21:59 -0800)]
nvmet-rdma: fix kernel-doc warning for nvmet_rdma_device_removal

This fixes following kernel-doc warning:-

drivers/nvme/target/rdma.c:1722: warning: expecting prototype for nvme_rdma_device_removal(). Prototype was for nvmet_rdma_device_removal() instead

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet-fc: fix kernel-doc warning for nvmet_fc_unregister_targetport
Chaitanya Kulkarni [Wed, 9 Mar 2022 05:59:26 +0000 (21:59 -0800)]
nvmet-fc: fix kernel-doc warning for nvmet_fc_unregister_targetport

This fixes following kernel-doc warning:-

drivers/nvme/target/fc.c:1619: warning: expecting prototype for nvme_fc_unregister_targetport(). Prototype was for nvmet_fc_unregister_targetport() instead

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet-fc: fix kernel-doc warning for nvmet_fc_register_targetport
Chaitanya Kulkarni [Wed, 9 Mar 2022 05:59:25 +0000 (21:59 -0800)]
nvmet-fc: fix kernel-doc warning for nvmet_fc_register_targetport

This fixes following kernel-doc warning :-

drivers/nvme/target/fc.c:1365: warning: expecting prototype for nvme_fc_register_targetport(). Prototype was for nvmet_fc_register_targetport() instead

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-tcp: lockdep: annotate in-kernel sockets
Chris Leech [Wed, 16 Feb 2022 02:22:49 +0000 (18:22 -0800)]
nvme-tcp: lockdep: annotate in-kernel sockets

Put NVMe/TCP sockets in their own class to avoid some lockdep warnings.
Sockets created by nvme-tcp are not exposed to user-space, and will not
trigger certain code paths that the general socket API exposes.

Lockdep complains about a circular dependency between the socket and
filesystem locks, because setsockopt can trigger a page fault with a
socket lock held, but nvme-tcp sends requests on the socket while file
system locks are held.

  ======================================================
  WARNING: possible circular locking dependency detected
  5.15.0-rc3 #1 Not tainted
  ------------------------------------------------------
  fio/1496 is trying to acquire lock:
  (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_sendpage+0x23/0x80

  but task is already holding lock:
  (&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs]

  which lock already depends on the new lock.

  other info that might help us debug this:

  chain exists of:
   sk_lock-AF_INET --> sb_internal --> &xfs_dir_ilock_class/5

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&xfs_dir_ilock_class/5);
                                lock(sb_internal);
                                lock(&xfs_dir_ilock_class/5);
   lock(sk_lock-AF_INET);

  *** DEADLOCK ***

  6 locks held by fio/1496:
   #0: (sb_writers#13){.+.+}-{0:0}, at: path_openat+0x9fc/0xa20
   #1: (&inode->i_sb->s_type->i_mutex_dir_key){++++}-{3:3}, at: path_openat+0x296/0xa20
   #2: (sb_internal){.+.+}-{0:0}, at: xfs_trans_alloc_icreate+0x41/0xd0 [xfs]
   #3: (&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs]
   #4: (hctx->srcu){....}-{0:0}, at: hctx_lock+0x51/0xd0
   #5: (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0x33e/0x380 [nvme_tcp]

This annotation lets lockdep analyze nvme-tcp controlled sockets
independently of what the user-space sockets API does.

Link: https://lore.kernel.org/linux-nvme/CAHj4cs9MDYLJ+q+2_GXUK9HxFizv2pxUryUR0toX974M040z7g@mail.gmail.com/
Signed-off-by: Chris Leech <cleech@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-tcp: don't fold the line
Chaitanya Kulkarni [Wed, 23 Feb 2022 03:36:57 +0000 (19:36 -0800)]
nvme-tcp: don't fold the line

The call to nvme_tcp_alloc_queue() fits perfectly in one line without
exceeding 80 char limit for the line.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-tcp: don't initialize ret variable
Chaitanya Kulkarni [Wed, 23 Feb 2022 03:36:56 +0000 (19:36 -0800)]
nvme-tcp: don't initialize ret variable

No point in initializing ret variable to 0 in nvme_tcp_start_io_queue()
since it gets overwritten by a call to nvme_tcp_start_queue().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-multipath: call bio_io_error in nvme_ns_head_submit_bio
Guoqing Jiang [Wed, 9 Mar 2022 06:02:28 +0000 (14:02 +0800)]
nvme-multipath: call bio_io_error in nvme_ns_head_submit_bio

Use bio_io_error() here since bio_io_error does the same thing.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-multipath: use vmalloc for ANA log buffer
Hannes Reinecke [Wed, 9 Mar 2022 13:29:00 +0000 (14:29 +0100)]
nvme-multipath: use vmalloc for ANA log buffer

The ANA log buffer can get really large, as it depends on the
controller configuration. So to avoid an out-of-memory issue
during scanning use kvmalloc() instead of the kmalloc().

Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agoxen/blkfront: speed up purge_persistent_grants()
Juergen Gross [Fri, 11 Mar 2022 10:35:27 +0000 (11:35 +0100)]
xen/blkfront: speed up purge_persistent_grants()

purge_persistent_grants() is scanning the grants list for persistent
grants being no longer in use by the backend. When having found such a
grant, it will be set to "invalid" and pushed to the tail of the list.

Instead of pushing it directly to the end of the list, add it first to
a temporary list, avoiding to scan those entries again in the main
list traversal. After having finished the scan, append the temporary
list to the grant list.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Link: https://lore.kernel.org/r/20220311103527.12931-1-jgross@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoMerge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md...
Jens Axboe [Thu, 10 Mar 2022 23:04:03 +0000 (16:04 -0700)]
Merge branch 'md-next' of https://git./linux/kernel/git/song/md into for-5.18/drivers

Pull MD updates from Song:

"This set contains raid5 bio handling cleanups for raid5."

* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  raid5: initialize the stripe_head embeeded bios as needed
  raid5-cache: statically allocate the recovery ra bio
  raid5-cache: fully initialize flush_bio when needed
  raid5-ppl: fully initialize the bio in ppl_new_iounit

2 years agoraid5: initialize the stripe_head embeeded bios as needed
Christoph Hellwig [Mon, 28 Feb 2022 11:25:03 +0000 (13:25 +0200)]
raid5: initialize the stripe_head embeeded bios as needed

Use bio_init to initialize the bios when needed to the full state
instead of a partial initialization plus later setting of dev and op
and bio_reset.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
2 years agoraid5-cache: statically allocate the recovery ra bio
Christoph Hellwig [Mon, 28 Feb 2022 11:25:02 +0000 (13:25 +0200)]
raid5-cache: statically allocate the recovery ra bio

There is no need to preallocate the bio and reset it when use.  Just
allocate it on-stack and use a bvec places next to the pages used for
it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
2 years agoraid5-cache: fully initialize flush_bio when needed
Christoph Hellwig [Mon, 28 Feb 2022 11:25:01 +0000 (13:25 +0200)]
raid5-cache: fully initialize flush_bio when needed

Stop using bio_reset and just initialize the bio fully when needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
2 years agoraid5-ppl: fully initialize the bio in ppl_new_iounit
Christoph Hellwig [Mon, 28 Feb 2022 11:25:00 +0000 (13:25 +0200)]
raid5-ppl: fully initialize the bio in ppl_new_iounit

We have all the information to pass the bdev and op directly to bio_init,
so do that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
2 years agoMerge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md...
Jens Axboe [Tue, 8 Mar 2022 23:40:12 +0000 (16:40 -0700)]
Merge branch 'md-next' of https://git./linux/kernel/git/song/md into for-5.18/drivers

Pull MD fixes from Song:

"Most of these changes are minor fixes and clean-ups."

* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md: use msleep() in md_notify_reboot()
  lib/raid6: Include <asm/ppc-opcode.h> for VPERMXOR
  lib/raid6/test/Makefile: Use $(pound) instead of \# for Make 4.3
  lib/raid6/test: fix multiple definition linking error
  md: raid1/raid10: drop pending_cnt

2 years agomd: use msleep() in md_notify_reboot()
Eric Dumazet [Thu, 3 Mar 2022 23:19:33 +0000 (15:19 -0800)]
md: use msleep() in md_notify_reboot()

Calling mdelay(1000) from process context, even while a reboot
is in progress, does not make sense.

Using msleep() allows other threads to make progress.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: linux-raid@vger.kernel.org
Signed-off-by: Song Liu <song@kernel.org>
2 years agolib/raid6: Include <asm/ppc-opcode.h> for VPERMXOR
Paul Menzel [Tue, 8 Feb 2022 15:21:50 +0000 (16:21 +0100)]
lib/raid6: Include <asm/ppc-opcode.h> for VPERMXOR

On Ubuntu 21.10 (ppc64le) building raid6test with gcc (Ubuntu
11.2.0-7ubuntu2) 11.2.0 fails with the error below.

    gcc -I.. -I ../../../include -g -O2                       \
             -I../../../arch/powerpc/include -DCONFIG_ALTIVEC \
             -c -o vpermxor1.o vpermxor1.c
    vpermxor1.c: In function ‘raid6_vpermxor1_gen_syndrome_real’:
    vpermxor1.c:64:29: error: expected string literal before ‘VPERMXOR’
       64 |   asm(VPERMXOR(%0,%1,%2,%3):"=v"(wq0):"v"(gf_high), "v"(gf_low), "v"(wq0));
          |       ^~~~~~~~
    make: *** [Makefile:58: vpermxor1.o] Error 1

So, include the header asm/ppc-opcode.h defining this macro also when
not building the Linux kernel but only this too.

Cc: Matt Brown <matthew.brown.dev@gmail.com>
Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Song Liu <song@kernel.org>
2 years agolib/raid6/test/Makefile: Use $(pound) instead of \# for Make 4.3
Paul Menzel [Tue, 8 Feb 2022 15:21:48 +0000 (16:21 +0100)]
lib/raid6/test/Makefile: Use $(pound) instead of \# for Make 4.3

Buidling raid6test on Ubuntu 21.10 (ppc64le) with GNU Make 4.3 shows the
errors below:

    $ cd lib/raid6/test/
    $ make
    <stdin>:1:1: error: stray ‘\’ in program
    <stdin>:1:2: error: stray ‘#’ in program
    <stdin>:1:11: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ \
        before ‘<’ token

    [...]

The errors come from the HAS_ALTIVEC test, which fails, and the POWER
optimized versions are not built. That’s also reason nobody noticed on the
other architectures.

GNU Make 4.3 does not remove the backslash anymore. From the 4.3 release
announcment:

> * WARNING: Backward-incompatibility!
>   Number signs (#) appearing inside a macro reference or function invocation
>   no longer introduce comments and should not be escaped with backslashes:
>   thus a call such as:
>     foo := $(shell echo '#')
>   is legal.  Previously the number sign needed to be escaped, for example:
>     foo := $(shell echo '\#')
>   Now this latter will resolve to "\#".  If you want to write makefiles
>   portable to both versions, assign the number sign to a variable:
>     H := \#
>     foo := $(shell echo '$H')
>   This was claimed to be fixed in 3.81, but wasn't, for some reason.
>   To detect this change search for 'nocomment' in the .FEATURES variable.

So, do the same as commit 9564a8cf422d ("Kbuild: fix # escaping in .cmd
files for future Make") and commit 929bef467771 ("bpf: Use $(pound) instead
of \# in Makefiles") and define and use a $(pound) variable.

Reference for the change in make:
https://git.savannah.gnu.org/cgit/make.git/commit/?id=c6966b323811c37acedff05b57

Cc: Matt Brown <matthew.brown.dev@gmail.com>
Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Song Liu <song@kernel.org>
2 years agolib/raid6/test: fix multiple definition linking error
Dirk Müller [Tue, 8 Feb 2022 16:50:50 +0000 (17:50 +0100)]
lib/raid6/test: fix multiple definition linking error

GCC 10+ defaults to -fno-common, which enforces proper declaration of
external references using "extern". without this change a link would
fail with:

  lib/raid6/test/algos.c:28: multiple definition of `raid6_call';
  lib/raid6/test/test.c:22: first defined here

the pq.h header that is included already includes an extern declaration
so we can just remove the redundant one here.

Cc: <stable@vger.kernel.org>
Signed-off-by: Dirk Müller <dmueller@suse.de>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Song Liu <song@kernel.org>
2 years agomd: raid1/raid10: drop pending_cnt
Mariusz Tkaczyk [Mon, 17 Jan 2022 11:38:47 +0000 (12:38 +0100)]
md: raid1/raid10: drop pending_cnt

Those counters are not necessary after commit 11bb45e8aaf6 ("md: drop queue
limitation for RAID1 and RAID10"). Remove them from all code (conf and
plug structs). raid1_plug_cb and raid10_plug_cb are identical, so move
definition of raid1_plug_cb to common raid1-10 definitions and use it for
RAID10 too.

Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
2 years agoMerge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/colyli/linux...
Jens Axboe [Sun, 6 Mar 2022 15:13:09 +0000 (08:13 -0700)]
Merge branch 'for-next' of git://git./linux/kernel/git/colyli/linux-bcache into for-5.18/drivers

Pull bcache updates from Coly:

"We have 2 patches for Linux v5.18, both of them are from Mingzhe Zou.
 The first patch improves bcache initialization speed by avoid
 unnecessary cost of cache consistency, the second one fixes a potential
 NULL pointer deference in bcache initialization time."

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/colyli/linux-bcache:
  bcache: fixup multiple threads crash
  bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing

2 years agobcache: fixup multiple threads crash
Mingzhe Zou [Fri, 11 Feb 2022 06:39:15 +0000 (14:39 +0800)]
bcache: fixup multiple threads crash

When multiple threads to check btree nodes in parallel, the main
thread wait for all threads to stop or CACHE_SET_IO_DISABLE flag:

wait_event_interruptible(check_state->wait,
                         atomic_read(&check_state->started) == 0 ||
                         test_bit(CACHE_SET_IO_DISABLE, &c->flags));

However, the bch_btree_node_read and bch_btree_node_read_done
maybe call bch_cache_set_error, then the CACHE_SET_IO_DISABLE
will be set. If the flag already set, the main thread return
error. At the same time, maybe some threads still running and
read NULL pointer, the kernel will crash.

This patch change the event wait condition, the main thread must
wait for all threads to stop.

Fixes: 8e7102273f597 ("bcache: make bch_btree_check() to be multithreaded")
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Coly Li <colyli@suse.de>
2 years agobcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing
Mingzhe Zou [Fri, 7 Jan 2022 08:21:13 +0000 (16:21 +0800)]
bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing

When attaching a cached device (a.k.a backing device) to a cache
device, bch_sectors_dirty_init() is called to count dirty sectors
and stripes (see what bcache_dev_sectors_dirty_add() does) on the
cache device.

When bcache_dev_sectors_dirty_add() is called, set_bit(stripe,
d->full_dirty_stripes) or clear_bit(stripe, d->full_dirty_stripes)
operation will always be performed. In full_dirty_stripes, each 1bit
represents stripe_size (8192) sectors (512B), so 1bit=4MB (8192*512),
and each CPU cache line=64B=512bit=2048MB. When 20 threads process
a cached disk with 100G dirty data, a single thread processes about
23M at a time, and 20 threads total 460M. These full_dirty_stripes
bits corresponding to the 460M data is likely to fall in the same CPU
cache line. When one of these threads performs a set_bit or clear_bit
operation, the same CPU cache line of other threads will become invalid
and must read the full_dirty_stripes from the main memory again. Compared
with single thread, the time of a bcache_dev_sectors_dirty_add()
call is increased by about 50 times in our test (100G dirty data,
20 threads, bcache_dev_sectors_dirty_add() is called more than
20 million times).

This patch tries to test_bit before set_bit or clear_bit operation.
Therefore, a lot of force set and clear operations will be avoided,
and most of bcache_dev_sectors_dirty_add() calls will only read CPU
cache line.

Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
2 years agofloppy: use memcpy_{to,from}_bvec
Christoph Hellwig [Thu, 3 Mar 2022 11:19:05 +0000 (14:19 +0300)]
floppy: use memcpy_{to,from}_bvec

Use the helpers instead of open coding them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20220303111905.321089-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: use bvec_kmap_local in recv_dless_read
Christoph Hellwig [Thu, 3 Mar 2022 11:19:04 +0000 (14:19 +0300)]
drbd: use bvec_kmap_local in recv_dless_read

Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20220303111905.321089-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: use bvec_kmap_local in drbd_csum_bio
Christoph Hellwig [Thu, 3 Mar 2022 11:19:03 +0000 (14:19 +0300)]
drbd: use bvec_kmap_local in drbd_csum_bio

Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20220303111905.321089-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agobcache: use bvec_kmap_local in bio_csum
Christoph Hellwig [Thu, 3 Mar 2022 11:19:02 +0000 (14:19 +0300)]
bcache: use bvec_kmap_local in bio_csum

Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20220303111905.321089-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agonvdimm-btt: use bvec_kmap_local in btt_rw_integrity
Christoph Hellwig [Thu, 3 Mar 2022 11:19:01 +0000 (14:19 +0300)]
nvdimm-btt: use bvec_kmap_local in btt_rw_integrity

Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20220303111905.321089-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agonvdimm-blk: use bvec_kmap_local in nd_blk_rw_integrity
Christoph Hellwig [Thu, 3 Mar 2022 11:19:00 +0000 (14:19 +0300)]
nvdimm-blk: use bvec_kmap_local in nd_blk_rw_integrity

Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20220303111905.321089-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agozram: use memcpy_from_bvec in zram_bvec_write
Christoph Hellwig [Thu, 3 Mar 2022 11:18:59 +0000 (14:18 +0300)]
zram: use memcpy_from_bvec in zram_bvec_write

Use memcpy_from_bvec instead of open coding the logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20220303111905.321089-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agozram: use memcpy_to_bvec in zram_bvec_read
Christoph Hellwig [Thu, 3 Mar 2022 11:18:58 +0000 (14:18 +0300)]
zram: use memcpy_to_bvec in zram_bvec_read

Use the proper helper instead of open coding the copy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20220303111905.321089-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoaoe: use bvec_kmap_local in bvcpy
Christoph Hellwig [Thu, 3 Mar 2022 11:18:57 +0000 (14:18 +0300)]
aoe: use bvec_kmap_local in bvcpy

Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220303111905.321089-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoiss-simdisk: use bvec_kmap_local in simdisk_submit_bio
Christoph Hellwig [Thu, 3 Mar 2022 11:18:56 +0000 (14:18 +0300)]
iss-simdisk: use bvec_kmap_local in simdisk_submit_bio

Using local kmaps slightly reduces the chances to stray writes, and
the bvec interface cleans up the code a little bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Link: https://lore.kernel.org/r/20220303111905.321089-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoMerge tag 'nvme-5.18-2022-03-03' of git://git.infradead.org/nvme into for-5.18/drivers
Jens Axboe [Thu, 3 Mar 2022 15:51:48 +0000 (08:51 -0700)]
Merge tag 'nvme-5.18-2022-03-03' of git://git.infradead.org/nvme into for-5.18/drivers

Pull NVMe updates from Christoph:

"nvme updates for Linux 5.18

 - add vectored-io support for user-passthrough (Kanchan Joshi)
 - add verbose error logging (Alan Adamson)
 - support buffered I/O on block devices in nvmet (Chaitanya Kulkarni)
 - central discovery controller support (Martin Belanger)
 - fix and extended the globally unique idenfier validation (me)
 - move away from the deprecated IDA APIs (Sagi Grimberg)
 - misc code cleanup (Keith Busch, Max Gurtovoy, Qinghua Jin,
   Chaitanya Kulkarni)"

* tag 'nvme-5.18-2022-03-03' of git://git.infradead.org/nvme: (27 commits)
  nvme: check that EUI/GUID/UUID are globally unique
  nvme: check for duplicate identifiers earlier
  nvme: fix the check for duplicate unique identifiers
  nvme: cleanup __nvme_check_ids
  nvme: remove nssa from struct nvme_ctrl
  nvme: explicitly set non-error for directives
  nvme: expose cntrltype and dctype through sysfs
  nvme: send uevent on connection up
  nvme: add vectored-io support for user-passthrough
  nvme: add verbose error logging
  nvme: add a helper to initialize connect_q
  nvme-rdma: add helpers for mapping/unmapping request
  nvmet-tcp: replace ida_simple[get|remove] with the simler ida_[alloc|free]
  nvmet-rdma: replace ida_simple[get|remove] with the simler ida_[alloc|free]
  nvmet-fc: replace ida_simple[get|remove] with the simler ida_[alloc|free]
  nvmet: replace ida_simple[get|remove] with the simler ida_[alloc|free]
  nvme-fc: replace ida_simple[get|remove] with the simler ida_[alloc|free]
  nvme: replace ida_simple[get|remove] with the simler ida_[alloc|free]
  nvmet: allow bdev in buffered_io mode
  nvmet: use i_size_read() to set size for file-ns
  ...

2 years agonvme: check that EUI/GUID/UUID are globally unique
Christoph Hellwig [Thu, 24 Feb 2022 16:48:32 +0000 (17:48 +0100)]
nvme: check that EUI/GUID/UUID are globally unique

Add a check to verify that the unique identifiers are unique globally
in addition to the existing check that verifies that they are unique
inside a single subsystem.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2 years agonvme: check for duplicate identifiers earlier
Christoph Hellwig [Thu, 24 Feb 2022 16:46:50 +0000 (17:46 +0100)]
nvme: check for duplicate identifiers earlier

Lift the check for duplicate identifiers into nvme_init_ns_head, which
avoids pointless error unwinding in case they don't match, and also
matches where we check identifier validity for the multipath case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2 years agonvme: fix the check for duplicate unique identifiers
Christoph Hellwig [Thu, 24 Feb 2022 10:32:58 +0000 (11:32 +0100)]
nvme: fix the check for duplicate unique identifiers

nvme_subsys_check_duplicate_ids should needs to return an error if any of
the identifiers matches, not just if all of them match.  But it does not
need to and should not look at the CSI value for this sanity check.

Rewrite the logic to be separate from nvme_ns_ids_equal and optimize it
by reducing duplicate checks for non-present identifiers.

Fixes: ed754e5deeb1 ("nvme: track shared namespaces")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2 years agonvme: cleanup __nvme_check_ids
Christoph Hellwig [Thu, 24 Feb 2022 09:57:15 +0000 (10:57 +0100)]
nvme: cleanup __nvme_check_ids

Pass the actual nvme_ns_ids used for the comparison instead of the
ns_head that isn't needed and use a more descriptive function name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2 years agonvme: remove nssa from struct nvme_ctrl
Keith Busch [Tue, 15 Feb 2022 15:37:19 +0000 (07:37 -0800)]
nvme: remove nssa from struct nvme_ctrl

The reported number of streams is not used outside the function that
gets it, so no need to stash it in the controller structure. Use a local
variable instead.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme: explicitly set non-error for directives
Keith Busch [Tue, 15 Feb 2022 15:03:08 +0000 (07:03 -0800)]
nvme: explicitly set non-error for directives

Stream directives is an optional feature. It is not an error if a
controller doesn't support as many as the kernel can optionally use.
Explicitly set the non-error return value on this condition with a
comment explaining why.

Note, the return value was already 0 in this condition, so the setting
is redundant. This patch should just silence bots that falsely believe
the condition contains an error omission.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme: expose cntrltype and dctype through sysfs
Martin Belanger [Tue, 8 Feb 2022 19:33:46 +0000 (14:33 -0500)]
nvme: expose cntrltype and dctype through sysfs

TP8010 introduces the Discovery Controller Type attribute (dctype).
The dctype is returned in the response to the Identify command. This
patch exposes the dctype through the sysfs. Since the dctype depends on
the Controller Type (cntrltype), another attribute of the Identify
response, the patch also exposes the cntrltype as well. The dctype will
only be displayed for discovery controllers.

A note about the naming of this attribute:
Although TP8010 calls this attribute the Discovery Controller Type,
note that the dctype is now part of the response to the Identify
command for all controller types. I/O, Discovery, and Admin controllers
all share the same Identify response PDU structure. Non-discovery
controllers as well as pre-TP8010 discovery controllers will continue
to set this field to 0 (which has always been the default for reserved
bytes). Per TP8010, the value 0 now means "Discovery controller type is
not reported" instead of "Reserved". One could argue that this
definition is correct even for non-discovery controllers, and by
extension, exposing it in the sysfs for non-discovery controllers is
appropriate.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme: send uevent on connection up
Martin Belanger [Tue, 8 Feb 2022 19:33:45 +0000 (14:33 -0500)]
nvme: send uevent on connection up

When connectivity with a controller is lost, the driver will keep
trying to reconnect once every 10 sec. When connection is restored,
user-space apps need to be informed so that they can take proper
action. For example, TP8010 introduces the DIM PDU, which is used to
register with a discovery controller (DC). The DIM PDU is sent from
user-space.  The DIM PDU must be sent every time a connection is
established with a DC. Therefore, the kernel must tell user-space apps
when connection is restored so that registration can happen.

The uevent sent is a "change" uevent with environmental data
set to: "NVME_EVENT=connected".

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme: add vectored-io support for user-passthrough
Kanchan Joshi [Thu, 10 Feb 2022 05:37:55 +0000 (11:07 +0530)]
nvme: add vectored-io support for user-passthrough

Add a new NVME_IOCTL_IO64_CMD_VEC ioctl that works like the existing
NVME_IOCTL_IO64_CMD ioctl except that it takes and array of iovecs
and thus supports vectored I/O.

  - cmd.addr is base address of user iovec array
  - cmd.vec_cnt is count of iovec array elements

This patch does not include vectored-variant for admin-commands as most
of them are light on buffers and likely to have low invocation frequency.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme: add verbose error logging
Alan Adamson [Thu, 3 Feb 2022 08:11:53 +0000 (00:11 -0800)]
nvme: add verbose error logging

Improves logging of NVMe errors.  If NVME_VERBOSE_ERRORS is configured,
a verbose description of the error is logged, otherwise only status
codes/bits is logged.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
[kch]: fix several nits, cosmetics, and trim down code.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme: add a helper to initialize connect_q
Chaitanya Kulkarni [Thu, 10 Feb 2022 19:12:36 +0000 (11:12 -0800)]
nvme: add a helper to initialize connect_q

Add and use helper to remove duplicate code for fabrics connect_q
initialization and error handling for all the transports.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-rdma: add helpers for mapping/unmapping request
Max Gurtovoy [Wed, 9 Feb 2022 08:54:49 +0000 (10:54 +0200)]
nvme-rdma: add helpers for mapping/unmapping request

Introduce nvme_rdma_dma_map_req/nvme_rdma_dma_unmap_req helper functions
to improve code readability and ease on the error flow.

Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet-tcp: replace ida_simple[get|remove] with the simler ida_[alloc|free]
Sagi Grimberg [Mon, 14 Feb 2022 09:07:32 +0000 (11:07 +0200)]
nvmet-tcp: replace ida_simple[get|remove] with the simler ida_[alloc|free]

ida_simple_[get|remove] are wrappers anyways.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet-rdma: replace ida_simple[get|remove] with the simler ida_[alloc|free]
Sagi Grimberg [Mon, 14 Feb 2022 09:07:31 +0000 (11:07 +0200)]
nvmet-rdma: replace ida_simple[get|remove] with the simler ida_[alloc|free]

ida_simple_[get|remove] are wrappers anyways.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet-fc: replace ida_simple[get|remove] with the simler ida_[alloc|free]
Sagi Grimberg [Mon, 14 Feb 2022 09:07:30 +0000 (11:07 +0200)]
nvmet-fc: replace ida_simple[get|remove] with the simler ida_[alloc|free]

ida_simple_[get|remove] are wrappers anyways.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: replace ida_simple[get|remove] with the simler ida_[alloc|free]
Sagi Grimberg [Mon, 14 Feb 2022 09:07:29 +0000 (11:07 +0200)]
nvmet: replace ida_simple[get|remove] with the simler ida_[alloc|free]

ida_simple_[get|remove] are wrappers anyways.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-fc: replace ida_simple[get|remove] with the simler ida_[alloc|free]
Sagi Grimberg [Mon, 14 Feb 2022 09:07:28 +0000 (11:07 +0200)]
nvme-fc: replace ida_simple[get|remove] with the simler ida_[alloc|free]

ida_simple_[get|remove] are wrappers anyways.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme: replace ida_simple[get|remove] with the simler ida_[alloc|free]
Sagi Grimberg [Mon, 14 Feb 2022 09:07:27 +0000 (11:07 +0200)]
nvme: replace ida_simple[get|remove] with the simler ida_[alloc|free]

ida_simple_[get|remove] are wrappers anyways.

Also, use ida_alloc_min with the ns_ida as namespace
enumeration starts with 1.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: allow bdev in buffered_io mode
Chaitanya Kulkarni [Wed, 2 Feb 2022 09:04:45 +0000 (01:04 -0800)]
nvmet: allow bdev in buffered_io mode

Allow block device to be configured in the buffered I/O mode by using
the file backend. In this way now we can use cache for the block
device namespace which shows significant performance improvement.

We update the block device ns enable function and return early when
buffered_io flag is set.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvmet: use i_size_read() to set size for file-ns
Chaitanya Kulkarni [Wed, 2 Feb 2022 09:04:44 +0000 (01:04 -0800)]
nvmet: use i_size_read() to set size for file-ns

Instead of calling vfs_getattr() use i_size_read() to read the size of
file so we can read the size of not only file type but also block type
with one call. This is needed to implement buffered_io support for the
NVMeOF block device backend.

We also change return type of function nvmet_file_ns_revalidate() from
int to void, since this function does not return any meaning value.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-fabrics: remove unnecessary braces for case
Chaitanya Kulkarni [Wed, 12 Jan 2022 06:20:58 +0000 (22:20 -0800)]
nvme-fabrics: remove unnecessary braces for case

Braces are not required for enum value NVME_SC_CONNECT_INVALID_PARAM
when used on the switch-case statement, remove the braces.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-fabrics: use consistent zeroout pattern
Chaitanya Kulkarni [Wed, 12 Jan 2022 06:20:57 +0000 (22:20 -0800)]
nvme-fabrics: use consistent zeroout pattern

Remove zeroout memeset call & zeroout local variable cmd at the time
of declaration in nvmf_ref_read32() similar to what we have done in
nvmf_reg_read64(), nvmf_reg_write32(), nvmf_connect_admin_queue(), and
nvmf_connect_io_queue().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-fabrics: use unsigned int type
Chaitanya Kulkarni [Wed, 12 Jan 2022 06:21:00 +0000 (22:21 -0800)]
nvme-fabrics: use unsigned int type

Loop variable i will never have a negative value, so use
unsigned int type instaed of int.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-fabrics: use unsigned int type
Chaitanya Kulkarni [Wed, 12 Jan 2022 06:20:59 +0000 (22:20 -0800)]
nvme-fabrics: use unsigned int type

Loop variable i will never have a negative value, so use
unsigned int type instaed of int.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-core: remove unnecessary function parameter
Chaitanya Kulkarni [Sat, 22 Jan 2022 05:05:39 +0000 (21:05 -0800)]
nvme-core: remove unnecessary function parameter

In function nvme_execute_rq() we don't use gendisk parameter at all.
Remove the unsed parameter and adjust the calls.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-core: remove unnecessary semicolon
Chaitanya Kulkarni [Wed, 19 Jan 2022 07:49:54 +0000 (23:49 -0800)]
nvme-core: remove unnecessary semicolon

It is not a good practice to have a semicolon at the end of the
function definition. Remove it from nvme_pr_type().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme-fc: fix a typo
Qinghua Jin [Fri, 7 Jan 2022 02:22:58 +0000 (10:22 +0800)]
nvme-fc: fix a typo

subsytem -> subsystem

Signed-off-by: Qinghua Jin <qhjin.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonull_blk: null_alloc_page() cleanup
Chaitanya Kulkarni [Tue, 22 Feb 2022 15:28:52 +0000 (07:28 -0800)]
null_blk: null_alloc_page() cleanup

Remove goto labels and use direct returns as error unwinding code only
needs to free t_page variable if we alloc_pages() call fails as having
two labels for one kfree() can be avoided easily.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220222152852.26043-3-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agonull_blk: remove hardcoded null_alloc_page() param
Chaitanya Kulkarni [Tue, 22 Feb 2022 15:28:51 +0000 (07:28 -0800)]
null_blk: remove hardcoded null_alloc_page() param

Only caller of null_alloc_page() is null_insert_page() unconditionally
sets only parameter to GFP_NOIO and that is statically hard-coded in
null_blk. There is no point in having statically hardcoded function
parameter.

Remove the unnecessary parameter gfp_flags and adjust the code, so it
can retain existing behavior null_alloc_page() with GFP_NOIO.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220222152852.26043-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agonull_blk: remove hardcoded alloc_cmd() parameter
Chaitanya Kulkarni [Wed, 16 Feb 2022 17:29:45 +0000 (09:29 -0800)]
null_blk: remove hardcoded alloc_cmd() parameter

Only caller of alloc_cmd() is null_submit_bio() unconditionally sets
second parameter to true and that is statically hard-coded in null_blk.
There is no point in having statically hardcoded function parameter.

Remove the unnecessary parameter can_wait and adjust the code so it
can retain existing behavior of waiting when we don't get valid
nullb_cmd from __alloc_cmd() in alloc_cmd().

The restructured code avoids multiple return statements, multiple
calls to __alloc_cmd() and resulting a fast path call to
prepare_to_wait() due to removal of first alloc_cmd() call.

Follow the pattern that we have in bio_alloc() to set the structure
members in the structure allocation function in alloc_cmd() and pass
bio to initialize newly allocated cmd->bio member.

Follow the pattern in copy_to_nullb() to use result of one function call
(null_cache_active()) to be used as a parameter to another function call
(null_insert_page()), use result of alloc_cmd() as a first parameter to
the null_handle_cmd() in null_submit_bio() function. This allow us to
remove the local variable cmd on stack in null_submit_bio() that is in
fast path.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220216172945.31124-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoloop: allow user to set the queue depth
Chaitanya Kulkarni [Tue, 15 Feb 2022 21:33:10 +0000 (13:33 -0800)]
loop: allow user to set the queue depth

Instead of hardcoding queue depth allow user to set the hw queue depth
using module parameter. Set default value to 128 to retain the existing
behavior.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20220215213310.7264-5-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoloop: remove extra variable in lo_req_flush
Chaitanya Kulkarni [Tue, 15 Feb 2022 21:33:09 +0000 (13:33 -0800)]
loop: remove extra variable in lo_req_flush

The local variable file is used to pass it to the vfs_fsync(). We can
get away with using lo->lo_backing_file instead of storing in a local
variable which is not used anywhere else.

No functional change in this patch.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20220215213310.7264-4-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoloop: remove extra variable in lo_fallocate()
Chaitanya Kulkarni [Tue, 15 Feb 2022 21:33:08 +0000 (13:33 -0800)]
loop: remove extra variable in lo_fallocate()

The local variable q is used to pass it to the blk_queue_discard(). We
can get away with using lo->lo_queue instead of storing in a local
variable which is not used anywhere else.

No functional change in this patch.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20220215213310.7264-3-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoloop: use sysfs_emit() in the sysfs xxx show()
Chaitanya Kulkarni [Tue, 15 Feb 2022 21:33:07 +0000 (13:33 -0800)]
loop: use sysfs_emit() in the sysfs xxx show()

sprintf does not know the PAGE_SIZE maximum of the temporary buffer
used for outputting sysfs content and it's possible to overrun the
PAGE_SIZE buffer length.

Use a generic sysfs_emit function that knows the size of the
temporary buffer and ensures that no overrun is done for offset
attribute in
loop_attr_[offset|sizelimit|autoclear|partscan|dio]_show() callbacks.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Link: https://lore.kernel.org/r/20220215213310.7264-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agonull_blk: fix return value from null_add_dev()
Chaitanya Kulkarni [Tue, 15 Feb 2022 11:59:51 +0000 (03:59 -0800)]
null_blk: fix return value from null_add_dev()

The function nullb_device_power_store() returns -ENOMEM when
null_add_dev() fails. null_add_dev() can fail with return value
other than -ENOMEM such as -EINVAL when Zoned Block Device option
is used, see :

nullb_device_power_store()
 null_add_dev()
  null_init_zoned_dev()
return -EINVAL;

When trying to load the module having -ENOMEM value returned on the
command line creates confusion when pleanty of memory is free on the
machine.

Instead of hardcoding -ENOMEM return the value of null_add_dev()
function.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220215115951.15945-1-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoloop: clean up grammar in warning message
Colin Ian King [Tue, 8 Feb 2022 11:46:56 +0000 (11:46 +0000)]
loop: clean up grammar in warning message

The phrase "has still" should be "still has" to clean up the grammar.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20220208114656.61629-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock/rnbd: Remove a useless mutex
Christophe JAILLET [Mon, 7 Feb 2022 20:48:19 +0000 (21:48 +0100)]
block/rnbd: Remove a useless mutex

According to lib/idr.c,
   The IDA handles its own locking.  It is safe to call any of the IDA
   functions without synchronisation in your code.

so the 'ida_lock' mutex can just be removed.
It is here only to protect some ida_simple_get()/ida_simple_remove() calls.

While at it, switch to ida_alloc_XXX()/ida_free() instead to
ida_simple_get()/ida_simple_remove().
The latter is deprecated and more verbose.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/7f9eccd8b1fce1bac45ac9b01a78cf72f54c0a61.1644266862.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock/rnbd: client device does not care queue/rotational
Gioh Kim [Fri, 14 Jan 2022 15:58:55 +0000 (16:58 +0100)]
block/rnbd: client device does not care queue/rotational

On client side, the device is a network device. There is no reason
to set rotational even-if the target device on server is rotational.

Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20220114155855.984144-3-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock/rnbd-clt: fix CHECK:BRACES warning
Gioh Kim [Fri, 14 Jan 2022 15:58:54 +0000 (16:58 +0100)]
block/rnbd-clt: fix CHECK:BRACES warning

This patch fix the "CHECK:BRACES: braces {} should be used on all
arms of this statement" warning from checkpatch

Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://lore.kernel.org/r/20220114155855.984144-2-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: default BLOCK_LEGACY_AUTOLOAD to y
Christoph Hellwig [Fri, 25 Feb 2022 18:14:40 +0000 (19:14 +0100)]
block: default BLOCK_LEGACY_AUTOLOAD to y

As Luis reported, losetup currently doesn't properly create the loop
device without this if the device node already exists because old
scripts created it manually.  So default to y for now and remove the
aggressive removal schedule.

Reported-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220225181440.1351591-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: update io_ticks when io hang
Zhang Wensheng [Thu, 17 Feb 2022 06:42:47 +0000 (14:42 +0800)]
block: update io_ticks when io hang

When the inflight IOs are slow and no new IOs are issued, we expect
iostat could manifest the IO hang problem. However after
commit 5b18b5a73760 ("block: delete part_round_stats and switch to less
precise counting"), io_tick and time_in_queue will not be updated until
the end of IO, and the avgqu-sz and %util columns of iostat will be zero.

Because it has using stat.nsecs accumulation to express time_in_queue
which is not suitable to change, and may %util will express the status
better when io hang occur. To fix io_ticks, we use update_io_ticks and
inflight to update io_ticks when diskstats_show and part_stat_show
been called.

Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Zhang Wensheng <zhangwensheng5@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220217064247.4041435-1-zhangwensheng5@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock, bfq: don't move oom_bfqq
Yu Kuai [Sat, 29 Jan 2022 01:59:24 +0000 (09:59 +0800)]
block, bfq: don't move oom_bfqq

Our test report a UAF:

[ 2073.019181] ==================================================================
[ 2073.019188] BUG: KASAN: use-after-free in __bfq_put_async_bfqq+0xa0/0x168
[ 2073.019191] Write of size 8 at addr ffff8000ccf64128 by task rmmod/72584
[ 2073.019192]
[ 2073.019196] CPU: 0 PID: 72584 Comm: rmmod Kdump: loaded Not tainted 4.19.90-yk #5
[ 2073.019198] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 2073.019200] Call trace:
[ 2073.019203]  dump_backtrace+0x0/0x310
[ 2073.019206]  show_stack+0x28/0x38
[ 2073.019210]  dump_stack+0xec/0x15c
[ 2073.019216]  print_address_description+0x68/0x2d0
[ 2073.019220]  kasan_report+0x238/0x2f0
[ 2073.019224]  __asan_store8+0x88/0xb0
[ 2073.019229]  __bfq_put_async_bfqq+0xa0/0x168
[ 2073.019233]  bfq_put_async_queues+0xbc/0x208
[ 2073.019236]  bfq_pd_offline+0x178/0x238
[ 2073.019240]  blkcg_deactivate_policy+0x1f0/0x420
[ 2073.019244]  bfq_exit_queue+0x128/0x178
[ 2073.019249]  blk_mq_exit_sched+0x12c/0x160
[ 2073.019252]  elevator_exit+0xc8/0xd0
[ 2073.019256]  blk_exit_queue+0x50/0x88
[ 2073.019259]  blk_cleanup_queue+0x228/0x3d8
[ 2073.019267]  null_del_dev+0xfc/0x1e0 [null_blk]
[ 2073.019274]  null_exit+0x90/0x114 [null_blk]
[ 2073.019278]  __arm64_sys_delete_module+0x358/0x5a0
[ 2073.019282]  el0_svc_common+0xc8/0x320
[ 2073.019287]  el0_svc_handler+0xf8/0x160
[ 2073.019290]  el0_svc+0x10/0x218
[ 2073.019291]
[ 2073.019294] Allocated by task 14163:
[ 2073.019301]  kasan_kmalloc+0xe0/0x190
[ 2073.019305]  kmem_cache_alloc_node_trace+0x1cc/0x418
[ 2073.019308]  bfq_pd_alloc+0x54/0x118
[ 2073.019313]  blkcg_activate_policy+0x250/0x460
[ 2073.019317]  bfq_create_group_hierarchy+0x38/0x110
[ 2073.019321]  bfq_init_queue+0x6d0/0x948
[ 2073.019325]  blk_mq_init_sched+0x1d8/0x390
[ 2073.019330]  elevator_switch_mq+0x88/0x170
[ 2073.019334]  elevator_switch+0x140/0x270
[ 2073.019338]  elv_iosched_store+0x1a4/0x2a0
[ 2073.019342]  queue_attr_store+0x90/0xe0
[ 2073.019348]  sysfs_kf_write+0xa8/0xe8
[ 2073.019351]  kernfs_fop_write+0x1f8/0x378
[ 2073.019359]  __vfs_write+0xe0/0x360
[ 2073.019363]  vfs_write+0xf0/0x270
[ 2073.019367]  ksys_write+0xdc/0x1b8
[ 2073.019371]  __arm64_sys_write+0x50/0x60
[ 2073.019375]  el0_svc_common+0xc8/0x320
[ 2073.019380]  el0_svc_handler+0xf8/0x160
[ 2073.019383]  el0_svc+0x10/0x218
[ 2073.019385]
[ 2073.019387] Freed by task 72584:
[ 2073.019391]  __kasan_slab_free+0x120/0x228
[ 2073.019394]  kasan_slab_free+0x10/0x18
[ 2073.019397]  kfree+0x94/0x368
[ 2073.019400]  bfqg_put+0x64/0xb0
[ 2073.019404]  bfqg_and_blkg_put+0x90/0xb0
[ 2073.019408]  bfq_put_queue+0x220/0x228
[ 2073.019413]  __bfq_put_async_bfqq+0x98/0x168
[ 2073.019416]  bfq_put_async_queues+0xbc/0x208
[ 2073.019420]  bfq_pd_offline+0x178/0x238
[ 2073.019424]  blkcg_deactivate_policy+0x1f0/0x420
[ 2073.019429]  bfq_exit_queue+0x128/0x178
[ 2073.019433]  blk_mq_exit_sched+0x12c/0x160
[ 2073.019437]  elevator_exit+0xc8/0xd0
[ 2073.019440]  blk_exit_queue+0x50/0x88
[ 2073.019443]  blk_cleanup_queue+0x228/0x3d8
[ 2073.019451]  null_del_dev+0xfc/0x1e0 [null_blk]
[ 2073.019459]  null_exit+0x90/0x114 [null_blk]
[ 2073.019462]  __arm64_sys_delete_module+0x358/0x5a0
[ 2073.019467]  el0_svc_common+0xc8/0x320
[ 2073.019471]  el0_svc_handler+0xf8/0x160
[ 2073.019474]  el0_svc+0x10/0x218
[ 2073.019475]
[ 2073.019479] The buggy address belongs to the object at ffff8000ccf63f00
 which belongs to the cache kmalloc-1024 of size 1024
[ 2073.019484] The buggy address is located 552 bytes inside of
 1024-byte region [ffff8000ccf63f00ffff8000ccf64300)
[ 2073.019486] The buggy address belongs to the page:
[ 2073.019492] page:ffff7e000333d800 count:1 mapcount:0 mapping:ffff8000c0003a00 index:0x0 compound_mapcount: 0
[ 2073.020123] flags: 0x7ffff0000008100(slab|head)
[ 2073.020403] raw: 07ffff0000008100 ffff7e0003334c08 ffff7e00001f5a08 ffff8000c0003a00
[ 2073.020409] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000
[ 2073.020411] page dumped because: kasan: bad access detected
[ 2073.020412]
[ 2073.020414] Memory state around the buggy address:
[ 2073.020420]  ffff8000ccf64000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020424]  ffff8000ccf64080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020428] >ffff8000ccf64100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020430]                                   ^
[ 2073.020434]  ffff8000ccf64180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020438]  ffff8000ccf64200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2073.020439] ==================================================================

The same problem exist in mainline as well.

This is because oom_bfqq is moved to a non-root group, thus root_group
is freed earlier.

Thus fix the problem by don't move oom_bfqq.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220129015924.3958918-4-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock, bfq: avoid moving bfqq to it's parent bfqg
Yu Kuai [Sat, 29 Jan 2022 01:59:23 +0000 (09:59 +0800)]
block, bfq: avoid moving bfqq to it's parent bfqg

Moving bfqq to it's parent bfqg is pointless.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220129015924.3958918-3-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock, bfq: cleanup bfq_bfqq_to_bfqg()
Yu Kuai [Sat, 29 Jan 2022 01:59:22 +0000 (09:59 +0800)]
block, bfq: cleanup bfq_bfqq_to_bfqg()

Use bfq_group() instead, which do the same thing.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220129015924.3958918-2-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock/bfq_wf2q: correct weight to ioprio
Yahu Gao [Fri, 7 Jan 2022 06:58:59 +0000 (14:58 +0800)]
block/bfq_wf2q: correct weight to ioprio

The return value is ioprio * BFQ_WEIGHT_CONVERSION_COEFF or 0.
What we want is ioprio or 0.
Correct this by changing the calculation.

Signed-off-by: Yahu Gao <gaoyahu19@gmail.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220107065859.25689-1-gaoyahu19@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues
David Jeffery [Mon, 31 Jan 2022 20:33:37 +0000 (15:33 -0500)]
blk-mq: avoid extending delays of active hctx from blk_mq_delay_run_hw_queues

When blk_mq_delay_run_hw_queues sets an hctx to run in the future, it can
reset the delay length for an already pending delayed work run_work. This
creates a scenario where multiple hctx may have their queues set to run,
but if one runs first and finds nothing to do, it can reset the delay of
another hctx and stall the other hctx's ability to run requests.

To avoid this I/O stall when an hctx's run_work is already pending,
leave it untouched to run at its current designated time rather than
extending its delay. The work will still run which keeps closed the race
calling blk_mq_delay_run_hw_queues is needed for while also avoiding the
I/O stall.

Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220131203337.GA17666@redhat
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agovirtio_blk: simplify refcounting
Christoph Hellwig [Tue, 15 Feb 2022 09:45:14 +0000 (10:45 +0100)]
virtio_blk: simplify refcounting

Implement the ->free_disk method to free the virtio_blk structure only
once the last gendisk reference goes away instead of keeping a local
refcount.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20220215094514.3828912-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agomemstick/mspro_block: simplify refcounting
Christoph Hellwig [Tue, 15 Feb 2022 09:45:13 +0000 (10:45 +0100)]
memstick/mspro_block: simplify refcounting

Implement the ->free_disk method to free the msb_data structure only once
the last gendisk reference goes away instead of keeping a local
refcount.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220215094514.3828912-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agomemstick/mspro_block: fix handling of read-only devices
Christoph Hellwig [Tue, 15 Feb 2022 09:45:12 +0000 (10:45 +0100)]
memstick/mspro_block: fix handling of read-only devices

Use set_disk_ro to propagate the read-only state to the block layer
instead of checking for it in ->open and leaking a reference in case
of a read-only device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220215094514.3828912-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agomemstick/ms_block: simplify refcounting
Christoph Hellwig [Tue, 15 Feb 2022 09:45:11 +0000 (10:45 +0100)]
memstick/ms_block: simplify refcounting

Implement the ->free_disk method to free the msb_data structure only once
the last gendisk reference goes away instead of keeping a local refcount.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220215094514.3828912-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: add a ->free_disk method
Christoph Hellwig [Tue, 15 Feb 2022 09:45:10 +0000 (10:45 +0100)]
block: add a ->free_disk method

Add a method to notify the driver that the gendisk is about to be freed.
This allows drivers to tie the lifetime of their private data to that of
the gendisk and thus deal with device removal races without expensive
synchronization and boilerplate code.

A new flag is added so that ->free_disk is only called after a successful
call to add_disk, which significantly simplifies the error handling path
during probing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220215094514.3828912-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: revert 4f1e9630afe6 ("blk-throtl: optimize IOPS throttle for large IO scenarios")
Ming Lei [Wed, 16 Feb 2022 04:45:14 +0000 (12:45 +0800)]
block: revert 4f1e9630afe6 ("blk-throtl: optimize IOPS throttle for large IO scenarios")

Revert commit 4f1e9630afe6 ("blk-throtl: optimize IOPS throttle for large
IO scenarios") since we have another easier way to address this issue and
get better iops throttling result.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: don't try to throttle split bio if iops limit isn't set
Ming Lei [Wed, 16 Feb 2022 04:45:13 +0000 (12:45 +0800)]
block: don't try to throttle split bio if iops limit isn't set

We need to throttle split bio in case of IOPS limit even though the
split bio has been marked as BIO_THROTTLED since block layer
accounts split bio actually.

If only throughput throttle is setup, no need to throttle any more
if BIO_THROTTLED is set since we have accounted & considered the
whole bio bytes already.

Add one flag of THROTL_TG_HAS_IOPS_LIMIT for serving this purpose.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: throttle split bio in case of iops limit
Ming Lei [Wed, 16 Feb 2022 04:45:12 +0000 (12:45 +0800)]
block: throttle split bio in case of iops limit

Commit 111be8839817 ("block-throttle: avoid double charge") marks bio as
BIO_THROTTLED unconditionally if __blk_throtl_bio() is called on this bio,
then this bio won't be called into __blk_throtl_bio() any more. This way
is to avoid double charge in case of bio splitting. It is reasonable for
read/write throughput limit, but not reasonable for IOPS limit because
block layer provides io accounting against split bio.

Chunguang Xu has already observed this issue and fixed it in commit
4f1e9630afe6 ("blk-throtl: optimize IOPS throttle for large IO scenarios").
However, that patch only covers bio splitting in __blk_queue_split(), and
we have other kind of bio splitting, such as bio_split() &
submit_bio_noacct() and other ways.

This patch tries to fix the issue in one generic way by always charging
the bio for iops limit in blk_throtl_bio(). This way is reasonable:
re-submission & fast-cloned bio is charged if it is submitted to same
disk/queue, and BIO_THROTTLED will be cleared if bio->bi_bdev is changed.

This new approach can get much more smooth/stable iops limit compared with
commit 4f1e9630afe6 ("blk-throtl: optimize IOPS throttle for large IO
scenarios") since that commit can't throttle current split bios actually.

Also this way won't cause new double bio iops charge in
blk_throtl_dispatch_work_fn() in which blk_throtl_bio() won't be called
any more.

Reported-by: Ning Li <lining2020x@163.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: merge submit_bio_checks() into submit_bio_noacct
Ming Lei [Wed, 16 Feb 2022 04:45:11 +0000 (12:45 +0800)]
block: merge submit_bio_checks() into submit_bio_noacct

Now submit_bio_checks() is only called by submit_bio_noacct(), so merge
it into submit_bio_noacct().

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: don't check bio in blk_throtl_dispatch_work_fn
Ming Lei [Wed, 16 Feb 2022 04:45:10 +0000 (12:45 +0800)]
block: don't check bio in blk_throtl_dispatch_work_fn

The bio has been checked already before throttling, so no need to check
it again before dispatching it from throttle queue.

Add a helper of submit_bio_noacct_nocheck() for this purpose.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220216044514.2903784-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: don't declare submit_bio_checks in local header
Ming Lei [Wed, 16 Feb 2022 04:45:09 +0000 (12:45 +0800)]
block: don't declare submit_bio_checks in local header

submit_bio_checks() won't be called outside of block/blk-core.c any more
since commit 9d497e2941c3 ("block: don't protect submit_bio_checks by
q_usage_counter"), so mark it as one local helper.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: move blk_crypto_bio_prep() out of blk-mq.c
Ming Lei [Wed, 16 Feb 2022 04:45:08 +0000 (12:45 +0800)]
block: move blk_crypto_bio_prep() out of blk-mq.c

blk_crypto_bio_prep() is called for both bio based and blk-mq drivers,
so move it out of blk-mq.c, then we can unify this kind of handling.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: move submit_bio_checks() into submit_bio_noacct
Ming Lei [Wed, 16 Feb 2022 04:45:07 +0000 (12:45 +0800)]
block: move submit_bio_checks() into submit_bio_noacct

It is more clean & readable to check bio when starting to submit it,
instead of just before calling ->submit_bio() or blk_mq_submit_bio().

Also it provides us chance to optimize bio submission without checking
bio.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodm: remove dm_dispatch_clone_request
Christoph Hellwig [Tue, 15 Feb 2022 10:05:40 +0000 (11:05 +0100)]
dm: remove dm_dispatch_clone_request

Fold dm_dispatch_clone_request into it's only caller, and use a switch
statement to single dispatch for the handling of the different return
values from blk_insert_cloned_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodm: remove useless code from dm_dispatch_clone_request
Christoph Hellwig [Tue, 15 Feb 2022 10:05:39 +0000 (11:05 +0100)]
dm: remove useless code from dm_dispatch_clone_request

Both ->start_time_ns and the RQF_IO_STAT are set when the request is
allocated using blk_mq_alloc_request by dm-mpath in blk_mq_rq_ctx_init.
The block layer also ensures ->start_time_ns is only set when actually
needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>