OSDN Git Service

tomoyo/tomoyo-test1.git
4 years agobtrfs: Call btrfs_pin_reserved_extent only during active transaction
Nikolay Borisov [Mon, 20 Jan 2020 14:09:11 +0000 (16:09 +0200)]
btrfs: Call btrfs_pin_reserved_extent only during active transaction

Calling btrfs_pin_reserved_extent makes sense only with a valid
transaction since pinned extents are processed from transaction commit
in btrfs_finish_extent_commit. In case of error it's sufficient to
adjust the reserved counter to account for log tree extents allocated in
the last transaction.

This commit moves btrfs_pin_reserved_extent to be called only with valid
transaction handle and otherwise uses the newly introduced
unaccount_log_buffer to adjust "reserved". If this is not done if a
failure occurs before transaction is committed WARN_ON are going to be
triggered on unmount. This was especially pronounced with generic/475
test.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Introduce unaccount_log_buffer
Nikolay Borisov [Mon, 20 Jan 2020 14:09:10 +0000 (16:09 +0200)]
btrfs: Introduce unaccount_log_buffer

This function correctly adjusts the reserved bytes occupied by a log
tree extent buffer. It will be used instead of calling
btrfs_pin_reserved_extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Make btrfs_pin_extent take trans handle
Nikolay Borisov [Mon, 20 Jan 2020 14:09:09 +0000 (16:09 +0200)]
btrfs: Make btrfs_pin_extent take trans handle

Preparation for switching pinned extent tracking to a per-transaction
basis.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Perform pinned cleanup directly in btrfs_destroy_delayed_refs
Nikolay Borisov [Mon, 20 Jan 2020 14:09:08 +0000 (16:09 +0200)]
btrfs: Perform pinned cleanup directly in btrfs_destroy_delayed_refs

Having btrfs_destroy_delayed_refs call btrfs_pin_extent is problematic
for making pinned extents tracking per-transaction since
btrfs_trans_handle cannot be passed to btrfs_pin_extent in this context.
Additionally delayed refs heads pinned in btrfs_destroy_delayed_refs
are going to be handled very closely, in btrfs_destroy_pinned_extent.

To enable btrfs_pin_extent to take btrfs_trans_handle simply open code
it in btrfs_destroy_delayed_refs and call btrfs_error_unpin_extent_range
on the range. This enables us to do less work in
btrfs_destroy_pinned_extent and leaves btrfs_pin_extent being called in
contexts which have a valid btrfs_trans_handle.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: sysfs, unify handler name of devinfo/missing
Anand Jain [Thu, 13 Feb 2020 08:40:53 +0000 (16:40 +0800)]
btrfs: sysfs, unify handler name of devinfo/missing

The devinfo attribute handlers were added in 668e48af7a94 ("btrfs:
sysfs, add devid/dev_state kobject and device attributes") and the name
should contain _devinfo_, there's one that does not conform, so unify it
with the rest.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: sysfs, rename device_link add/remove functions
Anand Jain [Wed, 12 Feb 2020 09:28:13 +0000 (17:28 +0800)]
btrfs: sysfs, rename device_link add/remove functions

Since commit 668e48af7a94 ("btrfs: sysfs, add devid/dev_state kobject and
device attributes"), the functions btrfs_sysfs_add_device_link() and
btrfs_sysfs_rm_device_link() do more than just adding and removing the
device link as its name indicated. Rename them to be more specific
that's about the directory with the attirbutes

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: sysfs, use btrfs_sysfs_remove_fsid to celanup errors in add_fsid
Anand Jain [Wed, 12 Feb 2020 09:28:12 +0000 (17:28 +0800)]
btrfs: sysfs, use btrfs_sysfs_remove_fsid to celanup errors in add_fsid

We have one simple function btrfs_sysfs_remove_fsid() to undo
btrfs_sysfs_add_fsid(), which also does proper checks before releasing
objects.

One difference, if btrfs_sysfs_remove_fsid is used that now we also call
kobject_del() which was missing before. This was tested (with kobject
debug turned on) and no change in behaviour was found.

This is a cleanup patch.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: sink argument tree to __do_readpage
David Sterba [Wed, 5 Feb 2020 18:09:42 +0000 (19:09 +0100)]
btrfs: sink argument tree to __do_readpage

The tree pointer can be safely read from the inode, use it and drop the
redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: sink arugment tree to contiguous_readpages
David Sterba [Wed, 5 Feb 2020 18:09:40 +0000 (19:09 +0100)]
btrfs: sink arugment tree to contiguous_readpages

The tree pointer can be safely read from the inode, use it and drop the
redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: sink argument tree to __extent_read_full_page
David Sterba [Wed, 5 Feb 2020 18:09:37 +0000 (19:09 +0100)]
btrfs: sink argument tree to __extent_read_full_page

The tree pointer can be safely read from the inode, use it and drop the
redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: sink argument tree to extent_read_full_page
David Sterba [Wed, 5 Feb 2020 18:09:35 +0000 (19:09 +0100)]
btrfs: sink argument tree to extent_read_full_page

The tree pointer can be safely read from the page's inode, use it and
drop the redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: drop argument tree from btrfs_lock_and_flush_ordered_range
David Sterba [Wed, 5 Feb 2020 18:09:33 +0000 (19:09 +0100)]
btrfs: drop argument tree from btrfs_lock_and_flush_ordered_range

The tree pointer can be safely read from the inode so we can drop the
redundant argument from btrfs_lock_and_flush_ordered_range.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: add assertions for tree == inode->io_tree to extent IO helpers
David Sterba [Wed, 5 Feb 2020 18:09:30 +0000 (19:09 +0100)]
btrfs: add assertions for tree == inode->io_tree to extent IO helpers

Add assertions to all helpers that get tree as argument and verify that
it's the same that can be obtained from the inode or from its pages. In
followup patches the redundant arguments and assertions will be removed
one by one.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: drop argument tree from submit_extent_page
David Sterba [Wed, 5 Feb 2020 18:09:28 +0000 (19:09 +0100)]
btrfs: drop argument tree from submit_extent_page

Now that we're sure the tree from argument is same as the one we can get
from the page's inode io_tree, drop the redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: remove extent_page_data::tree
David Sterba [Wed, 5 Feb 2020 18:09:26 +0000 (19:09 +0100)]
btrfs: remove extent_page_data::tree

All functions that set up extent_page_data::tree set it to the inode
io_tree. That's passed down the callstack that accesses either the same
inode or its pages. In the end submit_extent_page can pull the tree out
of the page and we don't have to store it in the structure.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: add wrapper for transaction abort predicate
David Sterba [Wed, 5 Feb 2020 16:34:34 +0000 (17:34 +0100)]
btrfs: add wrapper for transaction abort predicate

The status of aborted transaction can change between calls and it needs
to be accessed by READ_ONCE. Add a helper that also wraps the unlikely
hint.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: move root node locking helpers to locking.c
David Sterba [Wed, 5 Feb 2020 16:26:51 +0000 (17:26 +0100)]
btrfs: move root node locking helpers to locking.c

The helpers are related to locking so move them there, update comments.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root
Josef Bacik [Fri, 24 Jan 2020 14:33:01 +0000 (09:33 -0500)]
btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root

We are now using these for all roots, rename them to btrfs_put_root()
and btrfs_grab_root();

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: add a leak check for roots
Josef Bacik [Fri, 24 Jan 2020 14:33:00 +0000 (09:33 -0500)]
btrfs: add a leak check for roots

Now that we're going to start relying on getting ref counting right for
roots, add a list to track allocated roots and print out any roots that
aren't freed up at free_fs_info time.

Hide this behind CONFIG_BTRFS_DEBUG because this will just be used for
developers to verify they aren't breaking things.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: make the init of static elements in fs_info separate
Josef Bacik [Fri, 24 Jan 2020 14:32:59 +0000 (09:32 -0500)]
btrfs: make the init of static elements in fs_info separate

In adding things like eb leak checking and root leak checking there were
a lot of weird corner cases that come from the fact that

  1) We do not init the fs_info until we get to open_ctree time in the
     normal case and

  2) The test infrastructure half-init's the fs_info for things that it
     needs.

This makes it really annoying to make changes because you have to add
init in two different places, have special cases for testing fs_info's
that may not have certain things initialized, and cases for fs_info's
that didn't make it to open_ctree and thus are not fully set up.

Fix this by extracting out the non-allocating init of the fs info into
it's own public function and use that to make sure we're all getting
consistent views of an allocated fs_info.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: move fs_info init work into it's own helper function
Josef Bacik [Fri, 24 Jan 2020 14:32:58 +0000 (09:32 -0500)]
btrfs: move fs_info init work into it's own helper function

open_ctree mixes initialization of fs stuff and fs_info stuff, which
makes it confusing when doing things like adding the root leak
detection.  Make a separate function that inits all the static
structures inside of the fs_info needed for the fs to operate, and then
call that before we start setting up the fs_info to be mounted.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: free more things in btrfs_free_fs_info
Josef Bacik [Fri, 24 Jan 2020 14:32:57 +0000 (09:32 -0500)]
btrfs: free more things in btrfs_free_fs_info

Things like the percpu_counters, the mapping_tree, and the csum hash can
all be freed at btrfs_free_fs_info time, since the helpers all check if
the structure has been initialized already.  This significantly cleans
up the error cases in open_ctree.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: push btrfs_grab_fs_root into btrfs_get_fs_root
Josef Bacik [Fri, 24 Jan 2020 14:32:56 +0000 (09:32 -0500)]
btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root

Now that all callers of btrfs_get_fs_root are subsequently calling
btrfs_grab_fs_root and handling dropping the ref when they are done
appropriately, go ahead and push btrfs_grab_fs_root up into
btrfs_get_fs_root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: use btrfs_put_fs_root to free roots always
Josef Bacik [Fri, 24 Jan 2020 14:32:55 +0000 (09:32 -0500)]
btrfs: use btrfs_put_fs_root to free roots always

If we are going to track leaked roots we need to free them all the same
way, so don't kfree() roots directly, use btrfs_put_fs_root.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in open_ctree
Josef Bacik [Fri, 24 Jan 2020 14:32:54 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in open_ctree

We lookup the fs_root and put it in our fs_info directly, we should hold
a ref on this root for the lifetime of the fs_info.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: export and rename free_fs_info
Josef Bacik [Fri, 24 Jan 2020 14:32:53 +0000 (09:32 -0500)]
btrfs: export and rename free_fs_info

We're going to start freeing roots and doing other complicated things in
free_fs_info, so we need to move it to disk-io.c and export it in order
to use things lik btrfs_put_fs_root().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in btrfs_check_uuid_tree_entry
Josef Bacik [Fri, 24 Jan 2020 14:32:52 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in btrfs_check_uuid_tree_entry

We lookup the uuid of arbitrary subvolumes, hold a ref on the root while
we're doing this.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in btrfs_recover_log_trees
Josef Bacik [Fri, 24 Jan 2020 14:32:51 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in btrfs_recover_log_trees

We replay the log into arbitrary fs roots, hold a ref on the root while
we're doing this.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in create_pending_snapshot
Josef Bacik [Fri, 24 Jan 2020 14:32:50 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in create_pending_snapshot

We create the snapshot and then use it for a bunch of things, we need to
hold a ref on it while we're messing with it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in get_subvol_name_from_objectid
Josef Bacik [Thu, 6 Feb 2020 15:24:26 +0000 (10:24 -0500)]
btrfs: hold a ref on the root in get_subvol_name_from_objectid

We lookup the name of a subvol which means we'll cross into different
roots.  Hold a ref while we're doing the look ups in the fs_root we're
searching.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in btrfs_ioctl_send
Josef Bacik [Fri, 24 Jan 2020 14:32:48 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in btrfs_ioctl_send

We lookup all the clone roots and the parent root for send, so we need
to hold refs on all of these roots while we're processing them.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in scrub_print_warning_inode
Josef Bacik [Fri, 24 Jan 2020 14:32:47 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in scrub_print_warning_inode

We look up the root for the bytenr that is failing, so we need to hold a
ref on the root for that operation.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref for the root in btrfs_find_orphan_roots
Josef Bacik [Fri, 24 Jan 2020 14:32:46 +0000 (09:32 -0500)]
btrfs: hold a ref for the root in btrfs_find_orphan_roots

We lookup roots for every orphan item we have, we need to hold a ref on
the root while we're doing this work.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: push grab_fs_root into read_fs_root
Josef Bacik [Fri, 24 Jan 2020 14:32:45 +0000 (09:32 -0500)]
btrfs: push grab_fs_root into read_fs_root

All of relocation uses read_fs_root to lookup fs roots, so push the
btrfs_grab_fs_root() up into that helper and remove the individual
calls.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in btrfs_recover_relocation
Josef Bacik [Fri, 24 Jan 2020 14:32:44 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in btrfs_recover_relocation

We look up the fs root in various places in here when recovering from a
crashed relcoation.  Make sure we hold a ref on the root whenever we
look them up.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in create_reloc_inode
Josef Bacik [Fri, 24 Jan 2020 14:32:43 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in create_reloc_inode

We're creating a reloc inode in the data reloc tree, we need to hold a
ref on the root while we're doing that.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in find_data_references
Josef Bacik [Fri, 24 Jan 2020 14:32:42 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in find_data_references

We're looking up the data references for the bytenr in a root, we need
to hold a ref on that root while we're doing that.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in record_reloc_root_in_trans
Josef Bacik [Fri, 24 Jan 2020 14:32:41 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in record_reloc_root_in_trans

We are recording this root in the transaction, so we need to hold a ref
on it until we do that.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in merge_reloc_roots
Josef Bacik [Fri, 24 Jan 2020 14:32:40 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in merge_reloc_roots

We look up the corresponding root for the reloc root, we need to hold a
ref while we're messing with it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in prepare_to_merge
Josef Bacik [Fri, 24 Jan 2020 14:32:39 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in prepare_to_merge

We look up the reloc roots corresponding root, we need to hold a ref on
that root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in build_backref_tree
Josef Bacik [Fri, 24 Jan 2020 14:32:38 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in build_backref_tree

This is trickier than the previous conversions.  We have backref_node's
that need to hold onto their root for their lifetime.  Do the read of
the root and grab the ref.  If at any point we don't use the root we
discard it, however if we use it in our backref node we don't free it
until we free the backref node.  Any time we switch the root's for the
backref node we need to drop our ref on the old root and grab the ref on
the new root, and if we dupe a node we need to get a ref on the root
there as well.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold ref on root in btrfs_ioctl_default_subvol
Josef Bacik [Fri, 24 Jan 2020 14:32:37 +0000 (09:32 -0500)]
btrfs: hold ref on root in btrfs_ioctl_default_subvol

We look up an arbitrary fs root here, we need to hold a ref on the root
for the duration.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in btrfs_ioctl_get_subvol_info
Josef Bacik [Fri, 24 Jan 2020 14:32:36 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in btrfs_ioctl_get_subvol_info

We look up whatever root userspace has given us, we need to hold a ref
throughout this operation. Use 'root' only for the on fs root and not as
a temporary variable elsewhere.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in btrfs_search_path_in_tree_user
Josef Bacik [Fri, 24 Jan 2020 14:32:35 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in btrfs_search_path_in_tree_user

We can wander into a different root, so grab a ref on the root we look
up.  Later on we make root = fs_info->tree_root so we need this separate
out label to make sure we do the right cleanup only in the case we're
looking up a different root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in btrfs_search_path_in_tree
Josef Bacik [Fri, 24 Jan 2020 14:32:34 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in btrfs_search_path_in_tree

We look up an arbitrary fs root, we need to hold a ref on it while we're
doing our search.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in search_ioctl
Josef Bacik [Fri, 24 Jan 2020 14:32:33 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in search_ioctl

We lookup a arbitrary fs root, we need to hold a ref on that root.  If
we're using our own inodes root then grab a ref on that as well to make
the cleanup easier.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in create_subvol
Josef Bacik [Fri, 24 Jan 2020 14:32:32 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in create_subvol

We're creating the new root here, but we should hold the ref until after
we've initialized the inode for it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in fixup_tree_root_location
Josef Bacik [Fri, 24 Jan 2020 14:32:31 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in fixup_tree_root_location

Looking up the inode from an arbitrary tree means we need to hold a ref
on that root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in __btrfs_run_defrag_inode
Josef Bacik [Fri, 24 Jan 2020 14:32:30 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in __btrfs_run_defrag_inode

We are looking up an arbitrary inode, we need to hold a ref on the root
while we're doing this.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a root ref in btrfs_get_dentry
Josef Bacik [Fri, 24 Jan 2020 14:32:29 +0000 (09:32 -0500)]
btrfs: hold a root ref in btrfs_get_dentry

Looking up the inode we need to search the root, make sure we hold a
reference on that root while we're doing the lookup.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in resolve_indirect_ref
Josef Bacik [Fri, 24 Jan 2020 14:32:28 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in resolve_indirect_ref

We're looking up a random root, we need to hold a ref on it while we're
using it.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on fs roots while they're in the radix tree
Josef Bacik [Fri, 24 Jan 2020 14:32:27 +0000 (09:32 -0500)]
btrfs: hold a ref on fs roots while they're in the radix tree

If the root is sitting in the radix tree, we should probably have a ref
for the radix tree.  Grab a ref on the root when we insert it, and drop
it when it gets deleted.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: describe the space reservation system in general
Josef Bacik [Tue, 4 Feb 2020 18:18:56 +0000 (13:18 -0500)]
btrfs: describe the space reservation system in general

Add another comment to cover how the space reservation system works
generally.  This covers the actual reservation flow, as well as how
flushing is handled.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: add a comment describing delalloc space reservation
Josef Bacik [Tue, 4 Feb 2020 18:18:55 +0000 (13:18 -0500)]
btrfs: add a comment describing delalloc space reservation

delalloc space reservation is tricky because it encompasses both data
and metadata.  Make it clear what each side does, the general flow of
how space is moved throughout the lifetime of a write, and what goes
into the calculations.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: add a comment describing block reserves
Josef Bacik [Tue, 4 Feb 2020 18:18:54 +0000 (13:18 -0500)]
btrfs: add a comment describing block reserves

This is a giant comment at the top of block-rsv.c describing generally
how block reserves work.  It is purely about the block reserves
themselves, and nothing to do with how the actual reservation system
works.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: handle NULL roots in btrfs_put/btrfs_grab_fs_root
Josef Bacik [Fri, 24 Jan 2020 14:32:26 +0000 (09:32 -0500)]
btrfs: handle NULL roots in btrfs_put/btrfs_grab_fs_root

We want to use this for dropping all roots, and in some error cases we
may not have a root, so handle this to make the cleanup code easier.
Make btrfs_grab_fs_root the same so we can use it in cases where the
root may not exist (like the quota root).

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: make the fs root init functions static
Josef Bacik [Fri, 24 Jan 2020 14:32:25 +0000 (09:32 -0500)]
btrfs: make the fs root init functions static

Now that the orphan cleanup stuff doesn't use this directly we can just
make them static.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: open code btrfs_read_fs_root_no_name
Josef Bacik [Fri, 24 Jan 2020 14:32:24 +0000 (09:32 -0500)]
btrfs: open code btrfs_read_fs_root_no_name

All this does is call btrfs_get_fs_root() with check_ref == true.  Just
use btrfs_get_fs_root() so we don't have a bunch of different helpers
that do the same thing.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: remove btrfs_read_fs_root, not used anymore
Josef Bacik [Fri, 24 Jan 2020 14:32:23 +0000 (09:32 -0500)]
btrfs: remove btrfs_read_fs_root, not used anymore

All helpers should either be using btrfs_get_fs_root() or
btrfs_read_tree_root().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: make relocation use btrfs_read_tree_root()
Josef Bacik [Fri, 24 Jan 2020 14:32:22 +0000 (09:32 -0500)]
btrfs: make relocation use btrfs_read_tree_root()

Relocation has it's special roots, we don't want to save these in the
root cache either, so swap it to use btrfs_read_tree_root().  However
the reloc root does need REF_COWS set, so make sure we set it everywhere
we use this helper, as it no longer does the REF_COWS setting.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: export and use btrfs_read_tree_root for tree-log
Josef Bacik [Fri, 24 Jan 2020 14:32:21 +0000 (09:32 -0500)]
btrfs: export and use btrfs_read_tree_root for tree-log

Tree-log uses btrfs_read_fs_root to load its log, but this just calls
btrfs_read_tree_root.  We don't save the log roots in our root cache, so
just export this helper and use it in the logging code.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: make btrfs_find_orphan_roots use btrfs_get_fs_root
Josef Bacik [Fri, 24 Jan 2020 14:32:20 +0000 (09:32 -0500)]
btrfs: make btrfs_find_orphan_roots use btrfs_get_fs_root

btrfs_find_orphan_roots has this weird thing where it looks up the root
in cache to see if it is there before just reading the root.  But the
read it uses just reads the root, it doesn't do any of the init work, we
do that by hand here.  But this is unnecessary, all we really want is to
see if the root still exists and add it to the dead roots list to be
cleaned up, otherwise we delete the orphan item.

Fix this by just using btrfs_get_fs_root directly with check_ref set to
false so we get the orphan root items.  Then we just handle in cache and
out of cache roots the same, add them to the dead roots list and carry
on.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: move fs root init stuff into btrfs_init_fs_root
Josef Bacik [Fri, 24 Jan 2020 14:32:19 +0000 (09:32 -0500)]
btrfs: move fs root init stuff into btrfs_init_fs_root

We have a helper for reading fs roots that just reads the fs root off
the disk and then sets REF_COWS and init's the inheritable flags.  Move
this into btrfs_init_fs_root so we can later get rid of this helper and
consolidate all of the fs root reading into one helper.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: push __setup_root into btrfs_alloc_root
Josef Bacik [Fri, 24 Jan 2020 14:32:18 +0000 (09:32 -0500)]
btrfs: push __setup_root into btrfs_alloc_root

There's no reason to not init the root at alloc time, and with later
patches it actually causes problems if we error out mounting the fs
before the tree_root is init'ed because we expect it to have a valid ref
count.  Fix this by pushing __setup_root into btrfs_alloc_root.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: delete the ordered isize update code
Josef Bacik [Fri, 17 Jan 2020 14:02:24 +0000 (09:02 -0500)]
btrfs: delete the ordered isize update code

Now that we have a safe way to update the isize, remove all of this code
as it's no longer needed.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: replace all uses of btrfs_ordered_update_i_size
Josef Bacik [Fri, 17 Jan 2020 14:02:23 +0000 (09:02 -0500)]
btrfs: replace all uses of btrfs_ordered_update_i_size

Now that we have a safe way to update the i_size, replace all uses of
btrfs_ordered_update_i_size with btrfs_inode_safe_disk_i_size_write.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: use the file extent tree infrastructure
Josef Bacik [Fri, 17 Jan 2020 14:02:22 +0000 (09:02 -0500)]
btrfs: use the file extent tree infrastructure

We want to use this everywhere we modify the file extent items
permanently.  These include:

  1) Inserting new file extents for writes and prealloc extents.
  2) Truncating inode items.
  3) btrfs_cont_expand().
  4) Insert inline extents.
  5) Insert new extents from log replay.
  6) Insert a new extent for clone, as it could be past i_size.
  7) Hole punching

For hole punching in particular it might seem it's not necessary because
anybody extending would use btrfs_cont_expand, however there is a corner
that still can give us trouble.  Start with an empty file and

fallocate KEEP_SIZE 1M-2M

We now have a 0 length file, and a hole file extent from 0-1M, and a
prealloc extent from 1M-2M.  Now

punch 1M-1.5M

Because this is past i_size we have

[HOLE EXTENT][ NOTHING ][PREALLOC]
[0        1M][1M   1.5M][1.5M  2M]

with an i_size of 0.  Now if we pwrite 0-1.5M we'll increas our i_size
to 1.5M, but our disk_i_size is still 0 until the ordered extent
completes.

However if we now immediately truncate 2M on the file we'll just call
btrfs_cont_expand(inode, 1.5M, 2M), since our old i_size is 1.5M.  If we
commit the transaction here and crash we'll expose the gap.

To fix this we need to clear the file extent mapping for the range that
we punched but didn't insert a corresponding file extent for.  This will
mean the truncate will only get an disk_i_size set to 1M if we crash
before the finish ordered io happens.

I've written an xfstest to reproduce the problem and validate this fix.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: introduce per-inode file extent tree
Josef Bacik [Fri, 17 Jan 2020 14:02:21 +0000 (09:02 -0500)]
btrfs: introduce per-inode file extent tree

In order to keep track of where we have file extents on disk, and thus
where it is safe to adjust the i_size to, we need to have a tree in
place to keep track of the contiguous areas we have file extents for.

Add helpers to use this tree, as it's not required for NO_HOLES file
systems.  We will use this by setting DIRTY for areas we know we have
file extent item's set, and clearing it when we remove file extent items
for truncation.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: use btrfs_ordered_update_i_size in clone_finish_inode_update
Josef Bacik [Fri, 17 Jan 2020 14:02:19 +0000 (09:02 -0500)]
btrfs: use btrfs_ordered_update_i_size in clone_finish_inode_update

We were using btrfs_i_size_write(), which unconditionally jacks up
inode->disk_i_size.  However since clone can operate on ranges we could
have pending ordered extents for a range prior to the start of our clone
operation and thus increase disk_i_size too far and have a hole with no
file extent.

Fix this by using the btrfs_ordered_update_i_size helper which will do
the right thing in the face of pending ordered extents outside of our
clone range.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: update the comment of btrfs_control_ioctl()
Su Yue [Tue, 4 Feb 2020 04:51:56 +0000 (12:51 +0800)]
btrfs: update the comment of btrfs_control_ioctl()

Btrfsctl was removed in 2012, now the function btrfs_control_ioctl()
is only used for devices ioctls. So update the comment.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: relocation: Add introduction of how relocation works
Qu Wenruo [Thu, 16 Jan 2020 05:04:07 +0000 (13:04 +0800)]
btrfs: relocation: Add introduction of how relocation works

Relocation is one of the most complex part of btrfs, while it's also the
foundation stone for online resizing, profile converting.

For such a complex facility, we should at least have some introduction
to it.

This patch will add an basic introduction at pretty a high level,
explaining:

- What relocation does
- How relocation is done
  Only mentioning how data reloc tree and reloc tree are involved in the
  operation.
  No details like the backref cache, or the data reloc tree contents.
- Which function to refer.

More detailed comments will be added for reloc tree creation, data reloc
tree creation and backref cache.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agoBtrfs: don't iterate mod seq list when putting a tree mod seq
Filipe Manana [Wed, 22 Jan 2020 12:23:54 +0000 (12:23 +0000)]
Btrfs: don't iterate mod seq list when putting a tree mod seq

Each new element added to the mod seq list is always appended to the list,
and each one gets a sequence number coming from a counter which gets
incremented everytime a new element is added to the list (or a new node
is added to the tree mod log rbtree). Therefore the element with the
lowest sequence number is always the first element in the list.

So just remove the list iteration at btrfs_put_tree_mod_seq() that
computes the minimum sequence number in the list and replace it with
a check for the first element's sequence number.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Add overview of device replace
Qu Wenruo [Thu, 23 Jan 2020 07:44:50 +0000 (15:44 +0800)]
btrfs: Add overview of device replace

The overview of btrfs dev-replace.  It mentions some corner cases caused
by the write duplication and scrub based data copy.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ adjust wording ]
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agoLinux 5.6-rc7 v5.6-rc7
Linus Torvalds [Mon, 23 Mar 2020 01:31:56 +0000 (18:31 -0700)]
Linux 5.6-rc7

4 years agoMerge tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Sun, 22 Mar 2020 18:35:33 +0000 (11:35 -0700)]
Merge tag 'for-5.6-rc6-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two fixes.

  The first is a regression: when dropping some incompat bits the
  conditions were reversed. The other is a fix for rename whiteout
  potentially leaving stack memory linked to a list"

* tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix removal of raid[56|1c34} incompat flags after removing block group
  btrfs: fix log context list corruption after rename whiteout error

4 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Sun, 22 Mar 2020 17:46:50 +0000 (10:46 -0700)]
Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
 "10 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  x86/mm: split vmalloc_sync_all()
  mm, slub: prevent kmalloc_node crashes and memory leaks
  mm/mmu_notifier: silence PROVE_RCU_LIST warnings
  epoll: fix possible lost wakeup on epoll_ctl() path
  mm: do not allow MADV_PAGEOUT for CoW pages
  mm, memcg: throttle allocators based on ancestral memory.high
  mm, memcg: fix corruption on 64-bit divisor in memory.high throttling
  page-flags: fix a crash at SetPageError(THP_SWAP)
  mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case
  memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event

4 years agox86/mm: split vmalloc_sync_all()
Joerg Roedel [Sun, 22 Mar 2020 01:22:41 +0000 (18:22 -0700)]
x86/mm: split vmalloc_sync_all()

Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
__purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
the vunmap() code-path.  While this change was necessary to maintain
correctness on x86-32-pae kernels, it also adds additional cycles for
architectures that don't need it.

Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
severe performance regressions in micro-benchmarks because it now also
calls the x86-64 implementation of vmalloc_sync_all() on vunmap().  But
the vmalloc_sync_all() implementation on x86-64 is only needed for newly
created mappings.

To avoid the unnecessary work on x86-64 and to gain the performance
back, split up vmalloc_sync_all() into two functions:

* vmalloc_sync_mappings(), and
* vmalloc_sync_unmappings()

Most call-sites to vmalloc_sync_all() only care about new mappings being
synchronized.  The only exception is the new call-site added in the
above mentioned commit.

Shile Zhang directed us to a report of an 80% regression in reaim
throughput.

Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Borislav Petkov <bp@suse.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [GHES]
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomm, slub: prevent kmalloc_node crashes and memory leaks
Vlastimil Babka [Sun, 22 Mar 2020 01:22:37 +0000 (18:22 -0700)]
mm, slub: prevent kmalloc_node crashes and memory leaks

Sachin reports [1] a crash in SLUB __slab_alloc():

  BUG: Kernel NULL pointer dereference on read at 0x000073b0
  Faulting instruction address: 0xc0000000003d55f4
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in:
  CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
  NIP:  c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000
  REGS: c0000008b37836d0 TRAP: 0300   Not tainted  (5.6.0-rc2-next-20200218-autotest)
  MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24004844  XER: 00000000
  CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
  GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500
  GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620
  GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000
  GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000
  GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002
  GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122
  GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8
  GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180
  NIP ___slab_alloc+0x1f4/0x760
  LR __slab_alloc+0x34/0x60
  Call Trace:
    ___slab_alloc+0x334/0x760 (unreliable)
    __slab_alloc+0x34/0x60
    __kmalloc_node+0x110/0x490
    kvmalloc_node+0x58/0x110
    mem_cgroup_css_online+0x108/0x270
    online_css+0x48/0xd0
    cgroup_apply_control_enable+0x2ec/0x4d0
    cgroup_mkdir+0x228/0x5f0
    kernfs_iop_mkdir+0x90/0xf0
    vfs_mkdir+0x110/0x230
    do_mkdirat+0xb0/0x1a0
    system_call+0x5c/0x68

This is a PowerPC platform with following NUMA topology:

  available: 2 nodes (0-1)
  node 0 cpus:
  node 0 size: 0 MB
  node 0 free: 0 MB
  node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
  node 1 size: 35247 MB
  node 1 free: 30907 MB
  node distances:
  node   0   1
    0:  10  40
    1:  40  10

  possible numa nodes: 0-31

This only happens with a mmotm patch "mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node" [2] which effectively calls
kmalloc_node for each possible node.  SLUB however only allocates
kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on
node_to_mem_node to return such valid node for other nodes since commit
a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating
on memoryless node").  This is however not true in this configuration
where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31,
thus it contains zeroes and get_partial() ends up accessing
non-allocated kmem_cache_node.

A related issue was reported by Bharata (originally by Ramachandran) [3]
where a similar PowerPC configuration, but with mainline kernel without
patch [2] ends up allocating large amounts of pages by kmalloc-1k
kmalloc-512.  This seems to have the same underlying issue with
node_to_mem_node() not behaving as expected, and might probably also
lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4].

This patch should fix both issues by not relying on node_to_mem_node()
anymore and instead simply falling back to NUMA_NO_NODE, when
kmalloc_node(node) is attempted for a node that's not online, or has no
usable memory.  The "usable memory" condition is also changed from
node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly
the condition that SLUB uses to allocate kmem_cache_node structures.
The check in get_partial() is removed completely, as the checks in
___slab_alloc() are now sufficient to prevent get_partial() being
reached with an invalid node.

[1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
[2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/
[3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/
[4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/

Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.cz
Debugged-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomm/mmu_notifier: silence PROVE_RCU_LIST warnings
Qian Cai [Sun, 22 Mar 2020 01:22:34 +0000 (18:22 -0700)]
mm/mmu_notifier: silence PROVE_RCU_LIST warnings

It is safe to traverse mm->notifier_subscriptions->list either under
SRCU read lock or mm->notifier_subscriptions->lock using
hlist_for_each_entry_rcu().  Silence the PROVE_RCU_LIST false positives,
for example,

  WARNING: suspicious RCU usage
  -----------------------------
  mm/mmu_notifier.c:484 RCU-list traversed in non-reader section!!

  other info that might help us debug this:

  rcu_scheduler_active = 2, debug_locks = 1
  3 locks held by libvirtd/802:
   #0: ffff9321e3f58148 (&mm->mmap_sem#2){++++}, at: do_mprotect_pkey+0xe1/0x3e0
   #1: ffffffff91ae6160 (mmu_notifier_invalidate_range_start){+.+.}, at: change_p4d_range+0x5fa/0x800
   #2: ffffffff91ae6e08 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x178/0x460

  stack backtrace:
  CPU: 7 PID: 802 Comm: libvirtd Tainted: G          I       5.6.0-rc6-next-20200317+ #2
  Hardware name: HP ProLiant BL460c Gen8, BIOS I31 11/02/2014
  Call Trace:
    dump_stack+0xa4/0xfe
    lockdep_rcu_suspicious+0xeb/0xf5
    __mmu_notifier_invalidate_range_start+0x3ff/0x460
    change_p4d_range+0x746/0x800
    change_protection+0x1df/0x300
    mprotect_fixup+0x245/0x3e0
    do_mprotect_pkey+0x23b/0x3e0
    __x64_sys_mprotect+0x51/0x70
    do_syscall_64+0x91/0xae8
    entry_SYSCALL_64_after_hwframe+0x49/0xb3

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Link: http://lkml.kernel.org/r/20200317175640.2047-1-cai@lca.pw
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agoepoll: fix possible lost wakeup on epoll_ctl() path
Roman Penyaev [Sun, 22 Mar 2020 01:22:30 +0000 (18:22 -0700)]
epoll: fix possible lost wakeup on epoll_ctl() path

This fixes possible lost wakeup introduced by commit a218cc491420.
Originally modifications to ep->wq were serialized by ep->wq.lock, but
in commit a218cc491420 ("epoll: use rwlock in order to reduce
ep_poll_callback() contention") a new rw lock was introduced in order to
relax fd event path, i.e. callers of ep_poll_callback() function.

After the change ep_modify and ep_insert (both are called on epoll_ctl()
path) were switched to ep->lock, but ep_poll (epoll_wait) was using
ep->wq.lock on wqueue list modification.

The bug doesn't lead to any wqueue list corruptions, because wake up
path and list modifications were serialized by ep->wq.lock internally,
but actual waitqueue_active() check prior wake_up() call can be
reordered with modifications of ep ready list, thus wake up can be lost.

And yes, can be healed by explicit smp_mb():

  list_add_tail(&epi->rdlink, &ep->rdllist);
  smp_mb();
  if (waitqueue_active(&ep->wq))
wake_up(&ep->wp);

But let's make it simple, thus current patch replaces ep->wq.lock with
the ep->lock for wqueue modifications, thus wake up path always observes
activeness of the wqueue correcty.

Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
Reported-by: Max Neunhoeffer <max@arangodb.com>
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Max Neunhoeffer <max@arangodb.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Christopher Kohlhoff <chris.kohlhoff@clearpool.io>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jes Sorensen <jes.sorensen@gmail.com>
Cc: <stable@vger.kernel.org> [5.1+]
Link: http://lkml.kernel.org/r/20200214170211.561524-1-rpenyaev@suse.de
References: https://bugzilla.kernel.org/show_bug.cgi?id=205933
Bisected-by: Max Neunhoeffer <max@arangodb.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomm: do not allow MADV_PAGEOUT for CoW pages
Michal Hocko [Sun, 22 Mar 2020 01:22:26 +0000 (18:22 -0700)]
mm: do not allow MADV_PAGEOUT for CoW pages

Jann has brought up a very interesting point [1].  While shared pages
are excluded from MADV_PAGEOUT normally, CoW pages can be easily
reclaimed that way.  This can lead to all sorts of hard to debug
problems.  E.g.  performance problems outlined by Daniel [2].

There are runtime environments where there is a substantial memory
shared among security domains via CoW memory and a easy to reclaim way
of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either
performance degradation in for the parent process which might be more
privileged or even open side channel attacks.

The feasibility of the latter is not really clear to me TBH but there is
no real reason for exposure at this stage.  It seems there is no real
use case to depend on reclaiming CoW memory via madvise at this stage so
it is much easier to simply disallow it and this is what this patch
does.  Put it simply MADV_{PAGEOUT,COLD} can operate only on the
exclusively owned memory which is a straightforward semantic.

[1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
[2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com

Fixes: 9c276cc65a58 ("mm: introduce MADV_COLD")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomm, memcg: throttle allocators based on ancestral memory.high
Chris Down [Sun, 22 Mar 2020 01:22:23 +0000 (18:22 -0700)]
mm, memcg: throttle allocators based on ancestral memory.high

Prior to this commit, we only directly check the affected cgroup's
memory.high against its usage.  However, it's possible that we are being
reclaimed as a result of hitting an ancestor memory.high and should be
penalised based on that, instead.

This patch changes memory.high overage throttling to use the largest
overage in its ancestors when considering how many penalty jiffies to
charge.  This makes sure that we penalise poorly behaving cgroups in the
same way regardless of at what level of the hierarchy memory.high was
breached.

Fixes: 0e4b01df8659 ("mm, memcg: throttle allocators when failing reclaim over memory.high")
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org> [5.4.x+]
Link: http://lkml.kernel.org/r/8cd132f84bd7e16cdb8fde3378cdbf05ba00d387.1584036142.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomm, memcg: fix corruption on 64-bit divisor in memory.high throttling
Chris Down [Sun, 22 Mar 2020 01:22:20 +0000 (18:22 -0700)]
mm, memcg: fix corruption on 64-bit divisor in memory.high throttling

Commit 0e4b01df8659 had a bunch of fixups to use the right division
method.  However, it seems that after all that it still wasn't right --
div_u64 takes a 32-bit divisor.

The headroom is still large (2^32 pages), so on mundane systems you
won't hit this, but this should definitely be fixed.

Fixes: 0e4b01df8659 ("mm, memcg: throttle allocators when failing reclaim over memory.high")
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: <stable@vger.kernel.org> [5.4.x+]
Link: http://lkml.kernel.org/r/80780887060514967d414b3cd91f9a316a16ab98.1584036142.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agopage-flags: fix a crash at SetPageError(THP_SWAP)
Qian Cai [Sun, 22 Mar 2020 01:22:17 +0000 (18:22 -0700)]
page-flags: fix a crash at SetPageError(THP_SWAP)

Commit bd4c82c22c36 ("mm, THP, swap: delay splitting THP after swapped
out") supported writing THP to a swap device but forgot to upgrade an
older commit df8c94d13c7e ("page-flags: define behavior of FS/IO-related
flags on compound pages") which could trigger a crash during THP
swapping out with DEBUG_VM_PGFLAGS=y,

  kernel BUG at include/linux/page-flags.h:317!

  page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
  page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0
  anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked)

  end_swap_bio_write()
    SetPageError(page)
      VM_BUG_ON_PAGE(1 && PageCompound(page))

  <IRQ>
  bio_endio+0x297/0x560
  dec_pending+0x218/0x430 [dm_mod]
  clone_endio+0xe4/0x2c0 [dm_mod]
  bio_endio+0x297/0x560
  blk_update_request+0x201/0x920
  scsi_end_request+0x6b/0x4b0
  scsi_io_completion+0x509/0x7e0
  scsi_finish_command+0x1ed/0x2a0
  scsi_softirq_done+0x1c9/0x1d0
  __blk_mqnterrupt+0xf/0x20
  </IRQ>

Fix by checking PF_NO_TAIL in those places instead.

Fixes: bd4c82c22c36 ("mm, THP, swap: delay splitting THP after swapped out")
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200310235846.1319-1-cai@lca.pw
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case
Baoquan He [Sun, 22 Mar 2020 01:22:13 +0000 (18:22 -0700)]
mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case

In section_deactivate(), pfn_to_page() doesn't work any more after
ms->section_mem_map is resetting to NULL in SPARSEMEM|!VMEMMAP case.  It
causes a hot remove failure:

  kernel BUG at mm/page_alloc.c:4806!
  invalid opcode: 0000 [#1] SMP PTI
  CPU: 3 PID: 8 Comm: kworker/u16:0 Tainted: G        W         5.5.0-next-20200205+ #340
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
  Workqueue: kacpi_hotplug acpi_hotplug_work_fn
  RIP: 0010:free_pages+0x85/0xa0
  Call Trace:
   __remove_pages+0x99/0xc0
   arch_remove_memory+0x23/0x4d
   try_remove_memory+0xc8/0x130
   __remove_memory+0xa/0x11
   acpi_memory_device_remove+0x72/0x100
   acpi_bus_trim+0x55/0x90
   acpi_device_hotplug+0x2eb/0x3d0
   acpi_hotplug_work_fn+0x1a/0x30
   process_one_work+0x1a7/0x370
   worker_thread+0x30/0x380
   kthread+0x112/0x130
   ret_from_fork+0x35/0x40

Let's move the ->section_mem_map resetting after
depopulate_section_memmap() to fix it.

[akpm@linux-foundation.org: remove unneeded initialization, per David]
Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200307084229.28251-2-bhe@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomemcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event
Chunguang Xu [Sun, 22 Mar 2020 01:22:10 +0000 (18:22 -0700)]
memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event

An eventfd monitors multiple memory thresholds of the cgroup, closes them,
the kernel deletes all events related to this eventfd.  Before all events
are deleted, another eventfd monitors the memory threshold of this cgroup,
leading to a crash:

  BUG: kernel NULL pointer dereference, address: 0000000000000004
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 800000033058e067 P4D 800000033058e067 PUD 3355ce067 PMD 0
  Oops: 0002 [#1] SMP PTI
  CPU: 2 PID: 14012 Comm: kworker/2:6 Kdump: loaded Not tainted 5.6.0-rc4 #3
  Hardware name: LENOVO 20AWS01K00/20AWS01K00, BIOS GLET70WW (2.24 ) 05/21/2014
  Workqueue: events memcg_event_remove
  RIP: 0010:__mem_cgroup_usage_unregister_event+0xb3/0x190
  RSP: 0018:ffffb47e01c4fe18 EFLAGS: 00010202
  RAX: 0000000000000001 RBX: ffff8bb223a8a000 RCX: 0000000000000001
  RDX: 0000000000000001 RSI: ffff8bb22fb83540 RDI: 0000000000000001
  RBP: ffffb47e01c4fe48 R08: 0000000000000000 R09: 0000000000000010
  R10: 000000000000000c R11: 071c71c71c71c71c R12: ffff8bb226aba880
  R13: ffff8bb223a8a480 R14: 0000000000000000 R15: 0000000000000000
  FS:  0000000000000000(0000) GS:ffff8bb242680000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000004 CR3: 000000032c29c003 CR4: 00000000001606e0
  Call Trace:
    memcg_event_remove+0x32/0x90
    process_one_work+0x172/0x380
    worker_thread+0x49/0x3f0
    kthread+0xf8/0x130
    ret_from_fork+0x35/0x40
  CR2: 0000000000000004

We can reproduce this problem in the following ways:

1. We create a new cgroup subdirectory and a new eventfd, and then we
   monitor multiple memory thresholds of the cgroup through this eventfd.

2.  closing this eventfd, and __mem_cgroup_usage_unregister_event ()
   will be called multiple times to delete all events related to this
   eventfd.

The first time __mem_cgroup_usage_unregister_event() is called, the
kernel will clear all items related to this eventfd in thresholds->
primary.

Since there is currently only one eventfd, thresholds-> primary becomes
empty, so the kernel will set thresholds-> primary and hresholds-> spare
to NULL.  If at this time, the user creates a new eventfd and monitor
the memory threshold of this cgroup, kernel will re-initialize
thresholds-> primary.

Then when __mem_cgroup_usage_unregister_event () is called for the
second time, because thresholds-> primary is not empty, the system will
access thresholds-> spare, but thresholds-> spare is NULL, which will
trigger a crash.

In general, the longer it takes to delete all events related to this
eventfd, the easier it is to trigger this problem.

The solution is to check whether the thresholds associated with the
eventfd has been cleared when deleting the event.  If so, we do nothing.

[akpm@linux-foundation.org: fix comment, per Kirill]
Fixes: 907860ed381a ("cgroups: make cftype.unregister_event() void-returning")
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/077a6f67-aefa-4591-efec-f2f3af2b0b02@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agoMerge tag 'block-5.6-20200320' of git://git.kernel.dk/linux-block
Linus Torvalds [Sat, 21 Mar 2020 19:08:26 +0000 (12:08 -0700)]
Merge tag 'block-5.6-20200320' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Just two NVMe fabrics fixes that should go into 5.6"

* tag 'block-5.6-20200320' of git://git.kernel.dk/linux-block:
  nvmet-tcp: set MSG_MORE only if we actually have more to send
  nvme-rdma: Avoid double freeing of async event data

4 years agoMerge tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-block
Linus Torvalds [Sat, 21 Mar 2020 18:54:47 +0000 (11:54 -0700)]
Merge tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Two different fixes in here:

   - Fix for a potential NULL pointer deref for links with async or
     drain marked (Pavel)

   - Fix for not properly checking RLIMIT_NOFILE for async punted
     operations.

     This affects openat/openat2, which were added this cycle, and
     accept4. I did a full audit of other cases where we might check
     current->signal->rlim[] and found only RLIMIT_FSIZE for buffered
     writes and fallocate. That one is fixed and queued for 5.7 and
     marked stable"

* tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-block:
  io_uring: make sure accept honor rlimit nofile
  io_uring: make sure openat/openat2 honor rlimit nofile
  io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN}

4 years agoMerge branch 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
Linus Torvalds [Sat, 21 Mar 2020 18:50:36 +0000 (11:50 -0700)]
Merge branch 'turbostat' of git://git./linux/kernel/git/lenb/linux

Pull turbostat updates from Len Brown:
 "Update to turbostat v20.03.20.

  These patches unlock the full turbostat features for some new
  machines, plus a couple other minor tweaks"

* 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
  tools/power turbostat: update version
  tools/power turbostat: Print cpuidle information
  tools/power turbostat: Fix 32-bit capabilities warning
  tools/power turbostat: Fix missing SYS_LPI counter on some Chromebooks
  tools/power turbostat: Support Elkhart Lake
  tools/power turbostat: Support Jasper Lake
  tools/power turbostat: Support Ice Lake server
  tools/power turbostat: Support Tiger Lake
  tools/power turbostat: Fix gcc build warnings
  tools/power turbostat: Support Cometlake

4 years agoMerge tag 'powerpc-5.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sat, 21 Mar 2020 15:51:45 +0000 (08:51 -0700)]
Merge tag 'powerpc-5.6-5' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Two fixes for bugs introduced this cycle:

   - fix a crash when shutting down a KVM PR guest (our original style
     of KVM which doesn't use hypervisor mode)

   - fix for the recently added 32-bit KASAN_VMALLOC support

  Thanks to: Christophe Leroy, Greg Kurz, Sean Christopherson"

* tag 'powerpc-5.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  KVM: PPC: Fix kernel crash with PR KVM
  powerpc/kasan: Fix shadow memory protection with CONFIG_KASAN_VMALLOC

4 years agotools/power turbostat: update version
Len Brown [Mon, 18 Nov 2019 02:18:31 +0000 (21:18 -0500)]
tools/power turbostat: update version

A stitch in time saves nine.

Signed-off-by: Len Brown <len.brown@intel.com>
4 years agotools/power turbostat: Print cpuidle information
Len Brown [Sat, 21 Mar 2020 04:47:47 +0000 (00:47 -0400)]
tools/power turbostat: Print cpuidle information

Print cpuidle driver and governor.

Originally-by: Antti Laakso <antti.laakso@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
4 years agoMerge branch 'nvme-5.6-rc6' of git://git.infradead.org/nvme into block-5.6
Jens Axboe [Sat, 21 Mar 2020 01:14:16 +0000 (19:14 -0600)]
Merge branch 'nvme-5.6-rc6' of git://git.infradead.org/nvme into block-5.6

Pull NVMe fixes from Keith:

"Two late nvme fabrics fixes for 5.6: a double free with the rdma
 transport, and a regression fix for tcp; please pull."

* 'nvme-5.6-rc6' of git://git.infradead.org/nvme:
  nvmet-tcp: set MSG_MORE only if we actually have more to send
  nvme-rdma: Avoid double freeing of async event data

4 years agobtrfs: fix removal of raid[56|1c34} incompat flags after removing block group
Filipe Manana [Fri, 20 Mar 2020 18:43:48 +0000 (18:43 +0000)]
btrfs: fix removal of raid[56|1c34} incompat flags after removing block group

We are incorrectly dropping the raid56 and raid1c34 incompat flags when
there are still raid56 and raid1c34 block groups, not when we do not any
of those anymore. The logic just got unintentionally broken after adding
the support for the raid1c34 modes.

Fix this by clear the flags only if we do not have block groups with the
respective profiles.

Fixes: 9c907446dce3 ("btrfs: drop incompat bit for raid1c34 after last block group is gone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agonvmet-tcp: set MSG_MORE only if we actually have more to send
Sagi Grimberg [Thu, 12 Mar 2020 23:06:38 +0000 (16:06 -0700)]
nvmet-tcp: set MSG_MORE only if we actually have more to send

When we send PDU data, we want to optimize the tcp stack
operation if we have more data to send. So when we set MSG_MORE
when:
- We have more fragments coming in the batch, or
- We have a more data to send in this PDU
- We don't have a data digest trailer
- We optimize with the SUCCESS flag and omit the NVMe completion
  (used if sq_head pointer update is disabled)

This addresses a regression in QD=1 with SUCCESS flag optimization
as we unconditionally set MSG_MORE when we didn't actually have
more data to send.

Fixes: 70583295388a ("nvmet-tcp: implement C2HData SUCCESS optimization")
Reported-by: Mark Wunderlich <mark.wunderlich@intel.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
4 years agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Fri, 20 Mar 2020 16:28:25 +0000 (09:28 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:

 - Fix panic() when it occurs during secondary CPU startup

 - Fix "kpti=off" when KASLR is enabled

 - Fix howler in compat syscall table for vDSO clock_getres() fallback

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: compat: Fix syscall number of compat_clock_getres
  arm64: kpti: Fix "kpti=off" when KASLR is enabled
  arm64: smp: fix crash_smp_send_stop() behaviour
  arm64: smp: fix smp_send_stop() behaviour

4 years agoMerge tag 'char-misc-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Fri, 20 Mar 2020 16:24:22 +0000 (09:24 -0700)]
Merge tag 'char-misc-5.6-rc7' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
 "Here are some small different driver fixes for 5.6-rc7:

   - binderfs fix, yet again

   - slimbus new device id added

   - hwtracing bugfixes for reported issues and a new device id

  All of these have been in linux-next with no reported issues"

* tag 'char-misc-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  intel_th: pci: Add Elkhart Lake CPU support
  intel_th: Fix user-visible error codes
  intel_th: msu: Fix the unexpected state warning
  stm class: sys-t: Fix the use of time_after()
  slimbus: ngd: add v2.1.0 compatible
  binderfs: use refcount for binder control devices too

4 years agoMerge tag 'staging-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Fri, 20 Mar 2020 16:20:38 +0000 (09:20 -0700)]
Merge tag 'staging-5.6-rc7' of git://git./linux/kernel/git/gregkh/staging

Pull staging/IIO fixes from Greg KH:
 "Here are a number of small staging and IIO driver fixes for 5.6-rc7

  Nothing major here, just resolutions for some reported problems:
   - iio bugfixes for a number of different drivers
   - greybus loopback_test fixes
   - wfx driver fixes

  All of these have been in linux-next with no reported issues"

* tag 'staging-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: rtl8188eu: Add device id for MERCUSYS MW150US v2
  staging: greybus: loopback_test: fix potential path truncations
  staging: greybus: loopback_test: fix potential path truncation
  staging: greybus: loopback_test: fix poll-mask build breakage
  staging: wfx: fix RCU usage between hif_join() and ieee80211_bss_get_ie()
  staging: wfx: fix RCU usage in wfx_join_finalize()
  staging: wfx: make warning about pending frame less scary
  staging: wfx: fix lines ending with a comma instead of a semicolon
  staging: wfx: fix warning about freeing in-use mutex during device unregister
  staging/speakup: fix get_word non-space look-ahead
  iio: ping: set pa_laser_ping_cfg in of_ping_match
  iio: chemical: sps30: fix missing triggered buffer dependency
  iio: st_sensors: remap SMO8840 to LIS2DH12
  iio: light: vcnl4000: update sampling periods for vcnl4040
  iio: light: vcnl4000: update sampling periods for vcnl4200
  iio: accel: adxl372: Set iio_chan BE
  iio: magnetometer: ak8974: Fix negative raw values in sysfs
  iio: trigger: stm32-timer: disable master mode when stopping
  iio: adc: stm32-dfsdm: fix sleep in atomic context
  iio: adc: at91-sama5d2_adc: fix differential channels in triggered mode

4 years agoMerge tag 'usb-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Fri, 20 Mar 2020 16:16:35 +0000 (09:16 -0700)]
Merge tag 'usb-5.6-rc7' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here are some small USB fixes for 5.6-rc7. And there's a thunderbolt
  driver fix thrown in for good measure as well.

  These fixes are:
   - new device ids for usb-serial drivers
   - thunderbolt error code fix
   - xhci driver fixes
   - typec fixes
   - cdc-acm driver fixes
   - chipidea driver fix
   - more USB quirks added for devices that need them.

  All of these have been in linux-next with no reported issues"

* tag 'usb-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  USB: cdc-acm: fix rounding error in TIOCSSERIAL
  USB: cdc-acm: fix close_delay and closing_wait units in TIOCSSERIAL
  usb: quirks: add NO_LPM quirk for RTL8153 based ethernet adapters
  usb: chipidea: udc: fix sleeping function called from invalid context
  USB: serial: pl2303: add device-id for HP LD381
  USB: serial: option: add ME910G1 ECM composition 0x110b
  usb: host: xhci-plat: add a shutdown
  usb: typec: ucsi: displayport: Fix a potential race during registration
  usb: typec: ucsi: displayport: Fix NULL pointer dereference
  USB: Disable LPM on WD19's Realtek Hub
  usb: xhci: apply XHCI_SUSPEND_DELAY to AMD XHCI controller 1022:145c
  xhci: Do not open code __print_symbolic() in xhci trace events
  thunderbolt: Fix error code in tb_port_is_width_supported()

4 years agoMerge tag 'tty-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Linus Torvalds [Fri, 20 Mar 2020 16:13:35 +0000 (09:13 -0700)]
Merge tag 'tty-5.6-rc7' of git://git./linux/kernel/git/gregkh/tty

Pull tty fixes from Greg KH:
 "Here are three small tty_io bugfixes for reported issues that Eric has
  resolved for 5.6-rc7

  All of these have been in linux-next with no reported issues"

* tag 'tty-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  tty: fix compat TIOCGSERIAL checking wrong function ptr
  tty: fix compat TIOCGSERIAL leaking uninitialized memory
  tty: drop outdated comments about release_tty() locking