git.osdn.net Git - qmiga/qemu.git/log

hw/block/nvme: allow cmb and pmr to coexist

With BAR 4 now free to use, allow PMR and CMB to be enabled
simultaneously.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: move msix table and pba to BAR 0

In the interest of supporting both CMB and PMR to be enabled on the same
device, move the MSI-X table and pending bit array out of BAR 4 and into
BAR 0.

This is a simplified version of the patch contributed by Andrzej
Jakowski (see [1]). Leaving the CMB at offset 0 removes the need for
changes to CMB address mapping code.

[1]: https://lore.kernel.org/qemu-devel/20200729220107.37758-3-andrzej.jakowski@linux.intel.com/

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Tested-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: indicate CMB support through controller capabilities register

This patch sets CMBS bit in controller capabilities register when user
configures NVMe driver with CMB support, so capabilites are correctly
reported to guest OS.

Signed-off-by: Andrzej Jakowski <andrzej.jakowski@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsky@gmail.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: fix 64 bit register hi/lo split writes

64 bit registers like ASQ and ACQ should be writable by both a hi/lo 32
bit write combination as well as a plain 64 bit write. The spec does not
define ordering on the hi/lo split, but the code currently assumes that
the low order bits are written first. Additionally, the code does not
consider that another address might already have been written into the
register, causing the OR'ing to result in a bad address.

Fix this by explicitly overwriting only the low or high order bits for
32 bit writes.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>

hw/block/nvme: add size to mmio read/write trace events

Add the size of the mmio read/write to the trace event.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: trigger async event during injecting smart warning

During smart critical warning injection by setting property from QMP
command, also try to trigger asynchronous event.

Suggested by Keith, if a event has already been raised, there is no
need to enqueue the duplicate event any more.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
[k.jensen: fix typo in commit message]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: add smart_critical_warning property

There is a very low probability that hitting physical NVMe disk
hardware critical warning case, it's hard to write & test a monitor
agent service.

For debugging purposes, add a new 'smart_critical_warning' property
to emulate this situation.

The orignal version of this change is implemented by adding a fixed
property which could be initialized by QEMU command line. Suggested
by Philippe & Klaus, rework like current version.

Test with this patch:
1, change smart_critical_warning property for a running VM:
#virsh qemu-monitor-command nvme-upstream '{ "execute": "qom-set",
  "arguments": { "path": "/machine/peripheral-anon/device[0]",
  "property": "smart_critical_warning", "value":16 } }'
2, run smartctl in guest
#smartctl -H -l error /dev/nvme0n1

  === START OF SMART DATA SECTION ===
  SMART overall-health self-assessment test result: FAILED!
  - volatile memory backup device has failed

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

nvme: introduce bit 5 for critical warning

According to NVM Express v1.4, Section 5.14.1.2 ("SMART / Health
Information"), introduce bit 5 for "Persistent Memory Region has become
read-only or unreliable".

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
[k.jensen: minor brush ups in commit message]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: fix zone write finalize

The zone write pointer is unconditionally advanced, even for write
faults. Make sure that the zone is always transitioned to Full if the
write pointer reaches zone capacity.

Cc: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: remove unused argument in nvme_ns_setup

nvme_ns_setup() finally does not have nothing to do with NvmeCtrl
instance.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: split setup and register for namespace

In NVMe, namespace is being attached to process I/O.  We register NVMe
namespace to a controller via nvme_register_namespace() during
nvme_ns_setup().  This is main reason of receiving NvmeCtrl object
instance to this function to map the namespace to a controller.

To make namespace instance more independent, it should be split into two
parts: setup and register.  This patch split them into two differnt
parts, and finally nvme_ns_setup() does not have nothing to do with
NvmeCtrl instance at all.

This patch is a former patch to introduce NVMe subsystem scheme to the
existing design especially for multi-path.  In that case, it should be
split into two to make namespace independent from a controller.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: remove unused argument in nvme_ns_init_blk

Removed no longer used aregument NvmeCtrl object in nvme_ns_init_blk().

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: open code for volatile write cache

Volatile Write Cache(VWC) feature is set in nvme_ns_setup() in the
initial time. This feature is related to block device backed, but this
feature is controlled in controller level via Set/Get Features command.

This patch removed dependency between nvme and nvme-ns to manage the VWC
flag value. Also, it open coded the Get Features for VWC to check all
namespaces attached to the controller, and if false detected, return
directly false.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
[k.jensen: report write cache preset if present on ANY namespace]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: remove unused argument in nvme_ns_init_zoned

nvme_ns_init_zoned() has no use for given NvmeCtrl object.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: Correct error status for unaligned ZA

TP 4053 says (in section 2.3.1.1) -
... if a Zone Append command specifies a ZSLBA that is not the lowest
logical block address in that zone, then the controller shall abort
that command with a status code of Invalid Field In Command.

In the code, Zone Invalid Write is returned instead, fix this.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: remove unnecessary check for append

nvme_io_cmd already checks if the namespace supports the Zone Append
command, so the removed check is dead code.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

hw/block/nvme: add missing string representations for commands

Add missing string representations for a couple of new commands.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

hw/block/nvme: zero out zones on reset

The zoned command set specification states that "All logical blocks in a
zone *shall* be marked as deallocated when [the zone is reset]". Since
the device guarantees 0x00 to be read from deallocated blocks we have to
issue a pwrite_zeroes since we cannot be sure that a discard will do
anything. But typically, this will be achieved with an efficient
unmap/discard operation.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

hw/block/nvme: enum style fix

Align with existing style and use a typedef for header-file enums.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

hw/block/nvme: merge implicitly/explicitly opened processing masks

Implicitly and explicitly opended zones are always bulk processed
together, so merge the two processing masks.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

hw/block/nvme: fix shutdown/reset logic

A shutdown is only about flushing stuff. It is the host that should
delete any queues, so do not perform a reset here.

Also, on shutdown, make sure that the PMR is flushed if in use.

Fixes: 368f4e752cf9 ("hw/block/nvme: Process controller reset and shutdown differently")
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Tested-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

hw/block/nvme: conditionally enable DULBE for zoned namespaces

The device uses the BDRV_BLOCK_ZERO flag to determine the "deallocated"
status of logical blocks. Since the zoned namespaces command set
specification defines that logical blocks SHALL be marked as deallocated
when the zone is in the Empty or Offline states, DULBE can only be
supported if the zone size is a multiple of the calculated deallocation
granularity (reported in NPDG) which depends on the underlying block
device cluster size (if applicable) or the configured
discard_granularity.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: fix for non-msix machines

Commit 1c0c2163aa08 ("hw/block/nvme: verify msix_init_exclusive_bar()
return value") had the unintended effect of breaking support on
several platforms not supporting MSI-X.

Still check for errors, but only report that MSI-X is unsupported
instead of bailing out.

Fixes: 1c0c2163aa08 ("hw/block/nvme: verify msix_init_exclusive_bar() return value")
Fixes: fbf2e5375e33 ("hw/block/nvme: Verify msix_vector_use() returned value")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: Document zoned parameters in usage text

Added brief descriptions of the new device properties that are
now available to users to configure features of Zoned Namespace
Command Set in the emulator.

This patch is for documentation only, no functionality change.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: Support Zone Descriptor Extensions

Zone Descriptor Extension is a label that can be assigned to a zone.
It can be set to an Empty zone and it stays assigned until the zone
is reset.

This commit adds a new optional module property,
"zoned.descr_ext_size". Its value must be a multiple of 64 bytes.
If this value is non-zero, it becomes possible to assign extensions
of that size to any Empty zones. The default value for this property
is 0, therefore setting extensions is disabled by default.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: Introduce max active and open zone limits

Add two module properties, "zoned.max_active" and "zoned.max_open"
to control the maximum number of zones that can be active or open.
Once these variables are set to non-default values, these limits are
checked during I/O and Too Many Active or Too Many Open command status
is returned if they are exceeded.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: Support Zoned Namespace Command Set

The emulation code has been changed to advertise NVM Command Set when
"zoned" device property is not set (default) and Zoned Namespace
Command Set otherwise.

Define values and structures that are needed to support Zoned
Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.
Define trace events where needed in newly introduced code.

In order to improve scalability, all open, closed and full zones
are organized in separate linked lists. Consequently, almost all
zone operations don't require scanning of the entire zone array
(which potentially can be quite large) - it is only necessary to
enumerate one or more zone lists.

Handlers for three new NVMe commands introduced in Zoned Namespace
Command Set specification are added, namely for Zone Management
Receive, Zone Management Send and Zone Append.

Device initialization code has been extended to create a proper
configuration for zoned operation using device properties.

Read/Write command handler is modified to only allow writes at the
write pointer if the namespace is zoned. For Zone Append command,
writes implicitly happen at the write pointer and the starting write
pointer value is returned as the result of the command. Write Zeroes
handler is modified to add zoned checks that are identical to those
done as a part of Write flow.

Subsequent commits in this series add ZDE support and checks for
active and open zone limits.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

nvme: Make ZNS-related definitions

Define values and structures that are needed to support Zoned
Namespace Command Set (NVMe TP 4053).

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: Support allocated CNS command variants

Many CNS commands have "allocated" command variants. These include
a namespace as long as it is allocated, that is a namespace is
included regardless if it is active (attached) or not.

While these commands are optional (they are mandatory for controllers
supporting the namespace attachment command), our QEMU implementation
is more complete by actually providing support for these CNS values.

However, since our QEMU model currently does not support the namespace
attachment command, these new allocated CNS commands will return the
same result as the active CNS command variants.

The reason for not hooking up this command completely is because the
NVMe specification requires the namespace management command to be
supported if the namespace attachment command is supported.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: Add support for Namespace Types

Define the structures and constants required to implement
Namespace Types support.

Namespace Types introduce a new command set, "I/O Command Sets",
that allows the host to retrieve the command sets associated with
a namespace. Introduce support for the command set and enable
detection for the NVM Command Set.

The new workflows for identify commands rely heavily on zero-filled
identify structs. E.g., certain CNS commands are defined to return
a zero-filled identify struct when an inactive namespace NSID
is supplied.

Add a helper function in order to avoid code duplication when
reporting zero-filled identify structures.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: Add Commands Supported and Effects log

This log page becomes necessary to implement to allow checking for
Zone Append command support in Zoned Namespace Command Set.

This commit adds the code to report this log page for NVM Command
Set only. The parts that are specific to zoned operation will be
added later in the series.

All incoming admin and i/o commands are now only processed if their
corresponding support bits are set in this log. This provides an
easy way to control what commands to support and what not to
depending on set CC.CSS.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: Combine nvme_write_zeroes() and nvme_write()

Move write processing to nvme_do_write() that now handles both WRITE
and WRITE ZEROES. Both nvme_write() and nvme_write_zeroes() become
inline helper functions.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: Separate read and write handlers

The majority of code in nvme_rw() is becoming read- or write-specific.
Move these parts to two separate handlers, nvme_read() and nvme_write()
to make the code more readable and to remove multiple is_write checks
that has been present in the i/o path.

This is a refactoring patch, no change in functionality.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: Generate namespace UUIDs

In NVMe 1.4, a namespace must report an ID descriptor of UUID type
if it doesn't support EUI64 or NGUID. Add a new namespace property,
"uuid", that provides the user the option to either specify the UUID
explicitly or have a UUID generated automatically every time a
namespace is initialized.

Suggested-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: Process controller reset and shutdown differently

Controller reset ans subsystem shutdown are handled very much the same
in the current code, but some of the steps should be different in these
two cases.

Introduce two new functions, nvme_reset_ctrl() and nvme_shutdown_ctrl(),
to separate some portions of the code from nvme_clear_ctrl(). The steps
that are made different between reset and shutdown are that BAR.CC is not
reset to zero upon the shutdown and namespace data is flushed to
backing storage as a part of shutdown handling, but not upon reset.

Suggested-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: fix bad clearing of CAP

Commit 37712e00b1f0 ("hw/block/nvme: factor out pmr setup") changed the
control flow such that the CAP register is erronously cleared after
nvme_init_pmr() has configured it. Since the entire NvmeCtrl structure
is zero-filled initially, there is no need for the explicit clearing, so
just remove it.

Fixes: 37712e00b1f0 ("hw/block/nvme: factor out pmr setup")
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>

hw/block/nvme: add compare command

Add the Compare command.

This implementation uses a bounce buffer to read in the data from
storage and then compare with the host supplied buffer.

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
[k.jensen: rebased]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>

hw/block/nvme: add the dataset management command

Add support for the Dataset Management command and the Deallocate
attribute. Deallocation results in discards being sent to the underlying
block device. Whether of not the blocks are actually deallocated is
affected by the same factors as Write Zeroes (see previous commit).

     format | discard | dsm (512B)  dsm (4KiB)  dsm (64KiB)
    --------------------------------------------------------
      qcow2    ignore   n           n           n
      qcow2    unmap    n           n           y
      raw      ignore   n           n           n
      raw      unmap    n           y           y

Again, a raw format and 4KiB LBAs are preferable.

In order to set the Namespace Preferred Deallocate Granularity and
Alignment fields (NPDG and NPDA), choose a sane minimum discard
granularity of 4KiB. If we are using a passthru device supporting
discard at a 512B granularity, user should set the discard_granularity
property explicitly. NPDG and NPDA will also account for the
cluster_size of the block driver if required (i.e. for QCOW2).

See NVM Express 1.3d, Section 6.7 ("Dataset Management command").

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>

nvme: add namespace I/O optimization fields to shared header

This adds the NPWG, NPWA, NPDG, NPDA and NOWS family of fields to the
shared nvme.h header for use by later patches.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fam Zheng <fam@euphon.net>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>

hw/block/nvme: add dulbe support

Add support for reporting the Deallocated or Unwritten Logical Block
Error (DULBE).

Rely on the block status flags reported by the block layer and consider
any block with the BDRV_BLOCK_ZERO flag to be deallocated.

Multiple factors affect when a Write Zeroes command result in
deallocation of blocks.

  * the underlying file system block size
  * the blockdev format
  * the 'discard' and 'logical_block_size' parameters

     format | discard | wz (512B)  wz (4KiB)  wz (64KiB)
    -----------------------------------------------------
      qcow2    ignore   n          n          y
      qcow2    unmap    n          n          y
      raw      ignore   n          y          y
      raw      unmap    n          y          y

So, this works best with an image in raw format and 4KiB LBAs, since
holes can then be punched on a per-block basis (this assumes a file
system with a 4kb block size, YMMV). A qcow2 image, uses a cluster size
of 64KiB by default and blocks will only be marked deallocated if a full
cluster is zeroed or discarded. However, this *is* consistent with the
spec since Write Zeroes "should" deallocate the block if the Deallocate
attribute is set and "may" deallocate if the Deallocate attribute is not
set. Thus, we always try to deallocate (the BDRV_REQ_MAY_UNMAP flag is
always set).

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>

hw/block/nvme: pull aio error handling

Add a new function, nvme_aio_err, to handle errors resulting from AIOs
and use this from the callbacks.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

hw/block/nvme: remove superfluous NvmeCtrl parameter

nvme_check_bounds has no use of the NvmeCtrl parameter; remove it.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>

Merge remote-tracking branch 'remotes/armbru/tags/pull-qapi-2021-02-08' into staging

QAPI patches patches for 2021-02-08

# gpg: Signature made Mon 08 Feb 2021 13:54:26 GMT
# gpg:                using RSA key 354BC8B3D7EB2A6B68674E5F3870B400EB918653
# gpg:                issuer "armbru@redhat.com"
# gpg: Good signature from "Markus Armbruster <armbru@redhat.com>" [full]
# gpg:                 aka "Markus Armbruster <armbru@pond.sub.org>" [full]
# Primary key fingerprint: 354B C8B3 D7EB 2A6B 6867  4E5F 3870 B400 EB91 8653

* remotes/armbru/tags/pull-qapi-2021-02-08:
  qapi: enable strict-optional checks
  qapi: type 'info' as Optional[QAPISourceInfo]
  qapi/gen: Drop support for QAPIGen without a file name
  qapi/commands: Simplify command registry generation
  qapi/gen: Support switching to another module temporarily
  qapi/gen: write _genc/_genh access shims
  qapi: centralize the built-in module name definition
  qapi/gen: Combine ._add_[user|system]_module
  qapi: use './builtin' as the built-in module name
  qapi: use explicitly internal module names
  qapi/gen: Replace ._begin_system_module()
  qapi: centralize is_[user|system|builtin]_module methods
  qapi/gen: inline _wrap_ifcond into end_if()
  qapi/main: handle theoretical None-return from re.match()
  qapi/events: fix visit_event typing
  qapi/commands: assert arg_type is not None

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

qapi: enable strict-optional checks

In the modules that we are checking so far, we can be stricter about the
difference between Optional[T] and T types. Enable that check.

Enabling it now will assist review on further typing and cleanup work.

Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-17-jsnow@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>

qapi: type 'info' as Optional[QAPISourceInfo]

For everything typed so far, type this parameter as
Optional[QAPISourceInfo].

In the most generic case, QAPISchemaEntity's info field may be None to
represent types that come from built-in definitions. Although some
Entity types may not currently have any built-in definitions, it is not
easily possible to constrain the type except on an ad-hoc basis using
assertions.

It's easier and simpler, then, to just say it's always an Optional type.

Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-16-jsnow@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>

qapi/gen: Drop support for QAPIGen without a file name

The previous commit removed the only user of QAPIGen(None). Tighten
the type hint.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-15-jsnow@redhat.com>

qapi/commands: Simplify command registry generation

QAPISchemaGenCommandVisitor.visit_command() needs to generate the
marshalling function into the current module, and also generate its
registration into the ./init system module. The latter is done
somewhat awkwardly: .__init__() creates a QAPIGenCCode that will not
be written out, each .visit_command() adds its registration to it, and
.visit_end() copies its contents into the ./init module it creates.

Instead provide the means to temporarily switch to another module.
Create the ./init module in .visit_begin(), and generate its initial
part. Add registrations to it in .visit_command(). Finish it in
.visit_end().

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-14-jsnow@redhat.com>

qapi/gen: Support switching to another module temporarily

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-13-jsnow@redhat.com>
[Commit message tweaked]

qapi/gen: write _genc/_genh access shims

Many places assume they can access these fields without checking them
first to ensure they are defined. Eliminating the _genc and _genh fields
and replacing them with functional properties that check for correct
state can ease the typing overhead by eliminating the Optional[T] return
type.

Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-12-jsnow@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>

qapi: centralize the built-in module name definition

Use a constant to make it obvious we're referring to a very specific thing.

Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-11-jsnow@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>

qapi/gen: Combine ._add_[user|system]_module

With callers to _add_system_module now explicitly using the './' prefix
to indicate a system module, there is no longer any reason to have
separate interfaces for adding system vs user modules; use a unified
interface that differentiates based on the name.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-10-jsnow@redhat.com>

qapi: use './builtin' as the built-in module name

Use './builtin' as the built-in module name instead of
None. Clarify the typing that this is now always a string.

Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-9-jsnow@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>

qapi: use explicitly internal module names

QAPISchemaModularCVisitor._add_system_module() prefixes './' to its name
argument to make it a module name. Pass the module name instead. This
will allow us to coalesce the methods to add modules later on.

Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-8-jsnow@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[Commit message reworded]
Signed-off-by: Markus Armbruster <armbru@redhat.com>

qapi/gen: Replace ._begin_system_module()

QAPISchemaModularCVisitor._begin_system_module() is actually just for
the builtin module. Rename it to ._begin_builtin_module() and drop
its useless @name parameter.

Clarify conditionals in visit_module to make this clear.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-7-jsnow@redhat.com>

qapi: centralize is_[user|system|builtin]_module methods

Define what a module is and define what kind of a module it is once and
for all, in one place.

Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-6-jsnow@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>

qapi/gen: inline _wrap_ifcond into end_if()

We assert _start_if is not None in end_if, but that's opaque to mypy.
By inlining _wrap_ifcond, that constraint becomes provable to mypy.

Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-5-jsnow@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>

qapi/main: handle theoretical None-return from re.match()

Mypy cannot understand that this match can never be None, so help it
along.

Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-4-jsnow@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>

qapi/events: fix visit_event typing

Actually, the arg_type can indeed be Optional.

Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-3-jsnow@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>

qapi/commands: assert arg_type is not None

When boxed is True, expr.py asserts that we must have
arguments. Ultimately, this should mean that if boxed is True that
arg_type should be defined. Mypy cannot infer this, and does not support
'stateful' type inference, e.g.:

```
if x:
assert y is not None

...

if x:
y.etc()
```

does not work, because mypy does not statefully remember the conditional
assertion in the second block. Help mypy out by creating a new local
that it can track more easily.

Signed-off-by: John Snow <jsnow@redhat.com>
Message-Id: <20210201193747.2169670-2-jsnow@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>

Merge remote-tracking branch 'remotes/stsquad/tags/pull-testing-gdbstub-docs-080221-1' into staging

Testing, gdbstub and doc tweaks:

  - increase timeout on replay kernel acceptance test
  - fixes for binfmt_misc docker images
  - better gdb version detection
  - don't silently skip gdb tests
  - fix for gdbstub auxv handling
  - cleaner handling of check-tcg on tcg disabled builds
  - expand vexpress/versitile docs with examples

# gpg: Signature made Mon 08 Feb 2021 11:12:03 GMT
# gpg:                using RSA key 6685AE99E75167BCAFC8DF35FBD0DB095A9E2A44
# gpg: Good signature from "Alex Bennée (Master Work Key) <alex.bennee@linaro.org>" [full]
# Primary key fingerprint: 6685 AE99 E751 67BC AFC8  DF35 FBD0 DB09 5A9E 2A44

* remotes/stsquad/tags/pull-testing-gdbstub-docs-080221-1:
  docs/system: document an example booting the versatilepb machine
  docs/system: document an example vexpress-a15 invocation
  tests/Makefile.include: don't use TARGET_DIRS for check-tcg
  scripts/mtest2make.py: export all-%s-targets variable and use it
  tests/tcg: Replace /bin/true by true (required on macOS)
  gdbstub: Fix handle_query_xfer_auxv
  tests/tcg: don't silently skip the gdb tests
  configure: bump the minimum gdb version for check-tcg to 9.1
  configure: make version_ge more tolerant of shady version input
  tests/docker: add a docker-exec-copy-test
  tests/docker: alias docker-help target for consistency
  tests/docker: preserve original name when copying libs
  tests/docker: make _copy_with_mkdir accept missing files
  tests/docker: Fix typo in help message
  tests/docker: Fix _get_so_libs() for docker-binfmt-image
  tests/acceptance: Increase the timeout in the replay tests

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

Merge remote-tracking branch 'remotes/dg-gitlab/tags/cgs-pull-request' into staging

Generalize memory encryption models

A number of hardware platforms are implementing mechanisms whereby the
hypervisor does not have unfettered access to guest memory, in order
to mitigate the security impact of a compromised hypervisor.

AMD's SEV implements this with in-cpu memory encryption, and Intel has
its own memory encryption mechanism.  POWER has an upcoming mechanism
to accomplish this in a different way, using a new memory protection
level plus a small trusted ultravisor.  s390 also has a protected
execution environment.

The current code (committed or draft) for these features has each
platform's version configured entirely differently.  That doesn't seem
ideal for users, or particularly for management layers.

AMD SEV introduces a notionally generic machine option
"machine-encryption", but it doesn't actually cover any cases other
than SEV.

This series is a proposal to at least partially unify configuration
for these mechanisms, by renaming and generalizing AMD's
"memory-encryption" property.  It is replaced by a
"confidential-guest-support" property pointing to a platform specific
object which configures and manages the specific details.

Note to Ram Pai: the documentation I've included for PEF is very
minimal.  If you could send a patch expanding on that, it would be
very helpful.

Changes since v8:
* Rebase
* Fixed some cosmetic typos
Changes since v7:
* Tweaked and clarified meaning of the 'ready' flag
* Polished the interface to the PEF internals
* Shifted initialization for s390 PV later (I hope I've finally got
   this after apply_cpu_model() where it needs to be)
Changes since v6:
* Moved to using OBJECT_DECLARE_TYPE and OBJECT_DEFINE_TYPE macros
* Assorted minor fixes
Changes since v5:
* Renamed from "securable guest memory" to "confidential guest
   support"
* Simpler reworking of x86 boot time flash encryption
* Added a bunch of documentation
* Fixed some compile errors on POWER
Changes since v4:
* Renamed from "host trust limitation" to "securable guest memory",
   which I think is marginally more descriptive
* Re-organized initialization, because the previous model called at
   kvm_init didn't work for s390
* Assorted fixes to the s390 implementation; rudimentary testing
   (gitlab CI) only
Changes since v3:
* Rebased
* Added first cut at handling of s390 protected virtualization
Changes since RFCv2:
* Rebased
* Removed preliminary SEV cleanups (they've been merged)
* Changed name to "host trust limitation"
* Added migration blocker to the PEF code (based on SEV's version)
Changes since RFCv1:
* Rebased
* Fixed some errors pointed out by Dave Gilbert

# gpg: Signature made Mon 08 Feb 2021 06:07:27 GMT
# gpg:                using RSA key 75F46586AE61A66CC44E87DC6C38CACA20D9B392
# gpg: Good signature from "David Gibson <david@gibson.dropbear.id.au>" [full]
# gpg:                 aka "David Gibson (Red Hat) <dgibson@redhat.com>" [full]
# gpg:                 aka "David Gibson (ozlabs.org) <dgibson@ozlabs.org>" [full]
# gpg:                 aka "David Gibson (kernel.org) <dwg@kernel.org>" [unknown]
# Primary key fingerprint: 75F4 6586 AE61 A66C C44E  87DC 6C38 CACA 20D9 B392

* remotes/dg-gitlab/tags/cgs-pull-request:
  s390: Recognize confidential-guest-support option
  confidential guest support: Alter virtio default properties for protected guests
  spapr: PEF: prevent migration
  spapr: Add PEF based confidential guest support
  confidential guest support: Update documentation
  confidential guest support: Move SEV initialization into arch specific code
  confidential guest support: Introduce cgs "ready" flag
  sev: Add Error ** to sev_kvm_init()
  confidential guest support: Rework the "memory-encryption" property
  confidential guest support: Move side effect out of machine_set_memory_encryption()
  sev: Remove false abstraction of flash encryption
  confidential guest support: Introduce new confidential guest support class
  qom: Allow optional sugar props

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

docs/system: document an example booting the versatilepb machine

There is a bit more out there including Aurelien's excellent write up
and older Debian images here:

https://www.aurel32.net/info/debian_arm_qemu.php
https://people.debian.org/~aurel32/qemu/armel/

However the web is transitory and git is forever so lets add something
to the fine manual.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Message-Id: <20210202134001.25738-16-alex.bennee@linaro.org>

docs/system: document an example vexpress-a15 invocation

The wiki and the web are curiously absent of the right runes to boot a
vexpress model so I had to work from first principles to work it out.
Use the more modern -drive notation so alternative backends can be
used (unlike the hardwired -sd mode).

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Cc: Anders Roxell <anders.roxell@linaro.org>
Message-Id: <20210202134001.25738-15-alex.bennee@linaro.org>

tests/Makefile.include: don't use TARGET_DIRS for check-tcg

TARGET_DIRS reflects what we wanted to configure which in the normal
case is all our targets. However once meson has pared-down our target
list due to missing features we need to check the final list of
ninja-targets. This prevents check-tcg barfing on a --disable-tcg
build.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210202134001.25738-14-alex.bennee@linaro.org>

scripts/mtest2make.py: export all-%s-targets variable and use it

There are some places where the conditional makefile support is the
simplest solution. Now we don't expose CONFIG_TCG as a variable create
a new one that can be checked for the check-help output.

As check-tcg is a PHONY target we re-use check-softfloat to gate that
as well.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210202134001.25738-13-alex.bennee@linaro.org>

tests/tcg: Replace /bin/true by true (required on macOS)

/bin/true is missing on macOS, but simply "true" is available as a shell builtin.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20210128135627.2067003-1-sw@weilnetz.de>
Message-Id: <20210202134001.25738-12-alex.bennee@linaro.org>

gdbstub: Fix handle_query_xfer_auxv

The main problem was that we were treating a guest address
as a host address with a mere cast.

Use the correct interface for accessing guest memory. Do not
allow offset == auxv_len, which would result in an empty packet.

Fixes: 51c623b0de1 ("gdbstub: add support to Xfer:auxv:read: packet")
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20210128201831.534033-1-richard.henderson@linaro.org>
Message-Id: <20210202134001.25738-11-alex.bennee@linaro.org>

tests/tcg: don't silently skip the gdb tests

Otherwise people won't know what they are missing.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20210202134001.25738-10-alex.bennee@linaro.org>

configure: bump the minimum gdb version for check-tcg to 9.1

For SVE, currently the bulk of the GDB TCG tests, we need at least GDB
9.1 to support the "ieee_half" data type we report. This only affects
when GDB tests are run; users can still use lower versions of gdb as
long as they aren't talking to an SVE enabled model. The work around
is to either get a newer gdb or disable SVE for their CPU model.

Reported-by: Claudio Fontana <cfontana@suse.de>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Cc: Luis Machado <luis.machado@linaro.org>
Message-Id: <20210202134001.25738-9-alex.bennee@linaro.org>

configure: make version_ge more tolerant of shady version input

When checking GDB versions we have to tolerate all sorts of random
distro extensions to the version string. While we already attempt to
do some of that before we call version_ge is makes sense to try and
regularise the first input by stripping extraneous -'s. While we at it
convert the old-style shell quoting into a cleaner form t shut up my
editors linter lest it confuse me by underlining the whole line.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210202134001.25738-8-alex.bennee@linaro.org>

tests/docker: add a docker-exec-copy-test

This provides test machinery for checking the QEMU copying logic works
properly. It takes considerably less time to run than starting a
debootstrap only for it to fail later. I considered adding a remove
command to docker.py but figured that might be gold plating given the
relative size of the containers compared to the ones with actual stuff
in them.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20210202134001.25738-7-alex.bennee@linaro.org>

tests/docker: alias docker-help target for consistency

We have a bunch of -help targets so this will save some cognitive
dissonance. Keep the original for those with muscle memory.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20210202134001.25738-6-alex.bennee@linaro.org>

tests/docker: preserve original name when copying libs

While it is important we chase down the symlinks to copy the correct
data we can confuse the kernel by renaming the interpreter to what is
in the binary. Extend _copy_with_mkdir to preserve the original name
of the file when asked.

Fixes: 5e33f7fead ("tests/docker: better handle symlinked libs")
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20210202134001.25738-5-alex.bennee@linaro.org>

tests/docker: make _copy_with_mkdir accept missing files

Depending on the linker/ldd setup we might get a file with no path.
Typically this is the psuedo library linux-vdso.so which doesn't
actually exist on the disk. Rather than try and catch these distro
specific edge cases just shout about it and try and continue.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20210202134001.25738-4-alex.bennee@linaro.org>

tests/docker: Fix typo in help message

To have the variable properly passed, we need to set it,
ie. NOUSER=1. Fix the message displayed by 'make docker'.

Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20210119052120.522069-1-f4bug@amsat.org>
Message-Id: <20210202134001.25738-3-alex.bennee@linaro.org>

tests/docker: Fix _get_so_libs() for docker-binfmt-image

Fix a variable rename mistake from commit 5e33f7fead5:

  Traceback (most recent call last):
    File "./tests/docker/docker.py", line 710, in <module>
      sys.exit(main())
    File "./tests/docker/docker.py", line 706, in main
      return args.cmdobj.run(args, argv)
    File "./tests/docker/docker.py", line 489, in run
      _copy_binary_with_libs(args.include_executable,
    File "./tests/docker/docker.py", line 149, in _copy_binary_with_libs
      libs = _get_so_libs(src)
    File "./tests/docker/docker.py", line 123, in _get_so_libs
      libs.append(s.group(1))
  NameError: name 's' is not defined

Fixes: 5e33f7fead5 ("tests/docker: better handle symlinked libs")
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20210119050149.516910-1-f4bug@amsat.org>
Message-Id: <20210202134001.25738-2-alex.bennee@linaro.org>

tests/acceptance: Increase the timeout in the replay tests

Our gitlab-CI just showed a failed test_ppc_mac99 since it was apparently
killed some few seconds before the test finished. Allow it some more
time to complete.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Acked-by: Pavel Dovgalyuk <pavel.dovgalyuk@ispras.ru>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20210127065222.48650-1-thuth@redhat.com>

Merge remote-tracking branch 'remotes/mcayland/tags/qemu-sparc-20210207' into staging

qemu-sparc queue

# gpg: Signature made Sun 07 Feb 2021 22:09:12 GMT
# gpg:                using RSA key CC621AB98E82200D915CC9C45BC2C56FAE0F321F
# gpg:                issuer "mark.cave-ayland@ilande.co.uk"
# gpg: Good signature from "Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>" [full]
# Primary key fingerprint: CC62 1AB9 8E82 200D 915C  C9C4 5BC2 C56F AE0F 321F

* remotes/mcayland/tags/qemu-sparc-20210207:
  utils/fifo8: add VMSTATE_FIFO8_TEST macro
  utils/fifo8: change fatal errors from abort() to assert()

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

s390: Recognize confidential-guest-support option

At least some s390 cpu models support "Protected Virtualization" (PV),
a mechanism to protect guests from eavesdropping by a compromised
hypervisor.

This is similar in function to other mechanisms like AMD's SEV and
POWER's PEF, which are controlled by the "confidential-guest-support"
machine option.  s390 is a slightly special case, because we already
supported PV, simply by using a CPU model with the required feature
(S390_FEAT_UNPACK).

To integrate this with the option used by other platforms, we
implement the following compromise:

- When the confidential-guest-support option is set, s390 will
   recognize it, verify that the CPU can support PV (failing if not)
   and set virtio default options necessary for encrypted or protected
   guests, as on other platforms.  i.e. if confidential-guest-support
   is set, we will either create a guest capable of entering PV mode,
   or fail outright.

- If confidential-guest-support is not set, guests might still be
   able to enter PV mode, if the CPU has the right model.  This may be
   a little surprising, but shouldn't actually be harmful.

To start a guest supporting Protected Virtualization using the new
option use the command line arguments:
    -object s390-pv-guest,id=pv0 -machine confidential-guest-support=pv0

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>

confidential guest support: Alter virtio default properties for protected guests

The default behaviour for virtio devices is not to use the platforms normal
DMA paths, but instead to use the fact that it's running in a hypervisor
to directly access guest memory. That doesn't work if the guest's memory
is protected from hypervisor access, such as with AMD's SEV or POWER's PEF.

So, if a confidential guest mechanism is enabled, then apply the
iommu_platform=on option so it will go through normal DMA mechanisms.
Those will presumably have some way of marking memory as shared with
the hypervisor or hardware so that DMA will work.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>

spapr: PEF: prevent migration

We haven't yet implemented the fairly involved handshaking that will be
needed to migrate PEF protected guests. For now, just use a migration
blocker so we get a meaningful error if someone attempts this (this is the
same approach used by AMD SEV).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>

spapr: Add PEF based confidential guest support

Some upcoming POWER machines have a system called PEF (Protected
Execution Facility) which uses a small ultravisor to allow guests to
run in a way that they can't be eavesdropped by the hypervisor.  The
effect is roughly similar to AMD SEV, although the mechanisms are
quite different.

Most of the work of this is done between the guest, KVM and the
ultravisor, with little need for involvement by qemu.  However qemu
does need to tell KVM to allow secure VMs.

Because the availability of secure mode is a guest visible difference
which depends on having the right hardware and firmware, we don't
enable this by default.  In order to run a secure guest you need to
create a "pef-guest" object and set the confidential-guest-support
property to point to it.

Note that this just *allows* secure guests, the architecture of PEF is
such that the guest still needs to talk to the ultravisor to enter
secure mode.  Qemu has no direct way of knowing if the guest is in
secure mode, and certainly can't know until well after machine
creation time.

To start a PEF-capable guest, use the command line options:
    -object pef-guest,id=pef0 -machine confidential-guest-support=pef0

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>

confidential guest support: Update documentation

Now that we've implemented a generic machine option for configuring various
confidential guest support mechanisms:
  1. Update docs/amd-memory-encryption.txt to reference this rather than
     the earlier SEV specific option
  2. Add a docs/confidential-guest-support.txt to cover the generalities of
     the confidential guest support scheme

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>

confidential guest support: Move SEV initialization into arch specific code

While we've abstracted some (potential) differences between mechanisms for
securing guest memory, the initialization is still specific to SEV. Given
that, move it into x86's kvm_arch_init() code, rather than the generic
kvm_init() code.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>

confidential guest support: Introduce cgs "ready" flag

The platform specific details of mechanisms for implementing
confidential guest support may require setup at various points during
initialization. Thus, it's not really feasible to have a single cgs
initialization hook, but instead each mechanism needs its own
initialization calls in arch or machine specific code.

However, to make it harder to have a bug where a mechanism isn't
properly initialized under some circumstances, we want to have a
common place, late in boot, where we verify that cgs has been
initialized if it was requested.

This patch introduces a ready flag to the ConfidentialGuestSupport
base type to accomplish this, which we verify in
qemu_machine_creation_done().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>

sev: Add Error ** to sev_kvm_init()

This allows failures to be reported richly and idiomatically.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>

confidential guest support: Rework the "memory-encryption" property

Currently the "memory-encryption" property is only looked at once we
get to kvm_init(). Although protection of guest memory from the
hypervisor isn't something that could really ever work with TCG, it's
not conceptually tied to the KVM accelerator.

In addition, the way the string property is resolved to an object is
almost identical to how a QOM link property is handled.

So, create a new "confidential-guest-support" link property which sets
this QOM interface link directly in the machine. For compatibility we
keep the "memory-encryption" property, but now implemented in terms of
the new property.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>

confidential guest support: Move side effect out of machine_set_memory_encryption()

When the "memory-encryption" property is set, we also disable KSM
merging for the guest, since it won't accomplish anything.

We want that, but doing it in the property set function itself is
thereoretically incorrect, in the unlikely event of some configuration
environment that set the property then cleared it again before
constructing the guest.

More importantly, it makes some other cleanups we want more difficult.
So, instead move this logic to machine_run_board_init() conditional on
the final value of the property.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>

sev: Remove false abstraction of flash encryption

When AMD's SEV memory encryption is in use, flash memory banks (which are
initialed by pc_system_flash_map()) need to be encrypted with the guest's
key, so that the guest can read them.

That's abstracted via the kvm_memcrypt_encrypt_data() callback in the KVM
state.. except, that it doesn't really abstract much at all.

For starters, the only call site is in code specific to the 'pc'
family of machine types, so it's obviously specific to those and to
x86 to begin with.  But it makes a bunch of further assumptions that
need not be true about an arbitrary confidential guest system based on
memory encryption, let alone one based on other mechanisms:

* it assumes that the flash memory is defined to be encrypted with the
   guest key, rather than being shared with hypervisor
* it assumes that that hypervisor has some mechanism to encrypt data into
   the guest, even though it can't decrypt it out, since that's the whole
   point
* the interface assumes that this encrypt can be done in place, which
   implies that the hypervisor can write into a confidential guests's
   memory, even if what it writes isn't meaningful

So really, this "abstraction" is actually pretty specific to the way SEV
works.  So, this patch removes it and instead has the PC flash
initialization code call into a SEV specific callback.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>

confidential guest support: Introduce new confidential guest support class

Several architectures have mechanisms which are designed to protect
guest memory from interference or eavesdropping by a compromised
hypervisor.  AMD SEV does this with in-chip memory encryption and
Intel's TDX can do similar things.  POWER's Protected Execution
Framework (PEF) accomplishes a similar goal using an ultravisor and
new memory protection features, instead of encryption.

To (partially) unify handling for these, this introduces a new
ConfidentialGuestSupport QOM base class.  "Confidential" is kind of vague,
but "confidential computing" seems to be the buzzword about these schemes,
and "secure" or "protected" are often used in connection to unrelated
things (such as hypervisor-from-guest or guest-from-guest security).

The "support" in the name is significant because in at least some of the
cases it requires the guest to take specific actions in order to protect
itself from hypervisor eavesdropping.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

qom: Allow optional sugar props

Global properties have an @optional field, which allows to apply a given
property to a given type even if one of its subclasses doesn't support
it. This is especially used in the compat code when dealing with the
"disable-modern" and "disable-legacy" properties and the "virtio-pci"
type.

Allow object_register_sugar_prop() to set this field as well.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <159738953558.377274.16617742952571083440.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>

utils/fifo8: add VMSTATE_FIFO8_TEST macro

Rewrite the existing VMSTATE_FIFO8 macro to use VMSTATE_FIFO8_TEST as per the
standard pattern in include/migration/vmstate.h.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-Id: <20210128221728.14887-3-mark.cave-ayland@ilande.co.uk>

utils/fifo8: change fatal errors from abort() to assert()

Developer errors are better represented with assert() rather than abort(). Also
improve the strictness of the checks by using range checks within the assert()
rather than converting the existing equality checks to inequality checks.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Claudio Fontana <cfontana@suse.de>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210121102518.20112-1-mark.cave-ayland@ilande.co.uk>

Merge remote-tracking branch 'remotes/rth-gitlab/tags/pull-tcg-20210205' into staging

TCGCPUOps cleanups (claudio)
tcg/s390 compare fix (phil)
tcg/aarch64 rotli_vec fix
tcg/tci cleanups and fixes

# gpg: Signature made Fri 05 Feb 2021 22:55:10 GMT
# gpg:                using RSA key 7A481E78868B4DB6A85A05C064DF38E8AF7E215F
# gpg:                issuer "richard.henderson@linaro.org"
# gpg: Good signature from "Richard Henderson <richard.henderson@linaro.org>" [full]
# Primary key fingerprint: 7A48 1E78 868B 4DB6 A85A  05C0 64DF 38E8 AF7E 215F

* remotes/rth-gitlab/tags/pull-tcg-20210205: (46 commits)
  accel: introduce AccelCPUClass extending CPUClass
  accel: replace struct CpusAccel with AccelOpsClass
  accel: extend AccelState and AccelClass to user-mode
  cpu: tcg_ops: move to tcg-cpu-ops.h, keep a pointer in CPUClass
  cpu: move debug_check_watchpoint to tcg_ops
  cpu: move adjust_watchpoint_address to tcg_ops
  physmem: make watchpoint checking code TCG-only
  cpu: move do_unaligned_access to tcg_ops
  cpu: move cc->transaction_failed to tcg_ops
  cpu: move cc->do_interrupt to tcg_ops
  target/arm: do not use cc->do_interrupt for KVM directly
  cpu: Move debug_excp_handler to tcg_ops
  cpu: Move tlb_fill to tcg_ops
  cpu: Move cpu_exec_* to tcg_ops
  cpu: Move synchronize_from_tb() to tcg_ops
  accel/tcg: split TCG-only code from cpu_exec_realizefn
  target/riscv: remove CONFIG_TCG, as it is always TCG
  cpu: Introduce TCGCpuOperations struct
  tcg/tci: Remove TCG_CONST
  tcg/tci: Fix TCG_REG_R4 misusage
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

accel: introduce AccelCPUClass extending CPUClass

add a new optional interface to CPUClass, which allows accelerators
to extend the CPUClass with additional accelerator-specific
initializations.

This will allow to separate the target cpu code that is specific
to each accelerator, and register it automatically with object
hierarchy lookup depending on accelerator code availability,
as part of the accel_init_interfaces() initialization step.

Signed-off-by: Claudio Fontana <cfontana@suse.de>
Message-Id: <20210204163931.7358-19-cfontana@suse.de>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

accel: replace struct CpusAccel with AccelOpsClass

This will allow us to centralize the registration of
the cpus.c module accelerator operations (in accel/accel-softmmu.c),
and trigger it automatically using object hierarchy lookup from the
new accel_init_interfaces() initialization step, depending just on
which accelerators are available in the code.

Rename all tcg-cpus.c, kvm-cpus.c, etc to tcg-accel-ops.c,
kvm-accel-ops.c, etc, matching the object type names.

Signed-off-by: Claudio Fontana <cfontana@suse.de>
Message-Id: <20210204163931.7358-18-cfontana@suse.de>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

accel: extend AccelState and AccelClass to user-mode

Signed-off-by: Claudio Fontana <cfontana@suse.de>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
[claudio: rebased on Richard's splitwx work]

Signed-off-by: Claudio Fontana <cfontana@suse.de>
Message-Id: <20210204163931.7358-17-cfontana@suse.de>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

cpu: tcg_ops: move to tcg-cpu-ops.h, keep a pointer in CPUClass

we cannot in principle make the TCG Operations field definitions
conditional on CONFIG_TCG in code that is included by both common_ss
and specific_ss modules.

Therefore, what we can do safely to restrict the TCG fields to TCG-only
builds, is to move all tcg cpu operations into a separate header file,
which is only included by TCG, target-specific code.

This leaves just a NULL pointer in the cpu.h for the non-TCG builds.

This also tidies up the code in all targets a bit, having all TCG cpu
operations neatly contained by a dedicated data struct.

Signed-off-by: Claudio Fontana <cfontana@suse.de>
Message-Id: <20210204163931.7358-16-cfontana@suse.de>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

cpu: move debug_check_watchpoint to tcg_ops

commit 568496c0c0f1 ("cpu: Add callback to check architectural") and
commit 3826121d9298 ("target-arm: Implement checking of fired")
introduced an ARM-specific hack for cpu_check_watchpoint.

Make debug_check_watchpoint optional, and move it to tcg_ops.

Signed-off-by: Claudio Fontana <cfontana@suse.de>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20210204163931.7358-15-cfontana@suse.de>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

cpu: move adjust_watchpoint_address to tcg_ops

commit 40612000599e ("arm: Correctly handle watchpoints for BE32 CPUs")

introduced this ARM-specific, TCG-specific hack to adjust the address,
before checking it with cpu_check_watchpoint.

Make adjust_watchpoint_address optional and move it to tcg_ops.

Signed-off-by: Claudio Fontana <cfontana@suse.de>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20210204163931.7358-14-cfontana@suse.de>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>