git.osdn.net Git - qmiga/qemu.git/log

block: introduce BDRV_REQUEST_MAX_SECTORS

we check and adjust request sizes at several places with
sometimes inconsistent checks or default values:
INT_MAX
INT_MAX >> BDRV_SECTOR_BITS
UINT_MAX >> BDRV_SECTOR_BITS
SIZE_MAX >> BDRV_SECTOR_BITS

This patches introdocues a macro for the maximal allowed sectors
per request and uses it at several places.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

nbd: Improve error messages

This patch makes use of the Error object for nbd_receive_negotiate() so
that errors during negotiation look nicer.

Furthermore, this patch adds an additional error message if the received
magic was wrong, but would be correct for the other protocol version,
respectively: So if an export name was specified, but the NBD server
magic corresponds to an old handshake, this condition is explicitly
signaled to the user, and vice versa.

As these messages are now part of the "Could not open image" error
message, additional filtering has to be employed in iotest 083, which
this patch does as well.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

iotests: Fix 104 for NBD

_make_test_img sets up an NBD server, _cleanup_test_img shuts it down;
thus, _cleanup_test_img has to be called before _make_test_img is
invoked another time.

Furthermore, the pipe through _filter_test_img was unnecessary;
_make_test_img already takes care of that.

And finally, a filter is added to _filter_img_info to replace
"nbd://127.0.0.1:10810" by "TEST_DIR/t.IMGFMT", since the former is the
way to express the full image path (normally the latter) for NBD tests.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

iotests: Fix 100 for nbd

In case of NBD, _make_test_img starts a new NBD server. Therefore,
_cleanup_test_img (which shuts that server down) has to be invoked
before the next _make_test_img call in order to make 100 work for NBD.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

iotests: Fix 083

As of 8f9e835fd2e687d2bfe936819c3494af4343614d, probing should be
disabled in the qemu-iotests (at least when using qemu-io). This broke
083's reference output (which consisted mostly of "Could not read image
for determining its format").

This patch fixes it.

Note that one case which failed before is now successful: Disconnect
after data. This is due to qemu having read twice before (once for
probing, once for the qemu-io read command), but only once now (the
qemu-io read command). Therefore, reading is successful (which is
correct).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block: fix off-by-one error in qcow and qcow2

This fixes an off-by-one error introduced in 9a29e18. Both qcow and
qcow2 need to make sure to leave room for string terminator '\0' for
the backing file, so the max length of the non-terminated string is
either 1023 or PATH_MAX - 1.

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

qemu-iotests: add 116 invalid QED input file tests

These tests exercise error code paths in the QED image format. The
tests are very simple, they just prove that the error path exits
cleanly.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1421065893-18875-3-git-send-email-stefanha@redhat.com
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

qed: check for header size overflow

Header size is denoted in clusters. The maximum cluster size is 64 MB
but there is no limit on header size. Check for uint32_t overflow in
case the header size field has a whacky value.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1421065893-18875-2-git-send-email-stefanha@redhat.com
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/dmg: improve zeroes handling

Disk images may contain large all-zeroes gaps (1.66k sectors or 812 MiB
is seen in the real world). These blocks (type 2) do not need to be
extracted into a temporary buffer, there is no need to allocate memory
for these blocks nor to check its length.

(For the test image, the maximum uncompressed size is 1054371 bytes,
probably for a bzip2-compressed block.)

Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 1420566495-13284-13-git-send-email-peter@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/dmg: support bzip2 block entry types

This patch adds support for bzip2-compressed block entries as introduced
with OS X 10.4 (source: https://en.wikipedia.org/wiki/Apple_Disk_Image).

It was tested against a 5.2G "OS X Yosemite" installation image which
stores the BLXX block in the XML property list (instead of resource
forks) and has over 5k chunks.

New configure entries are added (--enable-bzip2 / --disable-bzip2) to
control inclusion of bzip2 functionality (which requires linking against
libbz2). The help message suggests that this option is needed for DMG
files, but the tests are generic enough that other parts of QEMU can use
bzip2 if needed.

The identifiers are based on http://newosxbook.com/DMG.html.

The decompression routines are based on the zlib case, but as there is
no way to reset the decompression state (unlike zlib), memory is
allocated and deallocated for every decompression. This should not be
problematic as the decompression takes most of the time and as blocks
are typically about/over 1 MiB in size, only one allocation is done
every 2000 sectors.

Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: John Snow <jsnow@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1420566495-13284-12-git-send-email-peter@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/dmg: factor out block type check

In preparation for adding bzip2 support, split the type check into a
separate function. Make all offsets relative to the begin of a chunk
such that it is easier to recognize the position without having to
add up all offsets. Some comments are added to describe the fields.

There is no functional change.

Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 1420566495-13284-11-git-send-email-peter@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/dmg: use SectorNumber from BLKX header

Previously the sector table parsing relied on the previous offset of
the DMG file. Now it uses the sector number from the BLKX header
(see http://newosxbook.com/DMG.html).

The implementation of dmg2img (from vu1tur) does not base the output
sector on the location of the terminator (0xffffffff) either so it
should be safe to drop this dependency on the previous state.

(It makes somehow makes sense, a terminator should halt further
processing of a block and is perhaps used to preallocate some space.)

Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 1420566495-13284-10-git-send-email-peter@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/dmg: fix sector data offset calculation

This patch addresses two issues:

- The data fork offset was not taken into account, resulting in failure
   to read an InstallESD.dmg file (5164763151 bytes) which had a
   non-zero DataForkOffset field.
- The offset of the previous block ("partition") was unconditionally
   added to the current block because older files would start the input
   offset of a new block at zero. Newer files (including vlc-2.1.5.dmg,
   tuxpaint-0.9.15-macosx.dmg and OS X Yosemite [MAS].dmg) failed in
   reads because these files have chunk offsets, relative to the begin
   of a data fork.

Now the data offset of the mish is taken into account. While we could
check that the data_offset is within the data fork, let's not do that
here as it would only result in parse failures on invalid files (rather
than gracefully handling such bad files). dmg_read will error out if
the offset is incorrect.

Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 1420566495-13284-9-git-send-email-peter@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/dmg: set virtual size to a non-zero value

Right now the virtual size is always reported as zero which makes it
impossible to convert between formats.

After this patch, the number of sectors will be read from the trailer
("koly" block).

To verify the behavior, the output of `dmg2img foo.dmg foo.img` was
compared against `qemu-img convert -f dmg -O raw foo.dmg foo.raw`. The
tests showed that the file contents are exactly the same, except that
QEMU creates a slightly larger file (it matches the total sectors
count).

Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 1420566495-13284-8-git-send-email-peter@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/dmg: process XML plists

The format is simple enough to avoid using a full-blown XML parser. It
assumes that all BLKX items begin with the "mish" magic word, therefore
it is not a problem if other values get matched which are not a BLKX
block.

The offsets are based on the description at
http://newosxbook.com/DMG.html

For compatibility with glib 2.12, use g_base64_decode (which
additionally requires an extra buffer allocation) instead of
g_base64_decode_inplace (which is only available since glib 2.20).

Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 1420566495-13284-7-git-send-email-peter@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/dmg: validate chunk size to avoid overflow

Previously the chunk size was not checked, allowing for a large memory
allocation. This patch checks whether the chunks size is within the
resource fork length, and whether the resource fork is below the
trailer of the dmg file.

Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 1420566495-13284-6-git-send-email-peter@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/dmg: process a buffer instead of reading ints

As the decoded plist XML is not a pointer in the file,
dmg_read_mish_block must be able to process a buffer instead of a file
pointer. Since the full buffer must be processed, let's change the
return value again to just a success flag.

Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 1420566495-13284-5-git-send-email-peter@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/dmg: extract processing of resource forks

Besides the offset, also read the resource length. This length is now
used in the extracted function to verify the end of the resource fork
against "count" from the resource fork.

Instead of relying on the value of offset to conclude whether the
resource fork is available or not (info_begin==0), check the
rsrc_fork_length instead. This would allow a dmg file to begin with a
resource fork. This seemingly unnecessary restriction was found while
trying to craft a DMG file by hand.

Other changes:

- Do not require resource data offset to be 0x100 (but check that it
   is within bounds though).
- Further improve boundary checking (resource data must be within
   the resource fork).
- Use correct value for resource data length (spotted by John Snow)
- Consider the resource data offset when determining info_end.
   This fixes an EINVAL on the tuxpaint dmg example.

The resource fork format is documented at
https://developer.apple.com/legacy/library/documentation/mac/pdf/MoreMacintoshToolbox.pdf#page=151

Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 1420566495-13284-4-git-send-email-peter@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/dmg: extract mish block decoding functionality

Extract the mish block decoder such that this can be used for other
formats in the future. A new DmgHeaderState struct is introduced to
share state while decoding.

The code is kept unchanged as much as possible, a "fail" label is added
for example where a simple return would probably do. In dmg_open, the
variable "tmp" is renamed to "rsrc_data_offset" for clarity and comments
have been added explaining various data.

Note that this patch has one subtle difference with the previous
version which should not affect functionality. In the previous code,
the end of a resource was inferred from the mish block (the offsets
would be increased by the fields). In this patch, the resource length
is used instead to avoid the need to rely on the previous offsets.

Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1420566495-13284-3-git-send-email-peter@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/dmg: properly detect the UDIF trailer

DMG files have a variable length with a UDIF trailer at the end of a
file. This UDIF trailer is essential as it describes the contents of
the image. At the moment however, the start of this trailer is almost
always incorrect as bdrv_getlength() returns a multiple of the block
size (rounded up). This results in a failure to recognize DMG files,
resulting in Invalid argument (EINVAL) errors.

As there is no API to retrieve the real file size, look for the magic
header in the last two sectors to find the start of this 512-byte UDIF
trailer (the "koly" block).

The resource fork offset ("info_begin") has its offset adjusted as the
initial value of offset does not mean "end of file" anymore, but "begin
of UDIF trailer".

[Replaced error_set(errp, ERROR_CLASS_GENERIC_ERROR, ...) with
error_setg(errp, ...) as discussed with Peter.
--Stefan]

Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1420566495-13284-2-git-send-email-peter@lekensteyn.nl
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block: add event when disk usage exceeds threshold

Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
To let the guest run smoothly and be not unnecessarily paused, oVirt sets
a disk usage threshold (so called 'high water mark') based on the occupation
of the device,  and automatically extends the image once the threshold
is reached or exceeded.

In order to detect the crossing of the threshold, oVirt has no choice but
aggressively polling the QEMU monitor using the query-blockstats command.
This lead to unnecessary system load, and is made even worse under scale:
deployments with hundreds of VMs are no longer rare.

To fix this, this patch adds:
* A new monitor command `block-set-write-threshold', to set a mark for
  a given block device.
* A new event `BLOCK_WRITE_THRESHOLD', to report if a block device
  usage exceeds the threshold.
* A new `write_threshold' field into the `BlockDeviceInfo' structure,
  to report the configured threshold.

This will allow the managing application to use smarter and more
efficient monitoring, greatly reducing the need of polling.

[Updated qemu-iotests 067 output to add the new 'write_threshold'
property. --Stefan]
[Changed g_assert_false() to !g_assert() to fix the build on older glib
versions. --Kevin]

Signed-off-by: Francesco Romani <fromani@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1421068273-692-1-git-send-email-fromani@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

iotests: Specify format for qemu-nbd

This patch is necessary to suppress the "probed raw" warning when
running raw over nbd tests.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

qemu-iotests: Fix supported_oses check

There is a bug in the recently added sys.platform test, and we no longer
run python tests, because "linux2" is the value to compare here. So do a
prefix match. According to python doc [1], the way to use sys.platform
is "unless you want to test for a specific system version, it is
therefore recommended to use the following idiom":

if sys.platform.startswith('freebsd'):
# FreeBSD-specific code here...
elif sys.platform.startswith('linux'):
# Linux-specific code here...

[1]: https://docs.python.org/2.7/library/sys.html#sys.platform

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

virtio-blk: add a knob to disable request merging

this adds a knob to disable request merging for debugging or benchmarks if dedired.

Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

virtio-blk: introduce multiread

this patch finally introduces multiread support to virtio-blk. While
multiwrite support was there for a long time, read support was missing.

The complete merge logic is moved into virtio-blk.c which has
been the only user of request merging ever since. This is required
to be able to merge chunks of requests and immediately invoke callbacks
for those requests. Secondly, this is required to switch to
direct invocation of coroutines which is planned at a later stage.

The following benchmarks show the performance of running fio with
4 worker threads on a local ram disk. The numbers show the average
of 10 test runs after 1 run as warmup phase.

              |        4k        |       64k        |        4k
MB/s          | rd seq | rd rand | rd seq | rd rand | wr seq | wr rand
--------------+--------+---------+--------+---------+--------+--------
master        | 1221   | 1187    | 4178   | 4114    | 1745   | 1213
multiread     | 1829   | 1189    | 4639   | 4110    | 1894   | 1216

Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block-backend: expose bs->bl.max_transfer_length

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

hw/virtio-blk: add a constant for max number of merged requests

As it was not obvious (at least for me) where the 32 comes from;
add a constant for it.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block: add accounting for merged requests

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

qed: Really remove unused field QEDAIOCB.finished

The commit 533ffb17a that removed qed_aiocb_info.cancel said to remove
this but didn't do it.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

qemu-img: Add QEMU_PKGVERSION to QEMU_IMG_VERSION

This is the same way vl.c handles this.

Signed-off-by: Don Slutz <dslutz@verizon.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block: change default for discard and write zeroes to INT_MAX

do not trim requests if the driver does not supply a limit
through BlockLimits. For write zeroes we still keep a limit
for the unsupported path to avoid allocating a big bounce buffer.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Suggested-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes

This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported.
Unfortunately, FALLOC_FL_ZERO_RANGE is supported on really modern systems
and only for a couple of filesystems. FALLOC_FL_PUNCH_HOLE is much more
mature.

The sequence of 2 operations FALLOC_FL_PUNCH_HOLE and 0 is necessary due
to the following reasons:
- FALLOC_FL_PUNCH_HOLE creates a hole in the file, the file becomes
  sparse. In order to retain original functionality we must allocate
  disk space afterwards. This is done using fallocate(0) call
- fallocate(0) without preceeding FALLOC_FL_PUNCH_HOLE will do nothing
  if called above already allocated areas of the file, i.e. the content
  will not be zeroed

This should increase the performance a bit for not-so-modern kernels.

CC: Max Reitz <mreitz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes

There is a possibility that we are extending our image and thus writing
zeroes beyond the end of the file. In this case we do not need to care
about the hole to make sure that there is no data in the file under
this offset (pre-condition to fallocate(0) to work). We could simply call
fallocate(0).

This improves the performance of writing zeroes even on really old
platforms which do not have even FALLOC_FL_PUNCH_HOLE.

Before the patch do_fallocate was used when either
CONFIG_FALLOCATE_PUNCH_HOLE or CONFIG_FALLOCATE_ZERO_RANGE are defined.
Now the story is different. CONFIG_FALLOCATE is defined when Linux
fallocate is defined, posix_fallocate is completely different story
(CONFIG_POSIX_FALLOCATE). CONFIG_FALLOCATE is mandatory prerequite
for both CONFIG_FALLOCATE_PUNCH_HOLE and CONFIG_FALLOCATE_ZERO_RANGE
thus we are on the safe side.

CC: Max Reitz <mreitz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes

This efficiently writes zeroes on Linux if the kernel is capable enough.
FALLOC_FL_ZERO_RANGE correctly handles all cases, including and not
including file expansion.

CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/raw-posix: refactor handle_aiocb_write_zeroes a bit

move code dealing with a block device to a separate function. This will
allow to implement additional processing for ordinary files.

Please note, that xfs_code has been moved before checking for
s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside.
This makes code a bit more consistent.

CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/raw-posix: create do_fallocate helper

The pattern
    do {
        if (fallocate(s->fd, mode, offset, len) == 0) {
            return 0;
        }
    } while (errno == EINTR);
    ret = translate_err(-errno);
will be commonly useful in next patches. Create helper for it.

CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

block/raw-posix: create translate_err helper to merge errno values

actually the code
    if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
        ret == -ENOTTY) {
        ret = -ENOTSUP;
    }
is present twice and will be added a couple more times. Create helper
for this.

CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

atapi migration: Throw recoverable error to avoid recovery

(With the previous atapi_dma flag recovery)
If migration happens between the ATAPI command being written and the
bmdma being started, the DMA is dropped. Eventually the guest times
out and recovers, but that can take many seconds.
(This is rare, on a pingpong reading the CD continuously I hit
this about ~1/30-1/50 migrates)

I don't think we've got enough state to be able to recover safely
at this point, so I throw a 'medium error, no seek complete'
that I'm assuming guests will try and recover from an apparently
dirty CD.

OK, it's a hack, the real solution is probably to push a lot of
ATAPI state into the migration stream, but this is a fix that
works with no stream changes. Tested only on Linux (both RHEL5
(pre-libata) and RHEL7).

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

Restore atapi_dma flag across migration

If a migration happens just after the guest has kicked
off an ATAPI command and kicked off DMA, we lose the atapi_dma
flag, and the destination tries to complete the command as PIO
rather than DMA. This upsets Linux; modern libata based kernels
stumble and recover OK, older kernels end up passing bad data
to userspace.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

Merge remote-tracking branch 'remotes/stefanha/tags/net-pull-request' into staging

# gpg: Signature made Fri 06 Feb 2015 14:10:40 GMT using RSA key ID 81AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>"
# gpg:                 aka "Stefan Hajnoczi <stefanha@gmail.com>"

* remotes/stefanha/tags/net-pull-request:
  monitor: more accurate completion for host_net_remove()
  net: del hub port when peer is deleted
  net: remove the wrong comment in net_init_hubport()
  monitor: print hub port name during info network
  rtl8139: simplify timer logic
  MAINTAINERS: add Jason Wang as net subsystem maintainer

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

monitor: more accurate completion for host_net_remove()

Current completion for host_net_remove will show hub ports and clients
that were not peered with hub ports. Fix this.

Cc: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-id: 1422860798-17495-4-git-send-email-jasowang@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

net: del hub port when peer is deleted

We should del hub port when peer is deleted since it will not be reused
and will only be freed during exit.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-id: 1422860798-17495-3-git-send-email-jasowang@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

net: remove the wrong comment in net_init_hubport()

Not only nic could be the one to peer.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-id: 1422860798-17495-2-git-send-email-jasowang@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

monitor: print hub port name during info network

Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-id: 1422860798-17495-1-git-send-email-jasowang@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

rtl8139: simplify timer logic

Pavel Dovgalyuk reports that TimerExpire and the timer are not restored
correctly on the receiving end of migration.

It is not clear to me whether this is really the case, but we can take
the occasion to get rid of the complicated code that computes PCSTimeout
on the fly upon changes to IntrStatus/IntrMask. Just always keep a
timer running, it will fire every ~130 seconds at most if the interrupt
is masked with TimerInt != 0.

This makes rtl8139_set_next_tctr_time idempotent (when the virtual clock
is stopped between two calls, as is the case during migration).

Tested with Frediano's qtest.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1421765099-26190-1-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Merge remote-tracking branch 'remotes/stefanha/tags/tracing-pull-request' into staging

# gpg: Signature made Fri 06 Feb 2015 13:45:06 GMT using RSA key ID 81AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>"
# gpg: aka "Stefan Hajnoczi <stefanha@gmail.com>"

* remotes/stefanha/tags/tracing-pull-request:
trace: Print PID and time in stderr traces

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

trace: Print PID and time in stderr traces

When debugging migration it's useful to know the PID of
each trace message so you can figure out if it came from the source
or the destination.

Printing the time makes it easy to do latency measurements or timings
between trace points.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-id: 1421746875-9962-1-git-send-email-dgilbert@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Merge remote-tracking branch 'remotes/juanquintela/tags/migration/20150205' into staging

migration/next for 20150205

# gpg: Signature made Thu 05 Feb 2015 16:17:08 GMT using RSA key ID 5872D723
# gpg: Can't check signature: public key not found

* remotes/juanquintela/tags/migration/20150205:
  fix mc146818rtc wrong subsection name to avoid vmstate_subsection_load() fail
  Tracify migration/rdma.c
  Add migration stream analyzation script
  migration: Append JSON description of migration stream
  qemu-file: Add fast ftell code path
  QJSON: Add JSON writer
  Print errors in some of the early migration failure cases.
  Migration: Add lots of trace events
  savevm: Convert fprintf to error_report
  vmstate-static-checker: update whitelist

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

Merge remote-tracking branch 'remotes/armbru/tags/pull-cov-model-2015-02-05' into staging

coverity: Improve and extend model

# gpg: Signature made Thu 05 Feb 2015 16:20:49 GMT using RSA key ID EB918653
# gpg: Good signature from "Markus Armbruster <armbru@redhat.com>"
# gpg:                 aka "Markus Armbruster <armbru@pond.sub.org>"

* remotes/armbru/tags/pull-cov-model-2015-02-05:
  MAINTAINERS: Add myself as Coverity model maintainer
  coverity: Model g_free() isn't necessarily free()
  coverity: Model GLib string allocation partially
  coverity: Improve model for GLib memory allocation

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

fix mc146818rtc wrong subsection name to avoid vmstate_subsection_load() fail

fix mc146818rtc wrong subsection name to avoid vmstate_subsection_load() fail
during incoming migration or loadvm.

Signed-off-by: Zhang Haoyu <zhanghy@sangfor.com.cn>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>

MAINTAINERS: Add myself as Coverity model maintainer

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>

Tracify migration/rdma.c

Turn all the D/DD/DDDPRINTFs into trace events
Turn most of the fprintf(stderr, into error_report

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>

Add migration stream analyzation script

This patch adds a python tool to the scripts directory that can read
a dumped migration stream if it contains the JSON description of the
device states. I constructs a human readable JSON stream out of it.

It's very simple to use:

  $ qemu-system-x86_64
    (qemu) migrate "exec:cat > mig"
  $ ./scripts/analyze_migration.py -f mig

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>

migration: Append JSON description of migration stream

One of the annoyances of the current migration format is the fact that
it's not self-describing. In fact, it's not properly describing at all.
Some code randomly scattered throughout QEMU elaborates roughly how to
read and write a stream of bytes.

We discussed an idea during KVM Forum 2013 to add a JSON description of
the migration protocol itself to the migration stream. This patch
adds a section after the VM_END migration end marker that contains
description data on what the device sections of the stream are composed of.

This approach is backwards compatible with any QEMU version reading the
stream, because QEMU just stops reading after the VM_END marker and ignores
any data following it.

With an additional external program this allows us to decipher the
contents of any migration stream and hopefully make migration bugs easier
to track down.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>

qemu-file: Add fast ftell code path

For ftell we flush the output buffer to ensure that we don't have anything
lingering in our internal buffers. This is a very safe thing to do.

However, with the dynamic size measurement that the dynamic vmstate
description will bring this would turn out quite slow.

Instead, we can fast path this specific measurement and just take the
internal buffers into account when telling the kernel our position.

I'm sure I overlooked some corner cases where this doesn't work, so
instead of tuning the safe, existing version, this patch adds a fast
variant of ftell that gets used by the dynamic vmstate description code
which isn't critical when it fails.

Signed-off-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>

QJSON: Add JSON writer

To support programmatic JSON assembly while keeping the code that generates it
readable, this patch introduces a simple JSON writer. It emits JSON serially
into a buffer in memory.

The nice thing about this writer is its simplicity and low memory overhead.
Unlike the QMP JSON writer, this one does not need to spawn QObjects for every
element it wants to represent.

This is a prerequisite for the migration stream format description generator.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>

Print errors in some of the early migration failure cases.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>

Migration: Add lots of trace events

Mostly on the load side, so that when we get a complaint about
a migration failure we can figure out what it didn't like.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>

savevm: Convert fprintf to error_report

Convert a bunch of fprintfs to error_reports

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>

vmstate-static-checker: update whitelist

Commit 22382bb96c8bd88370c1ff0cb28c3ee6bee79ed3 renamed the
'hw_cursor_x' and 'hw_cursor_y' fields in cirrus_vga. Update the static
checker's whitelist to allow matching against the old and new names.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Reviewed-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>

coverity: Model g_free() isn't necessarily free()

Memory allocated with GLib needs to be freed with GLib.  Freeing it
with free() instead of g_free() is a common error.  Harmless when
g_free() is a trivial wrapper around free(), which is commonly the
case.  But model the difference anyway.

In a local scan, this flags four ALLOC_FREE_MISMATCH.  Requires
--enable ALLOC_FREE_MISMATCH, because the checker is still preview.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>

coverity: Model GLib string allocation partially

Without a model, Coverity can't know that the result of g_strdup()
needs to be fed to g_free().

One way to get such a model is to scan GLib, build a derived model
file with cov-collect-models, and use that when scanning QEMU.
Unfortunately, the Coverity Scan service we use doesn't support that.

Thus, we're stuck with the other way: write a user model.  Doing that
for all of GLib is hardly practical.  I'm doing it for the "String
Utility Functions" we actually use that return dynamically allocated
strings.

In a local scan, this flags 20 additional RESOURCE_LEAKs.  The ones I
checked look genuine.

It also loses a NULL_RETURNS about ppce500_init() using
qemu_find_file() without error checking.  I don't understand why.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>

coverity: Improve model for GLib memory allocation

In current versions of GLib, g_new() may expand into g_malloc_n().
When it does, Coverity can't see the memory allocation, because we
don't model g_malloc_n().  Similarly for g_new0(), g_renew(),
g_try_new(), g_try_new0(), g_try_renew().

Model g_malloc_n(), g_malloc0_n(), g_realloc_n().  Model
g_try_malloc_n(), g_try_malloc0_n(), g_try_realloc_n() by adding
indeterminate out of memory conditions on top.

To avoid undue duplication, replace the existing models for g_malloc()
& friends by trivial wrappers around g_malloc_n() & friends.

In a local scan, this flags four additional RESOURCE_LEAKs and one
NULL_RETURNS.

The NULL_RETURNS is a false positive: Coverity can now see that
g_try_malloc(l1_sz * sizeof(uint64_t)) in
qcow2_check_metadata_overlap() may return NULL, but is too stupid to
recognize that a loop executing l1_sz times won't be entered then.

Three out of the four RESOURCE_LEAKs appear genuine.  The false
positive is in ppce500_prep_device_tree(): the pointer dies, but a
pointer to a struct member escapes, and we get the pointer back for
freeing with container_of().  Too funky for Coverity.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>

Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20150205' into staging

target-arm queue:
* refactor/clean up armv7m_init()
* some initial cleanup in the direction of supporting 64-bit EL3
* fix broken synchronization of registers between QEMU and KVM
   for 32-bit ARM hosts (which among other things broke memory
   access via gdbstub)
* fix flush-to-zero handling in FMULX, FRECPS, FRSQRTS and FRECPE
* don't crash QEMU for UNPREDICTABLE BFI insns in A32 encoding
* explain why virt board's device-to-transport mapping code is
   the way it is
* implement mmu_idx values which match the architectural
   distinctions, and introduce the concept of a translation
   regime to get_phys_addr() rather than incorrectly looking
   at the current CPU state
* update to upstream VIXL 1.7 (gives us correct code addresses
   when dissassembling pc-relative references)
* sync system register state between KVM and QEMU for 64-bit ARM
* support virtio on big-endian guests by implementing the
   "which endian is the guest now?" CPU method

# gpg: Signature made Thu 05 Feb 2015 14:02:16 GMT using RSA key ID 14360CDE
# gpg: Good signature from "Peter Maydell <peter.maydell@linaro.org>"

* remotes/pmaydell/tags/pull-target-arm-20150205: (28 commits)
  target-arm: fix for exponent comparison in recpe_f64
  target-arm: Guest cpu endianness determination for virtio KVM ARM/ARM64
  target-arm: KVM64: Get and Sync up guest register state like kvm32.
  disas/arm-a64.cc: Tell libvixl correct code addresses
  disas/libvixl: Update to upstream VIXL 1.7
  target-arm: Fix brace style in reindented code
  target-arm: Reindent ancient page-table-walk code
  target-arm: Use mmu_idx in get_phys_addr()
  target-arm: Pass mmu_idx to get_phys_addr()
  target-arm: Split AArch64 cases out of ats_write()
  target-arm: Don't define any MMU_MODE*_SUFFIXes
  target-arm: Use correct mmu_idx for unprivileged loads and stores
  target-arm: Define correct mmu_idx values and pass them in TB flags
  target-arm/translate-a64: Fix wrong mmu_idx usage for LDT/STT
  target-arm: Make arm_current_el() return sensible values for M profile
  cpu_ldst.h: Allow NB_MMU_MODES to be 7
  hw/arm/virt: explain device-to-transport mapping in create_virtio_devices()
  target-arm: check that LSB <= MSB in BFI instruction
  target-arm: Squash input denormals in FRECPS and FRSQRTS
  Fix FMULX not squashing denormalized inputs when FZ is set.
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

target-arm: fix for exponent comparison in recpe_f64

f64 exponent in HELPER(recpe_f64) should be compared to 2045 rather than 1023
(FPRecipEstimate in ARMV8 spec). This fixes incorrect underflow handling when
flushing denormals to zero in the FRECPE instructions operating on 64-bit
values.

Signed-off-by: Ildar Isaev <ild@inbox.ru>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

target-arm: Guest cpu endianness determination for virtio KVM ARM/ARM64

This patch implements a fucntion pointer "virtio_is_big_endian"
from "CPUClass" structure for arm/arm64.
Function arm_cpu_is_big_endian() is added to determine and
return the guest cpu endianness to virtio.
This is required for running cross endian guests with virtio on ARM/ARM64.

Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
Message-id: 1423130382-18640-3-git-send-email-pranavkumar@linaro.org
[PMM: check CPSR_E in env->cpsr_uncached, not env->pstate.]
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

target-arm: KVM64: Get and Sync up guest register state like kvm32.

This patch adds:
1. Call write_kvmstate_to_list() and write_list_to_cpustate()
in kvm_arch_get_registers() to sync guest register state.
2. Call write_list_to_kvmstate() in kvm_arch_put_registers()
to sync guest register state.

These changes are already there for kvm32 in target-arm/kvm32.c.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
Message-id: 1423130382-18640-2-git-send-email-pranavkumar@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

disas/arm-a64.cc: Tell libvixl correct code addresses

disassembling relative branches in code which doesn't reside at
what the guest CPU would think its execution address is. Use
the new MapCodeAddress() API to tell libvixl where the code is
from the guest CPU's point of view so it can get the target
addresses right.

Previous disassembly:

0x0000000040000000:  580000c0      ldr x0, pc+24 (addr 0x7f6cb7020434)
0x0000000040000004:  aa1f03e1      mov x1, xzr
0x0000000040000008:  aa1f03e2      mov x2, xzr
0x000000004000000c:  aa1f03e3      mov x3, xzr
0x0000000040000010:  58000084      ldr x4, pc+16 (addr 0x7f6cb702042c)
0x0000000040000014:  d61f0080      br x4

Fixed disassembly:
0x0000000040000000:  580000c0      ldr x0, pc+24 (addr 0x40000018)
0x0000000040000004:  aa1f03e1      mov x1, xzr
0x0000000040000008:  aa1f03e2      mov x2, xzr
0x000000004000000c:  aa1f03e3      mov x3, xzr
0x0000000040000010:  58000084      ldr x4, pc+16 (addr 0x40000020)
0x0000000040000014:  d61f0080      br x4

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1422274779-13359-3-git-send-email-peter.maydell@linaro.org

disas/libvixl: Update to upstream VIXL 1.7

Update our copy of libvixl to upstream's 1.7 release.
This includes upstream's fix for the issue we had a local
patch for in commit 94cc44a9e.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1422274779-13359-2-git-send-email-peter.maydell@linaro.org

target-arm: Fix brace style in reindented code

This patch fixes the brace style in the code reindented in the
previous commit.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Greg Bellows <greg.bellows@linaro.org>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>

target-arm: Reindent ancient page-table-walk code

A few of the oldest parts of the page-table-walk code have broken indent
(either hardcoded tabs or two-spaces). Reindent these sections.

For ease of review, this patch does not touch the brace style and
so is a whitespace-only change.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Greg Bellows <greg.bellows@linaro.org>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>

target-arm: Use mmu_idx in get_phys_addr()

Now we have the mmu_idx in get_phys_addr(), use it correctly to
determine the behaviour of virtual to physical address translations,
rather than using just an is_user flag and the current CPU state.

Some TODO comments have been added to indicate where changes will
need to be made to add EL2 and 64-bit EL3 support.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Greg Bellows <greg.bellows@linaro.org>

target-arm: Pass mmu_idx to get_phys_addr()

Make all the callers of get_phys_addr() pass it the correct
mmu_idx rather than just a simple "is_user" flag. This includes
properly decoding the AT/ATS system instructions; we include the
logic for handling all the opc1/opc2 cases because we'll need
them later for supporting EL2/EL3, even if we don't have the
regdef stanzas yet.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Greg Bellows <greg.bellows@linaro.org>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>

target-arm: Split AArch64 cases out of ats_write()

Instead of simply reusing ats_write() as the handler for both AArch32
and AArch64 address translation operations, use a different function
for each with the common code in a third function. This is necessary
because the semantics for selecting the right translation regime are
different; we are only getting away with sharing currently because
we don't support EL2 and only support EL3 in AArch32.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Greg Bellows <greg.bellows@linaro.org>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>

target-arm: Don't define any MMU_MODE*_SUFFIXes

target-arm doesn't use any of the MMU-mode specific cpu ldst
accessor functions. Suppress their generation by not defining
any of the MMU_MODE*_SUFFIX macros. ("user" and "kernel" are
too simplistic as descriptions of indexes 0 and 1 anyway.)

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Greg Bellows <greg.bellows@linaro.org>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>

target-arm: Use correct mmu_idx for unprivileged loads and stores

The MMU index to use for unprivileged loads and stores is more
complicated than we currently implement:
* for A64, it should be "if at EL1, access as if EL0; otherwise
access at current EL"
* for A32/T32, it should be "if EL2, UNPREDICTABLE; otherwise
access as if at EL0".

In both cases, if we want to make the access for Secure EL0
this is not the same mmu_idx as for Non-Secure EL0.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Greg Bellows <greg.bellows@linaro.org>

target-arm: Define correct mmu_idx values and pass them in TB flags

We currently claim that for ARM the mmu_idx should simply be the current
exception level. However this isn't actually correct -- secure EL0 and EL1
should have separate indexes from non-secure EL0 and EL1 since their
VA->PA mappings may differ. We also will want an index for stage 2
translations when we properly support EL2.

Define and document all seven mmu index values that we require, and
pass the mmu index in the TB flags rather than exception level or
priv/user bit.

This change doesn't update the get_phys_addr() code, so our page
table walking still assumes a simplistic "user or priv?" model for
the moment.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Greg Bellows <greg.bellows@linaro.org>
---
This leaves some odd gaps in the TB flags usage. I will circle
back and clean this up later (including moving the other common
flags like the singlestep ones to the top of the flags word),
but I didn't want to bloat this patchseries further.

target-arm/translate-a64: Fix wrong mmu_idx usage for LDT/STT

The LDT/STT (load/store unprivileged) instruction decode was using
the wrong MMU index value. This meant that instead of these insns
being "always access as if user-mode regardless of current privilege"
they were "always access as if kernel-mode regardless of current
privilege". This went unnoticed because AArch64 Linux doesn't use
these instructions.

Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Greg Bellows <greg.bellows@linaro.org>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
---
I'm not counting this as a security issue because I'm assuming
nobody treats TCG guests as a security boundary (certainly I
would not recommend doing so...)

target-arm: Make arm_current_el() return sensible values for M profile

Although M profile doesn't have the same concept of exception level
as A profile, it does have a notion of privileged versus not, which
we currently track in the privmode TB flag. Support returning this
information if arm_current_el() is called on an M profile core, so
that we can identify the correct MMU index to use (and put the MMU
index in the TB flags) without having to special-case M profile.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Greg Bellows <greg.bellows@linaro.org>

cpu_ldst.h: Allow NB_MMU_MODES to be 7

Support guest CPUs which need 7 MMU index values.
Add a comment about what would be required to raise the limit
further (trivial for 8, TCG backend rework for 9 or more).

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Greg Bellows <greg.bellows@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>

hw/arm/virt: explain device-to-transport mapping in create_virtio_devices()

Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Message-id: 1422592273-4432-1-git-send-email-lersek@redhat.com
[PMM: added note recommending UUIDs]
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

target-arm: check that LSB <= MSB in BFI instruction

The documentation states that if LSB > MSB in BFI instruction behaviour
is unpredictable. Currently QEMU crashes because of assertion failure in
this case:

tcg/tcg-op.h:2061: tcg_gen_deposit_i32: Assertion `len <= 32' failed.

While assertion failure may meet the "unpredictable" definition this
behaviour is undesirable because it allows an unprivileged guest program
to crash the emulator with the OS and other programs.

This patch addresses the issue by throwing illegal instruction exception
if LSB > MSB. Only ARM decoder is affected because Thumb decoder already
has this check in place.

To reproduce issue run the following program

int main(void) {
    asm volatile (".long 0x07c00c12" :: );
    return 0;
}

compiled with
  gcc -marm -static badop_arm.c -o badop_arm

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

target-arm: Squash input denormals in FRECPS and FRSQRTS

The helper functions for FRECPS and FRSQRTS have special case
handling that includes checks for zero inputs, so squash input
denormals if necessary before those checks. This fixes incorrect
output when the FPCR DZ bit is set to enable squashing of input
denormals.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Tested-by: Laurent Desnogues <laurent.desnogues@gmail.com>

Fix FMULX not squashing denormalized inputs when FZ is set.

While FMULX returns a 2.0f float when two operators are infinity and
zero, those operators should be unpacked from raw inputs first. Inconsistent
cases would occur when operators are denormalized floats in flush-to-zero
mode. A wrong codepath will be entered and 2.0f will not be returned
without this patch.
Fix by checking whether inputs need to be flushed before running into
different codepaths.

Signed-off-by: Xiangyu Hu <libhu.so@gmail.com>
Message-id: 1422459650-12490-1-git-send-email-libhu.so@gmail.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

target-arm: Add checks that cpreg raw accesses are handled

Add assertion checking when cpreg structures are registered that they
either forbid raw-access attempts or at least make an attempt at
handling them. Also add an assert in the raw-accessor-of-last-resort,
to avoid silently doing a read or write from offset zero, which is
actually AArch32 CPU register r0.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1422282372-13735-3-git-send-email-peter.maydell@linaro.org
Reviewed-by: Greg Bellows <greg.bellows@linaro.org>

target-arm: Split NO_MIGRATE into ALIAS and NO_RAW

We currently mark ARM coprocessor/system register definitions with
the flag ARM_CP_NO_MIGRATE for two different reasons:
1) register is an alias on to state that's also visible via
   some other register, and that other register is the one
   responsible for migrating the state
2) register is not actually state at all (for instance the TLB
   or cache maintenance operation "registers") and it makes no
   sense to attempt to migrate it or otherwise access the raw state

This works fine for identifying which registers should be ignored
when performing migration, but we also use the same functions for
synchronizing system register state between QEMU and the kernel
when using KVM. In this case we don't want to try to sync state
into registers in category 2, but we do want to sync into registers
in category 1, because the kernel might have picked a different
one of the aliases as its choice for which one to expose for
migration. (In particular, on 32 bit hosts the kernel will
expose the state in the AArch32 version of the register, but
TCG's convention is to mark the AArch64 version as the version
to migrate, even if the CPU being emulated happens to be 32 bit,
so almost all system registers will hit this issue now that we've
added AArch64 system emulation.)

Fix this by splitting the NO_MIGRATE flag in two (ALIAS and NO_RAW)
corresponding to the two different reasons we might not want to
migrate a register. When setting up the TCG list of registers to
migrate we honour both flags; when populating the list from KVM,
only ignore registers which are NO_RAW.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Greg Bellows <greg.bellows@linaro.org>
Message-id: 1422282372-13735-2-git-send-email-peter.maydell@linaro.org
[PMM: changed ARM_CP_NO_MIGRATE to ARM_CP_ALIAS on new SP_EL1 and
SP_EL2 reginfo stanzas since there was a (semantic) merge conflict
with the patchset that added those]

target-arm: Add missing SP_ELx register definition

Added CP register definitions for SP_EL1 and SP_EL2.

Signed-off-by: Greg Bellows <greg.bellows@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1422029835-4696-5-git-send-email-greg.bellows@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

target-arm: Change reset to highest available EL

Update to arm_cpu_reset() to reset into the highest available exception level
based on the set ARM features.

Signed-off-by: Greg Bellows <greg.bellows@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1422029835-4696-4-git-send-email-greg.bellows@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

target-arm: Add extended RVBAR support

Added RVBAR_EL2 and RVBAR_EL3 CP register support. All RVBAR_EL# registers
point to the same location and only the highest EL version exists at any one
time.

Signed-off-by: Greg Bellows <greg.bellows@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1422029835-4696-3-git-send-email-greg.bellows@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

target-arm: Fix RVBAR_EL1 register encoding

Fix the RVBAR_EL1 CP register opc2 encoding from 2 to 1

Signed-off-by: Greg Bellows <greg.bellows@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1422029835-4696-2-git-send-email-greg.bellows@linaro.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

target_arm: Parameterise the irq lines for armv7m_init

This patch allows the board to specifiy the number of NVIC interrupt
lines when using armv7m_init.

Signed-off-by: Alistair Francis <alistair23@gmail.com>
Reviewed-by: Peter Crosthwaite <peter.crosthwaite@xilinx.com>
Message-id: 5a0b0fcc778df0340899f488053acc9493679e03.1422077994.git.alistair23@gmail.com
[PMM: removed stale FIXME comment]
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

target_arm: Remove memory region init from armv7m_init

This patch moves the memory region init code from the
armv7m_init function to the stellaris_init function

Signed-off-by: Alistair Francis <alistair23@gmail.com>
Reviewed-by: Peter Crosthwaite <peter.crosthwaite@xilinx.com>
Message-id: 4836be7e1d708554d6eb0bc639dc2fbf7dac0458.1422077994.git.alistair23@gmail.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

Merge remote-tracking branch 'remotes/armbru/tags/pull-error-2015-02-05' into staging

qmp hmp balloon: Cleanups around error reporting

# gpg: Signature made Thu 05 Feb 2015 07:15:11 GMT using RSA key ID EB918653
# gpg: Good signature from "Markus Armbruster <armbru@redhat.com>"
# gpg:                 aka "Markus Armbruster <armbru@pond.sub.org>"

* remotes/armbru/tags/pull-error-2015-02-05:
  balloon: Eliminate silly QERR_ macros
  balloon: Factor out common "is balloon active" test
  balloon: Inline qemu_balloon(), qemu_balloon_status()
  qmp: Eliminate silly QERR_COMMAND_NOT_FOUND macro
  qmp: Simplify recognition of capability negotiation command
  qmp: Clean up qmp_query_spice() #ifndef !CONFIG_SPICE dummy
  hmp: Compile hmp_info_spice() only with CONFIG_SPICE
  qmp hmp: Improve error messages when SPICE is not in use
  qmp hmp: Factor out common "using spice" test

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

MAINTAINERS: add Jason Wang as net subsystem maintainer

Jason Wang will be co-maintaining the QEMU net subsystem with me. He
has contributed improvements and reviewed patches over the past years as
part of working on virtio-net and virtualized networking.

Jason has already been backing me up with patch reviews. For the time
being I will continue to submit pull requests.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>

Merge remote-tracking branch 'remotes/awilliam/tags/vfio-update-20150204.0' into staging

VFIO fixes:
- Fix wrong initializer (Chen Fan)
- Add missing object_unparent (Alex Williamson)

# gpg: Signature made Wed 04 Feb 2015 18:49:24 GMT using RSA key ID 3BB08B22
# gpg: Good signature from "Alex Williamson <alex.williamson@redhat.com>"
# gpg:                 aka "Alex Williamson <alex@shazbot.org>"
# gpg:                 aka "Alex Williamson <alwillia@redhat.com>"
# gpg:                 aka "Alex Williamson <alex.l.williamson@gmail.com>"

* remotes/awilliam/tags/vfio-update-20150204.0:
  vfio-pci: Fix missing unparent of dynamically allocated MemoryRegion
  vfio: fix wrong initialize vfio_group_list

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

vfio-pci: Fix missing unparent of dynamically allocated MemoryRegion

Commit d8d95814609e added explicit object_unparent() calls for
dynamically allocated MemoryRegions. The VFIOMSIXInfo structure also
contains such a MemoryRegion, covering the mmap'd region of a PCI BAR
above the MSI-X table. This structure is freed as part of the class
exit function and therefore also needs an explicit object_unparent().
Failing to do this results in random segfaults due to fields within
the structure, often the class pointer, being reclaimed and corrupted
by the time object_finalize_child_property() is called for the object.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-stable@nongnu.org # 2.2

vfio: fix wrong initialize vfio_group_list

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Merge remote-tracking branch 'remotes/rth/tags/pull-tg-s390-20150203' into staging

s390 translator bug fixes

# gpg: Signature made Tue 03 Feb 2015 20:39:15 GMT using RSA key ID 4DD0279B
# gpg: Good signature from "Richard Henderson <rth7680@gmail.com>"
# gpg:                 aka "Richard Henderson <rth@redhat.com>"
# gpg:                 aka "Richard Henderson <rth@twiddle.net>"

* remotes/rth/tags/pull-tg-s390-20150203:
  target-s390x: fix and optimize slb* and slbg* computation of carry/borrow flag
  target-s390x: support OC and NC in the EX instruction
  disas/s390.c: Remove unused variables
  target-s390x: Mark check_privileged() as !CONFIG_USER_ONLY
  target-s390: Implement ECAG
  target-s390: Implement LURA, LURAG, STURG
  target-s390: Fix STURA
  target-s390: Fix STIDP
  target-s390: Implement EPSW
  target-s390: Implement SAM specification exception

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

target-s390x: fix and optimize slb* and slbg* computation of carry/borrow flag

This patch fixes the bug with borrow_in being set incorrectly, but it
also simplifies the logic to be much more plain, improving speed.  It
fixes both the 32-bit SLB* and 64-bit SLBG*.

The SLBG* change has been well-tested.  I haven't tested the SLB* change
explicitly, but the code was copy-pasted from the tested code.

The error of these functions' current implementations would not likely
be triggered by compiler-generated code, since the only error was in the
state of the carry/borrow flag.  Compilers rarely generate an
instruction sequence such as carry-set -> carry-set-and-use ->
carry-use.

(With Paolo's fix and mine, there are still a couple of failures from
GMP's testsuite, but they are almost surely due to incorrect code
generation from gcc 4.9.  But since this gcc is running under qemu, it
might be qemu bugs.  I intend to investigate this.)

Signed-off-by: Torbjorn Granlund <torbjorng@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>

target-s390x: support OC and NC in the EX instruction

This is needed to run the GMP testsuite.

Reported-by: Torbjorn Granlund <torbjorng@google.com>
Tested-by: Torbjorn Granlund <torbjorng@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>