git.osdn.net Git - android-x86/external-ffmpeg.git/commit

author	Martin Storsjö <martin@martin.st>
	Mon, 9 Jan 2017 22:15:16 +0000 (00:15 +0200)
committer	Michael Niedermayer <michael@niedermayer.cc>
	Sat, 14 Jan 2017 20:13:32 +0000 (21:13 +0100)
commit	8b11a89c06b94632d545f67ca508bd9c05c435ac
tree	723636e32014c76b820b49b6d4c5b5c64c11837e	tree \| snapshot
parent	388f6e6715ff6f84efc858f69530867a235fe909	commit \| diff

aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32

This work is sponsored by, and copyright, Google.

Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:

vp9_inv_dct_dct_16x16_sub16_add_neon:   1373.2
vp9_inv_dct_dct_32x32_sub32_add_neon:   8089.0

By skipping individual 8x16 or 8x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:

vp9_inv_dct_dct_16x16_sub1_add_neon:     235.3
vp9_inv_dct_dct_16x16_sub2_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub4_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub8_add_neon:    1036.7
vp9_inv_dct_dct_16x16_sub12_add_neon:   1372.1
vp9_inv_dct_dct_16x16_sub16_add_neon:   1372.1
vp9_inv_dct_dct_32x32_sub1_add_neon:     555.1
vp9_inv_dct_dct_32x32_sub2_add_neon:    5190.2
vp9_inv_dct_dct_32x32_sub4_add_neon:    5180.0
vp9_inv_dct_dct_32x32_sub8_add_neon:    5183.1
vp9_inv_dct_dct_32x32_sub12_add_neon:   6161.5
vp9_inv_dct_dct_32x32_sub16_add_neon:   6155.5
vp9_inv_dct_dct_32x32_sub20_add_neon:   7136.3
vp9_inv_dct_dct_32x32_sub24_add_neon:   7128.4
vp9_inv_dct_dct_32x32_sub28_add_neon:   8098.9
vp9_inv_dct_dct_32x32_sub32_add_neon:   8098.8

I.e. in general a very minor overhead for the full subpartition case due
to the additional cmps, but a significant speedup for the cases when we
only need to process a small part of the actual input data.

This is cherrypicked from libav commits
cad42fadcd2c2ae1b3676bb398844a1f521a2d7b and
a0c443a3980dc22eb02b067ac4cb9ffa2f9b04d2.

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>