现在的位置: 首页 > 综合 > 正文

MPEG video compression technique

2018年04月03日 ⁄ 综合 ⁄ 共 10010字 ⁄ 字号 评论关闭

From: http://vsr.informatik.tu-chemnitz.de/~jan/MPEG/HTML/mpeg_tech.html

a brief discussion

The MPEG compression technique is described here as long it is necessary to understand the Java implementation problems.

A MPEG "film" is a sequence of three kinds of frames:

frametype.gif

The I-frames are intra coded, i.e. they can be reconstructed without any reference to other frames. The P-frames are forward predicted from the last I-frame or P-frame,
i.e. it is impossible to reconstruct them without the data of another frame (I or P). The B-frames are both, forward predicted and backward predicted from the last/next I-frame or P-frame,
i.e. there are two other frames necessary to reconstruct them. P-frames and B-frames are referred to as inter coded frames.

That means a Java program must buffer at least three frames: One for forward prediction and one for backward prediction. The third buffer contains the frame coming into being. As the figure shows the frame for backward prediction
follows the predicted frame. That would require to suspend the decoding of B-frames till the next P- or B- frame appears. But fortunately the display order is not the coding order. The frames appear on MPEG data stream in such an order that the referred frames
precede the referring frames.

As an example the frame sequence above is transfered in the following order: I P B B B P B B B. The only task of the decoder is to reorder the reconstructed frames. To support this an ascending
frame number comes with each frame (modulo 1024).

What does "prediction" mean?

motion.gif

Imagine an I-frame showing a triangle on white background! A following P-frame shows the same triangle but at another position. Prediction means to supply a motion vector which declares
how to move the triangle on I-frame to obtain the triangle in P-frame. This motion vector is part of the MPEG stream and it is divided in a horizontal and a vertical part. These parts can be positive or negative. A positive value means motion
to the right
 ormotion downwards, respectively. A negative value means motion to the left or motion upwards, respectively.

The parts of the motion vector are in an range of -64 ... +63. So the referred area can be up to 64x64 pixels away.



But this model assumes that every change between frames can be expressed as a simple displacement of pixels. But the figure to the right shows this isn't true. The red rectangle is shifted and rotated by 5° to the right. So a simple displacement
of the red rectangle will cause a prediction error. Therefore the MPEG stream contains a matrix for compensating this prediction error.
predict.gif

Thus, the reconstruction of inter coded frames goes ahead in two steps:

  1. Application of the motion vector to the referred frame;
  2. Adding the prediction error compensation to the result;

reconstruct.gif

Note that the prediction error compensation requires less bytes than the whole frame because the white parts are zero and can be discarded from MPEG stream. Furthermore the DCT compression (see
later in this chapter)
 is applied to theprediction error which decreases its memory size.

Note also the different meanings of the two + - signs. The first means adding the motion vector to the x-, y- coordinates of each pixel. The second means adding an
error value to the color value of the appropriate pixel
.

But what if some parts move to the left and others to the right ?

The motion vector isn't valid for the whole frame. Instead of this the frame is divided into macro blocks of 16x16 pixels. Every macro block has its own motion vector. Of course, this does not avoid contradictory motion but it minimizes
its probability.

And if contradictory motion occurs? One of the greatest misunderstandings of the MPEG compression technique is to assume that all macro blocks of P-frames are predicted. If the prediction error is
to big the coder can decide to intra-code a macro block. Similarly the macro blocks in B-frames can be forward predicted or backward predicted or forward and backward predicted or intra-coded.

macroblocks.gif
blocks.gif

Every macro block contains 4 luminance blocks and 2 chrominance blocks. Every block has a dimension of 8x8 values. The luminance blocks contain information of the brightness of every pixel in macro block. The chrominance
blocks
 contain color information. Because of some properties of the human eye it isn't necessary to give color information for every pixel. Instead 4 pixels are related to one color value. This color value is divided into two parts. The first is in Cb color
block the second is in Cr color block. The color information is to be applied as shown in the picture to the left.

Depending on the kind of macro block the blocks contain pixel information or prediction error information as mentioned above. In any case the information is compressed using the discrete cosine transform (DCT).



The discrete cosine transform(DCT)

As mentioned above the 8x8 block values are coded by means of the discrete cosine transform. To explain this regard the figure to the right. This shall be an enlargement of an 8x8 area of a black-white frame. p_anf.gif
120 108 90 75 69 73 82 89
127 115 97 81 75 79 88 95
134 122 105 89 83 87 96 103
137 125 107 92 86 90 99 106
131 119 101 86 80 83 93 100
117 105 87 72 65 69 78 85
100 88 70 55 49 53 62 69
89 77 59 44 38 42 51 58
The normal way is to determine the brightness of each of the 64 pixels and to scale them to some limits, say from 0 to 255*, whereby "0" means "black" and "255" means "white". We can also represent the values in form of an 8x8 bar diagram.
Normally the values are processed line by line. That requires 64 byte of storage.
anf.gif

*(In MPEG a range from -256 to 255 is used.)

But you can define all the 64 values by only 5 integers if you apply the following formula called discrete cosine transform (DCT)

dct.png

Where f(x,y) is the brightness of the pixel at position [x,y]. The result is F an 8x8 array, too:

700 90 100 0 0 0 0 0
90 0 0 0 0 0 0 0
-89 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
But as you can see: almost all values are equal to zero. Because the non-zero values are concentrated at the upper left corner the matrix is transfered to the receiver in zigzag scan order. That would result in:
700 90 90 -89 0 100 0 0 0 .... 0
Of course, the zeros are not transferred. An End-Of-Block sign is coded instead.
zigzag.gif

The decoder can reconstruct the pixel values by the following formula called inverse discrete cosine transform (IDCT):

idct.png

Where F(u,v) is the transform matrix value at position [u,v]. The results are exactly the original pixel values. Therefore the MPEG compression could be regarded as loss-less. But that
isn't true, because the transformed values are quantized. That means they are (integer) divided by a certain value greater or equal 8 because the DCT supplies values up to 2047. To reduce them under the byte length at least the quantization value 8 is applied.
The decoder multiplies the result by the same value. Of course the result differs from the original value. But again because of some properties of the human eye the error isn't visible. In MPEG there is a quantization matrix which defines a different quantization
value for every transform value depending on its position.

Was it a well chosen example ?

No, it wasn't! The DCT always tends to compute zeros. This effect is assisted by the quantization which zeros small values. To understand this one must recognize the essence of the DCT. To do this let's begin from
the opposite side.

Let us apply the IDCT to a matrix only containing one value of 700 at the upper left corner: The result of the IDCT is And the bar diagram looks like
700 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
87 87 87 87 87 87 87 87
87 87 87 87 87 87 87 87
87 87 87 87 87 87 87 87
87 87 87 87 87 87 87 87
87 87 87 87 87 87 87 87
87 87 87 87 87 87 87 87
87 87 87 87 87 87 87 87
87 87 87 87 87 87 87 87
const.gif

Of course, the picture is an grey colored square. The value at the upper left corner is called the DC value. This is the abbreviation for direct current and refers to a similar phenomenon in the
theory of alternating current where an alternating current can have a direct component. In DCT the DC value determines the average brightness in block. All other values describe the variation around this DC value. Therefore they are sometimes referred
as to AC values (from "alternating current").

Now let's add an
AC value of 100
The result of
the IDCT is
And the bar diagram looks like
700 100 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
105 102 97 91 84 78 73 70
105 102 97 91 84 78 73 70
105 102 97 91 84 78 73 70
105 102 97 91 84 78 73 70
105 102 97 91 84 78 73 70
105 102 97 91 84 78 73 70
105 102 97 91 84 78 73 70
105 102 97 91 84 78 73 70
first.gif

The resulting picture looks like p_first.gif As you can see the values vary around the DC value of 87. Furthermore if you regard
the shape of the bar diagram you'll see a curve like a half cosine line. It is said the picture has a frequency of 1 in X-direction. Imagine a car that drives with constant speed from left to right along the "bar diagram street" parallel to X-Axis. In contrast
to the DC example the car is shaked with a certain frequency but only if it follows the X-Axis. In Y direction it moves along the same height during the whole way. This behaviour is documented in the transform matrix because the only AC value of 100 appears
in X direction.

Now let's consider what happens
if we place the AC value of 100
at the next position
The result of
the IDCT is
And the bar diagram looks like
700 0 100 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
104 94 81 71 71 81 94 104
104 94 81 71 71 81 94 104
104 94 81 71 71 81 94 104
104 94 81 71 71 81 94 104
104 94 81 71 71 81 94 104
104 94 81 71 71 81 94 104
104 94 81 71 71 81 94 104
104 94 81 71 71 81 94 104
second.gif

The resulting picture looks like p_second.gif. The shape of the bar diagram shows a cosine line, too. But now we see a full period,
i.e. the frequency is as twice as high as in the first example. This behaviour would continue if we replace the AC-value step by step to the right. Every step increases the frequency of the cosine wave.

But what happens if we
place both AC values
The result of
the IDCT is
And the bar diagram looks like
700 100 100 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
121 109 91 75 68 71 80 86
121 109 91 75 68 71 80 86
121 109 91 75 68 71 80 86
121 109 91 75 68 71 80 86
121 109 91 75 68 71 80 86
121 109 91 75 68 71 80 86
121 109 91 75 68 71 80 86
121 109 91 75 68 71 80 86
f_s.gif

The resulting picture looks like p_f_s.gif.Regarding the shape of the bar diagram you can see a mix of both, the first and the second
cosine wave. Indeed, the resulting AC value is simply an addition of the cosine lines.

Now let's add a AC
value at the other direction
The result of
the IDCT is
And the bar diagram looks like
700 100 100 0 0 0 0 0
200 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
156 144 125 109 102 106 114 121
151 138 120 104 97 100 109 116
141 129 110 94 87 91 99 106
128 116 97 82 75 78 86 93
114 102 84 68 61 64 73 80
102 89 71 55 48 51 60 67
92 80 61 45 38 42 50 57
86 74 56 40 33 36 45 52
fi_fi.gif

The resulting picture looks like p_fi_fi.gif Now the values vary in Y direction, too. The principle is: The higher the index of the
AC value the greater is the frequency.

Now as a last example let's place an AC value at the opposite side of the DC value. We already know what it means: The highest possible frequency of 8 is applied in both, the X- and the Y- direction.

What is to be expected? The result of
the IDCT is
And the bar diagram looks like
950 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 500
124 105 139 95 143 98 132 114
105 157 61 187 51 176 80 132
139 61 205 17 221 32 176 98
95 187 17 239 0 221 51 143
143 51 221 0 239 17 187 95
98 176 32 221 17 205 61 139
132 80 176 51 187 61 157 105
114 132 98 143 95 139 105 124
heigh.gif

Because of the high frequency the neighbouring values differ numerously. The picture shows a checker-like appearance p_heigh.gif.
Note that this shall be a 8x8 pixel enlargement in a real picture! How often does it happen? We can hope that such a case is very seldom. And that's why the DCT computes in almost every case zeros for the higher frequencies.

抱歉!评论已关闭.