现在的位置: 首页 > 综合 > 正文

贪心算法(3)哈弗曼编码算法

2018年11月07日 ⁄ 综合 ⁄ 共 7861字 ⁄ 字号 评论关闭

算法描述

摘自
对应中文:哈弗曼编码算法

英文:A
simple example of Huffman coding on a string

You’ve probably heard about David
Huffman
 and his popular compression algorithm. If you didn’t, you’ll find that info on the Internet. I will not bore you with history or math lessons in this article. I’m going to try to show you a practical example of this algorithm applied to a character
string. This application will only generate console output representing the code values for the symbols inputted and generate the original symbols from a given code.

The source code attached to this article will show you how Huffman Coding works so you will get a basic understanding of it. This is for the people who have difficulty understanding the mathematics of it. In a future article (I hope) we’ll be talking about
how to apply this to any files to produce their compressed format. (A simple file archiver like WinZip or WinRAR.)

The idea behind Huffman coding is based upon the frequency of a symbol in a sequence. The symbol that is the most frequent in that sequence gets a new code that is very small, the least frequent symbol
will get a code that is very long, so that when we’ll translate the input we want to encode the most frequent symbols will take less space than they used to and the least frequent symbols will take more space but because they’re less frequent it won’t matter
that much. For this application I chose the symbol to be 8 bits long so that the symbol will be a character (char).

We could just as easily have chosen the symbol to be 16 bits long, so we could have grouped 2 characters together as a symbol or 10 bits or 20 etc. Depending on the input we expect to have, we’ll chose
the size of the symbol and the way we use it. For example, if I expect to encode raw video files, I’ll chose the symbol to be the size of a pixel. Keep in mind that when increasing or decreasing the size of the symbol, it will affect the size of the code for
each symbol because the bigger the size, the more symbols you can have of that size. There are less ways to write the ones and zeroes on 8 bits than there are on 16 bits. You’ll want to adjust the size of the symbol depending on how the ones and zeroes are
likely to repeat themselves in a sequence.

For this algorithm you need to have a basic understanding of binary tree data structure and the priority queue data structure. In the source code we’ll actually use the priority
queue
 code available in a previous article.

Let’s say we have the string “beep boop beer!” which in his actual form, occupies 1 byte of memory for each character. That means that in total, it occupies 15*8 = 120 bits of memory. Through encoding,
the string will occupy 40 bits. (Theoretically, in this application we’ll output to the console a string of 40 char elements of 0 and 1 representing the encoded version of the string in bits. For this to occupy 40 bits we need to convert that string directly
into bits using logical bit operations which we’ll not discuss now.)

To better understand this example, we’ll going to apply it on an example. The string “beep boop beer!” is a very good example to illustrate this. In order to obtain the code for each element depending
on it’s frequency we’ll need to build a binary tree such that each leaf of the tree will contain a symbol (a character from the string). The tree will be build from the leafs to the root, meaning that the elements of least frequency will be farther from the
root than the elements that are more frequent. You’ll see soon why we chose to do this.

To build the tree this way we’ll use a priority queue with a slight modification, that the element with the least priority is the most important. Meaning that the elements that are the least frequent
will be the first ones we get from the queue. We need to do this so we can build the tree from the leaves to the root.

Firstly we calculate the frequency of each character :

Character Frequency
‘b’ 3
‘e’ 4
‘p’ 2
‘ ‘ 2
‘o’ 2
‘r’ 1
‘!’ 1

After calculating the frequencies, we’ll create binary tree nodes for each character and we’ll introduce them in the priority queue with the frequency as priority :

We now get the first two elements from the queue and create a link between them by creating a new binary tree node to have them both as successors, so that the characters are siblings and we add their
priorities. After that we add the new node we created with the sum of the priorities of it’s successors as it’s priority in the queue. (The numbers represent the priority, i.e. their frequency.)

We repeat the same steps and we get the following :

Now after we link the last two elements we’ll get the final tree :

Now, to obtain the code for each symbol we just need to traverse the trees until we get to that symbol and after each step we take to the left we add a 0 to the code or 1 if we go right.

If we do this, we’ll get the following codes :

Character Code
‘b’ 00
‘e’ 11
‘p’ 101
‘ ‘ 011
‘o’ 010
‘r’ 1000
‘!’ 1001

To decode a string of bits we just need to traverse the tree for each bit, if the bit is 0 we take a left step and if the bit is 1 we take a right step until we hit a leaf (which is the symbol we are
looking for). For example, if we have the string “101 11 101 11″ and our tree, decoding it we’ll get the string “pepe”.

It’s very important to observe that not one code is a prefix of another code for another symbol. In our example, if 00 is the code for ‘b’, 000 cannot be a code for any other symbol because there’s going
to be a conflict. We’ll never reach that symbol because after taking steps for the first two bits we’ll get ‘b’, we’re never going the find the symbol for 000.

A practical aspect of implementing this algorithm is considering to build a Huffman table as soon as we have the tree. The table is basically a linked list or an array that contains each symbol with
it’s code because it will make encoding something more efficient. It’s hard to look for a symbol by traversing a tree and at the same time calculating it’s code because we don’t know where exactly in the tree is that symbol located. As a principle, we use
a Huffman table for encoding and a Huffman tree for decoding.

The input string : beep boop beer!

The input string in binary : 0110 0010 0110 0101 0110 0101 0111 0000 0010 0000 0110 0010 0110 1111 0110 1111 0111 0000 0010 0000 0110 0010 0110 0101 0110 0101 0111 0010 0010 0001

The encoded string : 0011 1110 1011 0001 0010 1010 1100 1111 1000 1001

As you can see there is a major difference in the ASCII version of the string and the Huffman coded version.

The source code behaves as described above. You’ll find more details in the comments in the code.

All the sources have been compiled and verified using the C99 standard. Happy programming.

Download
the source files

To be clearer since there have been comments on this. This article only illustrates how the algorithm works. To use this in a real scenario you need to add the Huffman tree you created at the end or
at the start of your encoded string and the reciever should know how to interpret it in order to decode the message. A good way to do this is to build a string of bits by traversing the tree in any order you want (I prefer post-order) and concatenate a 0 for
each node that is not a leaf and a 1 for each node that is a leaf followed by the bits representing the original symbol (in our case, 8 bits representing the ASCII char value). Placing this string of bits at the start of your encoded message is ideal. Once
the reciever has build the tree, he will know how to decode the message to obtain the original one.

To read about a more advanced version of this algorithm, you can read our article on Adaptive
Huffman Coding
.

编程实现哈弗曼编码和解码

(摘自《Data Structure and Algorithm in JAVA》)

如何构建哈弗曼编码树

   Here is the algorithm for constructing the tree(初始化:为每个字符建立一个树对象,并将这些树插入优先队列中):
1. Make a Node object for each character used in the message. Each node has two data items: the character and that character’s frequency in the message. 
2. Make a tree object for each of these nodes. The node becomes the root of the tree.
3. Insert these trees in a priority queue. They are ordered by frequency, with the smallest frequency having the highest priority. That is, when you remove a tree, it’s always the one with the least-used character.
    Now do the following(构建哈弗曼编码树):
1. Remove from priority tree.Remove two trees from the priority queue, and make them into children of a new node. The new node has a frequency that is the sum of the children’s frequencies; its character field can be left blank.
2. Insert into priority tree. Insert this new three-node tree back into the priority queue.
3. Keep repeating steps 1 and 2. The trees will get larger and larger, and there will be fewer and fewer of them. When there is only one tree left in the queue, it is the Huffman tree and you’re done.
编码
   How do we create the Huffman code to put into the code table? The process is like decoding a message.We start at the
root of the Huffman tree and follow every possible path to a leaf node. As we go along the path, we remember the sequence of left and right choices, recording a 0 for a left edge and a 1 for a right edge. When we arrive at the leaf node for a character, the
sequence of 0s and 1s is the Huffman code for that character. We put this code into the code table(编码表) at the appropriate index number.

  This process can be handled by calling a method that starts at the root and then calls itself recursively for each child. Eventually, the paths to all the leaf nodes will be explored and the code table will be complete.
解码
从根节点开始,如果是0找到其左节点,如果是1找到其右节点,知道叶子节点,找到元素。

抱歉!评论已关闭.