哈弗曼树总结
在我们日常的编码中,计算机通常使用一些二进制位串来标记对象,如ascll码。但是很多时候由于在一个内容中,不同的对象出现的频率不同,8位的ascii码其实有许多空间上的浪费。为了减少这种浪费,我们选择使用一些别的方法来储存这些对象。即我们接下来讲到的哈夫曼编码
具体的定义百度上可以搜到,这里不作赘述。这里来具体讲一下哈夫曼编码的获得途径,即哈夫曼树的构建。
哈夫曼树的本质就是一颗满k叉树(具体的k是几取决于用几进制表示这一串编码),所有的节点的度要么是0要么是k。且哈弗曼树总是满足给定一组具有确定权值的叶子结点,带权路径长度最小的二叉树。(除了叶子节点,所有节点的权值等于其子节点的权值之和,叶子节点的权值为其所代表的字符出现的次数)而最终通过哈夫曼编码重新获得的字符串的编码长度就是所有的叶子节点向根节点的路径上的节点的的权值和(带权路径和)。为了使长度尽量得小,就要求权值大的节点,尽量在深度小的地方出现。
举个例子:给出字符串“AAAAABCD”,其中A出现了5次,B出现了1次,C出现了1次,D出现了一次,所以A,B,C,D的权值就分别是5,1,1,1这样我们可以构建这样几种树:
最终在的带权路径和即为21
此图的带权路径和即为17
此图的带权路径即为13
可以发现,权值越大的点越靠上,整棵树的带权路径之和就会越小。
依据这个原理我们来构建一颗哈弗曼树:
⑴ 初始化:由给定的n个权值{w1,w2,…,wn}构造n棵只有一个根结点的二叉树,从而得到一个二叉树集合F={T1,T2,…,Tn};
⑵ 选取与合并:在F中选取根结点的权值最小的两棵二叉树分别作为左、右子树构造一棵新的二叉树,这棵新二叉树的根结点的权值为其左、右子树根结点的权值之和;
⑶ 删除与加入:在F中删除作为左、右子树的两棵二叉树,并将新建立的二叉树加入到F中;
⑷ 重复⑵、⑶两步,当集合F中只剩下一棵二叉树时,这棵二叉树便是哈夫曼树。
具体实现到代码上,我们使用优先队列来维护这个森林,下面是范例代码:
priority_queue<int,vector<int>,greater<int> > q;
将叶子结点的权值加入q中
int ans=0;
if(q.size()==1) ans=q.top();
while(q.size()>1)
{
int a=q.top();q.pop();
int b=q.top();q.pop();
ans+=(a+b);
q.push(a+b);
}
最终得到的ans就是新的编码的长度了。
但有时我们需要得到最长的编码的长度(即树的深度),这时就需要使用结构体来保存每个节点的信息(权值和高度),考虑到会有高度不同的树合并的情况,我们一颗新的树的高度等于他的子树中最高的高度+1.
见代码(略丑):
struct node{
ll num;
ll depth;
friend bool operator>(node a,node b)
{
if(a.num!=b.num)return a.num>b.num;
return a.depth>b.depth;
}
};
priority_queue<node,vector<node>,greater<node> > q;
输入叶子节点的权值,并把它们的深度都设为0
ll ans=0;
if(q.size()==1) ans=q.top().num;
while(q.size()>1)
{
node s;
ll maxh=0;
ll u=0;
for(i=0;i<k;i++)
{
u+=q.top().num;
maxh=max(q.top().depth,maxh);
q.pop();
}
ans+=u;
s.depth=maxh+1;
s.num=u;
q.push(s);
}
cout<<ans<<endl;
cout<<(q.top().depth)<<endl;
值得注意的是,当k叉树时,节点的总数并不能构成一颗满k叉树,这样就会构成错误,这就需要我们添加虚点(权值为0的点)把节点都挤到更好的位置上去。
下面见两个例题:
http://poj.org/problem?id=1521
poj1521
Entropy
Time Limit: 1000MS | Memory Limit: 10000K | |
Total Submissions: 10041 | Accepted: 3682 |
Description
An entropy encoder is a data encoding method that achieves lossless data compression by encoding a message with "wasted" or "extra" information removed. In other words, entropy encoding removes information that was not necessary in the first place to accurately encode the message. A high degree of entropy implies a message with a great deal of wasted information; english text encoded in ASCII is an example of a message type that has very high entropy. Already compressed messages, such as JPEG graphics or ZIP archives, have very little entropy and do not benefit from further attempts at entropy encoding.
English text encoded in ASCII has a high degree of entropy because all characters are encoded using the same number of bits, eight. It is a known fact that the letters E, L, N, R, S and T occur at a considerably higher frequency than do most other letters in english text. If a way could be found to encode just these letters with four bits, then the new encoding would be smaller, would contain all the original information, and would have less entropy. ASCII uses a fixed number of bits for a reason, however: it’s easy, since one is always dealing with a fixed number of bits to represent each possible glyph or character. How would an encoding scheme that used four bits for the above letters be able to distinguish between the four-bit codes and eight-bit codes? This seemingly difficult problem is solved using what is known as a "prefix-free variable-length" encoding.
In such an encoding, any number of bits can be used to represent any glyph, and glyphs not present in the message are simply not encoded. However, in order to be able to recover the information, no bit pattern that encodes a glyph is allowed to be the prefix of any other encoding bit pattern. This allows the encoded bitstream to be read bit by bit, and whenever a set of bits is encountered that represents a glyph, that glyph can be decoded. If the prefix-free constraint was not enforced, then such a decoding would be impossible.
Consider the text "AAAAABCD". Using ASCII, encoding this would require 64 bits. If, instead, we encode "A" with the bit pattern "00", "B" with "01", "C" with "10", and "D" with "11" then we can encode this text in only 16 bits; the resulting bit pattern would be "0000000000011011". This is still a fixed-length encoding, however; we’re using two bits per glyph instead of eight. Since the glyph "A" occurs with greater frequency, could we do better by encoding it with fewer bits? In fact we can, but in order to maintain a prefix-free encoding, some of the other bit patterns will become longer than two bits. An optimal encoding is to encode "A" with "0", "B" with "10", "C" with "110", and "D" with "111". (This is clearly not the only optimal encoding, as it is obvious that the encodings for B, C and D could be interchanged freely for any given encoding without increasing the size of the final encoded message.) Using this encoding, the message encodes in only 13 bits to "0000010110111", a compression ratio of 4.9 to 1 (that is, each bit in the final encoded message represents as much information as did 4.9 bits in the original encoding). Read through this bit pattern from left to right and you’ll see that the prefix-free encoding makes it simple to decode this into the original text even though the codes have varying bit lengths.
As a second example, consider the text "THE CAT IN THE HAT". In this text, the letter "T" and the space character both occur with the highest frequency, so they will clearly have the shortest encoding bit patterns in an optimal encoding. The letters "C", "I’ and "N" only occur once, however, so they will have the longest codes.
There are many possible sets of prefix-free variable-length bit patterns that would yield the optimal encoding, that is, that would allow the text to be encoded in the fewest number of bits. One such optimal encoding is to encode spaces with "00", "A" with "100", "C" with "1110", "E" with "1111", "H" with "110", "I" with "1010", "N" with "1011" and "T" with "01". The optimal encoding therefore requires only 51 bits compared to the 144 that would be necessary to encode the message with 8-bit ASCII encoding, a compression ratio of 2.8 to 1.
Input
The input file will contain a list of text strings, one per line. The text strings will consist only of uppercase alphanumeric characters and underscores (which are used in place of spaces). The end of the input will be signalled by a line containing only the word “END” as the text string. This line should not be processed.
Output
For each text string in the input, output the length in bits of the 8-bit ASCII encoding, the length in bits of an optimal prefix-free variable-length encoding, and the compression ratio accurate to one decimal point.
Sample Input
AAAAABCD THE_CAT_IN_THE_HAT END
Sample Output
64 13 4.9 144 51 2.8
题意很简单,就是给你一串字符串,让你将它构建成哈夫曼编码,并问重新构建之后的编码长度。
直接统计每个字母的权值,然后构建哈弗曼树,计算出带权路径之和即可
代码:
#include<algorithm>
#include<iostream>
#include<cstring>
#include<string>
#include<cstdio>
#include<vector>
#include<stack>
#include<cmath>
#include<queue>
#include<set>
#include<map>
using namespace std;
priority_queue<int,vector<int>,greater<int> > q;
int main()
{
string str;
while(cin>>str&&str!="END")
{
int num[27];
memset(num,0,sizeof(num));
while(!q.empty()) q.pop();
int len=str.length();
int i;
for(i=0;i<len;i++)
{
if(str[i]>='A'&&str[i]<='Z')
{
++num[str[i]-'A'];
}
else ++num[26];
}
for(int i=0;i<27;i++)
{
//cout<<num[i]<<endl;
if(num[i]!=0)
{
//point[cnt++]=u;
q.push(num[i]);
}
}
int ans=0;
if(q.size()==1) ans=q.top();
while(q.size()>1)
{
int a=q.top();q.pop();
int b=q.top();q.pop();
ans+=(a+b);
q.push(a+b);
}
int oth=len*8;
cout<<oth<<' '<<ans;
printf(" %.1f\n",oth/(float)ans);
}
return 0;
}
https://www.luogu.org/problem/show?pid=P2168
洛谷2168
题目描述
追逐影子的人,自己就是影子 ——荷马
Allison 最近迷上了文学。她喜欢在一个慵懒的午后,细细地品上一杯卡布奇诺,静静地阅读她爱不释手的《荷马史诗》。但是由《奥德赛》和《伊利亚特》 组成的鸿篇巨制《荷马史诗》实在是太长了,Allison 想通过一种编码方式使得它变得短一些。
一部《荷马史诗》中有n种不同的单词,从1到n进行编号。其中第i种单 词出现的总次数为wi。Allison 想要用k进制串si来替换第i种单词,使得其满足如下要求:
对于任意的 1 ≤ i, j ≤ n , i ≠ j ,都有:si不是sj的前缀。
现在 Allison 想要知道,如何选择si,才能使替换以后得到的新的《荷马史诗》长度最小。在确保总长度最小的情况下,Allison 还想知道最长的si的最短长度是多少?
一个字符串被称为k进制字符串,当且仅当它的每个字符是 0 到 k − 1 之间(包括 0 和 k − 1 )的整数。
字符串 str1 被称为字符串 str2 的前缀,当且仅当:存在 1 ≤ t ≤ m ,使得str1 = str2[1..t]。其中,m是字符串str2的长度,str2[1..t] 表示str2的前t个字符组成的字符串。
输入输出格式
输入格式:
输入的第 1 行包含 2 个正整数 n, k ,中间用单个空格隔开,表示共有 n种单词,需要使用k进制字符串进行替换。
接下来n行,第 i + 1 行包含 1 个非负整数wi ,表示第 i 种单词的出现次数。
输出格式:
输出包括 2 行。
第 1 行输出 1 个整数,为《荷马史诗》经过重新编码以后的最短长度。
第 2 行输出 1 个整数,为保证最短总长度的情况下,最长字符串 si 的最短长度。
输入输出样例
输入样例#1: 复制
4 2 1 1 2 2
输出样例#1: 复制
12 2
输入样例#2: 复制
6 3 1 1 3 3 9 9
输出样例#2: 复制
36 3
考虑到是k叉树,可能有无法满树的情况,因此我们需要先对不满树的情况补充k-1-(n-1)%(k-1)个节点,然后直接对给出的权值,构建哈夫曼树,算出带权路径之和以及高度即可
代码:
#include<algorithm>
#include<iostream>
#include<cstring>
#include<string>
#include<cstdio>
#include<vector>
#include<stack>
#include<cmath>
#include<queue>
#include<set>
#include<map>
using namespace std;
typedef unsigned long long ll;
struct node{
ll num;
ll depth;
friend bool operator>(node a,node b)
{
if(a.num!=b.num)return a.num>b.num;
return a.depth>b.depth;
}
};
priority_queue<node,vector<node>,greater<node> > q;
ll s[100001];
int main()
{
ll n,k,i;
scanf("%lld%lld",&n,&k);
for(i=0;i<n;i++)
{
ll t;
node s;
s.depth=0;
scanf("%lld",&t);
s.num=t;
q.push(s);
}
if((n-1)%(k-1))
{
for(i=0;i<k-1-(n-1)%(k-1);i++)
{
node s;
s.num=0;
s.depth=0;
q.push(s);
}
}
ll ans=0;
if(q.size()==1) ans=q.top().num;
//ll cnt=0;
while(q.size()>1)
{
node s;
ll maxh=0;
//s.depth=q.top().depth+1;
//++cnt;
ll u=0;
for(i=0;i<k;i++)
{
u+=q.top().num;
maxh=max(q.top().depth,maxh);
q.pop();
}
ans+=u;
s.depth=maxh+1;
s.num=u;
q.push(s);
}
cout<<ans<<endl;
cout<<(q.top().depth)<<endl;
return 0;
}