哈夫曼编码

基本介绍

1.哈夫曼编码是一种编码方式，属于一种程序算法

2.它是可变字长编码的一种，是前缀编码，权值必然在叶子节点上，不会造成匹配多义性

3.前缀编码：指对字符集进行编码时，要求字符集中任一字符的编码都不是其它字符的编码的前缀

4.根据排序方法的不同，对应的哈夫曼编码不完全相同，但树的带权路径长度都是相等且最小的

哈夫曼编码的数据压缩

1.每次生成新二叉树排在权值相同二叉树的最后一位，使平均编玛长度的方差最小

2.实现思路

（1）Node类：data：存放对应的字符，weight：该字符出现的总次数，left：左节点，right：右节点

（2）假设数据是字符串，用getbyte()方法，得到字符串对应的byte[]数组，每个元素就对应了字符的ASCII码值

（3）遍历byte[]数组，用HashMap<Byte, Integer>统计每个字符出现的次数，统计完成后把HashMap中键值对创建Node，放入List

（4）根据List构建哈夫曼树，非叶子节点是没有存放数据的，即data为null，由哈夫曼树得到哈夫曼编码表

（5）将十进制ASCII编码的原始byte[]数组经过哈夫曼编码表，转成对应二进制哈夫曼变长编码，将其看补码，再转成byte，放入byte[]数组

哈夫曼编码的数据解压

1.实现思路

（1）压缩后的byte[]数组，重新转成二进制补码字符串

（2）创建新的HashMap，将哈夫曼编码表反转

（3）根据反转的哈夫曼编码表，将二进制补码字符串，转成原始的byte[]数组

非结尾数据的补位

1.以11111111为例

（1）11111111，将其看作补码，压缩数据，转成十进制的byte：-1（原码）

（2）解压数据：1 byte（8 bit）10000001（原码），转1 int（32bit）10000000 00000000 00000000 00000001（原码），高位补0

（3）因为是负数，反码：11111111 11111111 11111111 11111110，补码：11111111 11111111 11111111 11111111

（4）只有Integer类中有toBinaryString方法，，把十进制int转为字符串形式的二进制补码，所以要把byte转int

（5）负数的符号位为1，0不能省，所以负数不用补位

（6）截取后八位，字符串形式的二进制哈夫曼变长编码：11111111

2.以00000001为例

（1）00000001，将其看作补码，压缩数据，转成十进制的byte：1（原码）

（2）解压数据：1 byte（8 bit）00000001（原码），转1 int（32bit）00000000 00000000 00000000 00000001（原码），高位补0

（3）因为是正数，原码、反码、补码相同：00000000 00000000 00000000 00000001

（4）正数的符号位为0，0会省去，补码：1

（5）需要按位或256，进行补位，256的补码：00000000 00000000 0000001 00000000，补位后的运算结果：00000000 00000000 00000001 00000001

（6）截取后八位，字符串形式的二进制哈夫曼变长编码：00000001

3.综上

（1）数据为非负数时，其实是让倒数第九位（bit）为1，使后八位的补码得以保留

（2）数据为负数时，负数按位或256，也不会影响后八位

结尾数据的补位

1.以0111为例

（1）0111，将其看作补码，压缩数据，转成十进制的byte：7（原码）

（2）解压数据：1 byte（8 bit）0111（原码），转1 int（32bit）00000000 00000000 00000000 00000111（原码），高位补0

（3）因为是正数，原码、反码、补码相同：00000000 00000000 00000000 00000001

（4）正数的符号位为0，0会省去，补码：111

（5）需要按位或256，进行补位，256的补码：00000000 00000000 0000001 00000000，补位后的运算结果：00000000 00000000 00000001 00000111

（6）截取后八位，字符串形式的二进制哈夫曼变长编码：00000111，显然与原数据0111不符

2.综上

（1）原数据为非负数，需要在压缩时，记录高位有多少个0，在解压时，补上高位的0

（2）原数据为负数时，符号位为1，所以必定为八位，不需要补位

注意事项

1.对于压缩过的文件，或文件中重复的内容不多时，哈夫曼压缩效率不高

2.哈夫曼编码是按字节来处理的，因此可以处理所有文件

代码实现

import java.io.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class HuffmanCode {
    public static Map<Byte, String> huffmanCodeTable = new HashMap<>();//将赫夫曼编码表存放中Map<Byte,String>
    public static int count = 0;//记录最后一个编码省略0的个数，用于解压

    //封装压缩方法
    public static byte[] huffmanZip(byte[] bytes) {
        //将字节数组转成Node存入集合
        List<Node> nodes = getList(bytes);
        //创建哈夫曼树
        Node huffmanTreeRoot = createHuffmanTree(nodes);
        //跟据创建的哈夫曼树生成哈夫曼编码表
        Map<Byte, String> huffmanCodeTable = huffmanCodeTable(huffmanTreeRoot);
        //根据哈夫曼编码表,压缩原始字节数组
        return huffmanZip(bytes, huffmanCodeTable);
    }

    //Node放入List
    public static List<Node> getList(byte[] bytes) {
        List<Node> nodes = new ArrayList<>();
        //统计每个字符出现的次数，key是字符，value是出现次数
        HashMap<Byte, Integer> map = new HashMap<>();
        for (byte b : bytes) {
            Integer count = map.get(b);
            if (count == null) {//map中没有存入该字符
                map.put(b, 1);//设置该字符的出现次数为1
            } else {//map已有该字符
                map.put(b, ++count);
            }
        }
        //将map中的数据取出放入list中
        for (Map.Entry<Byte, Integer> entry : map.entrySet()) {
            nodes.add(new Node(entry.getKey(), entry.getValue()));
        }
        return nodes;
    }

    //根据List创建哈夫曼树
    public static Node createHuffmanTree(List<Node> nodes) {
        while (nodes.size() > 1) {
            //从小到大排序，新二叉树排在权值相同二叉树的最后一位，直到剩余元素为1
            nodes.sort((o1, o2) -> {
                if (o1.weight >= o2.weight) {
                    return 1;
                } else {//o1.weight < o2.weight
                    return -1;
                }
            });
            //取出权值最小的结点
            Node leftNode = nodes.get(0);
            //取出权值第二小的结点
            Node rightNode = nodes.get(1);
            //通过取出的两个节点权重计算他们的根节点的权重，root没有data
            Node root = new Node(null, (leftNode.weight + rightNode.weight));
            root.left = leftNode;
            root.right = rightNode;
            //从List删除处理过的二叉树
            nodes.remove(leftNode);
            nodes.remove(rightNode);
            //将parent加入到nodes
            nodes.add(root);
        }
        return nodes.get(0);
    }

    //传入哈夫曼树的根节点，返回哈夫曼表
    public static Map<Byte, String> huffmanCodeTable(Node root) {
        StringBuilder stringBuilder = new StringBuilder();//存储某个叶子节点的路径,
        if (root == null) {
            return null;
        }
        huffmanCodeTable(root, "", stringBuilder);
        return huffmanCodeTable;
    }

    //功能：将传入所有叶子节点的赫夫曼编码得到并存放到huffmanCodeTable中
    //node：初始传入根节点；code：编码，向左为0，向右为1；builder：拼接编码；table：哈夫曼表
    public static void huffmanCodeTable(Node node, String code, StringBuilder stringBuilder) {
        StringBuilder temp = new StringBuilder(stringBuilder);//将传入的StringBuilder拼接
        temp.append(code);
        if (node != null) {
            //if(node.data ==null) 不进行处理
            if (node.data == null) {//说明时非叶子节点，继续寻找直到找到某一个叶子节点
                //往左边查找
                huffmanCodeTable(node.left, "0", temp);
                //往右边查找
                huffmanCodeTable(node.right, "1", temp);
            } else {//如果当前已经为叶子节点，表示这个字符的赫夫曼编码已经产生，放入map集合中
                huffmanCodeTable.put(node.data, temp.toString());
            }
        }
    }

    //哈夫曼编码压缩
    public static byte[] huffmanZip(byte[] bytes, Map<Byte, String> huffmanCodeTable) {
        //将十进制ASCII编码的原始byte[]数组经过哈夫曼编码表，转成对应二进制哈夫曼变长编码，并用字符串拼接
        StringBuilder stringBuilder = new StringBuilder();
        for (byte b : bytes) {
            stringBuilder.append(huffmanCodeTable.get(b));
        }
        //统计返回的byte[]的长度,1 byte对应8 bit
        int length = (stringBuilder.length() + 7) / 8;
        byte[] compressBytes = new byte[length];
        int index = 0;
        for (int i = 0; i < stringBuilder.length(); i += 8) {
            String temp;
            if (i + 8 > stringBuilder.length()) {
                temp = stringBuilder.substring(i);
            } else {
                temp = stringBuilder.substring(i, i + 8);
            }
            //字符串形式的二进制哈夫曼变长编码，将其看作补码，转成十进制的int，再转成byte
            compressBytes[index++] = (byte) Integer.parseInt(temp, 2);
        }
        return compressBytes;
    }

    //解压，返回原始数据的byte[]
    public static byte[] huffmanUnzip(byte[] compressBytes, Map<Byte, String> huffmanCodeTable) {
        //得到压缩数据对应的二进制补码字符串
        StringBuilder binaryString = bytesToBinaryString(compressBytes);
        //把哈夫曼编码表进行调换，实现反向查询，把二进制补码字符串按照哈夫曼编码表进行解码
        Map<String, Byte> reverseHuffmanCodeTable = new HashMap<>();
        for (Map.Entry<Byte, String> entry : huffmanCodeTable.entrySet()) {
            reverseHuffmanCodeTable.put(entry.getValue(), entry.getKey());
        }
        //创建List存放解压数据
        List<Byte> list = new ArrayList<>();
        for (int i = 0; i < binaryString.length(); ) {
            int count = 1;//计数器，记录每一次匹配时，i到count-1的长度
            Byte b;
            while (true) {
                String key = binaryString.substring(i, i + count);
                b = reverseHuffmanCodeTable.get(key);//自动装箱
                if (b == null) {//没有匹配到，则移动count，直到匹配到一个字符
                    count++;
                } else {
                    break;//匹配到则退出循环
                }
            }
            list.add(b);
            i += count;//i直接移动到count的位置，也是迭代循环变量
        }
        byte[] bytes = new byte[list.size()];
        for (int i = 0; i < bytes.length; i++) {
            bytes[i] = list.get(i);//自动拆箱
        }
        return bytes;
    }

    //将传入的byte[]数组转为对应的二进制补码字符串
    public static StringBuilder bytesToBinaryString(byte[] compressBytes) {
        StringBuilder binaryString = new StringBuilder();//储存compressBytes对应的二进制补码字符串
        for (int i = 0; i < compressBytes.length; i++) {//遍历每一个压缩数据
            int temp = compressBytes[i];
            //非负数需要补高位，最后一个数据需要补0
            if (i < compressBytes.length - 1) {//未到最后一个数据
                temp |= 256;//若为正数，令倒数第九位（bit）为1，使后八位的补码得以保留；若为负数，此运算不影响后八位
                String tempString = Integer.toBinaryString(temp);//temp转为二进制补码字符串
                String substring = tempString.substring(tempString.length() - 8);//截取二进制补码字符串的后八位
                binaryString.append(substring);//拼接二进制补码字符串
            } else {//最后一个数据
                String tempString = Integer.toBinaryString(temp);//temp转为二进制补码字符串
                for (int j = 0; j < count; j++) {
                    binaryString.append("0");//补上高位的0
                }
                binaryString.append(tempString);//拼接二进制补码字符串
            }
        }
        return binaryString;//以String返回完整的二进制补码
    }

    //压缩文件，srcFile：压缩文件的全路径，dstFile：压缩文件存放目录
    public static void zipFile(String srcFile, String dstFile) {
        BufferedInputStream bis = null;
        BufferedOutputStream bos = null;
        ObjectOutputStream oos = null;
        try {
            bis = new BufferedInputStream(new FileInputStream(srcFile));
            byte[] bytes = new byte[bis.available()];//创建一个和源文件一样大小的byte[]
            bis.read(bytes);//读取文件，写入bytes
            byte[] zipBytes = huffmanZip(bytes);//直接对源文件进行压缩
            bos = new BufferedOutputStream(new FileOutputStream(dstFile));
            oos = new ObjectOutputStream(bos);
            oos.writeObject(zipBytes);//把哈夫曼编码后的字节数组写入压缩文件
            oos.writeObject(huffmanCodeTable);//把哈夫曼编码表写入压缩文件
        } catch (Exception e) {
            System.out.println(e.getMessage());
        } finally {
            try {
                bis.close();
                bos.close();
                oos.close();
            } catch (Exception e) {
                System.out.println(e.getMessage());
            }
        }
    }

    //解压文件，zipFile：解压文件名，dstFile：解压路径
    public static void unzipFile(String zipFile, String dstFile) {
        BufferedInputStream bis = null;
        BufferedOutputStream bos = null;
        ObjectInputStream ois = null;
        try {
            bis = new BufferedInputStream(new FileInputStream(zipFile));
            ois = new ObjectInputStream(bis);
            byte[] zipBytes = (byte[]) ois.readObject();//读取byte数组
            huffmanCodeTable = (Map<Byte, String>) ois.readObject();//读取哈夫曼编码表
            byte[] sourceBytes = huffmanUnzip(zipBytes, huffmanCodeTable);//解压
            bos = new BufferedOutputStream(new FileOutputStream(dstFile));
            bos.write(sourceBytes);//写入数据到dstFile
        } catch (Exception e) {
            System.out.println(e.getMessage());
        } finally {
            try {
                bis.close();
                bos.close();
                ois.close();
            } catch (Exception e) {
                System.out.println(e.getMessage());
            }
        }
    }
}

class Node {
    Byte data;//存放数据
    int weight;//权值
    Node left;
    Node right;

    public Node(Byte data, int weight) {
        this.data = data;
        this.weight = weight;
    }

    @Override
    public String toString() {
        return "Node{" +
                "data=" + data +
                ", weight=" + weight +
                '}';
    }
}