1亿个数引发的思考(二)之 开拓视野 BitMap Index 和 布隆过滤器
java实现:
java.util.BitSet
谷歌实现
<dependency> <groupId>com.googlecode.javaewah</groupId> <artifactId>JavaEWAH</artifactId> <version>1.2.3</version> </dependency>
资料:
https://encyclopedia.thefreedictionary.com/Bitmap+index
理解:
1bit可以表示两种状态,0不存在,1存在
下标可以表示某个值(通过hash放到篮子里)
如何表示N个数的状态呢,byte=1字节=8位,long=8字节=64位,可以通过m个连续的long即long[m] ,m = N/64 + 1
把N个数add到bit set 中,即找到下标,bit位值为1
public void set(int bitIndex) {...}
public boolean get(int bitIndex) {...}
所以需求变成了小内存(压缩到BitSet后可以存放这些数据),流读取数据,先get,后set,如果get返回true则这个数重复,可以记录重复次数
Hash算法是散列算法,有很多种,有最出名的md5,sha256,还有很多其他的
验证1:
import java.util.ArrayList; import java.util.BitSet; import java.util.HashMap; import java.util.List; import java.util.Map; import cn.hutool.core.util.RandomUtil; public class BitSetTest { public static void main(String[] args) { BitSet bitSet = new BitSet(100); int[] randomInts = RandomUtil.randomInts(100); List<Integer> list = new ArrayList<>(); for (int r : randomInts) { list.add(r); } list.add(RandomUtil.randomInt(0, 100)); list.add(2); list.add(2); Map<Integer, Integer> occur = new HashMap<>(); // 内存不够用时,这里要换成流形式加载 for (int i : list) { boolean b = bitSet.get(i); bitSet.set(i); if (b) { Integer integer = occur.get(i); if (integer == null) { occur.put(i, 2); } else { occur.put(i, integer + 1); } } } // 最终输出哪些数字出现了多次 System.out.println(occur); } }
验证2:
import java.io.BufferedInputStream; import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.FileWriter; import java.io.IOException; import java.io.InputStream; import java.lang.management.ManagementFactory; import java.util.BitSet; import java.util.HashMap; import java.util.Map; import cn.hutool.core.util.RandomUtil; public class BitSetTest2 { public static void main(String[] args) throws IOException { // 1) 从0开始生成1亿个数字,随机两个位置插入两个数字 666、888 generate1yi(); // 2) 通过BitSet 查找重复出现的数字 find(); } private static void generate1yi() throws IOException { int max = 10000 * 10000; int i1 = RandomUtil.randomInt(0, max); int i2 = RandomUtil.randomInt(0, max); int[] randomInts = RandomUtil.randomInts(max); File file = new File("D://out//1yi.txt"); FileWriter fileWriter = new FileWriter(file); for (int i : randomInts) { fileWriter.append(i + "\r\n"); if (i == i1) { fileWriter.append(666 + "\r\n"); }else if(i == i2) { fileWriter.append(888 + "\r\n"); } } fileWriter.close(); } private static void find() throws NumberFormatException, IOException { usage(); BitSet bitSet = new BitSet(10000 * 10000 + 1); usage(); Map<Integer, Integer> occur = new HashMap<>(); BufferedReader br = new BufferedReader(new FileReader(new File("D://out/1yi.txt"))); String line = null; while ((line = br.readLine()) != null) { int i = Integer.valueOf(line); boolean b = bitSet.get(i); bitSet.set(i); if (b) { Integer integer = occur.get(i); if (integer == null) { occur.put(i, 2); } else { occur.put(i, integer + 1); } } } br.close(); usage(); System.out.println(occur); } public static String toStr(InputStream in) throws IOException { BufferedInputStream b = new BufferedInputStream(in); byte[] buffer = new byte[1024]; int l = 0; StringBuffer sb = new StringBuffer(); while ((l = b.read(buffer)) != -1) { sb.append(new String(buffer, 0, l)); } return sb.toString(); } private static void usage() { System.out.println(ManagementFactory.getMemoryMXBean().getHeapMemoryUsage()); } }
// 执行1和2 init = 536870912(524288K) used = 738080064(720781K) committed = 2633498624(2571776K) max = 7615283200(7436800K) init = 536870912(524288K) used = 750580088(732988K) committed = 2633498624(2571776K) max = 7615283200(7436800K) init = 536870912(524288K) used = 1033007456(1008796K) committed = 2310012928(2255872K) max = 7615283200(7436800K) {888=2, 666=2}
// 仅执行2 init = 536870912(524288K) used = 2694856(2631K) committed = 514850816(502784K) max = 7615283200(7436800K) init = 536870912(524288K) used = 15194880(14838K) committed = 514850816(502784K) max = 7615283200(7436800K) init = 536870912(524288K) used = 265467672(259245K) committed = 1259864064(1230336K) max = 7615283200(7436800K) {888=2, 666=2}
结论:执行1 时一个1亿长度的int数组占用内存大约7百兆,仅执行2时初始一个1亿长度的BitSet 内存占用大约12兆;该方案时可以通过使用较少内存来判读某些数字重复出现的
思考:BitSet 压缩问题,比如两个数1,11111111111