海量数据的处理问题：海量IP频次统计

摘要

本文分析了海量IP的查询问题，分析了如何在内存受限的情况下处理大规模数据的问题，给出了海量数据处理用到的主要思想：分治法。并且给出了Java程序实现，且对程序的设计进行了一定的分析。

📝问题介绍📝

假设海量的IP地址存放在磁盘之上，这些数据可能是从网络日志获得的，从某些网站的服务器获取到的访问用户的IP地址，现在想对这些IP数据进行一定的统计分析，例如找出访问最频繁的K个用户。假如数据量很小的话，这将是一个十分简单地问题，但是现在的日志文件特别大，大约占100GB，而计算机的内存只用4G（或者更小），如何在内存受限的情况下完成上述任务。

🖊️问题分析🖊️

假如我们能够将上述IP按照某种方式拆分成小的部分，并且满足这样的要求：

拆分后的分组尽量分布均匀
不同的IP被分到不同的分组，相同的IP被分到相同的分组

假如我们实现了上述的分组，那么我们便可以将上述大文件分成1000或者更多的分组，这样每个分组的大小不到1GB，这样再单独处理每一个分组，由于此时文件比较小，完全可以载入内存进行处理，在分别得到每个分组的频次结果后，然后再进行统一的汇总处理。

上述过程实际上就是利用了分治法的思想。分组的过程是devide的过程，而汇总的过程就是conquer的过程。那么现在的关键问题是如何对IP进行分组，实际上熟悉哈希表的同学肯定立马就能想到解决方案，我们可以将IP看成字符串，利用Java内置的hashCode运算得到哈希值，按照哈希值取模运算进行分桶，我们也可以将IP地址看成无符号的32位整数，进行手动的计算哈希值，然后进行取模分组运算。

💻编程实现💻

编程实现采用世界上最好的语言：Java，虽然上述思想很简单，但是在编程实现的时候需要考虑一个问题，那就是在分组的过程中在什么时候进行将分组写入磁盘，然后清空内存的操作。经过分析以后，采用读取线程满10000条数据将分组数据写到磁盘的操作，在这期间，读取线程需要等待存储线程完成，同时进行内存的清空操作，这样就能始终保证内存的占用在可接受的范围（实际的编程实现可以根据自己的机器设置读取的条数）。实际上读取线程和存储线程可以抽象为生产者消费者模式（本程序采用的单生产者多消费者模式，用ContdonwLatch进行线程的同步）。其程序设计流程图如下所示：![海量数据ip]{{uploading-image-730271.png(uploading...)}}

在分完组之后采用多线程的方式进行频次统计工作，需要注意的是，为了线程的复用，采用了线程池来管理开启的多个线程。

package ipsearch;
import java.io.File;
import java.util.*;
import java.util.concurrent.Callable;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
public class IpSearch {
	public static void main(String[] args) {
		ConcurrentHashMap<Integer, List<String>> bucket = new ConcurrentHashMap<>();
		ExecutorService pool = Executors.newFixedThreadPool(3);
		String fileName = "./data/hugeIpFile.txt";
		Thread producer = new Thread(new Producer(fileName,bucket,pool));
		//split the huge ip data into small file by the bucket method
		producer.start();
		//wait the split producer thread finished
		try {
			producer.join();
		}
		catch(InterruptedException e) {
			e.printStackTrace();
		}
		List<Future<List<IpwithFreq>>> ans = new ArrayList<>();
		File rootDir = new File("./data/smallipfile");
		for(File file : rootDir.listFiles()){
			//System.out.println(file.getName());
			ans.add(pool.submit(new FrequencyCounter(file)));
		}
		pool.shutdown();
		List<IpwithFreq> topFreqList = new ArrayList<>();
		for(Future<List<IpwithFreq>> res : ans) {
			try {
				topFreqList.addAll(res.get());
			}
			catch (InterruptedException e) {
				e.printStackTrace();
			}	
			catch(ExecutionException e) {
				e.printStackTrace();
			}
		}
		
		Comparator<IpwithFreq> com = new Comparator<IpwithFreq>() {
			
			@Override
			public int compare(IpwithFreq o1, IpwithFreq o2) {
				// TODO Auto-generated method stub
				if(o1.freq != o2.freq) return (int) (o2.freq - o1.freq);
				return 0;
			}
		};
		topFreqList.sort(com);
		for(int i = 0; i < 100; i++) {
			
			System.out.println(topFreqList.get(i));
			
		}
		System.out.println("all count have finished");
		//when the producer finished, use the multithread way to 
		//count the frequency
	}
}

上述代码的伪代码如下：

def main():
    split_huge_into_small_bucket_group
    wait util above is down
    start count the freq about each samll_bucket_group
    merge the ans

由于需要等待某个线程执行等待，我们这里使用join()方法，此时主线程会一直被阻塞，直到调用join()的线程执行完毕。

读取线程的核心代码如下：

public void run() {
		try {
			Scanner sc = init();
			while(true) {
				if(!sc.hasNext()) {
					break;
				}
				String line = sc.nextLine();
				if(!isValidIp(line))continue;
				//System.out.println(line + " " + cnt);
				cnt++;
				if(cnt == 10000) {
					batch++;
					//wait util the pool task is all done;
					//it is blocking
					System.out.printf("batch %d have finished\n",batch);
					latch = new CountDownLatch(bucket.size());
					//System.out.println(bucket.size());
					submitTask(bucket);
					latch.await();
					cnt = 0;
				}
				int bucketId = calculateBucketId(line);
				List<String> list = bucket.getOrDefault(bucketId , new ArrayList<String>());
				list.add(line);
				bucket.put(bucketId,list);
			}
			submitTask(bucket);
		}
		catch (InterruptedException e) {
			e.printStackTrace();
		}
		catch(IOException e) {
			e.printStackTrace();
		}
	}

存储线程的核心代码如下：

public void run() {
		try {
			saveIntoFile();
		}
		catch (IOException e) {
			e.printStackTrace();
		}
		finally {
			try {
				//clear the buffered memory and write into disk
				osw.flush();
				osw.close();
			}
			catch(IOException e) {
				
			}
			//when the all line save into disk,clear the list
			entry.getValue().clear();
			latch.countDown();
		}
	}
	public void saveIntoFile() throws IOException{
		int num = entry.getKey();
		String fileName = String.format("./data/smallipfile/ipmod%d.txt",num);
		File file = new File(fileName);
		FileOutputStream fos = new FileOutputStream(file,true);
		osw = new OutputStreamWriter(new BufferedOutputStream(fos));
		for (String line : entry.getValue()) {
			osw.write(line,0,line.length());
			osw.write("\n",0,1);
		}
	}

完整代码见本人的github

posted @ 2020-08-31 21:52 smalllll 阅读(62) 评论(0) 编辑收藏举报

刷新页面返回顶部

我思故我在

海量数据的处理问题：海量IP频次统计