WordCount(Java、Scala、Python)

处理数据常用的语言,使用基本的api处理一个wordcount

读取文件,找出单词(转大写)出现次数,并排序,获取TopK数据。

scala语言

 def main(args: Array[String]): Unit = {

    //读取文件
   val source: BufferedSource = Source.fromFile("dir/wordcount.txt")

    /*
       hadoop Spark hive
      Spark Flink hadoop
      java scala hadoop
      Spark Hadoop Java
    */
    val text:String =source.mkString
    
    //切分字符串为数组
    val strings: Array[String] = text.split("\\W+")

    //处理数据
    strings.map(_.toUpperCase).map((_,1)).groupBy(_._1).map(k=>(k._1,k._2.length)). //转大写->转元组->分组->聚合
      toArray.sortBy(_._2).reverse                       //排序->反转->遍历
      .foreach(println)
    
    /*
    (HADOOP,4)
    (SPARK,3)
    (JAVA,2)
    (HIVE,1)
    (SCALA,1)
    (FLINK,1)
     */
    source.close()

  }

Java语言

Java中的集合要转换为Stream才支持高阶函数。

        FileReader reader = new FileReader(new File("dir/wordcount.txt"));
        char[] chars=new char[1024];
        int len = reader.read(chars);

        Stream<String> stream = Stream.of(new String(chars).split("\\W+"));//分割字符串并转流

        java.util.Map<String, Long> collect = stream.map(String::toUpperCase).collect(Collectors.groupingBy(s->s,Collectors.counting())); //转大写->分类->聚合
		
		//wordCount完成
        System.out.println(collect); //{JAVA=3, HIVE=1, HADOOP=4, SCALA=1, HDFS=1, SPARK=3, HBASE=1, YARN=1, FLINK=1}

        reader.close();


		//排序
        List<java.util.Map.Entry<String, Long>> entryList = collect.entrySet().stream().sorted((e1, e2) -> Long.compare(e1.getValue(), e2.getValue()))
                .collect(Collectors.toList());

        Collections.reverse(entryList);//翻转链表
		
		//获取Top5数据,TopK完成
        entryList.stream().limit(5).forEach(System.out::println);
        /*
       HADOOP=4
        SPARK=3
        JAVA=3
        FLINK=1
        YARN=1
         */

Python

Python虽然也支持一点函数式编程,但使用还是很吃力

import re
import copy

# 读取文件
with open('wordcount.txt', 'r') as file:
    text = file.readlines()
    file.close()

# 将text处理成字符串
lines = ''.join(text).upper()

""" lines:
HADOOP SPARK HIVE YARN HDFS
SPARK FLINK HADOOP JAVA
JAVA SCALA HADOOP HBASE
SPARK HADOOP JAVA
"""


word = re.split("\\s+", lines)
"""分割字符串
['HADOOP', 'SPARK', 'HIVE', 'YARN','HDFS', 'SPARK', 'FLINK', 'HADOOP', 'JAVA', 'JAVA', 'SCALA', 'HADOOP','HBASE', 'SPARK', 'HADOOP', 'JAVA']
"""

data = list(map(lambda x: (x,1), word))
"""生成元组data:
[('HADOOP', 1), ('SPARK', 1), ('HIVE', 1), ('YARN', 1), ('HDFS', 1), ('SPARK', 1), ('FLINK', 1), ('HADOOP', 1), 
('JAVA', 1), ('JAVA', 1), ('SCALA', 1), ('HADOOP', 1), ('HBASE', 1), ('SPARK', 1), ('HADOOP', 1), ('JAVA', 1)]
"""

#  深拷贝一份,用来形成字典
new_data = copy.copy(data)

new_list = list(dict(new_data))
""" 元数据,没有重复:
['HADOOP', 'SPARK', 'HIVE', 'YARN', 'HDFS', 'FLINK', 'JAVA', 'SCALA', 'HBASE']
"""

topK = []  # 去重后的数据,还未排序
c = 0  # 用于计数

#  计数,形成新列表
for i in new_list:
    for j in data:
        if i == j[0]:
            c+=1
    topK.append((c, i))
    c=0

#   排序,翻转
topK.sort(reverse=True)


#   获取Top5
for word in topK[0:5]:
    print(word)

"""
(4, 'HADOOP')
(3, 'SPARK')
(3, 'JAVA')
(1, 'YARN')
(1, 'SCALA')
"""


posted @ 2020-12-16 11:25  cgl_dong  阅读(78)  评论(0编辑  收藏  举报