WordCount(Java、Scala、Python)
2021-03-10 20:29
标签:adl ext reverse tor 分类 处理 array block source 处理数据常用的语言,使用基本的api处理一个wordcount 读取文件,找出单词(转大写)出现次数,并排序,获取TopK数据。 Java中的集合要转换为Stream才支持高阶函数。 Python虽然也支持一点函数式编程,但使用还是很吃力 WordCount(Java、Scala、Python) 标签:adl ext reverse tor 分类 处理 array block source 原文地址:https://www.cnblogs.com/cgl-dong/p/14142966.html
scala语言
def main(args: Array[String]): Unit = {
//读取文件
val source: BufferedSource = Source.fromFile("dir/wordcount.txt")
/*
hadoop Spark hive
Spark Flink hadoop
java scala hadoop
Spark Hadoop Java
*/
val text:String =source.mkString
//切分字符串为数组
val strings: Array[String] = text.split("\\W+")
//处理数据
strings.map(_.toUpperCase).map((_,1)).groupBy(_._1).map(k=>(k._1,k._2.length)). //转大写->转元组->分组->聚合
toArray.sortBy(_._2).reverse //排序->反转->遍历
.foreach(println)
/*
(HADOOP,4)
(SPARK,3)
(JAVA,2)
(HIVE,1)
(SCALA,1)
(FLINK,1)
*/
source.close()
}
Java语言
FileReader reader = new FileReader(new File("dir/wordcount.txt"));
char[] chars=new char[1024];
int len = reader.read(chars);
Stream
Python
import re
import copy
# 读取文件
with open(‘wordcount.txt‘, ‘r‘) as file:
text = file.readlines()
file.close()
# 将text处理成字符串
lines = ‘‘.join(text).upper()
""" lines:
HADOOP SPARK HIVE YARN HDFS
SPARK FLINK HADOOP JAVA
JAVA SCALA HADOOP HBASE
SPARK HADOOP JAVA
"""
word = re.split("\\s+", lines)
"""分割字符串
[‘HADOOP‘, ‘SPARK‘, ‘HIVE‘, ‘YARN‘,‘HDFS‘, ‘SPARK‘, ‘FLINK‘, ‘HADOOP‘, ‘JAVA‘, ‘JAVA‘, ‘SCALA‘, ‘HADOOP‘,‘HBASE‘, ‘SPARK‘, ‘HADOOP‘, ‘JAVA‘]
"""
data = list(map(lambda x: (x,1), word))
"""生成元组data:
[(‘HADOOP‘, 1), (‘SPARK‘, 1), (‘HIVE‘, 1), (‘YARN‘, 1), (‘HDFS‘, 1), (‘SPARK‘, 1), (‘FLINK‘, 1), (‘HADOOP‘, 1),
(‘JAVA‘, 1), (‘JAVA‘, 1), (‘SCALA‘, 1), (‘HADOOP‘, 1), (‘HBASE‘, 1), (‘SPARK‘, 1), (‘HADOOP‘, 1), (‘JAVA‘, 1)]
"""
# 深拷贝一份,用来形成字典
new_data = copy.copy(data)
new_list = list(dict(new_data))
""" 元数据,没有重复:
[‘HADOOP‘, ‘SPARK‘, ‘HIVE‘, ‘YARN‘, ‘HDFS‘, ‘FLINK‘, ‘JAVA‘, ‘SCALA‘, ‘HBASE‘]
"""
topK = [] # 去重后的数据,还未排序
c = 0 # 用于计数
# 计数,形成新列表
for i in new_list:
for j in data:
if i == j[0]:
c+=1
topK.append((c, i))
c=0
# 排序,翻转
topK.sort(reverse=True)
# 获取Top5
for word in topK[0:5]:
print(word)
"""
(4, ‘HADOOP‘)
(3, ‘SPARK‘)
(3, ‘JAVA‘)
(1, ‘YARN‘)
(1, ‘SCALA‘)
"""
文章标题:WordCount(Java、Scala、Python)
文章链接:http://soscw.com/essay/62915.html