文本分析 - 聚类分析 (数据挖掘)
文本分析,在数据挖掘,甚至是深度学习中很重要的分支研究领域。如下运用R语言,通过采用文本相似度算法Jaro-Winkler Distance,能实现:
在题库中查找出相似度高的题并输出自动聚类的结果,从而提炼出练习重点,提高阅读效率。
## 寻找练习重点 library('xlsx') library('DBI') library('RSQLite') library('ff') library('bit') library('RecordLinkage') library('stringr') library('plyr') # 读取指定题目文件 file <- "D:/data/Q_1.xlsx" Q <- read.xlsx(file, 1, encoding = "UTF-8") # 按照规则寻找相似度等于或者高于80%的题 PickOutGroup <- function() { { #NO_B <- list() #PickingList_B <- list() i = 1 for (i in 1:length(Q$题号)) { Q_Main1 <- Q$题干[i] %>% as.character() Q_Branches1 <- Q$选项[i] %>% as.character() Q_Main_len <- Q$题干长度[i] %>% as.numeric() Q_list <- list() Q_list[i] <- Q$题号[i] %>% as.numeric() a = 1 for (a in 1:length(Q$题号)) { b = a + 1 Q_list_Pick <- Q$题号[b] %>% as.numeric() # 题干 Q_Main2 <- Q$题干[b] Q_Main_scores <- jarowinkler(Q_Main1, Q_Main2) %>% as.numeric() # 选项 Q_Branches2 <- Q$选项[b] Q_Branches_scores <- jarowinkler(Q_Branches1, Q_Branches2) %>% as.numeric() # 题干长度 Q_Main_Len <- Q$题干长度[b] %>% as.numeric() Q_Main_length_Con1 <- if (is.na((Q_Main_len >= as.numeric(Q_Main_Len - 10)) %>% as.logical())) { FALSE } else { TRUE } Q_Main_length_Con2 <- if (is.na((Q_Main_len <= as.numeric(Q_Main_Len + 10)) %>% as.logical())) { FALSE } else { TRUE } Q_Main_length <- tryCatch(if ((Q_Main_length_Con1) & (Q_Main_length_Con2)) { "Yes" } else { "No" }, error = function(e) { cat("ERROR:", conditionMessage((e))) }) #将相似选项加入列表 Q_list_Con1 <- (if (as.numeric(length(Q_Main_scores)) == 0) { FALSE } else { Q_Main_scores >= 0.8 }) %>% as.logical() Q_list_Con2 <- (if (as.numeric(length(Q_Branches_scores)) == 0) { FALSE } else { Q_Branches_scores >= 0.8 }) %>% as.logical() Q_list_Con3 <- (Q_Main_length == "Yes") %>% as.logical() Q_list[b] <- tryCatch(if ((Q_list_Con1) & (Q_list_Con2) & (Q_list_Con3)) { Q_list_Pick } else { 0 }, error = function(e) { cat("ERROR:", conditionMessage((e))) }) a = a + 1 } NO <- Q$题号[i] %>% as.numeric() Q_list <- str_c(Q_list, sep = "", collapse = ";") %>% as.character() %>% gsub(pattern = ";0", replacement = "", .) %>% gsub(pattern = "NULL;", replacement = "", .) PickingList <- data.frame(NO = NO, PickingList = Q_list) unique(write.csv(PickingList, "D:/data/Q_2.csv", append = T)) } i = i + 1 } } # 计算代码运行时间 system.time(PickOutGroup())
参考:
1. 实际操作视频:https://v.kuaishou.com/70L8Jg
2. “文本相似度算法Jaro-Winkler Distance” 介绍
Jaro-Winkler Distance是一个度量两个字符序列之间的编辑距离的字符串度量标准,是由William E. Winkler在1990年提出的Jaro Distance度量标准的一种变体。Jaro Distance是两个单词之间由一个转换为另一个所需的单字符转换的最小数量。Jaro-Winkler Distance通过前缀因子使Jaro Distance相同时共同前缀长度越大的相似度越高。Jaro–Winkler Distance越小,两个字符串越相似。如果分数是0,则表示完全不同,分数为1则表示完全匹配。Jaro–Winkler相似度是1 - Jaro–Winkler Distance。其公式如下:
3. 如上代码所整理“题库”的整理过程(前传) - 文本分析整理
都在这个视频里:https://v.kuaishou.com/7r280H
记录有用的信息和数据,并分享!