科研立项之codeforces 题目算法标签统计 Apare_xzc

立项之聚类 by xzc

之前我们爬取codeforces, 得到了所有题目的算法标签信息，以Excel的形式存放。现在我们要对题目进行聚类。

得到的表格是这样的形式：

一共有3205道题目。

我们可以发现，没到题目可能有不止一个算法标签。所以，要对这些题目进行聚类，首先，我们要搞清楚这个网站上面的题目到底有多少种算法标签。

我们用txt文本文件存放所有提米的算法标签Tags，从Excel中直接复制，粘贴到文本文件TagNames.txt中

然后我们用一段python代码来统计所有的算法标签

AllName = set() #用于存放数据的集合
with open('TagsName.txt', 'r+',encoding='utf-8') as f: # 文件读入
    for line in f.readlines(): #每次从文件中读取一行字符串
        List = line.split(',') #讲读到的一行字符串用','切分为几个子串，生成一个列表
        for i in range(0,len(List)):  #遍历字符串列表
            List[i] = List[i].strip() #去除列表中的每个字符串首尾的空格 
        AllName.update(List) #将这行的所有算法标签插入到集合中
        
for string in AllName: # 遍历集合
    print(string)     # 输出去重后的所有算法标签

结果如下图：

divide and conquer
math
expression parsing
flows
brute force
chinese remainder theorem
bitmasks
combinatorics
constructive algorithms
dfs and similar
greedy
games
ternary search
fft
data structures
number theory
2-sat
geometry
strings
dp
sortings
graphs
probabilities
meet-in-the-middle
string suffix structures
matrices
binary search
graph matchings
trees
hashing
implementation
shortest paths
dsu
two pointers
schedules

我们得到了所有的算法标签，把它放到Excel中，进行处理，对Tags标注中文意思，并人为规定序号。

算法标签	算法编号	中文名
2-sat	1	2-sat(适应性问题)
binary search	2	二分查找
bitmasks	3	位操作
brute force	4	暴力
chinese remainder theorem	5	中国剩余定理
combinatorics	6	组合数学
constructive algorithms	7	构造
data structures	8	数据结构
dfs and similar	9	深度优先搜索和类比?
divide and conquer	10	分治法
dp	11	动态规划
dsu	12	并查集
expression parsing	13	表达式解析
fft	14	快速傅里叶变换
flows	15	(网络)流
games	16	博弈
geometry	17	集合
graph matchings	18	图匹配
graphs	19	图论
greedy	20	贪心
hashing	21	哈希
implementation	22	模拟
math	23	数学
matrices	24	矩阵
meet-in-the-middle	25	折半搜索
number theory	26	数论
probabilities	27	概率
schedules	28	安排问题
shortest paths	29	最短路
sortings	30	排序
string suffix structures	31	字符串后缀结构
strings	32	字符串
ternary search	33	三分搜索
trees	34	树
two pointers	35	双指针

得到了所有的算法标签以后，我们下一步去统计每一种算法相关的题目有多少道。

我们先把爬取到的题目-标签Excel复制到文本文件Problems.txt中

然后，我们用python字典统计每种算法相关的题目数量：

Dict_tag = {} # 标签->序号
Dict_num = {} # 序号->标签
cnt = 0
with open('TagList.txt','r+',encoding='utf-8') as f:
    for line in f.readlines():
        s = line.strip()
        cnt += 1
        Dict_tag[s] = cnt  # 建立字典
        Dict_num[cnt] = s

Dict_cnt = {}
for i in range(1,36): # 初始化每种算法相关的题目数为零
    Dict_cnt[i] = 0  

with open('Problems.txt','r+',encoding='utf-8') as f:
    for line in f.readlines(): # 每次读一行，含有一道题目的名称和tags
        for x in Dict_tag.items(): # 遍历每个算法，看这道题是否涉及
            if x[0] in line: # 如果这道题涉及这个算法
                Dict_cnt[x[1]] = Dict_cnt[x[1]]+1 # 这个算法相关的题目数+1
                
mat = "{:<10}\t{:<25}\t{:<10}"
print(mat.format('序号','算法标签','相关题目数量'))
for x in Dict_cnt.items(): # 格式化输出统计结果
    print(mat.format(x[0],Dict_num[x[0]],x[1]))

得到的结果如下图所示：

我们存入Excel中

算法标签	算法编号	中文名	相关题目数量
2-sat	1	2-sat(适应性问题)	7
binary search	2	二分查找	252
bitmasks	3	位操作	98
brute force	4	暴力	445
chinese remainder theorem	5	中国剩余定理	6
combinatorics	6	组合数学	147
constructive algorithms	7	构造	323
data structures	8	数据结构	426
dfs and similar	9	深度优先搜索和类比?	277
divide and conquer	10	分治法	58
dp	11	动态规划	583
dsu	12	并查集	104
expression parsing	13	表达式解析	28
fft	14	快速傅里叶变换	9
flows	15	(网络)流	40
games	16	博弈	52
geometry	17	集合	153
graph matchings	18	图匹配	23
graphs	19	图论	218
greedy	20	贪心	557
hashing	21	哈希	67
implementation	22	模拟	990
math	23	数学	564
matrices	24	矩阵	46
meet-in-the-middle	25	折半搜索	13
number theory	26	数论	148
probabilities	27	概率	80
schedules	28	安排问题	5
shortest paths	29	最短路	70
sortings	30	排序	268
string suffix structures	31	字符串后缀结构	36
strings	32	字符串	184
ternary search	33	三分搜索	16
trees	34	树	176
two pointers	35	双指针	118

posted @ 2019-12-02 22:06 Apare 阅读(85) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

科研立项之codeforces 题目算法标签统计 Apare_xzc

立项之聚类 by xzc

之前我们爬取codeforces, 得到了所有题目的算法标签信息，以Excel的形式存放。现在我们要对题目进行聚类。

得到的表格是这样的形式：

公告