pandas随笔（七）-- 统计文本中单词出现的次数（词频统计）

合集 - pandas随笔(7)

1.pandas笔记（六）-- 即时食物配送2024-07-25 2.pandas笔记（五）-- 部门工资最高的员工（数据表的合并与分组）2024-03-27 3.pandas笔记（四）-- 第N高的薪水（去重、排序、含空值的数据表）2024-03-21 4.pandas笔记（三）-- 查找有效邮箱的用户（正则表达式应用）2024-03-08 5.pandas笔记（二）-- 从不订购的顾客（数据表连接，主键与外键）2024-03-06 6.pandas笔记（一）-- 大的国家（逻辑索引、切片）2024-03-06

7.pandas随笔（七）-- 统计文本中单词出现的次数（词频统计）2024-10-17

题目描述

编写解决方案，找出单词 'bull' 和 'bear' 作为独立词有出现的文件数量，不考虑任何它出现在两侧没有空格的情况（例如，'bullet', 'bears', 'bull.'，或者 'bear' 在句首或句尾不会被考虑）。
返回单词 'bull' 和 'bear' 以及它们对应的出现文件数量，顺序没有限制。

测试用例

输入 Files 表：

file_name	content
draft1.txt	The stock exchange predicts a bull market which would make many investors happy.
draft2.txt	The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market.
final.txt	The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices.

输出：

word	content
bull	3
bear	2

解析

在pandas中，凡是涉及到字符串操作的需求首先考虑 str 下所属的各种方法，前面我们介绍过正则表达式和pandas的综合应用，这里使用的则是 str.contains() 方法

代码

import pandas as pd

def count_occurrences(files: pd.DataFrame) -> pd.DataFrame:
    bull_count = len(files[files["content"].str.contains(r"\sbull\s")==True])
    bear_count = len(files[files["content"].str.contains(r"\sbear\s")==True])
    return pd.DataFrame({"word":["bull", "bear"], "count":[bull_count, bear_count]})

posted @ 2024-10-17 13:43 KevinScott0582 阅读(57) 评论(0) 编辑收藏举报

KevinScott0582

pandas随笔（七）-- 统计文本中单词出现的次数（词频统计）

题目描述

测试用例

解析

代码

公告

搜索

常用链接

合集

随笔档案

阅读排行榜