pandas随笔(七)-- 统计文本中单词出现的次数(词频统计)
题目描述
编写解决方案,找出单词 'bull' 和 'bear' 作为 独立词 有出现的文件数量,不考虑任何它出现在两侧没有空格的情况(例如,'bullet', 'bears', 'bull.',或者 'bear' 在句首或句尾 不会 被考虑)。
返回单词 'bull' 和 'bear' 以及它们对应的出现文件数量,顺序没有限制。
测试用例
输入 Files 表:
file_name | content |
---|---|
draft1.txt | The stock exchange predicts a bull market which would make many investors happy. |
draft2.txt | The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. |
final.txt | The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices. |
输出:
word | content |
---|---|
bull | 3 |
bear | 2 |
解析
在pandas中,凡是涉及到字符串操作的需求首先考虑 str 下所属的各种方法,前面我们介绍过正则表达式和pandas的综合应用,这里使用的则是 str.contains()
方法
代码
import pandas as pd
def count_occurrences(files: pd.DataFrame) -> pd.DataFrame:
bull_count = len(files[files["content"].str.contains(r"\sbull\s")==True])
bear_count = len(files[files["content"].str.contains(r"\sbear\s")==True])
return pd.DataFrame({"word":["bull", "bear"], "count":[bull_count, bear_count]})