pandas随笔(七)-- 统计文本中单词出现的次数(词频统计)

题目描述

编写解决方案,找出单词 'bull' 和 'bear' 作为 独立词 有出现的文件数量,不考虑任何它出现在两侧没有空格的情况(例如,'bullet', 'bears', 'bull.',或者 'bear' 在句首或句尾 不会 被考虑)。
返回单词 'bull' 和 'bear' 以及它们对应的出现文件数量,顺序没有限制。

测试用例

输入 Files 表:

file_name content
draft1.txt The stock exchange predicts a bull market which would make many investors happy.
draft2.txt The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market.
final.txt The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices.

输出:

word content
bull 3
bear 2

解析

在pandas中,凡是涉及到字符串操作的需求首先考虑 str 下所属的各种方法,前面我们介绍过正则表达式和pandas的综合应用,这里使用的则是 str.contains() 方法

代码

import pandas as pd

def count_occurrences(files: pd.DataFrame) -> pd.DataFrame:
    bull_count = len(files[files["content"].str.contains(r"\sbull\s")==True])
    bear_count = len(files[files["content"].str.contains(r"\sbear\s")==True])
    return pd.DataFrame({"word":["bull", "bear"], "count":[bull_count, bear_count]})
posted @ 2024-10-17 13:43  KevinScott0582  阅读(14)  评论(0编辑  收藏  举报