pandas dataframe 过滤——apply最灵活!!!

按照某特定string字段长度过滤:

1
2
3
4
5
6
7
8
import pandas as pd
 
df = pd.read_csv('filex.csv')
df['A'] = df['A'].astype('str')
df['B'] = df['B'].astype('str')
mask = (df['A'].str.len() == 10) & (df['B'].str.len() == 10)
df = df.loc[mask]
print(df)

  

Applied to filex.csv:

A,B
123,abc
1234,abcd
1234567890,abcdefghij

the code above prints

            A           B
2  1234567890  abcdefghij

或者是:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
data={"names":["Alice","Zac","Anna","O"],"cars":["Civic","BMW","Mitsubishi","Benz"],
     "age":["1","4","2","0"]}
 
df=pd.DataFrame(data)
"""
df:
  age        cars  names
0   1       Civic  Alice
1   4         BMW    Zac
2   2  Mitsubishi   Anna
3   0        Benz      O
Then:
"""
 
df[
df['names'].apply(lambda x: len(x)>1) &
df['cars'].apply(lambda x: "i" in x) &
df['age'].apply(lambda x: int(x)<2)
  ]
"""
We will have :
  age   cars  names
0   1  Civic  Alice
"""

  

最灵活的是用apply:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def load_metadata(dir_name):   
    columns_index_list = [
        MetaIndex.M_METADATA_ID_INDEX,
        MetaIndex.M_SRC_IP_INDEX,
        MetaIndex.M_DST_IP_INDEX,
        MetaIndex.M_SRC_PORT_INDEX,
        MetaIndex.M_DST_PORT_INDEX,
        MetaIndex.M_PROTOCOL_INDEX,
        MetaIndex.M_HEADER_H,
        MetaIndex.M_PAYLOAD_H,
        MetaIndex.M_TCP_FLAG_H,
        MetaIndex.M_FLOW_FIRST_PKT_TIME,
        MetaIndex.M_FLOW_LAST_PKT_TIME,
        MetaIndex.M_OCTET_DELTA_COUNT_FROM_TOTAL_LEN,
    ]
    columns_name_list = [
        "M_METADATA_ID_INDEX",
        "M_SRC_IP_INDEX",
        "M_DST_IP_INDEX",
        "M_SRC_PORT_INDEX",
        "M_DST_PORT_INDEX",
        "M_PROTOCOL_INDEX",
        "M_HEADER_H",
        "M_PAYLOAD_H",
        "M_TCP_FLAG_H",
        "M_FLOW_FIRST_PKT_TIME",
        "M_FLOW_LAST_PKT_TIME",
        "M_OCTET_DELTA_COUNT_FROM_TOTAL_LEN",
    ]
 
    def metadata_parse_filter(row):
        try:
            if row['M_PROTOCOL_INDEX'] != 6:
                return False
            if len(row['M_HEADER_H']) < 2 or len(row['M_PAYLOAD_H']) < 2 or not is_l34_tcp_metadata(row['M_METADATA_ID_INDEX']):
                return False
            first_time = row['M_FLOW_FIRST_PKT_TIME'].split('-')
            last_time = row['M_FLOW_LAST_PKT_TIME'].split('-')
 
            flow_first_pkt_time = int(first_time[0])
            rev_flow_first_pkt_time = int(first_time[1])
 
            flow_last_pkt_time = int(last_time[0])
            rev_flow_last_pkt_time = int(last_time[1])
            if flow_first_pkt_time > flow_last_pkt_time or rev_flow_first_pkt_time > rev_flow_last_pkt_time:
                return False
            return True
        except Exception as e:
            return False
 
    for root, dirs, files in os.walk(dir_name):
        for filename in files:
            file_path = os.path.join(root, filename)
            df = pd.read_csv(file_path, delimiter='^', usecols=columns_index_list, names=columns_name_list, encoding='utf-8', error_bad_lines=False, warn_bad_lines=True, header=0, lineterminator="\n")
            filter_df = df.loc[df.apply(metadata_parse_filter, axis=1)]
            yield filter_df

 直接按照row过滤! 

posted @   bonelee  阅读(5077)  评论(0编辑  收藏  举报
编辑推荐:
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」
历史上的今天:
2017-11-08 sklearn.preprocessing OneHotEncoder——仅仅是数值型字段才可以,如果是字符类型字段则不能直接搞定
2017-11-08 spark 从RDD createDataFrame 的坑
2017-11-08 spark 针对决策树进行交叉验证
2017-11-08 【转】webshell检测——使用auditd进行system调用审计
2017-11-08 AI目前的根本问题——缺乏 自由意志,无法分辨真正的善恶
2017-11-08 杨子见歧路而哭之——有的路必须自己去走
2016-11-08 spark 资源参数调优
点击右上角即可分享
微信分享提示