pandas dataframe 过滤——apply最灵活!!!
按照某特定string字段长度过滤:
1 2 3 4 5 6 7 8 | import pandas as pd df = pd.read_csv( 'filex.csv' ) df[ 'A' ] = df[ 'A' ].astype( 'str' ) df[ 'B' ] = df[ 'B' ].astype( 'str' ) mask = (df[ 'A' ]. str . len () = = 10 ) & (df[ 'B' ]. str . len () = = 10 ) df = df.loc[mask] print (df) |
Applied to filex.csv:
A,B
123,abc
1234,abcd
1234567890,abcdefghij
the code above prints
A B
2 1234567890 abcdefghij
或者是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | data = { "names" :[ "Alice" , "Zac" , "Anna" , "O" ], "cars" :[ "Civic" , "BMW" , "Mitsubishi" , "Benz" ], "age" :[ "1" , "4" , "2" , "0" ]} df = pd.DataFrame(data) """ df: age cars names 0 1 Civic Alice 1 4 BMW Zac 2 2 Mitsubishi Anna 3 0 Benz O Then: """ df[ df[ 'names' ]. apply ( lambda x: len (x)> 1 ) & df[ 'cars' ]. apply ( lambda x: "i" in x) & df[ 'age' ]. apply ( lambda x: int (x)< 2 ) ] """ We will have : age cars names 0 1 Civic Alice """ |
最灵活的是用apply:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | def load_metadata(dir_name): columns_index_list = [ MetaIndex.M_METADATA_ID_INDEX, MetaIndex.M_SRC_IP_INDEX, MetaIndex.M_DST_IP_INDEX, MetaIndex.M_SRC_PORT_INDEX, MetaIndex.M_DST_PORT_INDEX, MetaIndex.M_PROTOCOL_INDEX, MetaIndex.M_HEADER_H, MetaIndex.M_PAYLOAD_H, MetaIndex.M_TCP_FLAG_H, MetaIndex.M_FLOW_FIRST_PKT_TIME, MetaIndex.M_FLOW_LAST_PKT_TIME, MetaIndex.M_OCTET_DELTA_COUNT_FROM_TOTAL_LEN, ] columns_name_list = [ "M_METADATA_ID_INDEX" , "M_SRC_IP_INDEX" , "M_DST_IP_INDEX" , "M_SRC_PORT_INDEX" , "M_DST_PORT_INDEX" , "M_PROTOCOL_INDEX" , "M_HEADER_H" , "M_PAYLOAD_H" , "M_TCP_FLAG_H" , "M_FLOW_FIRST_PKT_TIME" , "M_FLOW_LAST_PKT_TIME" , "M_OCTET_DELTA_COUNT_FROM_TOTAL_LEN" , ] def metadata_parse_filter(row): try : if row[ 'M_PROTOCOL_INDEX' ] ! = 6 : return False if len (row[ 'M_HEADER_H' ]) < 2 or len (row[ 'M_PAYLOAD_H' ]) < 2 or not is_l34_tcp_metadata(row[ 'M_METADATA_ID_INDEX' ]): return False first_time = row[ 'M_FLOW_FIRST_PKT_TIME' ].split( '-' ) last_time = row[ 'M_FLOW_LAST_PKT_TIME' ].split( '-' ) flow_first_pkt_time = int (first_time[ 0 ]) rev_flow_first_pkt_time = int (first_time[ 1 ]) flow_last_pkt_time = int (last_time[ 0 ]) rev_flow_last_pkt_time = int (last_time[ 1 ]) if flow_first_pkt_time > flow_last_pkt_time or rev_flow_first_pkt_time > rev_flow_last_pkt_time: return False return True except Exception as e: return False for root, dirs, files in os.walk(dir_name): for filename in files: file_path = os.path.join(root, filename) df = pd.read_csv(file_path, delimiter = '^' , usecols = columns_index_list, names = columns_name_list, encoding = 'utf-8' , error_bad_lines = False , warn_bad_lines = True , header = 0 , lineterminator = "\n" ) filter_df = df.loc[df. apply (metadata_parse_filter, axis = 1 )] yield filter_df |
直接按照row过滤!
标签:
python
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」
2017-11-08 sklearn.preprocessing OneHotEncoder——仅仅是数值型字段才可以,如果是字符类型字段则不能直接搞定
2017-11-08 spark 从RDD createDataFrame 的坑
2017-11-08 spark 针对决策树进行交叉验证
2017-11-08 【转】webshell检测——使用auditd进行system调用审计
2017-11-08 AI目前的根本问题——缺乏 自由意志,无法分辨真正的善恶
2017-11-08 杨子见歧路而哭之——有的路必须自己去走
2016-11-08 spark 资源参数调优