featuretools入门
featuretools入门尝试
简介
特征工程中最重要的一步,是对baseline提升最大的一个步骤,对数据的EDA以及构建特征,是不可缺少一部分。python的特征工程常用agg与groupby的进行聚合统计。
首先,我们得先了解一下featuretools的3个基本组成
- 实体集(EntitySet):把一个二维表看作一个实体,实体集是一个或多个二维表的集合
- 特征基元(Feature Primitives):分为聚合和转换两类,相当于构造新特征的方法
- 深度特征合成(DFS, Deep Feature Synthesis):根据实体集里的实体和特征基元创造新特征
入门尝试
featuretools是一种自动化特征生成的特征工程框架,可以快速生成大量的,基于聚合统计类的特征。
并且对于时间类的特征,featuretools也可以自动切片。
首先从官方文档的入门教程开始尝试:
import featuretools as ft
data = ft.demo.load_mock_customer()
customers_df = data["customers"]
customers_df
customer_id | zip_code | join_date | date_of_birth | |
---|---|---|---|---|
0 | 1 | 60091 | 2011-04-17 10:48:33 | 1994-07-18 |
1 | 2 | 13244 | 2012-04-15 23:31:04 | 1986-08-18 |
2 | 3 | 13244 | 2011-08-13 15:42:34 | 2003-11-21 |
3 | 4 | 60091 | 2011-04-08 20:08:14 | 2006-08-15 |
4 | 5 | 60091 | 2010-07-17 05:27:50 | 1984-07-28 |
sessions_df = data["sessions"]
sessions_df.head()
session_id | customer_id | device | session_start | |
---|---|---|---|---|
0 | 1 | 2 | desktop | 2014-01-01 00:00:00 |
1 | 2 | 5 | mobile | 2014-01-01 00:17:20 |
2 | 3 | 4 | mobile | 2014-01-01 00:28:10 |
3 | 4 | 1 | mobile | 2014-01-01 00:44:25 |
4 | 5 | 4 | mobile | 2014-01-01 01:11:30 |
transactions_df = data["transactions"]
transactions_df.head()
transaction_id | session_id | transaction_time | product_id | amount | |
---|---|---|---|---|---|
0 | 298 | 1 | 2014-01-01 00:00:00 | 5 | 127.64 |
1 | 2 | 1 | 2014-01-01 00:01:05 | 2 | 109.48 |
2 | 308 | 1 | 2014-01-01 00:02:10 | 3 | 95.06 |
3 | 116 | 1 | 2014-01-01 00:03:15 | 4 | 78.92 |
4 | 371 | 1 | 2014-01-01 00:04:20 | 3 | 31.54 |
这里有三张featuretools自带的数据表,三张数据表分别可以采用外键进行关联。 一般常用的特征工程方式是采用merge将session_id,customer_id,进行关联。然后在一张大表上进行特征工程。
当表中数据量较大时,在merge时可能需要耗费较多的时间,且当表数目增加以后,多表关联需要耗费较多时间。
接下来采用featuretools将三张数据表构成实体集,创建字典映射,实体名:{表名,索引}。
entities = {
"customers" : (customers_df, "customer_id"),
"sessions" : (sessions_df, "session_id", "session_start"),
"transactions" : (transactions_df, "transaction_id", "transaction_time")
}
在entities实体集中的目前并没有相关联,因此将三张表创建关联。
创建relationships关联。
relationships = [("sessions", "session_id", "transactions", "session_id"),("customers", "customer_id", "sessions", "customer_id")]
创建关联后即可用采用ft.dfs进行特征生成。
设置主实体为‘customers’。
feature_matrix_customers, features_defs = ft.dfs(entities=entities,relationships=relationships,target_entity="customers")
可用通过查看生成的表格
feature_matrix_customers.head()
查看生成的都有什么特征
features_defs
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: NUM_UNIQUE(sessions.device)>,
<Feature: MODE(sessions.device)>,
<Feature: SUM(transactions.amount)>,
<Feature: STD(transactions.amount)>,
<Feature: MAX(transactions.amount)>,
<Feature: SKEW(transactions.amount)>,
<Feature: MIN(transactions.amount)>,
<Feature: MEAN(transactions.amount)>,
<Feature: COUNT(transactions)>,
<Feature: NUM_UNIQUE(transactions.product_id)>,
<Feature: MODE(transactions.product_id)>,
<Feature: DAY(date_of_birth)>,
<Feature: DAY(join_date)>,
<Feature: YEAR(date_of_birth)>,
<Feature: YEAR(join_date)>,
<Feature: MONTH(date_of_birth)>,
<Feature: MONTH(join_date)>,
<Feature: WEEKDAY(date_of_birth)>,
<Feature: WEEKDAY(join_date)>,
<Feature: SUM(sessions.NUM_UNIQUE(transactions.product_id))>,
<Feature: SUM(sessions.MIN(transactions.amount))>,
<Feature: SUM(sessions.MEAN(transactions.amount))>,
<Feature: SUM(sessions.SKEW(transactions.amount))>,
<Feature: SUM(sessions.MAX(transactions.amount))>,
<Feature: SUM(sessions.STD(transactions.amount))>,
<Feature: STD(sessions.NUM_UNIQUE(transactions.product_id))>,
<Feature: STD(sessions.COUNT(transactions))>,
<Feature: STD(sessions.MIN(transactions.amount))>,
<Feature: STD(sessions.MEAN(transactions.amount))>,
<Feature: STD(sessions.SKEW(transactions.amount))>,
<Feature: STD(sessions.MAX(transactions.amount))>,
<Feature: STD(sessions.SUM(transactions.amount))>,
<Feature: MAX(sessions.NUM_UNIQUE(transactions.product_id))>,
<Feature: MAX(sessions.COUNT(transactions))>,
<Feature: MAX(sessions.MIN(transactions.amount))>,
<Feature: MAX(sessions.MEAN(transactions.amount))>,
<Feature: MAX(sessions.SKEW(transactions.amount))>,
<Feature: MAX(sessions.SUM(transactions.amount))>,
<Feature: MAX(sessions.STD(transactions.amount))>,
<Feature: SKEW(sessions.NUM_UNIQUE(transactions.product_id))>,
<Feature: SKEW(sessions.COUNT(transactions))>,
<Feature: SKEW(sessions.MIN(transactions.amount))>,
<Feature: SKEW(sessions.MEAN(transactions.amount))>,
<Feature: SKEW(sessions.MAX(transactions.amount))>,
<Feature: SKEW(sessions.SUM(transactions.amount))>,
<Feature: SKEW(sessions.STD(transactions.amount))>,
<Feature: MIN(sessions.NUM_UNIQUE(transactions.product_id))>,
<Feature: MIN(sessions.COUNT(transactions))>,
<Feature: MIN(sessions.MEAN(transactions.amount))>,
<Feature: MIN(sessions.SKEW(transactions.amount))>,
<Feature: MIN(sessions.MAX(transactions.amount))>,
<Feature: MIN(sessions.SUM(transactions.amount))>,
<Feature: MIN(sessions.STD(transactions.amount))>,
<Feature: MEAN(sessions.NUM_UNIQUE(transactions.product_id))>,
<Feature: MEAN(sessions.COUNT(transactions))>,
<Feature: MEAN(sessions.MIN(transactions.amount))>,
<Feature: MEAN(sessions.MEAN(transactions.amount))>,
<Feature: MEAN(sessions.SKEW(transactions.amount))>,
<Feature: MEAN(sessions.MAX(transactions.amount))>,
<Feature: MEAN(sessions.SUM(transactions.amount))>,
<Feature: MEAN(sessions.STD(transactions.amount))>,
<Feature: NUM_UNIQUE(sessions.MONTH(session_start))>,
<Feature: NUM_UNIQUE(sessions.WEEKDAY(session_start))>,
<Feature: NUM_UNIQUE(sessions.YEAR(session_start))>,
<Feature: NUM_UNIQUE(sessions.MODE(transactions.product_id))>,
<Feature: NUM_UNIQUE(sessions.DAY(session_start))>,
<Feature: MODE(sessions.MONTH(session_start))>,
<Feature: MODE(sessions.WEEKDAY(session_start))>,
<Feature: MODE(sessions.YEAR(session_start))>,
<Feature: MODE(sessions.MODE(transactions.product_id))>,
<Feature: MODE(sessions.DAY(session_start))>,
<Feature: NUM_UNIQUE(transactions.sessions.customer_id)>,
<Feature: NUM_UNIQUE(transactions.sessions.device)>,
<Feature: MODE(transactions.sessions.customer_id)>,
<Feature: MODE(transactions.sessions.device)>]
总结:
- 其中发现基本的聚合统计特征都可以生成,std,mean等常用的统计方法,而且最后做到了二阶的特征交叉,省去了特征构造的麻烦,而且对于时间特征,都可以自动生成,例如周几,第几天的特征,省去了平时简单却耗时的特征工程。
- 且当有新的数据表想加入时,只需要继续往齐总加入新的实体与新的关联。
- 但是其同时存在一定问题,当设置的max_depth>2时,可能会导致数据维度急速升高,且对于生成的特征列,难以解释,且可能同样生成了大量无用的数据列,对模型生成一定量的噪声,影响模型结果。
- 常用的数据特征降维方法如PCA,SVD,随机森林选择,神经网络生成等方式对数据进行降维。