featuretools入门

featuretools入门尝试

简介

特征工程中最重要的一步,是对baseline提升最大的一个步骤,对数据的EDA以及构建特征,是不可缺少一部分。python的特征工程常用agg与groupby的进行聚合统计。
首先,我们得先了解一下featuretools的3个基本组成

  • 实体集(EntitySet):把一个二维表看作一个实体,实体集是一个或多个二维表的集合
  • 特征基元(Feature Primitives):分为聚合和转换两类,相当于构造新特征的方法
  • 深度特征合成(DFS, Deep Feature Synthesis):根据实体集里的实体和特征基元创造新特征

入门尝试

featuretools是一种自动化特征生成的特征工程框架,可以快速生成大量的,基于聚合统计类的特征。
并且对于时间类的特征,featuretools也可以自动切片。
首先从官方文档的入门教程开始尝试:

import featuretools as ft
 data = ft.demo.load_mock_customer()
customers_df = data["customers"]
customers_df
customer_id zip_code join_date date_of_birth
0 1 60091 2011-04-17 10:48:33 1994-07-18
1 2 13244 2012-04-15 23:31:04 1986-08-18
2 3 13244 2011-08-13 15:42:34 2003-11-21
3 4 60091 2011-04-08 20:08:14 2006-08-15
4 5 60091 2010-07-17 05:27:50 1984-07-28
sessions_df = data["sessions"]
sessions_df.head()
session_id customer_id device session_start
0 1 2 desktop 2014-01-01 00:00:00
1 2 5 mobile 2014-01-01 00:17:20
2 3 4 mobile 2014-01-01 00:28:10
3 4 1 mobile 2014-01-01 00:44:25
4 5 4 mobile 2014-01-01 01:11:30
transactions_df = data["transactions"]
transactions_df.head()
transaction_id session_id transaction_time product_id amount
0 298 1 2014-01-01 00:00:00 5 127.64
1 2 1 2014-01-01 00:01:05 2 109.48
2 308 1 2014-01-01 00:02:10 3 95.06
3 116 1 2014-01-01 00:03:15 4 78.92
4 371 1 2014-01-01 00:04:20 3 31.54

这里有三张featuretools自带的数据表,三张数据表分别可以采用外键进行关联。 一般常用的特征工程方式是采用merge将session_id,customer_id,进行关联。然后在一张大表上进行特征工程。
当表中数据量较大时,在merge时可能需要耗费较多的时间,且当表数目增加以后,多表关联需要耗费较多时间。

接下来采用featuretools将三张数据表构成实体集,创建字典映射,实体名:{表名,索引}。

entities = {
    "customers" : (customers_df, "customer_id"),
   "sessions" : (sessions_df, "session_id", "session_start"),
  "transactions" : (transactions_df, "transaction_id", "transaction_time")
}

在entities实体集中的目前并没有相关联,因此将三张表创建关联。
创建relationships关联。

relationships = [("sessions", "session_id", "transactions", "session_id"),("customers", "customer_id", "sessions", "customer_id")]

创建关联后即可用采用ft.dfs进行特征生成。
设置主实体为‘customers’。

feature_matrix_customers, features_defs = ft.dfs(entities=entities,relationships=relationships,target_entity="customers")

可用通过查看生成的表格

feature_matrix_customers.head()

查看生成的都有什么特征

features_defs
[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: NUM_UNIQUE(sessions.device)>,
 <Feature: MODE(sessions.device)>,
 <Feature: SUM(transactions.amount)>,
 <Feature: STD(transactions.amount)>,
 <Feature: MAX(transactions.amount)>,
 <Feature: SKEW(transactions.amount)>,
 <Feature: MIN(transactions.amount)>,
 <Feature: MEAN(transactions.amount)>,
 <Feature: COUNT(transactions)>,
 <Feature: NUM_UNIQUE(transactions.product_id)>,
 <Feature: MODE(transactions.product_id)>,
 <Feature: DAY(date_of_birth)>,
 <Feature: DAY(join_date)>,
 <Feature: YEAR(date_of_birth)>,
 <Feature: YEAR(join_date)>,
 <Feature: MONTH(date_of_birth)>,
 <Feature: MONTH(join_date)>,
 <Feature: WEEKDAY(date_of_birth)>,
 <Feature: WEEKDAY(join_date)>,
 <Feature: SUM(sessions.NUM_UNIQUE(transactions.product_id))>,
 <Feature: SUM(sessions.MIN(transactions.amount))>,
 <Feature: SUM(sessions.MEAN(transactions.amount))>,
 <Feature: SUM(sessions.SKEW(transactions.amount))>,
 <Feature: SUM(sessions.MAX(transactions.amount))>,
 <Feature: SUM(sessions.STD(transactions.amount))>,
 <Feature: STD(sessions.NUM_UNIQUE(transactions.product_id))>,
 <Feature: STD(sessions.COUNT(transactions))>,
 <Feature: STD(sessions.MIN(transactions.amount))>,
 <Feature: STD(sessions.MEAN(transactions.amount))>,
 <Feature: STD(sessions.SKEW(transactions.amount))>,
 <Feature: STD(sessions.MAX(transactions.amount))>,
 <Feature: STD(sessions.SUM(transactions.amount))>,
 <Feature: MAX(sessions.NUM_UNIQUE(transactions.product_id))>,
 <Feature: MAX(sessions.COUNT(transactions))>,
 <Feature: MAX(sessions.MIN(transactions.amount))>,
 <Feature: MAX(sessions.MEAN(transactions.amount))>,
 <Feature: MAX(sessions.SKEW(transactions.amount))>,
 <Feature: MAX(sessions.SUM(transactions.amount))>,
 <Feature: MAX(sessions.STD(transactions.amount))>,
 <Feature: SKEW(sessions.NUM_UNIQUE(transactions.product_id))>,
 <Feature: SKEW(sessions.COUNT(transactions))>,
 <Feature: SKEW(sessions.MIN(transactions.amount))>,
 <Feature: SKEW(sessions.MEAN(transactions.amount))>,
 <Feature: SKEW(sessions.MAX(transactions.amount))>,
 <Feature: SKEW(sessions.SUM(transactions.amount))>,
 <Feature: SKEW(sessions.STD(transactions.amount))>,
 <Feature: MIN(sessions.NUM_UNIQUE(transactions.product_id))>,
 <Feature: MIN(sessions.COUNT(transactions))>,
 <Feature: MIN(sessions.MEAN(transactions.amount))>,
 <Feature: MIN(sessions.SKEW(transactions.amount))>,
 <Feature: MIN(sessions.MAX(transactions.amount))>,
 <Feature: MIN(sessions.SUM(transactions.amount))>,
 <Feature: MIN(sessions.STD(transactions.amount))>,
 <Feature: MEAN(sessions.NUM_UNIQUE(transactions.product_id))>,
 <Feature: MEAN(sessions.COUNT(transactions))>,
 <Feature: MEAN(sessions.MIN(transactions.amount))>,
 <Feature: MEAN(sessions.MEAN(transactions.amount))>,
 <Feature: MEAN(sessions.SKEW(transactions.amount))>,
 <Feature: MEAN(sessions.MAX(transactions.amount))>,
 <Feature: MEAN(sessions.SUM(transactions.amount))>,
 <Feature: MEAN(sessions.STD(transactions.amount))>,
 <Feature: NUM_UNIQUE(sessions.MONTH(session_start))>,
 <Feature: NUM_UNIQUE(sessions.WEEKDAY(session_start))>,
 <Feature: NUM_UNIQUE(sessions.YEAR(session_start))>,
 <Feature: NUM_UNIQUE(sessions.MODE(transactions.product_id))>,
 <Feature: NUM_UNIQUE(sessions.DAY(session_start))>,
 <Feature: MODE(sessions.MONTH(session_start))>,
 <Feature: MODE(sessions.WEEKDAY(session_start))>,
 <Feature: MODE(sessions.YEAR(session_start))>,
 <Feature: MODE(sessions.MODE(transactions.product_id))>,
 <Feature: MODE(sessions.DAY(session_start))>,
 <Feature: NUM_UNIQUE(transactions.sessions.customer_id)>,
 <Feature: NUM_UNIQUE(transactions.sessions.device)>,
 <Feature: MODE(transactions.sessions.customer_id)>,
 <Feature: MODE(transactions.sessions.device)>]

总结:

  • 其中发现基本的聚合统计特征都可以生成,std,mean等常用的统计方法,而且最后做到了二阶的特征交叉,省去了特征构造的麻烦,而且对于时间特征,都可以自动生成,例如周几,第几天的特征,省去了平时简单却耗时的特征工程。
  • 且当有新的数据表想加入时,只需要继续往齐总加入新的实体与新的关联。
  • 但是其同时存在一定问题,当设置的max_depth>2时,可能会导致数据维度急速升高,且对于生成的特征列,难以解释,且可能同样生成了大量无用的数据列,对模型生成一定量的噪声,影响模型结果。
  • 常用的数据特征降维方法如PCA,SVD,随机森林选择,神经网络生成等方式对数据进行降维。
posted @ 2020-06-08 10:21  搞材料的小周  阅读(937)  评论(0编辑  收藏  举报