特征平台——feast

feast是google开源的一个特征平台,其提供特征注册管理,以及和特征存储(feature store),离线存储(offline store)和在线存储(online store)交互的SDK,官网文档:

https://docs.feast.dev/

目前最新的v0.24版本支持的离线存储:File,Snowflake,BigQuery,Redshift,Spark,PostgreSQL,Trino,AzureSynapse等,参考:

https://docs.feast.dev/reference/offline-stores

在线存储:SQLite,Snowflake,Redis,Datastore,DynamoDB,PostgreSQL,Cassandra等,参考:

https://docs.feast.dev/reference/online-stores

provider 用于定义feast运行的环境,其提供了feature store在不同平台组件上的实现,目前有4种:local, gcp,aws和azure

provider 支持的offline store 支持的online store
local BigQuery,file Redis,Datastore,Sqlite
gcp BigQuery,file Datastore,Sqlite
aws Redshift,file DynamoDB,Sqlite
azure Mysql,file Redis,Splite

参考:

https://docs.feast.dev/getting-started/architecture-and-components/provider

data source 用于定义特征的数据来源,每个batch data source都和一个offline store关联,比如SnowflakeSource只能和Snowflake offline store关联

data source的类型包括:file,Snowflake,bigquery,redshift,push,kafka,kinesis,spark,postgreSQL,Trino,AzureSynapse+AzureSQL

data source offline store
FileSource file
SnowflakeSource Snowflake
BigQuerySource BigQuery
RedshiftSource Redshift
PushSource(可以同时将feature写入online和offline store)  
KafkaSource(仍然处于实验性)  
KinesisSource(仍然处于实验性)  
SparkSource(支持hive和parquet文件) Spark
PostgreSQLSource PostgreSQL
TrinoSource Trino
MsSqlServerSource AzureSynapse+AzureSQL 

 

Batch Materialization Engines 用于将offline store的数据刷到online store,其配置位于feature_store.xml的batch_engine

其默认实现是LocalMaterializationEngine,也基于aws lambda的LambdaMaterializaionEngine

https://docs.feast.dev/getting-started/architecture-and-components/batch-materialization-engine

也可以Bytewax(配合k8s使用)和Snowflake(当使用SnowflakeSource的时候)作为batch materialization engine

此外,还可以自行实现engine,参考:

https://docs.feast.dev/how-to-guides/customizing-feast/creating-a-custom-materialization-engine

  

 

1.feast的安装

https://docs.feast.dev/getting-started/quickstart

下面的安装以v0.23版本为例,安装v0.23版本的时候建议使用python3.8,v0.22版本的时候建议使用python3.7

pip install feast===0.23.0

由于选择的离线存储是hive,在线存储是cassandra,所以还需要安装离线存储和在线存储的插件

pip install feast-cassandra==0.1.3
pip install feast-hive==0.17.0

如果安装feast-hive的时候遇到无法安装thriftpy,则需要先安装cython

pip install cython
pip install thriftpy

  

2.创建一个feast项目

feast init my_project


Creating a new Feast repository in /Users/lintong/coding/python/my_project.

(⎈ |docker-desktop:default)➜  /Users/lintong/coding/python $ tree -L 3 my_project
my_project
├── __init__.py
├── data
│   └── driver_stats.parquet
├── example.py
└── feature_store.yaml

1 directory, 4 files

其中feature_store.yaml,可以在其中配置offline store和online store,该文件必须位于project的根目录,参考:

https://docs.feast.dev/reference/feature-repository

如下

project: my_project
registry: data/registry.db
provider: local
online_store:
    path: data/online_store.db
entity_key_serialization_version: 2

example.py定义了feast pipeline的流程,即feature的数据source,特征的entity,特征的view注册,特征的服务化,如下

# This is an example feature definition file

from datetime import timedelta

from feast import Entity, FeatureService, FeatureView, Field, FileSource
from feast.types import Float32, Int64

# Read data from parquet files. Parquet is convenient for local development mode. For
# production, you can use your favorite DWH, such as BigQuery. See Feast documentation
# for more info.
driver_hourly_stats = FileSource(
    name="driver_hourly_stats_source",
    path="/Users/lintong/coding/python/my_project/data/driver_stats.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

# Define an entity for the driver. You can think of entity as a primary key used to
# fetch features.
driver = Entity(name="driver", join_keys=["driver_id"])

# Our parquet files contain sample data that includes a driver_id column, timestamps and
# three feature column. Here we define a Feature View that will allow us to serve this
# data to our model online.
driver_hourly_stats_view = FeatureView(
    name="driver_hourly_stats",
    entities=[driver],
    ttl=timedelta(days=1),
    schema=[
        Field(name="conv_rate", dtype=Float32),
        Field(name="acc_rate", dtype=Float32),
        Field(name="avg_daily_trips", dtype=Int64),
    ],
    online=True,
    source=driver_hourly_stats,
    tags={},
)

driver_stats_fs = FeatureService(
    name="driver_activity", features=[driver_hourly_stats_view]
)

  

3.配置注册store和feature

feature store的配置文件默认是feature_store.xml,也可以自行添加

feature定义的配置文件默认是exampl.xml,也可以自行添加

写好配置文件后通过运行feast apply命令来注册store和feature,也可以使用.feastignore文件来排除store和feature

 

如果feast apply遇到如下报错

importerror: cannot import name 'soft_unicode' from 'markupsafe'

则解决方法如下

pip install markupsafe==2.0.1

  

 

posted @ 2016-04-24 22:06  tonglin0325  阅读(909)  评论(0编辑  收藏  举报