特征平台——feast
feast是google开源的一个特征平台,其提供特征注册管理,以及和特征存储(feature store),离线存储(offline store)和在线存储(online store)交互的SDK,官网文档:
https://docs.feast.dev/
目前最新的v0.24版本支持的离线存储:File,Snowflake,BigQuery,Redshift,Spark,PostgreSQL,Trino,AzureSynapse等,参考:
https://docs.feast.dev/reference/offline-stores
在线存储:SQLite,Snowflake,Redis,Datastore,DynamoDB,PostgreSQL,Cassandra等,参考:
https://docs.feast.dev/reference/online-stores
provider 用于定义feast运行的环境,其提供了feature store在不同平台组件上的实现,目前有4种:local, gcp,aws和azure
provider | 支持的offline store | 支持的online store |
local | BigQuery,file | Redis,Datastore,Sqlite |
gcp | BigQuery,file | Datastore,Sqlite |
aws | Redshift,file | DynamoDB,Sqlite |
azure | Mysql,file | Redis,Splite |
参考:
https://docs.feast.dev/getting-started/architecture-and-components/provider
data source 用于定义特征的数据来源,每个batch data source都和一个offline store关联,比如SnowflakeSource只能和Snowflake offline store关联
data source的类型包括:file,Snowflake,bigquery,redshift,push,kafka,kinesis,spark,postgreSQL,Trino,AzureSynapse+AzureSQL
data source | offline store |
FileSource | file |
SnowflakeSource | Snowflake |
BigQuerySource | BigQuery |
RedshiftSource | Redshift |
PushSource(可以同时将feature写入online和offline store) | |
KafkaSource(仍然处于实验性) | |
KinesisSource(仍然处于实验性) | |
SparkSource(支持hive和parquet文件) | Spark |
PostgreSQLSource | PostgreSQL |
TrinoSource | Trino |
MsSqlServerSource | AzureSynapse+AzureSQL |
Batch Materialization Engines 用于将offline store的数据刷到online store,其配置位于feature_store.xml的batch_engine
其默认实现是LocalMaterializationEngine,也基于aws lambda的LambdaMaterializaionEngine
https://docs.feast.dev/getting-started/architecture-and-components/batch-materialization-engine
也可以Bytewax(配合k8s使用)和Snowflake(当使用SnowflakeSource的时候)作为batch materialization engine
此外,还可以自行实现engine,参考:
https://docs.feast.dev/how-to-guides/customizing-feast/creating-a-custom-materialization-engine
1.feast的安装
https://docs.feast.dev/getting-started/quickstart
下面的安装以v0.23版本为例,安装v0.23版本的时候建议使用python3.8,v0.22版本的时候建议使用python3.7
pip install feast===0.23.0
由于选择的离线存储是hive,在线存储是cassandra,所以还需要安装离线存储和在线存储的插件
pip install feast-cassandra==0.1.3 pip install feast-hive==0.17.0
如果安装feast-hive的时候遇到无法安装thriftpy,则需要先安装cython
pip install cython pip install thriftpy
2.创建一个feast项目
feast init my_project Creating a new Feast repository in /Users/lintong/coding/python/my_project. (⎈ |docker-desktop:default)➜ /Users/lintong/coding/python $ tree -L 3 my_project my_project ├── __init__.py ├── data │ └── driver_stats.parquet ├── example.py └── feature_store.yaml 1 directory, 4 files
其中feature_store.yaml,可以在其中配置offline store和online store,该文件必须位于project的根目录,参考:
https://docs.feast.dev/reference/feature-repository
如下
project: my_project registry: data/registry.db provider: local online_store: path: data/online_store.db entity_key_serialization_version: 2
example.py定义了feast pipeline的流程,即feature的数据source,特征的entity,特征的view注册,特征的服务化,如下
# This is an example feature definition file from datetime import timedelta from feast import Entity, FeatureService, FeatureView, Field, FileSource from feast.types import Float32, Int64 # Read data from parquet files. Parquet is convenient for local development mode. For # production, you can use your favorite DWH, such as BigQuery. See Feast documentation # for more info. driver_hourly_stats = FileSource( name="driver_hourly_stats_source", path="/Users/lintong/coding/python/my_project/data/driver_stats.parquet", timestamp_field="event_timestamp", created_timestamp_column="created", ) # Define an entity for the driver. You can think of entity as a primary key used to # fetch features. driver = Entity(name="driver", join_keys=["driver_id"]) # Our parquet files contain sample data that includes a driver_id column, timestamps and # three feature column. Here we define a Feature View that will allow us to serve this # data to our model online. driver_hourly_stats_view = FeatureView( name="driver_hourly_stats", entities=[driver], ttl=timedelta(days=1), schema=[ Field(name="conv_rate", dtype=Float32), Field(name="acc_rate", dtype=Float32), Field(name="avg_daily_trips", dtype=Int64), ], online=True, source=driver_hourly_stats, tags={}, ) driver_stats_fs = FeatureService( name="driver_activity", features=[driver_hourly_stats_view] )
3.配置注册store和feature
feature store的配置文件默认是feature_store.xml,也可以自行添加
feature定义的配置文件默认是exampl.xml,也可以自行添加
写好配置文件后通过运行feast apply命令来注册store和feature,也可以使用.feastignore文件来排除store和feature
如果feast apply遇到如下报错
importerror: cannot import name 'soft_unicode' from 'markupsafe'
则解决方法如下
pip install markupsafe==2.0.1
本文只发表于博客园和tonglin0325的博客,作者:tonglin0325,转载请注明原文链接:https://www.cnblogs.com/tonglin0325/p/5428513.html