tf.feature_column.input_layer 特征顺序问题
先说结论
- tf.feature_column.input_layer()的api,会对传入的feature_columns进行排序,并不是按照输入顺序进行组织,排序依据基于feature_column的name(tf生成的,类似于
'u_wu211_indicator', 'u_wu215_indicator', 'r_rsp113_indicator', 'u_wu211_X_u_wu215_indicator'
这种。 - 关键代码:
for column in sorted(feature_columns, key=lambda x: x.name):
ordered_columns.append(column)
- 代码验证:
In [31]: [x.name for x in sorted( fcs, key=lambda x: x.name)]
Out[31]:
['r_rsp113_indicator',
'u_wu211_X_u_wu215_indicator',
'u_wu211_indicator',
'u_wu215_indicator']
表现
In [24]: u_wu211 = tf.feature_column.categorical_column_with_vocabulary_list(key='u_wu211', vocabulary_list=['0','1','2'])
...: u_wu215 = tf.feature_column.categorical_column_with_vocabulary_list(key='u_wu215', vocabulary_list=['00s','10s','90s'])
...: r_rsp113 = tf.feature_column.categorical_column_with_vocabulary_list(key='r_rsp113', vocabulary_list=['0','-1','1'])
...: u_wu211_u_wu215_cross = tf.feature_column.crossed_column(keys = [u_wu211, u_wu215], hash_bucket_size=3)
...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu211)]))
...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu215)]))
...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(r_rsp113)]))
...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu211_u_wu215_cross)]))
...: print(tf.feature_column.input_layer(tfeatures, [tf.feature_column.indicator_column(u_wu211),
...: tf.feature_column.indicator_column(u_wu215),
...: tf.feature_column.indicator_column(r_rsp113),
...: tf.feature_column.indicator_column(u_wu211_u_wu215_cross)
...: ]))
...:
tf.Tensor(
[[1. 0. 0.]
[0. 0. 1.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[0. 0. 0.]
[1. 0. 0.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[0. 1. 0.]
[0. 1. 0.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[0. 0. 1.]
[0. 0. 1.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0.]], shape=(2, 12), dtype=float32)
- 由第一条sample举例:期望得到的是
u_wu211 + u_wu215 + r_rsp113 + u_wu211_u_wu215_cross
- 即:[1. 0. 0.] + [0. 0. 0.] + [0. 1. 0.] + [0. 0. 1.]
- 但得到的却是:[0. 1. 0.] + [0. 0. 1.] + [1. 0. 0.] + [0. 0. 0.],也就是
['r_rsp113', 'u_wu211_u_wu215_cross', 'u_wu211', 'u_wu215']
文档描述
feature_columns: An iterable containing the FeatureColumns to use as inputs
to your model. All items should be instances of classes derived from
`_DenseColumn` such as `numeric_column`, `embedding_column`,
`bucketized_column`, `indicator_column`. If you have categorical features,
you can wrap them with an `embedding_column` or `indicator_column`.
- feature_columns参数接收一个:包含模型中使用到的FeatureColumns的一个迭代器,列表中的项目都应该是
_DenseColumn
类的实例化对象,例如numeric_column
,embedding_column
,bucketized_column
,indicator_column
.如果是标签类别的特征,需要用embedding_column
orindicator_column
转换一下。 - 其中并未解释特征顺序相关问题。
源码探究
- tf.feature_column.input_layer
@tf_export(v1=['feature_column.input_layer'])
def input_layer(features,
feature_columns,
weight_collections=None,
trainable=True,
cols_to_vars=None,
cols_to_output_tensors=None):
"""Returns a dense `Tensor` as input layer based on given `feature_columns`.
Generally a single example in training data is described with FeatureColumns.
At the first layer of the model, this column oriented data should be converted
to a single `Tensor`.
Example:
``python
price = numeric_column('price')
keywords_embedded = embedding_column(
categorical_column_with_hash_bucket("keywords", 10K), dimensions=16)
columns = [price, keywords_embedded, ...]
features = tf.io.parse_example(..., features=make_parse_example_spec(columns))
dense_tensor = input_layer(features, columns)
for units in [128, 64, 32]:
dense_tensor = tf.compat.v1.layers.dense(dense_tensor, units, tf.nn.relu)
prediction = tf.compat.v1.layers.dense(dense_tensor, 1)
``
Args:
features: A mapping from key to tensors. `_FeatureColumn`s look up via these
keys. For example `numeric_column('price')` will look at 'price' key in
this dict. Values can be a `SparseTensor` or a `Tensor` depends on
corresponding `_FeatureColumn`.
feature_columns: An iterable containing the FeatureColumns to use as inputs
to your model. All items should be instances of classes derived from
`_DenseColumn` such as `numeric_column`, `embedding_column`,
`bucketized_column`, `indicator_column`. If you have categorical features,
you can wrap them with an `embedding_column` or `indicator_column`.
weight_collections: A list of collection names to which the Variable will be
added. Note that variables will also be added to collections
`tf.GraphKeys.GLOBAL_VARIABLES` and `ops.GraphKeys.MODEL_VARIABLES`.
trainable: If `True` also add the variable to the graph collection
`GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
cols_to_vars: If not `None`, must be a dictionary that will be filled with a
mapping from `_FeatureColumn` to list of `Variable`s. For example, after
the call, we might have cols_to_vars =
{_EmbeddingColumn(
categorical_column=_HashedCategoricalColumn(
key='sparse_feature', hash_bucket_size=5, dtype=tf.string),
dimension=10): [<tf.Variable 'some_variable:0' shape=(5, 10),
<tf.Variable 'some_variable:1' shape=(5, 10)]}
If a column creates no variables, its value will be an empty list.
cols_to_output_tensors: If not `None`, must be a dictionary that will be
filled with a mapping from '_FeatureColumn' to the associated
output `Tensor`s.
Returns:
A `Tensor` which represents input layer of a model. Its shape
is (batch_size, first_layer_dimension) and its dtype is `float32`.
first_layer_dimension is determined based on given `feature_columns`.
Raises:
ValueError: if an item in `feature_columns` is not a `_DenseColumn`.
"""
return _internal_input_layer(
features,
feature_columns,
weight_collections=weight_collections,
trainable=trainable,
cols_to_vars=cols_to_vars,
cols_to_output_tensors=cols_to_output_tensors)
- _internal_input_layer
def _internal_input_layer(features,
feature_columns,
weight_collections=None,
trainable=True,
cols_to_vars=None,
scope=None,
cols_to_output_tensors=None,
from_template=False):
"""See input_layer. `scope` is a name or variable scope to use."""
feature_columns = _normalize_feature_columns(feature_columns)
for column in feature_columns:
if not isinstance(column, _DenseColumn):
raise ValueError(
'Items of feature_columns must be a _DenseColumn. '
'You can wrap a categorical column with an '
'embedding_column or indicator_column. Given: {}'.format(column))
weight_collections = list(weight_collections or [])
if ops.GraphKeys.GLOBAL_VARIABLES not in weight_collections:
weight_collections.append(ops.GraphKeys.GLOBAL_VARIABLES)
if ops.GraphKeys.MODEL_VARIABLES not in weight_collections:
weight_collections.append(ops.GraphKeys.MODEL_VARIABLES)
def _get_logits(): # pylint: disable=missing-docstring
builder = _LazyBuilder(features)
output_tensors = []
ordered_columns = []
for column in sorted(feature_columns, key=lambda x: x.name):
ordered_columns.append(column)
with variable_scope.variable_scope(
None, default_name=column._var_scope_name): # pylint: disable=protected-access
tensor = column._get_dense_tensor( # pylint: disable=protected-access
builder,
weight_collections=weight_collections,
trainable=trainable)
num_elements = column._variable_shape.num_elements() # pylint: disable=protected-access
batch_size = array_ops.shape(tensor)[0]
output_tensor = array_ops.reshape(
tensor, shape=(batch_size, num_elements))
output_tensors.append(output_tensor)
if cols_to_vars is not None:
# Retrieve any variables created (some _DenseColumn's don't create
# variables, in which case an empty list is returned).
cols_to_vars[column] = ops.get_collection(
ops.GraphKeys.GLOBAL_VARIABLES,
scope=variable_scope.get_variable_scope().name)
if cols_to_output_tensors is not None:
cols_to_output_tensors[column] = output_tensor
_verify_static_batch_size_equality(output_tensors, ordered_columns)
return array_ops.concat(output_tensors, 1)
# If we're constructing from the `make_template`, that by default adds a
# variable scope with the name of the layer. In that case, we dont want to
# add another `variable_scope` as that would break checkpoints.
if from_template:
return _get_logits()
else:
with variable_scope.variable_scope(
scope, default_name='input_layer', values=features.values()):
return _get_logits()
- 两处需要注意:
- 在_get_logits中,_LazyBuilder对重复引用的特征做了去重,并且延迟初始化
- 另外在添加特征中,引入了一个排序,基于feature_column的name(tf生成的,类似于
'u_wu211_indicator', 'u_wu215_indicator', 'r_rsp113_indicator', 'u_wu211_X_u_wu215_indicator'
这种。 - 代码如下:
def _get_logits(): # pylint: disable=missing-docstring
builder = _LazyBuilder(features)
output_tensors = []
ordered_columns = []
for column in sorted(feature_columns, key=lambda x: x.name):
ordered_columns.append(column)
with variable_scope.variable_scope(
None, default_name=column._var_scope_name): # pylint: disable=protected-access
结论验证
In [29]: fcs = [tf.feature_column.indicator_column(u_wu211),
...: tf.feature_column.indicator_column(u_wu215),
...: tf.feature_column.indicator_column(r_rsp113),
...: tf.feature_column.indicator_column(u_wu211_u_wu215_cross)
...: ]
In [30]: sorted( fcs, key=lambda x: x.name)
Out[30]:
[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='r_rsp113', vocabulary_list=('0', '-1', '1'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=CrossedColumn(keys=(VocabularyListCategoricalColumn(key='u_wu211', vocabulary_list=('0', '1', '2'), dtype=tf.string, default_value=-1, num_oov_buckets=0), VocabularyListCategoricalColumn(key='u_wu215', vocabulary_list=('00s', '10s', '90s'), dtype=tf.string, default_value=-1, num_oov_buckets=0)), hash_bucket_size=3, hash_key=None)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='u_wu211', vocabulary_list=('0', '1', '2'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='u_wu215', vocabulary_list=('00s', '10s', '90s'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]
In [31]: [x.name for x in sorted( fcs, key=lambda x: x.name)]
Out[31]:
['r_rsp113_indicator',
'u_wu211_X_u_wu215_indicator',
'u_wu211_indicator',
'u_wu215_indicator']
- 期望结果:
- ['r_rsp113_indicator', 'u_wu211_X_u_wu215_indicator', 'u_wu211_indicator', 'u_wu215_indicator']
- 即: [0. 1. 0.] + [0. 0. 1.] + [1. 0. 0.] + [0. 0. 0.]
- [0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0.]
- 与预期一致。
澄轶: suanec -
http://www.cnblogs.com/suanec/
友链:marsggbo
声援博主:如果您觉得文章对您有帮助,可以点击文章右下角【推荐】一下。您的鼓励是博主的最大动力!
点个关注吧~
http://www.cnblogs.com/suanec/
友链:marsggbo
声援博主:如果您觉得文章对您有帮助,可以点击文章右下角【推荐】一下。您的鼓励是博主的最大动力!
点个关注吧~