作用：对字典数据进行特征值化
API：from sklearn.feature_extraction import DictVectorizer
- fit_transform(x)：x为字典过包含字典的迭代器，返回值为sparse矩阵
- inverse_transform(x)：x为sparse矩阵或array数组，返回值为转换之前的数据格式
- transform(x)：按照原先的转换标准
- get_feature_names()：返回类别名称

alist = [
    {'city': 'BeiJing', 'temp': 33},
    {'city': 'GZ', 'temp': 42},
    {'city': 'SH', 'temp': 40},
]
# 实例化一个工具类对象
d = DictVectorizer()
# 返回的是一个sparse矩阵
feature = d.fit_transform(alist)
print(feature)
print(d.inverse_transform(feature))

print("============================================")

# 实例化一个工具类对象
d = DictVectorizer(sparse=False)
# 返回的是一个二维列表
feature = d.fit_transform(alist)
print("输出为One_Hot编码：")
print(feature)
print(d.inverse_transform(feature))


  (0, 0)    1.0
  (0, 3)    33.0
  (1, 1)    1.0
  (1, 3)    42.0
  (2, 2)    1.0
  (2, 3)    40.0
[{'city=BeiJing': 1.0, 'temp': 33.0}, {'city=GZ': 1.0, 'temp': 42.0}, {'city=SH': 1.0, 'temp': 40.0}]
============================================
输出为One_Hot编码：
[[ 1.  0.  0. 33.]
 [ 0.  1.  0. 42.]
 [ 0.  0.  1. 40.]]
[{'city=BeiJing': 1.0, 'temp': 33.0}, {'city=GZ': 1.0, 'temp': 42.0}, {'city=SH': 1.0, 'temp': 40.0}]

Process finished with exit code 0

为什么需要onehot编码：

特征抽取主要目的就是对非数值型的数据进行特征值化，如果对文本类进行手动特征值化为数字，如1和4.则1和4有优先级或权重大小之分会影响机器学习

文本特征提取

作用：对文本数据进行特征值化
API:from sklearn.feature_extraction.text import CountVectorizer
fit_transform(x)：x为文本或包含文本字符串的可迭代对象，返回sparse矩阵
inverse_transform(x)：x为array数组或者sparse矩阵，返回转换之前的格式数据
英文

vector = CountVectorizer()
res = vector.fit_transform(['lift is short,i love python', 'lift is too long,i hate python'])
print(res)  # sparse 
print(vector.get_feature_names())
print(res.toarray())

  (0, 2)    1
  (0, 1)    1
  (0, 6)    1
  (0, 4)    1
  (0, 5)    1
  (1, 2)    1
  (1, 1)    1
  (1, 5)    1
  (1, 7)    1
  (1, 3)    1
  (1, 0)    1
['hate', 'is', 'lift', 'long', 'love', 'python', 'short', 'too']
[[0 1 1 0 1 1 1 0]
 [1 1 1 1 0 1 0 1]]
============================================

Process finished with exit code 0

中文
对有标点且有空格分隔的中文文本进行特征提取

res = vector.fit_transform(['人生苦短，我用python', '人生漫长，不用python'])
# print(res)  # sparse
print(vector.get_feature_names())
print(res.toarray())

['不用python', '人生漫长', '人生苦短', '我用python']
[[0 0 1 1]
 [1 1 0 0]]

目前CountVectorizer只可以对有标点符号和用分隔符的文本进行特征提取，显然这是满足不了我们日常需求的：
- 因为在自然语言处理中，我们是需要将一段中文文本中相关的词语成语形容词等等都要进行抽取
jieba‘分词：
- 对中文文章进行分词处理
- pip install jieba
- import jieba

res = vector.fit_transform(['人生苦短，我用python', '人生漫长，不用python'])
# print(res)  # sparse
print(vector.get_feature_names())
print(res.toarray())

print("============================================")

# jieba分词
jb = jieba.cut('人生苦短，我用python，人生漫长，不用python')
content = list(jb)
print(content)
ct = ' '.join(content)
print(ct) # 返回空格区分的词语
print("============================================")
jieba_res = vector.fit_transform([ct])
print(vector.get_feature_names())
print(jieba_res.toarray())
print("============================================")


['不用python', '人生漫长', '人生苦短', '我用python']
[[0 0 1 1]
 [1 1 0 0]]
============================================
['人生', '苦短', '，', '我用', 'python', '，', '人生', '漫长', '，', '不用', 'python']
人生 苦短 ， 我用 python ， 人生 漫长 ， 不用 python
============================================
['python', '不用', '人生', '我用', '漫长', '苦短']
[[2 1 2 1 1 1]]
============================================

posted on 2022-07-26 09:34 xxdd123321 阅读(367) 评论(0) 收藏举报

刷新页面返回顶部

导航

字典特征提取

字典特征提取

文本特征提取