特征模型1

1、文本类的分类任务，特征提取几种方式

1、词集模型

　　文本中单词的种类的集合，只统计单词的有无，和出现个数无关

　　使用场景：自定义的词集黑名单，是否可以使用这种模型，具体使用方式（TODO1）？

2、词袋模型

　　相对于词集模型，还有统计每个单词出现的次数（频率）

　　实现原理：参考sklearn（TODO2）

3、TF-IDF(词频-逆向文本频率模型）

　　从字面就可以理解，该模型即考虑单词在本文档出现的频率，同时考虑在其他文本出现的频率，单词频率 x 逆文档频率。主要思想为：单词在本文档出现频率越大，在其他文档出现的越少，则整体的值越大，即具有很好的区分能力。

　　通常词袋模型和TF-IDF配合使用，可以理解为TF-IDF为对词袋模型归一化处理吗？要回答这个问题，首先要搞明白2个子问题，1、TF-IDF具体做了啥？（TODO3） 2、归一化处理是要做啥，和问题1实现的事一致的吗？（TODO4）

4、词汇表模型

　　前面3中模型没有表达单词间的关系，于是又了词汇表模型。该模型在词袋模型思想的基础上，按照句子中单词顺序进行排序输出特征

上诉模型的几个问题

1、词袋模型实现（sklearn）

 a CountVectorizer初始化

　　　　def __init__(self, input='content', encoding='utf-8',
             decode_error='strict', strip_accents=None,
             lowercase=True, preprocessor=None, tokenizer=None,
             stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
             ngram_range=(1, 1), analyzer='word',
             max_df=1.0, min_df=1, max_features=None,
             vocabulary=None, binary=False, dtype=np.int64):

　　关键参数含义：

　　lowercase：是否区分大小写（默认True，全部转小写）

　　preprocessor：自定义预处理（字符串转换）方法（默认None，使用预定义）

　　analyzer：特征力度，可以为'word', 'char', 'char_wb'，默认word

    min_df、max_df：忽略词频大于设置值。可为整数或者小数/整数，小数表示该词的比例，整数表示该词的次数

   max_features：按照词频排序取前多少个特征，为整数

b、计算模型值
　　def fit_transform(self, raw_documents, y=None):

　　 ......

     vocabulary, X = self._count_vocab(raw_documents,self.fixed_vocabulary_)   #######计算特征的主要方法

    if self.binary:
        X.data.fill(1)

    if not self.fixed_vocabulary_:
        X = self._sort_features(X, vocabulary)           #############按照排序词汇表和特征，特征是根据indices属性表示非0列的

  　................
    X, self.stop_words_ = self._limit_features(X, vocabulary,
                                               max_doc_count,
                                               min_doc_count,
                                               max_features)    #############过滤特征个数
    self.vocabulary_ = vocabulary
    return X

def _count_vocab(self, raw_documents, fixed_vocab):
    .........
    for doc in raw_documents:            ########遍历计算特征文档
        feature_counter = {}
        for feature in analyze(doc):                      ########遍历文档中的特征（单词、字母）
            try:
                feature_idx = vocabulary[feature]                        #########vocabulary为全量特征字典
                if feature_idx not in feature_counter:                    ########本文档特征字典，形式为特征值：个数。如果新增特征，默认返回值为0
                    feature_counter[feature_idx] = 1
                else:
                    feature_counter[feature_idx] += 1
            except KeyError:
                # Ignore out-of-vocabulary items for fixed_vocab=True
                continue

        j_indices.extend(feature_counter.keys())                         #########j_indices为csr_matrix中定义一致，为非0元素对应的列索引值所组成数组
        values.extend(feature_counter.values())                           #########values为按照顺序的全部文档的特征值的次数，和上面的j_indices一一对应
        indptr.append(len(j_indices))                                      #########indptr为csr_matrix中定义一致，第一个元素0，之后每个元素表示稀疏矩阵中每行元素(非零元素)个数累计结果(和上面j_indices、values结合就可以还原每个文档特征值

    if not fixed_vocab:
        # disable defaultdict behaviour
        vocabulary = dict(vocabulary)
        if not vocabulary:
            raise ValueError("empty vocabulary; perhaps the documents only"
                             " contain stop words")

    if indptr[-1] > 2147483648:  # = 2**31 - 1
        if _IS_32BIT:
            raise ValueError(('sparse CSR array has {} non-zero '
                              'elements and requires 64 bit indexing, '
                              'which is unsupported with 32 bit Python.')
                             .format(indptr[-1]))
        indices_dtype = np.int64

    else:
        indices_dtype = np.int32
    j_indices = np.asarray(j_indices, dtype=indices_dtype)           ##########转换为csr_matrix要求格式
    indptr = np.asarray(indptr, dtype=indices_dtype)              ##########转换为csr_matrix要求格式
    values = np.frombuffer(values, dtype=np.intc)             ##########转换为csr_matrix要求格式

    X = sp.csr_matrix((values, j_indices, indptr),
                      shape=(len(indptr) - 1, len(vocabulary)),
                      dtype=self.dtype)
    X.sort_indices()
    return vocabulary, X

2、TF-IDF模型实现（sklearn）

tf-idf(t, d) = tf(t, d) * idf(t)

if ``smooth_idf=False``， idf(t) = log [ n / df(t) ] + 1 () ，其中n为所有文档数，df（t）为所有文档中包含term的文档数量； 后面+1的原因是当所有文档都包含term，则上诉等式就等于0

if ``smooth_idf=True``（默认情况），则 idf(d, t) = log [ (1 + n) / (1 + df(d, t)) ] + 1  ，即分子分母都+1，即一个默认的文档包含所有的term一次
if ``sublinear_tf=True``,则tf（t,d）=n,n为term在d出现的次数

if ``sublinear_tf=False``,则tf（t,d）=log(n),n为term在d出现的次数

　　a TF-IDF初始化

　　def __init__(self, norm='l2', use_idf=True, smooth_idf=True,sublinear_tf=False):

    self.norm = norm
    self.use_idf = use_idf
    self.smooth_idf = smooth_idf
    self.sublinear_tf = sublinear_tf
关键参数含义：
　　norm: '11'  L1范数, '12' L2范数,None
　　当为L2范数，向量中元素平方和为1；当为L1范数，向量中元数绝对值和为1.
　　smooth_idf和sublinear：控制计算td-idf计算公式的，详情见上文
　b、计算模型值

fit_transform先调用fit，再调用transform

def fit_transform(self, X, y=None, **fit_params):
    if y is None:
        # fit method of arity 1 (unsupervised transformation)
        return self.fit(X, **fit_params).transform(X)
    else:
        # fit method of arity 2 (supervised transformation)
        return self.fit(X, y, **fit_params).transform(X)
先看fit函数

def fit(self, X, y=None):　　　　　　　　######################计算idf
    ........
    if self.use_idf:　　　　　　　　　　#############判断是否使用use_idf属性
        n_samples, n_features = X.shape　　　　　　　　##############输入参数X形状，n_samples行数，n_features列数
        df = _document_frequency(X)　　　　　　##############计算非0列的各列的总数，结果就一行  np.bincount(X.indices, minlength=X.shape[1])

        df = df.astype(dtype, **_astype_copy_false(df))　　　　　　#############按照类型格式化

        # perform idf smoothing if required　　　　　　
        df += int(self.smooth_idf)　　　　　　　　############# smooth设置为True时，分母分子都+1、
        n_samples += int(self.smooth_idf)

        # log+1 instead of log makes sure terms with zero idf don't get
        # suppressed entirely.
        idf = np.log(n_samples / df) + 1　　　　　　################## idf的计算公式
        self._idf_diag = sp.diags(idf, offsets=0,　　　　　　　　###############将一维的idf值构成n_feature*n_feature对角矩阵，用于后面和tf矩阵相乘得到tf-idf的值
                                  shape=(n_features, n_features),
                                  format='csr',
                                  dtype=dtype)
    return self

def transform(self, X, copy=True):　　　　　　　　　　##################计算matrix的 tf或者tf-idf
    n_samples, n_features = X.shape　　　　　　　　　　#################计算入参matrix的 行数和列数
    if self.sublinear_tf:　　　　　　　　　　　　　　################## 按照超参数sublinear_tf=False计算值

np.log(X.data, X.data)
        X.data += 1

    if self.use_idf: 　　　　　　################## 按照超参数use_idf计算值

        check_is_fitted(self, '_idf_diag', 'idf vector is not fitted')
        expected_n_features = self._idf_diag.shape[0]
        if n_features != expected_n_features:
            raise ValueError("Input has n_features=%d while the model"
                             " has been trained with n_features=%d" % (
                                 n_features, expected_n_features))
        # *= doesn't work
        X = X * self._idf_diag　　　　　　　　　　　　　　#################### X乘以idf构成的对角矩阵，正好就把每一个tf的值和idf值乘了一遍（均在相应位置的对角线上有值，其他的位置都为0）

    if self.norm:
        X = normalize(X, norm=self.norm, copy=False)　　　　　　################## 对计算出来的tf-idf进行L1或者L2正则化

    return X

2、词汇表模型实现（tensorflow）
　　a 词汇表模型初始化
　　def __init__(self,max_document_length,min_frequency=0,vocabulary=None,tokenizer_fn=None):

  　　　　self.max_document_length = max_document_length
  　　　　self.min_frequency = min_frequency
  　　　　if vocabulary:
    　　　　self.vocabulary_ = vocabulary
  　　　　else:
    　　　　self.vocabulary_ = CategoricalVocabulary()
 　　　　if tokenizer_fn:
    　　　　self._tokenizer = tokenizer_fn
  　　　else:
    　　　　self._tokenizer = tokenizer
　　关键参数含义：
　　　　max_document_length：输出特征的长度

　　　　min_frequency：最小频率

　　　　vocabulary：对象，使用方法未知
　　　　tokenizer_fn：
　　b Fit_transform
　　　　def fit_transform(self, raw_documents, unused_y=None):

　　　　　　self.fit(raw_documents)
　　
　　　　　　return self.transform(raw_documents)

　　　　def fit(self, raw_documents, unused_y=None):
  　　　　　　for tokens in self._tokenizer(raw_documents):         ################### 每一条数据（每一个docment）遍历，tokens为一个document的分词后的list
    　　　　　　for token in tokens:　　　　　　　　　　　　　　　　　　　##################### 从 document分词的list中获取每一个单词 token
      　　　　　　self.vocabulary_.add(token)　　　　　　　　　　　　　　#################### 所有词汇的列表CategoricalVocabulary，其中CategoricalVocabulary有_mapping（为{单词：index}）、_reverse_mapping（为所有的单词list）、_freq（为{单词：总次数}）

　　　　　　if self.min_frequency > 0:　　　　　　　　　　　　
    　　　　　　self.vocabulary_.trim(self.min_frequency)
  　　　　　　self.vocabulary_.freeze()　　　　　　　　　　　　　　　　　########### 设置CategoricalVocabulary的freeze属性

  　　　　　　return self

　　　　　　　　def transform(self, raw_documents):

  　　　　for tokens in self._tokenizer(raw_documents):　　　　　　　　################### 每一条数据（每一个docment）遍历，tokens为一个document的分词后的list
    　　　　　word_ids = np.zeros(self.max_document_length, np.int64)　　　　　　################### 设置每一条数据的返回结果
    　　　　　for idx, token in enumerate(tokens):　　　　　　　　　　　　##################### 从 document分词的list中获取每一个单词 token和他的idx
      　　　　　　if idx >= self.max_document_length:　　　　　　　　　　##################### 超过最大长度，直接跳出过滤
        　　　　　　break
      　　　　　　word_ids[idx] = self.vocabulary_.get(token)　　　　##################### 在句子的先后单词位置设置为单词在vocabulary_的编号
    　　　　　yield word_ids　　　　　　　　　　　　　　　　　　　　　　　　##################### 返回整个句子编码后的内容

posted @ 2019-11-09 19:37 哈哈哈喽喽喽阅读(737) 评论(0) 编辑收藏举报

刷新页面返回顶部

哈哈哈喽喽喽

特征模型1

公告