逻辑回归

本文参考了很多网页，主要有：

http://blog.csdn.net/zouxy09/article/details/20319673

http://www.wbrecom.com/?p=394

http://www.cnblogs.com/EE-NovRain/p/3810737.html

下面是我收集的一些资料：

http://pan.baidu.com/s/1o72JoJg

逻辑回归

什么是逻辑回归？
逻辑回归求解
梯度下降法
正则化
在线梯度下降
FOBOS
RDA
FTRL

什么是逻辑回归？

　　某一天，计院的屌丝小明走在路上，看到了音乐学院的一个美女，内心无比荡漾。然后小明就四处打探，发现这个姑娘没有对象，哈哈。但是小明对自己没啥信心，不知道这个妹子喜欢什么样的男生，万一悲剧了，多伤心。不过小明在打探过程中知道这个姑娘曾经有过好几个对象，并且知道了这些对象的一些特点，个子都要超过1米7，而且比较有钱，而且曾经有个学霸追求她，但是被她拒绝了等等。这时候小明会去搞了一个分类算法，来预测下妹子是不是喜欢他。当然屌丝小明一开始不敢拿自己做实验，就用了实验室的几个小伙伴做预测，这时候小明发现，结果只有喜欢和不喜欢，这个太粗矿了。如果能够得到一个值，得到妹子喜欢的程度，岂不更好。于是乎，小明就用了逻辑回归，将妹子的喜欢程度变为了0到1之间的某个值，同时小明同学发现利用自己的预测值为0.7，而实验室的其它伙伴只有0.5不到。吆西，是时候组织一波进攻了。

　　在这里，姑娘前任的个子，有没有钱，是不是学霸，有没有车等等都可以表示为特征 $x_{1},x_{2},...,x_{n}$ ,同时姑娘要求身高一定要高过1米7，那么身高这个特征就很重要，在这里我们用 $w _{i}$ 表示升高在妹子心中的重要程度。

　　当然前面都是扯淡，主要是为了通过例子带出逻辑回归。在生活中，假设一件事情发生的概率为p，不可能发生的概率为1-p，那么这件事情发生的几率（odds）表示为 $\frac{p}{1-p}$ , log odds 表示为

$\log \frac{p}{1-p}=w_{1}x_{1}+...+w_{n}x_{x}=WX$

　　通过计算，可以得到一件事情发生的概率 $p(y=1|x)$ 为

$p(y=1|x)=\frac{1}{1+e^{-wx}}=\sigma(wx)$

　　这就是逻辑回归( sigmoid function)，

逻辑回归求解

　　假设现在有n个独立的训练样本 $\{\mathbf{x}_{i},y_{i}\},y_{i}=0,1$ ,其中X表示为变量，现在有一个逻辑回归模型， $p(\mathbf{x}_{i},y_{i})$ 这个样本出现的概率。

$p(\mathbf{x}_{i},y_{i})=p(y_{i}=1|\mathbf{x}_{i})^{y_{i}} + (1-p(y_{i}=1|\mathbf{x}_{i}))^{1-y_{i}}$

　　因为这些训练样本是独立的，所以这n个样本发生的概率为

$\prod p(\mathbf{x}_{i},y_{i})=p(y_{i}=1|\mathbf{x}_{i})^{y_{i}} + (1-p(y_{i}=1|\mathbf{x}_{i}))^{1-y_{i}}$

　　上面这个公式就是似然函数，直观上理解，如果这个似然函数最大，就表明这些样本发生的概率越大，这个模型越符合要求。我们一般用对数似然函数

$L(w)=\log (\prod p(\mathbf{x}_{i},y_{i})=p(y_{i}=1|\mathbf{x}_{i})^{y_{i}} + (1-p(y_{i}=1|\mathbf{x}_{i}))^{1-y_{i}})$

　　然后一顿噼里啪啦的求导,得到

$\frac{\partial L}{\partial w}=\sum_{i=1}^{n}(y_{i}-\sigma (\theta ^{T}x_{i}))x_{i}$

　　这里设 $l(w,z) = -\(L(w,z)$ ，题目就转化为一个最优化问题：

$w = \arg \min_{w}l(w,z)$

　　另该导数公式为0，然后求得对应的w，但是这是无法直接求解，可以通过不断迭代来逼近最优解。

随机梯度下降：

　　梯度下降法的意义在于函数的值沿着导数的反方向下降的最快。梯度下降法有批量梯度下降和随机梯度下降。

批量梯度下降：

　　 $w_{j}=w_{j-1} +\alpha \sum_{i=1}^{n}(y_{i}-\sigma (w^{T}x_{i}))x_{i}$

公式中 $\alpha$ 表示学习率（步长）。大了，可以快速到达最优值附近，但是无法逼近最优值。小了，收敛速度太慢了，等的时间太长。

梯度下降法每次迭代都需要对样本集进行遍历，如果样本数量太大，耗费时间就比较长。假如每次迭代只使用一个样本，就可以节约时间，这就是随机梯度上升。

w_{j}=w_{j-1} +\alpha(y_{i}-\sigma (w^{T}x_{i}))x_{i}

随机梯度上升与梯度上升相比较：

如果存在一个最优解，梯度上升可以求得最优解，随机梯度上升大部分都在最优解附近。
如果存在多个解，梯度上升很可能陷入局部最优解，随机梯度上升有可能跳出局部最优解。
随机梯度上升的步长小于随着迭代的次数逐渐减小，走的步子越来越小，可以越来越逼近最优解，避免步子太大了导致在最优解附近一直跳来跳去。
随机梯度上升适合增量计算(online)。

正则化

　　如果模型在训练的时候拟合度很高，但是在泛化（预测）的时候效果却很差，这就说明出现了过拟合问题了。当数据维度很高，但是数据量比较小的时候

就很容易出现过拟合问题。解决过拟合问题的方法就是正则化，逻辑回归常用的正则化有两种 $l1$ 正则和 $l2$ 正则。

　　 $l1$ 正则: $\varphi (w) =\left \| w \right \| _{1}= \sum |w|$

　　 $l2$ 正则: $\varphi (w) =\left \| w \right \| _{2}= \sum w^{2} = W^{T}W$

同时最优化问题变为了

　　　　 $w = \arg \min_{w}[l(w,z) + \lambda \varphi(W)])$

$l1$ 正则和 $l2$ 正则的作用都是为了防止W产生一些绝对值很大的值，避免过拟合。但是 $l1$ 正则和 $l2$ 正则存在一些区别：

$l1$ 正则在批量梯度下降的时候，能够产生稀疏解，就是W有很多维度的值为0，这些为0的维度通常都是相关度不高的维度，这样在计算高维度的向量时能够起到特征选择的作用。

稀疏解是在进行逻辑回归计算的时候一个很重要的目标，在广告CTR预估的时候，每个样本的维度都特别高，稀疏解可以大大节省计算时间。

但是前面说过批量梯度下降的计算时间很长，而随机梯度下降即使使用了 $l1$ 正则，仍然很难得到一个稀疏解。

有一种解决办法，就是当W中某个维度的值低于一定阈值的时候，直接将这个维度的值设置为0。这种方式叫做截断法，但是这种方法太简单暴力，容易将一些实际中需要的维度干掉。那么有没有一种办法能够温柔一点呢，那就是截断梯度法[1]。

FOBOS

FOBOS[2]叫做前向后向切分法，它不仅与迭代前的W相关，同时考虑到了迭代后的W。

第二个公式中左面部分确保了新的解不会距离上一个解太远，同时第二部分保证能够产生稀疏解。

RDA

RDA算法叫做正则对偶平均(Regularized Dual Averaging)，该算法的迭代方式如下：

公式中第1部分是指梯度和的平均值，第2部分是指正则项，第3部分是指额外的一个正则项。

与FOBOS相比较，RDA算法能够产生更加稀疏的解，但是精度会比较差。

FTRL

FTRL将FOBOS和RDA算法进行结合，这样既兼顾了精度，又兼顾了稀疏性。迭代公式如下：

经过不断的推导，最终结果变为：

其中

这个算法过程如下

这几部分描述的比较简陋，主要还是需要参考作者的论文，FTRL算法应用在CTR预估方面有很好的效果，kaggle上面也有对应的项目[3]。项目代码如下：

'''
           DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
                   Version 2, December 2004

Copyright (C) 2004 Sam Hocevar <sam@hocevar.net>

Everyone is permitted to copy and distribute verbatim or modified
copies of this license document, and changing it is allowed as long
as the name is changed.

           DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
  TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION

 0. You just DO WHAT THE FUCK YOU WANT TO.
'''


from datetime import datetime
from csv import DictReader
from math import exp, log, sqrt


# TL; DR, the main training process starts on line: 250,
# you may want to start reading the code from there


##############################################################################
# parameters #################################################################
##############################################################################

# A, paths
train = 'train_rev2'               # path to training file
test = 'test_rev2'                 # path to testing file
submission = 'submission1234.csv'  # path of to be outputted submission file

# B, model
alpha = .1  # learning rate
beta = 1.   # smoothing parameter for adaptive learning rate
L1 = 1.     # L1 regularization, larger value means more regularized
L2 = 1.     # L2 regularization, larger value means more regularized

# C, feature/hash trick
D = 2 ** 20             # number of weights to use
interaction = False     # whether to enable poly2 feature interactions

# D, training/validation
epoch = 1       # learn training data for N passes
holdafter = 9   # data after date N (exclusive) are used as validation
holdout = None  # use every N training instance for holdout validation


##############################################################################
# class, function, generator definitions #####################################
##############################################################################

class ftrl_proximal(object):
    ''' Our main algorithm: Follow the regularized leader - proximal

        In short,
        this is an adaptive-learning-rate sparse logistic-regression with
        efficient L1-L2-regularization

        Reference:
        http://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf
    '''

    def __init__(self, alpha, beta, L1, L2, D, interaction):
        # parameters
        self.alpha = alpha
        self.beta = beta
        self.L1 = L1
        self.L2 = L2

        # feature related parameters
        self.D = D
        self.interaction = interaction

        # model
        # n: squared sum of past gradients
        # z: weights
        # w: lazy weights
        self.n = [0.] * D
        self.z = [0.] * D
        self.w = {}

    def _indices(self, x):
        ''' A helper generator that yields the indices in x

            The purpose of this generator is to make the following
            code a bit cleaner when doing feature interaction.
        '''

        # first yield index of the bias term
        yield 0

        # then yield the normal indices
        for index in x:
            yield index

        # now yield interactions (if applicable)
        if self.interaction:
            D = self.D
            L = len(x)

            x = sorted(x)
            for i in xrange(L):
                for j in xrange(i+1, L):
                    # one-hot encode interactions with hash trick
                    yield abs(hash(str(x[i]) + '_' + str(x[j]))) % D

    def predict(self, x):
        ''' Get probability estimation on x

            INPUT:
                x: features

            OUTPUT:
                probability of p(y = 1 | x; w)
        '''

        # parameters
        alpha = self.alpha
        beta = self.beta
        L1 = self.L1
        L2 = self.L2

        # model
        n = self.n
        z = self.z
        w = {}

        # wTx is the inner product of w and x
        wTx = 0.
        for i in self._indices(x):
            sign = -1. if z[i] < 0 else 1.  # get sign of z[i]

            # build w on the fly using z and n, hence the name - lazy weights
            # we are doing this at prediction instead of update time is because
            # this allows us for not storing the complete w
            if sign * z[i] <= L1:
                # w[i] vanishes due to L1 regularization
                w[i] = 0.
            else:
                # apply prediction time L1, L2 regularization to z and get w
                w[i] = (sign * L1 - z[i]) / ((beta + sqrt(n[i])) / alpha + L2)

            wTx += w[i]

        # cache the current w for update stage
        self.w = w

        # bounded sigmoid function, this is the probability estimation
        return 1. / (1. + exp(-max(min(wTx, 35.), -35.)))

    def update(self, x, p, y):
        ''' Update model using x, p, y

            INPUT:
                x: feature, a list of indices
                p: click probability prediction of our model
                y: answer

            MODIFIES:
                self.n: increase by squared gradient
                self.z: weights
        '''

        # parameter
        alpha = self.alpha

        # model
        n = self.n
        z = self.z
        w = self.w

        # gradient under logloss
        g = p - y

        # update z and n
        for i in self._indices(x):
            sigma = (sqrt(n[i] + g * g) - sqrt(n[i])) / alpha
            z[i] += g - sigma * w[i]
            n[i] += g * g


def logloss(p, y):
    ''' FUNCTION: Bounded logloss

        INPUT:
            p: our prediction
            y: real answer

        OUTPUT:
            logarithmic loss of p given y
    '''

    p = max(min(p, 1. - 10e-15), 10e-15)
    return -log(p) if y == 1. else -log(1. - p)


def data(path, D):
    ''' GENERATOR: Apply hash-trick to the original csv row
                   and for simplicity, we one-hot-encode everything

        INPUT:
            path: path to training or testing file
            D: the max index that we can hash to

        YIELDS:
            ID: id of the instance, mainly useless
            x: a list of hashed and one-hot-encoded 'indices'
               we only need the index since all values are either 0 or 1
            y: y = 1 if we have a click, else we have y = 0
    '''

    for t, row in enumerate(DictReader(open(path))):
        # process id
        ID = row['id']
        del row['id']

        # process clicks
        y = 0.
        if 'click' in row:
            if row['click'] == '1':
                y = 1.
            del row['click']

        # extract date
        date = int(row['hour'][4:6])

        # turn hour really into hour, it was originally YYMMDDHH
        row['hour'] = row['hour'][6:]

        # build x
        x = []
        for key in row:
            value = row[key]

            # one-hot encode everything with hash trick
            index = abs(hash(key + '_' + value)) % D
            x.append(index)

        yield t, date, ID, x, y


##############################################################################
# start training #############################################################
##############################################################################

start = datetime.now()

# initialize ourselves a learner
learner = ftrl_proximal(alpha, beta, L1, L2, D, interaction)

# start training
for e in xrange(epoch):
    loss = 0.
    count = 0

    for t, date, ID, x, y in data(train, D):  # data is a generator
        #    t: just a instance counter
        # date: you know what this is
        #   ID: id provided in original data
        #    x: features
        #    y: label (click)

        # step 1, get prediction from learner
        p = learner.predict(x)

        if (holdafter and date > holdafter) or (holdout and t % holdout == 0):
            # step 2-1, calculate validation loss
            #           we do not train with the validation data so that our
            #           validation loss is an accurate estimation
            #
            # holdafter: train instances from day 1 to day N
            #            validate with instances from day N + 1 and after
            #
            # holdout: validate with every N instance, train with others
            loss += logloss(p, y)
            count += 1
        else:
            # step 2-2, update learner with label (click) information
            learner.update(x, p, y)

    print('Epoch %d finished, validation logloss: %f, elapsed time: %s' % (
        e, loss/count, str(datetime.now() - start)))


##############################################################################
# start testing, and build Kaggle's submission file ##########################
##############################################################################

with open(submission, 'w') as outfile:
    outfile.write('id,click\n')
    for t, date, ID, x, y in data(test, D):
        p = learner.predict(x)
        outfile.write('%s,%s\n' % (ID, str(p)))