精通-Python-机器学习的六个步骤-全-

精通 Python 机器学习的六个步骤（全）

原文：Mastering Machine Learning with Python in Six Steps

协议：CC BY-NC-SA 4.0

一、Python 3 入门

在本章中，你将获得关于 Python 语言及其核心理念的高层次概述，如何设置 Python 3 开发环境，以及围绕 Python 编程的关键概念，以帮助你入门。本章是非 Python 用户的附加步骤或先决步骤。如果您已经熟悉 Python，我建议您快速浏览一下目录，以确保您了解所有的关键概念。

生活中最好的东西都是免费的

有人说，生活中最美好的东西都是免费的！“Python 是一种开源、高级、面向对象、解释型的通用动态编程语言。它有一个基于社区的发展模式。其核心设计理论强调代码可读性，其编码结构使程序员能够用比 Java、C 或 C++等其他高级编程语言更少的代码行来阐述计算概念。

文档“Python 的禅”(Python 增强提案，信息条目编号 20)很好地总结了 Python 的设计哲学，其中包括如下格言:

美丽总比丑陋好——保持一致。
复杂比复杂好——使用现有的库。
简单总比复杂好——保持简单，笨蛋(吻)。
平的比嵌套的好——避免嵌套的 if。
明确的比含蓄的好——要清楚。
稀疏比密集好——将代码分成模块。
可读性很重要——缩进以方便阅读。
特例不足以特殊到打破规则——一切都是对象。
错误不应该悄无声息地过去——使用良好的异常处理。
尽管实用性胜过纯粹性——如果需要，打破规则。
除非明确禁止，否则使用错误记录和可追溯性。
在歧义中，拒绝猜测的诱惑——Python 语法更简单；然而，很多时候我们可能需要更长的时间来破译。
尽管方式可能一开始并不明显——实现某件事的方式并不只有一种。
最好只有一个显而易见的方法——使用现有的库。
如果实现很难解释，这是一个坏主意——如果你不能用简单的术语解释，那么你对它理解得不够好。
现在做总比不做好——有快速/肮脏的方法来完成工作，而不是尝试太多的优化。
尽管从来没有比现在更好——尽管有一条快速/肮脏的路，但不要走上一条没有优雅的回头路的道路。
名称空间是一个非常棒的想法，所以让我们多做一些吧！具体点。
如果实现很容易解释，这可能是一个好主意——简单是好的。

冉冉升起的明星

Python 于 1991 年 2 月 20 日正式诞生，版本号为 0.9.0。它的应用横跨各个领域，如网站开发、移动应用程序开发、科学和数字计算、桌面 GUI 和复杂软件开发。尽管 Python 是一种更通用的编程和脚本语言，但在过去几年中，它已经在数据工程师、科学家和机器学习(ML)爱好者中流行起来。

有一些设计良好的开发环境，如 Jupyter Notebook 和 Spyder，它们允许快速检查数据，并能够交互式地开发 ML 模型。

NumPy 和 Pandas 等强大的模块可以有效地使用数字数据。科学计算通过 SciPy 软件包变得很容易。许多主要的 ML 算法已经在 scikit-learn(也称为 sklearn)中有效地实现。HadooPy 和 PySpark 通过大数据技术堆栈提供无缝的工作体验。Cython 和 Numba 模块允许以 C 代码的速度执行 Python 代码。nosetest 等模块强调高质量、持续集成测试和自动部署。

将所有这些结合起来，使得许多 ML 工程师选择 Python 作为语言来探索数据、识别模式、构建模型并将其部署到生产环境中。最重要的是，各种关键 Python 包的商业友好许可证鼓励了商业和开源社区的合作，这对双方都有利。总的来说，Python 编程生态系统允许快速的结果和快乐的程序员。我们已经看到了这样一种趋势，即开发人员成为开源社区的一部分，为全球社区提供 bug 修复和新算法，同时保护他们所在公司的核心知识产权。

选择 Python 2.x 或 Python 3.x

2008 年 12 月发布的 Python 版是向后不兼容的。这是因为开发团队有很大的压力，强调将二进制数据与文本数据分开，并使所有文本数据自动支持 Unicode，以便项目团队可以轻松地使用多种语言。因此，任何从 2.x 到 3.x 的项目迁移都需要很大的改变。Python 2.x 原本计划在 2015 年寿终正寝，但后来又延长了 5 年到 2020 年。

Python 3 是一种前沿、更好、更一致的语言。这是 Python 语言的未来，它修复了 Python 2 中存在的许多问题。表 1-1 显示了一些主要差异。

表 1-1

Python 2 对 Python 3

Python 2

python3

|
| --- | --- |
| 它将在 2020 年退休；在那之前，它会收到安全更新和错误修复。 | 在过去的两年里，它被广泛采用；目前 99.7%的关键包支持 Python 3。 |
| 印刷品是一种陈述。打印“Hello World！” | 打印是一种功能。Print ("Hello World！") |
| 默认情况下，字符串存储为 ASCII。 | 默认情况下，字符串存储为 Unicode。 |
| 将整数除法舍入到最接近的整数 | 整数除法返回精确值，不四舍五入到最接近的整数。 |

截至目前，Python 3 readiness ( http://py3readiness.org/ )显示，Python 的 360 个顶级包中有 360 个支持 3.x，强烈建议我们使用 Python 3.x 进行开发工作。

我推荐 Anaconda (Python 发行版)，BSD 许可的，它允许您将它用于商业和再发行。它有大约 474 个包，包括对大多数科学应用、数据分析和 ML 最重要的包，如 NumPy、SciPy、Pandas、Jupyter Notebook、matplotlib 和 scikit-learn。它还提供了一个优秀的环境工具 conda，允许您轻松地在环境之间切换——甚至在 Python 2 和 3 之间切换(如果需要的话)。当一个包的新版本发布时，它也更新得非常快；你可以直接做conda update <packagename>来更新它。

你可以从他们的官方网站 https://www.anaconda.com/distribution/ 下载最新版本的 Anaconda，并按照安装说明进行操作。

要安装 Python，请参考以下章节。

Windows 操作系统

根据您的系统配置(32 或 64 位)，下载安装程序。
双击。exe 文件来安装 Anaconda，并按照屏幕上的安装向导进行操作。

系统

对于 Mac OS，您可以通过图形安装程序或命令行进行安装。

图形安装程序

下载图形安装程序。
双击下载的。pkg 文件，然后按照屏幕上的安装向导说明进行操作。

命令行安装程序

下载命令行安装程序
在你的终端窗口中，输入并遵循指令:bash

Linux 操作系统

根据您的系统配置，下载安装程序。
在您的终端窗口中，键入并遵循指令:bash anaconda 3-x . x . x-Linux-x86 _ xx . sh。

来自官方网站

如果你不想使用 Anaconda build pack，你可以去 Python 的官方网站 www.python.org/downloads/ 浏览到合适的 OS 部分并下载安装程序。注意，OSX 和大多数 Linux 都预装了 Python，所以不需要额外的配置。

为 Windows 设置路径时，请确保在运行安装程序时选中“将 Python 添加到路径选项”。这将允许您从任何目录调用 Python 解释器。

如果您没有勾选“将 Python 添加到路径选项”，请遵循以下步骤:

右键单击“我的电脑”
单击“属性”
单击侧面板中的“高级系统设置”
单击“环境变量”
单击系统变量下方的“新建”。
在 name 中，输入 pythonexe(或您想要的任何名称)。
在值中，输入 Python 的路径(例如:C:\Python32)。
现在编辑 Path 变量(在系统部分)并添加% pythonexe %到已经存在的东西的尽头。

运行 Python

在命令行中，键入“Python”打开交互式解释器。Python 脚本可以使用以下语法在命令行执行

python <scriptname.py>.

关键概念

Python 中有许多基本概念，理解它们对于您的入门至关重要。本章的其余部分对它们进行了简要的介绍。

Python 标识符

顾名思义，标识符帮助我们区分一个实体和另一个实体。类、函数和变量等 Python 实体被称为标识符。

它可以是大写或小写字母的组合(A 到 Z 或 A 到 Z)。
它可以是任何数字(0 到 9)或下划线(_)。
用 Python 编写标识符要遵循的一般规则:
- 它不能以数字开头。例如，1 变量无效，而变量 1 有效。
- Python 保留关键字(参考表 1-2 )不能用作标识符。
- 除了下划线(_)，特殊符号如！、@、#、$、%等。不能是标识符的一部分。

关键词

表 1-2 列出了 Python 中用来定义语言语法和结构的一组保留字。关键词区分大小写，除了真、假、无外，所有关键词都是小写。

表 1-2

Python 关键字

| 错误的 | 班级 | 最后 | 是 | 返回 | | 没有人 | 继续 | 为 | 希腊字母的第 11 个 | 尝试 | | 真实的 | 极好的 | 从 | 非局部的 | 正在… | | 和 | 是吗 | 全球的 | 不 | 随着 | | 如同 | 艾列弗 | 如果 | 或者 | 产量 | | 维护 | 其他 | 进口 | 及格 | | | 破裂 | 除...之外 | 在 | 上升 | |

我的第一个 Python 程序

与其他编程语言相比，使用 Python 要容易得多(图 1-1 )。让我们看看如何在一行代码中执行一个简单的 print 语句。您可以在命令提示符下启动 Python 交互式，输入以下文本，然后按 Enter 键。

图 1-1

Python 与其他

>>> print ("Hello, Python World!")

代码块

理解如何用 Python 编写代码块是非常重要的。让我们来看看关于代码块的两个关键概念:缩进和套件。

缺口

Python 最独特的特性之一是使用缩进来标记代码块。在 Python 中，每行代码必须缩进相同的量来表示一个代码块。与大多数其他编程语言不同，缩进不是用来让代码看起来漂亮的。需要缩进来指示哪个代码块或语句属于当前程序结构(参见清单 1-1 和 1-2 中的示例)。

套房

在 Python 中，构成单个代码块的单个语句的集合称为套件。if、while、def 和 class 等复合或复杂语句需要一个标题行，后面跟着一个 suite(我们将在后面的小节中详细理解这些语句)。标题行以关键字开始，以冒号(:)结束，后面是组成套件的一行或多行。

# incorrect indentation, program will generate a syntax error
# due to the space character inserted at the beginning of the second line
print ("Programming is an important skill for Data Science")
 print ("Statistics is an important skill for Data Science")
print ("Business domain knowledge is an important skill for Data Science")
3
# incorrect indentation, program will generate a syntax error
# due to the wrong indentation in the else statement
x = 1
if x == 1:
    print ('x has a value of 1')
else:
 print ('x does NOT have a value of 1')
-------Output-----------
    print ("Statistics is an important skill for Data Science")
    ^
IndentationError: unexpected indent

Listing 1-2Example of Incorrect Indentation

# Correct indentation
print ("Programming is an important skill for Data Science")
print ("Statistics is an important skill for Data Science")
print ("Business domain knowledge is an important skill for Data Science")

# Correct indentation, note that if statement here is an example of suites
x = 1
if x == 1:
    print ('x has a value of 1')
else:
    print ('x does NOT have a value of 1')

Listing 1-1Example of Correct Indentation

基本对象类型

表 1-3 列出了 Python 对象类型。根据 Python 数据模型参考，对象是 Python 的数据概念。Python 程序中的所有数据都由对象或对象之间的关系来表示。在某种意义上，与冯·诺依曼的“存储程序计算机”模型一致，代码也是由对象表示的。

每个对象都有标识、类型和值。清单 1-3 提供了理解对象类型的示例代码。

表 1-3

Python 对象类型

类型

例子

|
| --- | --- | --- |
| 没有人 | 没有人 | # singleton 空对象 |
| 布尔代数学体系的 | 真，假 | |
| 整数 | -一，零，一，麦克斯 | |
| 长的 | 1L，9787L | |
| 浮动 | 3.141592654 | |
| | inf，float('inf ') | #无限 |
| | -inf | #负无穷大 |
| | nan，float('nan ') | #不是一个数字 |
| 复杂的 | 2+8j | #注意 j 的使用 |
| 线 | 这是一串“也是我” | #使用单引号或双引号 |
| | r“原始字符串”，u“unicode 字符串” | |
| 元组 | empty =() | #空元组 |
| | (1，真，“ML”) | #不可变列表或不可改变列表 |
| 目录 | empty = [] | 空列表 |
| | [1，真，' ML'] | #可变列表或可变列表 |
| 一组 | empty = set() | #空集 |
| | set(1，True，' ML ') | #可变的或可改变的 |
| 词典 | empty = {} | #可变对象或可变对象 |
| | {'1':'A '，' 2':'AA '，True = 1，False = 0} | |
| 文件 | f =打开('文件名'，' rb ') | |

none = None           #singleton null object
boolean = bool(True)
integer = 1
Long = 3.14

# float
Float = 3.14
Float_inf = float('inf')
Float_nan = float('nan')

# complex object type, note the usage of letter j
Complex = 2+8j

# string can be enclosed in single or double quote
string = 'this is a string'
me_also_string = "also me"

List = [1, True, 'ML'] # Values can be changed

Tuple = (1, True, 'ML') # Values can not be changed

Set = set([1,2,2,2,3,4,5,5]) # Duplicates will not be stored

# Use a dictionary when you have a set of unique keys that map to values
Dictionary = {'a':'A', 2:'AA', True:1, False:0}

# lets print the object type and the value
print (type(none), none)
print (type(boolean), boolean)
print (type(integer), integer)
print (type(Long), Long)
print (type(Float), Float)
print (type(Float_inf), Float_inf)
print (type(Float_nan), Float_nan)
print (type(Complex), Complex)
print (type(string), string)
print (type(me_also_string), me_also_string)
print (type(Tuple), Tuple)
print (type(List), List)
print (type(Set), Set)
print (type(Dictionary), Dictionary)

----- output ------

<type 'NoneType'> None
<type 'bool'> True
<type 'int'> 1
<type 'float'> 3.14
<type 'float'> 3.14
<type 'float'> inf
<type 'float'> nan
<type 'complex'> (2+8j)
<type 'str'> this is a string
<type 'str'> also me
<type 'tuple'> (1, True, 'ML')
<type 'list'> [1, True, 'ML']
<type 'set'> set([1, 2, 3, 4, 5])
<type 'dict'> {'a': 'A', True: 1, 2: 'AA', False: 0}

Listing 1-3Code for Basic Object Types

何时使用列表、元组、集合或字典

四个关键的、常用的 Python 对象是列表、元组、集合和字典。理解什么时候使用这些很重要，这样才能写出高效的代码。

列表 : 当您需要一个有序的同质集合序列，其值可以在程序中稍后更改时使用。
Tuple: 当您需要一个异构集合的有序序列时使用，这些集合的值不需要在程序的后面进行更改。
Set: 当您不必存储重复项，并且不关心项目的顺序时，它是理想的选择。你只想知道一个特定的值是否已经存在。
Dictionary: 当您需要将值与键相关联，以便使用键高效地查找它们时，它是理想的选择。

Python 中的注释

单行注释:任何跟在#(散列)后面直到行尾的字符都被认为是注释的一部分，Python 解释器会忽略它们。

多行注释:字符串"""(称为多行字符串)之间的任何字符，即注释开头和结尾的字符，将被 Python 解释器忽略。请参考清单 1-4 中的注释代码示例。

# This is a single line comment in Python
print("Hello Python World") # This is also a single line comment in Python

""" This is an example of a multi-line
the comment that runs into multiple lines.
Everything that is in between is considered as comments
"""

Listing 1-4Example Code for Comments

多行语句

Python 在圆括号、中括号和大括号内的斜线延续是最受欢迎的换行方式。使用反斜杠来表示行继续符使得可读性更好；但是，如果需要，您可以在表达式两边添加一对额外的括号。适当地缩进代码的后续行是很重要的。请注意，打破二元运算符的首选位置是在运算符之后，而不是之前。Python 代码示例请参考清单 1-5 。

# Example of implicit line continuation
x = ('1' + '2' +
    '3' + '4')

# Example of explicit line continuation
y = '1' + '2' + \
    '11' + '12'

weekdays = ['Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday']

weekend = {'Saturday',
           'Sunday'}

print ('x has a value of', x)
print ('y has a value of', y)
print (weekdays)
print (weekend)

------ output -------
('x has a value of', '1234')
('y has a value of', '1234')
['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
set(['Sunday', 'Saturday'])

Listing 1-5Example Code for Multiline Statements

单行上的多条语句

Python 还允许通过使用分号(；)，假设该语句不启动新的代码块。清单 1-6 提供了一个代码示例。

import os; x = 'Hello'; print (x)

Listing 1-6Code Example for Multiple Statements on a Single Line

基本运算符

在 Python 中，运算符是可以操作操作数值的特殊符号。例如，让我们考虑表达式 1 + 2 = 3。这里 1 和 2 称为操作数，是运算符所运算的值，符号+称为运算符。

Python 语言支持以下类型的运算符:

算术运算符
比较或关系运算符
赋值运算符
按位运算符
逻辑运算符
成员运算符
标识运算符

让我们通过例子来学习所有的运算符，一个一个来。

算术运算符

算术运算符(列在表 1-4 中)对于对数字执行加、减、乘、除等数学运算很有用。代码示例请参考清单 1-7 。

表 1-4

算术运算符

操作员

描述

例子

|
| --- | --- | --- |
| + | 添加 | x + y = 30 |
| - | 减法 | x–y =-10 |
| *本文件迟交 | 增加 | x ' y = 200 |
| / | 分开 | y / x = 2 |
| % | 系数 | y % x = 0 |
| ∫指数 | 指数运算 | x∫b = 10 的 20 次方 |
| // | 地板除法-整数除法四舍五入到负无穷大 | 9//2 = 4 和 9.0//2.0 = 4.0，-11//3 = -4，-11.0/ |

# Variable x holds 10 and variable y holds 5
x = 10
y = 5

# Addition
print ("Addition, x(10) + y(5) = ", x + y)

# Subtraction
print ("Subtraction, x(10) - y(5) = ", x - y)

# Multiplication
print ("Multiplication, x(10) * y(5) = ", x * y)

# Division
print ("Division, x(10) / y(5) = ",x / y)

# Modulus
print ("Modulus, x(10) % y(5) = ", x % y)

# Exponent
print ("Exponent, x(10)**y(5) = ", x**y)

# Integer division rounded towards minus infinity
print ("Floor Division, x(10)//y(5) = ", x//y)

-------- output --------

Addition, x(10) + y(5) =  15
Subtraction, x(10) - y(5) =  5
Multiplication, x(10) * y(5) =  50
Divions, x(10) / y(5) =  2.0
Modulus, x(10) % y(5) =  0
Exponent, x(10)**y(5) =  100000
Floor Division, x(10)//y(5) =  2

Listing 1-7Example Code for Arithmetic Operators

比较或关系运算符

顾名思义，表 1-5 中列出的比较或关系运算符对于比较值很有用。对于给定的条件，它们将返回 True 或 False。代码示例参见清单 1-8 。

表 1-5

比较或关系运算符

操作员

描述

例子

|
| --- | --- | --- |
| == | 如果两个操作数的值相等，则条件为真。 | (10 == 5)不成立。 |
| ！= | 如果两个操作数的值不相等，则条件为真。 | (10 != 5)为真。 |
| > | 如果左操作数的值大于右操作数的值，则该条件为真。 | (10 > 5)不成立。 |
| < | 如果左操作数的值小于右操作数的值，则条件为真。 | (10 < 5)为真。 |
| >= | 如果左操作数的值大于或等于右操作数的值，则该条件为真。 | (10 >= 5)不成立。 |
| <= | 如果左操作数的值小于或等于右操作数的值，则该条件为真。 | (10 <= 5)为真。 |

# Variable x holds 10 and variable y holds 5
x = 10
y = 5

# Equal check operation
print ("Equal check, x(10) == y(5) ", x == y)

# Not Equal check operation
print ("Not Equal check, x(10) != y(5) ", x != y)

# Less than check operation
print ("Less than check, x(10) <y(5) ", x<y)

# Greater check operation
print ("Greater than check, x(10) >y(5) ", x>y)

# Less than or equal check operation
print ("Less than or equal to check, x(10) <= y(5) ", x<= y)

# Greater than or equal to check operation
print ("Greater than or equal to check, x(10) >= y(5) ", x>= y)

-------- output --------
Equal check, x(10) == y(5)  False
Not Equal check, x(10) != y(5)  True
Less than check, x(10) <y(5)  False
Greater than check, x(10) >y(5)  True
Less than or equal to check, x(10) <= y(5)  False
Greater than or equal to check, x(10) >= y(5)  True

Listing 1-8Example Code for Comparision/Relational Operators

赋值运算符

在 Python 中，表 1-6 中列出的赋值运算符用于给变量赋值。例如，考虑 x = 5；这是一个简单的赋值运算符，它将运算符右侧的数值 5 赋给左侧的变量 x。Python 中有一系列复合操作符，比如 x += 5，它们会添加到变量中，然后对变量赋值。它和 x = x + 5 一样好。代码示例参见清单 1-9 。

表 1-6

赋值运算符

操作员

描述

例子

|
| --- | --- | --- |
| = | 将右侧操作数的值分配给左侧操作数 | z = x + y 将 x + y 的值赋给 z |
| +=相加和 | 它将右操作数加到左边操作数，并将结果赋给左操作数 | z += x 相当于 z = z + x |
| -=减去和 | 它从左操作数中减去右操作数，并将结果赋给左操作数。 | z -= x 等价于 z = z - x |
| ∑=相乘和 | 它将右操作数与左操作数相乘，并将结果赋给左操作数。 | z∫= x 等价于 z = z∫x |
| /=除和 | 它将左操作数除以右操作数，并将结果赋给左操作数。 | z /= x 相当于 z = z/ xz/= x 相当于 z = z / x |
| %=模数和 | 它使用两个操作数取模，并将结果赋给左操作数。 | z %= x 相当于 z = z % x |
| ∫∫=指数和 | 它对运算符执行指数(幂)计算，并将值赋给左操作数。 | z∑∑= x 等价于 z = z |
| //=楼层划分 | 它对运算符执行底数除法，并将值赋给左操作数。 | z //= x 等价于 z = z// x |

# Variable x holds 10 and variable y holds 5
x = 5
y = 10

x += y
print ("Value of a post x+=y is ", x)

x *= y
print ("Value of a post x*=y is ", x)

x /= y
print ("Value of a post x/=y is ", x)

x %= y
print ("Value of a post x%=y is ", x)

x **= y
print ("Value of x post x**=y is ", x)

x //= y
print ("Value of a post x//=y is ", x)
-------- output --------
Value of a post x+=y is  15
Value of a post x*=y is  150
Value of a post x/=y is  15.0
Value of a post x%=y is  5.0
Value of x post x**=y is  9765625.0
Value of a post x//=y is  976562.0

Listing 1-9Example Code for Assignment Operators

按位运算符

你可能知道，计算机中的一切都是用比特来表示的，即一系列 0 和 1。表 1-7 中列出的位运算符使我们能够直接操作或操纵位。让我们了解一下基本的位运算。按位运算符的一个主要用途是解析十六进制颜色。

众所周知，按位运算符会让 Python 编程新手感到困惑，所以如果你一开始不理解可用性，也不要着急。事实是，在你的日常 ML 编程中，你不会真的看到按位操作符。但是，知道这些操作符是有好处的。

比如我们假设 x = 10(二进制 0000 1010)，y = 4(二进制 0000 0100)。代码示例请参考清单 1-10 。

表 1-7

按位运算符

操作员

描述

例子

|
| --- | --- | --- |
| &二进制和 | 如果在两个操作数中都存在一个位，则该运算符会将该位复制到结果中。 | (x 和 y)(表示 0000 0000) |
| 二进制或 | 如果任一操作数中存在某个位，则该运算符会复制该位。 | (x | y) = 14(表示 0000 1110) |
| ^二元异或 | 如果该位在一个操作数中设置，而不是在两个操作数中都设置，则该运算符复制该位。 | (x ^ y) = 14(表示 0000 1110) |
| ~二进制一补码 | 这个运算符是一元的，具有“翻转”位的效果。 | (~x ) = -11(表示 1111 0101) |
| < | 左操作数值左移右操作数指定的位数。 | x<< 2= 42(表示 0010 1000) |
| >>二进制右移 | 左操作数的值向右移动右操作数指定的位数。 | x>> 2 = 2(表示 0000 0010) |

# Basic six bitwise operations
# Let x = 10 (0000 1010 in binary) and y = 4 (0000 0100 in binary)
x = 10
y = 4

print (x >> y)  # Right Shift
print (x << y)  # Left Shift
print (x & y)   # Bitwise AND
print (x | y)   # Bitwise OR
print (x ^ y) # Bitwise XOR
print (~x)    # Bitwise NOT
-------- output --------

0
160
0
14
14
-11

Listing 1-10Example Code for Bitwise Operators

逻辑运算符

AND、OR、NOT 运算符称为逻辑运算符，列于表 1-8 中。这些对于检查给定条件下的两个变量是有用的，并且结果将适当地为真或为假。代码示例参见清单 1-11 。

表 1-8

逻辑运算符

操作员

描述

例子

|
| --- | --- | --- |
| 和逻辑和 | 如果两个操作数都为真，则条件为真。 | (var1 和 var2)为真。 |
| 或逻辑或 | 如果两个操作数中的任何一个不为零，则条件为真。 | (var1 或 var2)为真。 |
| 不符合逻辑不 | 用于反转其操作数的逻辑状态 | Not (var1 和 var2)为假。 |

var1 = True
var2 = False
print('var1 and var2 is',var1 and var2)
print('var1 or var2 is',var1 or var2)
print('not var1 is',not var1)
-------- output --------

var1 and var2 is False
var1 or var2 is True
not var1 is False

Listing 1-11Example Code for Logical Operators

成员运算符

表 1-9 中列出的成员运算符对于测试是否在一个序列中找到一个值很有用，即字符串、列表、元组、集合或字典。Python 中有两个成员运算符:“in”和“not in”注意，在字典的情况下，我们只能测试键(而不是值)的存在。代码示例参见清单 1-12 。

表 1-9

成员运算符

操作员

描述

例子

|
| --- | --- | --- |
| 在 | 如果一个值在指定的序列中，结果为 True，否则为 False | var2 中的 var1 |
| 不在 | 如果值不在指定的序列中，则结果为 True，否则为 False | var1 不在 var2 中 |

var1 = 'Hello world'          # string
var2 = {1:'a',2:'b'}          # dictionary
print('H' in var1)
print('hello' not in var1)
print(1 in var2)
print('a' in var2)
-------- output --------
True
True
True
False

Listing 1-12Example Code for Membership Operators

标识运算符

表 1-10 中列出的恒等运算符可用于测试两个变量是否存在于存储器的同一部分。Python 中有两个身份运算符:“是,不是”请注意，两个值相等的变量并不意味着它们是相同的。代码示例请参考清单 1-13 。

表 1-10

标识运算符

操作员

描述

例子

|
| --- | --- | --- |
| 存在 | 如果运算符两边的变量指向同一个对象，则结果为 True，否则为 False | var1 是 var2 |
| 不是 | 如果运算符两边的变量指向同一个对象，则结果为 False，否则为 True | Var1 不是 var2 |

var1 = 5
var1 = 5
var2 = 'Hello'
var2 = 'Hello'
var3 = [1,2,3]
var3 = [1,2,3]
print(var1 is not var1)
print(var2 is var2)
print(var3 is var3)
-------- output --------
False
True
False

Listing 1-13Example Code for Identity Operators

控制结构

控制结构是编程中的基本选择或决策过程。它是一段分析变量值并根据给定条件决定前进方向的代码。在 Python 中，主要有两种类型的控制结构:选择和迭代。

选择

选择语句允许程序员检查条件，并根据结果执行不同的操作。这个有用的结构有两个版本:1) if 和 2) if…else。代码示例请参考清单 1-14 至 1-16 。

score = 95

if score >= 99:
    print('A')
elif score >=75:
    print('B')
elif score >= 60:
    print('C')
elif score >= 35:
    print('D')
else:
    print('F')
-------- output --------
B

Listing 1-16Example Code for Nested “if else” Statements

var = 1

if var < 0:
    print ("the value of var is negative")
    print (var)
else:
    print ("the value of var is positive")
    print (var)
-------- output --------
the value of var is positive
1

Listing 1-15Example Code for “if else” Statement

var = -1
if var < 0:
    print (var)
    print ("the value of var is negative")

# If the suite of an if clause consists only of a single line, it may go on the same line as the header statement
if ( var  == -1 ) : print ("the value of var is negative")
-------- output --------
-1
the value of var is negative
the value of var is negative

Listing 1-14Example Code for a Simple “if” Statement

迭代次数

循环控制语句使我们能够多次执行单个或一组编程语句，直到满足给定的条件。Python 提供了两个基本的循环语句:1)“for”和 2)“while”

For loop: 它允许我们执行一个代码块特定的次数或特定的条件，直到它被满足。代码示例请参考清单 1-17 至 1-19 。

# First Example
print ("First Example")
for item in [1,2,3,4,5]:
    print ('item :', item)

# Second Example
print ("Second Example")
letters = ['A', 'B', 'C']
for letter in letters:
    print ('First loop letter :', letter)

# Third Example - Iterating by sequency index
print ("Third Example")
for index in range(len(letters)):
    print ('First loop letter :', letters[index])

# Fourth Example - Using else statement
print ("Fourth Example")
for item in [1,2,3,4,5]:
    print ('item :', item)
else:
    print ('looping over item complete!')
----- output ------
First Example
item : 1
item : 2
item : 3
item : 4
item : 5
Second Example
First loop letter : A
First loop letter : B
First loop letter : C
Third Example
First loop letter : A
First loop letter : B
First loop letter : C
Fourth Example
item : 1
item : 2
item : 3
item : 4
item : 5
looping over item complete!

Listing 1-17Example Code for a “for” Loop Statement

While 循环:While 语句重复一组代码，直到条件为真。

count = 0
while (count < 5):
    print ('The count is:', count)
    count = count + 1
----- output ------
The count is: 0
The count is: 1
The count is: 2
The count is: 3
The count is: 4

Listing 1-18Example Code for a “while” Loop Statement

警告

如果一个条件永远不会变为假，那么这个循环就变成了一个无限循环。

else 语句可以与 while 循环一起使用，当条件变为 false 时将执行 else 语句。

count = 0
while count < 5:
    print (count, " is  less than 5")
    count = count + 1
else:
    print (count, " is not less than 5")
----- output ------
0  is  less than 5
1  is  less than 5
2  is  less than 5
3  is  less than 5
4  is  less than 5
5  is not less than 5

Listing 1-19Example Code for a “while” with an “else” 

Statement

列表

Python 的列表是最灵活的数据类型。它们可以通过在方括号之间写一列逗号分隔的值来创建。请注意，列表中的项目不必是相同的数据类型。表 1-11 汇总了列表操作；代码示例请参考清单 1-20 至 1-24 。

表 1-11

Python 列表操作

描述

Python 表达式

例子

结果

|
| --- | --- | --- | --- |
| 创建项目列表 | [项目 1、项目 2、…] | list = ['a '，' b '，' c '，' d'] | ['a '，' b '，' c '，' d'] |
| 访问列表中的项目 | 列表[索引] | list = ['a '，' b '，' c '，' d']列表[2] | c |
| 长度 | len(列表) | len([1，2，3]) | three |
| 串联 | 列表 1 +列表 2 | [1, 2, 3] + [4, 5, 6] | [1, 2, 3, 4, 5, 6] |
| 重复 | list’int | ['你好']* 3 | ['你好'，'你好'，'你好'] |
| 成员资格 | 列表中的项目 | 3 英寸[1，2，3] | 真实的 |
| 循环 | 对于列表中的 x:print(x) | 对于[1，2，3]中的 x:打印(x) | 1 2 3 |
| 从右边数 | 列表[-索引] | list = [1，2，3]；列表[-2] | Two |
| 切片获取部分 | 列表[索引:] | list = [1，2，3]；列表[1:] | [2,3] |
| 返回最大项目 | 最大值(列表) | max([1，2，3，4，5]) | five |
| 返回最小项目 | 最小(列表) | max([1，2，3，4，5]) | one |
| 将对象追加到列表 | 列表.追加(对象) | [1,2,3,4].追加(5) | [1,2,3,4,5] |
| 计数项目出现次数 | 列表.计数(对象) | [1,1,2,3,4].计数(1) | Two |
| 将序列内容附加到列表 | list.extend(序列) | ['a '，1]。扩展(['b '，2]) | ['a '，1，' b '，2] |
| 返回项目的第一个索引位置 | 列表索引(对象) | ['a '，' b '，' c '，1，2，3]。索引(' c ') | Two |
| 将对象插入到列表中所需的索引处 | list.insert(index，obj) | ['a '，' b '，' c '，1，2，3]。插入(4，“d”) | ['a '，' b '，' c '，' d '，1，2，3] |
| 从列表中移除并返回最后一个对象 | list.pop(obj=list[-1]) | [' a '，' b '，' c '，1，2，3]. pop()[' a '，' b '，' c '，1，2，3].pop(2) | 3c |
| 从列表中删除对象 | list.remove(对象) | ['a '，' b '，' c '，1，2，3]。移除(' c ') | ['a '，' b '，1，2，3] |
| 就地反转列表中的对象 | list.reverse() | ['a '，' b '，' c '，1，2，3]。反向( ) | [3，2，1，' c '，' b '，a'] |
| 对列表中的对象排序 | list.sort() | ['a '，' b '，' c '，1，2，3]。sort( )['a '，' b '，' c '，1，2，3]。排序(反向=真) | [1，2，3，' a '，' b '，' c']['c '，' b '，' a '，3，2，1] |

# Basic Operations
print ("Length: ", len(list_1))
print ("Concatenation: ", [1,2,3] + [4, 5, 6])
print ("Repetition :", ['Hello'] * 4)
print ("Membership :", 3 in [1,2,3])
print ("Iteration :" )
for x in [1,2,3]: print (x)

# Negative sign will count from the right
print ("slicing :", list_1[-2])
# If you don't specify the end explicitly, all elements from the specified start index will be printed
print ("slicing range: ", list_1[1:])

# Comparing elements of lists
# cmp function is only available in Python 2 and not 3, so if you still need it you could use the below custom function
def cmp(a, b):
    return (a > b) - (a < b)

print ("Compare two lists: ", cmp([1,2,3, 4], [1,2,3]))
print ("Max of list: ", max([1,2,3,4,5]))
print ("Min of list: ", min([1,2,3,4,5]))
print ("Count number of 1 in list: ", [1,1,2,3,4,5,].count(1))
list_1.extend(list_2)
print ("Extended :", list_1)
print ("Index for Programming : ", list_1.index( 'Programming'))
print (list_1)
print ("pop last item in list: ", list_1.pop( ))
print ("pop the item with index 2: ", list_1.pop(2))
list_1.remove('b')
print ("removed b from list: ", list_1)
list_1.reverse( )
print ("Reverse: ", list_1)
list_1 = ['a','b','c']
list_1.sort( )
print ("Sort ascending: ", list_1)
list_1.sort(reverse = True)
print ("Sort descending: ", list_1)
---- output ----

Length:  5
Concatenation:  [1, 2, 3, 4, 5, 6]
Repetition : ['Hello', 'Hello', 'Hello', 'Hello']
Membership : True
Iteration :
1
2
3
slicing : 2017
slicing range:  ['Programming', 2015, 2017, 2018]
Compare two lists:  1
Max of list:  5
Min of list:  1
Count number of 1 in list:  2
Extended : ['Statistics', 'Programming', 2015, 2017, 2018, 'a', 'b', 1, 2, 3, 4, 5, 6, 7]
Index for Programming :  1
['Statistics', 'Programming', 2015, 2017, 2018, 'a', 'b', 1, 2, 3, 4, 5, 6, 7]
pop last item in list:  7
pop the item with index 2:  2015
removed b from list:  ['Statistics', 'Programming', 2017, 2018, 'a', 1, 2, 3, 4, 5, 6]
Reverse:  [6, 5, 4, 3, 2, 1, 'a', 2018, 2017, 'Programming', 'Statistics']
Sort ascending:  ['a', 'b', 'c']
Sort descending:  ['c', 'b', 'a'] 

Listing 1-24Example Code for Basic Operations on Lists

# Deleting list elements
print ("list_1 values: ", list_1)
del list_1[5];
print ("After deleting value at index 2 : ", list_1)
---- output ----
list_1 values:  ['Statistics', 'Programming', 2015, 2017, 2018, 2019]
After deleting value at index 2 :  ['Statistics', 'Programming', 2015, 2017, 2018]

Listing 1-23Example Code for Deleting a List Element

# Updating existing values of list
print ("Value available at index 2 : ", list_1[2])
list_1[2] = 2015;
print ("New value available at index 2 : ", list_1[2])
---- output ----
Values of list_1:  ['Statistics', 'Programming', 2016, 2017, 2018, 2019]
Value available at index 2 :  2016
New value available at index 2 :  2015

Listing 1-22Example Code for Updating Existing Values of Lists

print ("list_1 values: ", list_1)
list_1.append(2019)
print ("list_1 values post append: ", list_1)
---- output ----
list_1 values:  ['Statistics', 'Programming', 2016, 2017, 2018]
list_1 values post append:  ['Statistics', 'Programming', 2016, 2017, 2018, 2019]

Listing 1-21Example Code for Adding New Values to Lists

list_1 = ['Statistics', 'Programming', 2016, 2017, 2018];
list_2 = ['a', 'b', 1, 2, 3, 4, 5, 6, 7 ];

# Accessing values in lists
print ("list_1[0]: ", list_1[0])
print ("list2_[1:5]: ", list_2[1:5])
---- output ----

list_1[0]:  Statistics
list2_[1:5]:  ['b', 1, 2, 3]

Listing 1-20Example Code for Accessing Lists

元组

Python 元组是一系列不可变的 Python 对象，非常类似于列表。然而，列表和元组之间存在一些本质的区别:

与列表不同，元组的对象不能改变。
元组是用括号定义的，而列表是用方括号定义的。

表 1-12 总结了元组操作；代码示例参见清单 1-25 至 1-28 。

表 1-12

Python 元组操作

描述

Python 表达式

例子

结果

|
| --- | --- | --- | --- |
| 创建元组 | (项目 1、项目 2、…)()#空元组(item1，)#具有一个项目的元组，注意需要逗号 | 元组= ('a '，' b '，' c '，' d '，1，2，3)元组=()元组= (1，) | (' a '，' b '，' c '，' d '，1，2，3)( )one |
| 访问元组中的项目 | 元组[索引]元组[开始索引:结束索引] | 元组= ('a '，' b '，' c '，' d '，1，2，3)元组[2]元组[0:2] | c 甲、乙、丙 |
| 删除元组 | 元组名称 | 元组码 | |
| 长度 | 长度(元组) | len((1，2，3)) | three |
| 串联 | 元组 _1 +元组 _2 | (1, 2, 3) + (4, 5, 6) | (1, 2, 3, 4, 5, 6) |
| 重复 | 元组' int | ('你好'，' 4) | (“你好”、“你好”、“你好”、“你好”) |
| 成员资格 | 元组中的项目 | 3 英寸(1，2，3) | 真实的 |
| 循环 | 对于元组中的 x:print(x) | 对于(1，2，3)中的 x:print(x) | 1 2 3 |
| 从右边数 | 元组[-索引] | 元组= (1，2，3)；列表[-2] | Two |
| 切片获取部分 | 元组[索引:] | 元组= (1，2，3)；列表[1:] | (2,3) |
| 返回最大项目 | 最大值(元组) | 最大值((1，2，3，4，5)) | five |
| 返回最小项目 | 最小(元组) | 最大值((1，2，3，4，5)) | one |
| 将列表转换为元组 | 元组(序列) | 元组([1，2，3，4]) | (1,2,3,4,5) |

# Basic Tuple operations
Tuple = ('a','b','c','d',1,2,3)

print ("Length of Tuple: ", len(Tuple))

Tuple_Concat = Tuple + (7,8,9)
print ("Concatinated Tuple: ", Tuple_Concat)

print ("Repetition: ", (1, 'a',2, 'b') * 3)
print ("Membership check: ", 3 in (1,2,3))

# Iteration
for x in (1, 2, 3): print (x)

print ("Negative sign will retrieve item from right: ", Tuple_Concat[-2])
print ("Sliced Tuple [2:] ", Tuple_Concat[2:])

# Find max
print ("Max of the Tuple (1,2,3,4,5,6,7,8,9,10): ", max((1,2,3,4,5,6,7,8,9,10)))
print ("Min of the Tuple (1,2,3,4,5,6,7,8,9,10): ", min((1,2,3,4,5,6,7,8,9,10)))
print ("List [1,2,3,4] converted to tuple: ", type(tuple([1,2,3,4])))
---- output ----
Length of Tuple:  7
Concatinated Tuple:  ('a', 'b', 'c', 'd', 1, 2, 3, 7, 8, 9)
Repetition:  (1, 'a', 2, 'b', 1, 'a', 2, 'b', 1, 'a', 2, 'b')
Membership check:  True
1
2
3
Negative sign will retrieve an item from right:  8
Sliced Tuple [2:]  ('c', 'd', 1, 2, 3, 7, 8, 9)
Max of the Tuple (1,2,3,4,5,6,7,8,9,10):  10
Min of the Tuple (1,2,3,4,5,6,7,8,9,10):  1
List [1,2,3,4] converted to tuple:  <type 'tuple'>

Listing 1-28Example Code for Basic Tuple Operations (Not Exhaustive)

print ("Sample Tuple: ",Tuple)
del Tuple
print (Tuple) # Will throw an error message as the tuple does not exist

---- output ----

Sample Tuple:  ('a', 'b', 'c', 'd', 1, 2, 3)
---------------------------------------------------------------------------
Sample Tuple:  ('a', 'b', 'c', 'd', 1, 2, 3)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-002eefa7c22f> in <module>
      4 print ("Sample Tuple: ",Tuple)
      5 del Tuple
----> 6 print (Tuple) # Will throw an error message as the tuple does not exist

NameError: name 'Tuple' is not defined

Listing 1-27Example Code for Deleting a Tuple

Tuple = ('a', 'b', 'c', 'd', 1, 2, 3)

print ("3rd item of Tuple:", Tuple[2])
print ("First 3 items of Tuple", Tuple[0:2])
---- output ----
3rd item of Tuple: c
First 3 items of Tuple ('a', 'b')

Listing 1-26Example Code for Accessing a Tuple

Tuple = ( )
print ("Empty Tuple: ", Tuple)

Tuple = (1,)
print ("Tuple with a single item: ", Tuple)

Tuple = ('a','b','c','d',1,2,3)
print ("Sample Tuple :", Tuple)
---- output ----
Empty Tuple:  ( )
Tuple with a single item:  (1,)
Sample Tuple : ('a', 'b', 'c', 'd', 1, 2, 3)

Listing 1-25Example Code for Creating a Tuple

设置

顾名思义，集合是数学集合的实现，其主要特征如下:

项目集合是无序的。
不会存储重复的项目，这意味着每个项目都是唯一的。
集合是可变的，这意味着集合中的项目可以被改变。

可以在器械包中添加或删除物品。数学集合运算，如并、交等。可以在 Python 集合上执行。表 1-13 总结了 Python 集合操作，清单 1-29 显示了创建集合的示例代码，清单 1-30 显示了访问集合元素的示例代码。

表 1-13

Python 集合运算

描述

Python 表达式

例子

结果

|
| --- | --- | --- | --- |
| 创建集合 | set{item1，item2，…}set( ) #空集 | languages = set(['Python '，' R '，' SAS '，' Julia']) | set(['SAS '，' Python '，' R '，' Julia']) |
| 向集合中添加项目/元素 | 添加( ) | languages.add('SPSS ') | set(['SAS '，' SPSS '，' Python '，' R '，' Julia']) |
| 从集合中移除所有项目/元素 | 清除( ) | languages.clear() | 集合([]) |
| 返回集合的副本 | 复制( ) | lang = languages . copy()print(lang) | set(['SAS '，' SPSS '，' Python '，' R '，' Julia']) |
| 如果项目/元素是成员，则将其从集合中移除。(如果元素不在集合中，什么也不做) | 丢弃( ) | languages = set(['C '，' Java '，' Python '，' Data Science '，' Julia '，' SPSS '，' AI '，' R '，' SAS '，' Machine Learning'])语言. discard('AI ') | set(['C '，' Java '，' Python '，'数据科学'，' Julia '，' SPSS '，' R '，' SAS '，'机器学习']) |
| 从集合中删除项目/元素。如果该元素不是成员，则引发一个 KeyError。 | 移除( ) | languages = set(['C '，' Java '，' Python '，' Data Science '，' Julia '，' SPSS '，' AI '，' R '，' SAS '，' Machine Learning'])语言. remove('AI ') | set(['C '，' Java '，' Python '，'数据科学'，' Julia '，' SPSS '，' R '，' SAS '，'机器学习']) |
| 移除和返回任意集合元素。如果集合为空，则引发一个 KeyError。 | 流行( ) | languages = set(['C '，' Java '，' Python '，' Data Science '，' Julia '，' SPSS '，' AI '，' R '，' SAS '，' Machine Learning '])print(" Removed:"，(languages . pop()))print(languages) | 已删除:Cset(['Java '，' Python '，'数据科学'，' Julia '，' SPSS '，' R '，' SAS '，'机器学习']) |
| 将两个或多个集合的差作为一个新集合返回 | 差异( ) | #初始化 A 和 BA = {1，2，3，4，5}B = {4，5，6，7，8} A .差异(B) | {1, 2, 3} |
| 从此集合中移除另一个集合的所有项目/元素 | 差异 _ 更新( ) | #初始化 A 和 BA = {1，2，3，4，5}B = {4，5，6，7，8}A .差异 _ 更新(B)打印(A) | 集合([1，2，3]) |
| 将两个集合的交集作为新集合返回 | 交集( ) | #初始化 A 和 BA = {1，2，3，4，5}B = {4，5，6，7，8} A .交集(B) | {4, 5} |
| 用自身和另一个的交集更新集合 | 交集 _ 更新( ) | #初始化 A 和 BA = {1，2，3，4，5}B = {4，5，6，7，8 } A . intersection _ update(B)print(A) | 集合([4，5]) |
| 如果两个集合有零交集，则返回 True | isdisjoint() | # initialize A and BA = {1，2，3，4，5}B = {4，5，6，7，8} A.isdisjoint(B) | 错误的 |
| 如果另一个集合包含此集合，则返回 True | issubset（） | # initialize A 和 BA = {1，2，3，4，5}B = {4，5，6，7，8} print (A.issubset(B)) | 错误的 |
| 如果这个集合包含另一个集合，则返回 True | issuperset() | # initialize A 和 BA = {1，2，3，4，5}B = {4，5，6，7，8}print (A.issuperset(B)) | 错误的 |
| 将两个集合的对称差作为一个新集合返回 | 对称 _ 差异( ) | #初始化 A 和 BA = {1，2，3，4，5}B = {4，5，6，7，8}A.symmetric_difference(B) | {1, 2, 3, 6, 7, 8} |
| 用自身和另一个集合的对称差更新一个集合 | 对称 _ 差异 _ 更新( ) | #初始化 A 和 BA = {1，2，3，4，5}B = {4，5，6，7，8}A .对称 _ 差异(B)打印(A)A .对称 _ 差异 _ 更新(B)打印(A) | 集合([1，2，3，6，7，8]) |
| 返回新集合中集合的并集 | 联合( ) | #初始化 A 和 BA = {1，2，3，4，5}B = {4，5，6，7，8}A.union(B)print(A) | 集合([1，2，3，4，5]) |
| 用集合自身和其他集合的并集更新集合 | 更新( ) | #初始化 A 和 BA = {1，2，3，4，5}B = {4，5，6，7，8}A .更新(B)打印(A) | 集合([1，2，3，4，5，6，7，8]) |
| 返回集合中的长度(项目数) | len() | A = {1，2，3，4，5 }长度(A) | five |
| 返回集合中最大的项 | 最大( ) | A = {1，2，3，4，5 }最大值(A) | one |
| 返回集合中最小的项目 | 最小值( ) | A = {1，2，3，4，5 }分钟 | five |
| 从集合中的元素返回一个新的排序列表。不对集合进行排序 | 已排序( ) | A = {1，2，3，4，5 }排序(A) | [4, 5, 6, 7, 8] |
| 返回集合中所有项目/元素的总和 | 总和( ) | A = {1，2，3，4，5 }和(A) | Fifteen |

print (list(languages)[0])
print (list(languages)[0:3])
---- output ----
R
['R', 'Python', 'SAS']

Listing 1-30Example Code for Accessing Set Elements

# Creating an empty set
languages = set( )
print (type(languages), languages)

languages = {'Python', 'R', 'SAS', 'Julia'}
print (type(languages), languages)

# set of mixed datatypes
mixed_set = {"Python", (2.7, 3.4)}
print (type(mixed_set), languages)
---- output ----
<class 'set'> set( )
<class 'set'> {'R', 'Python', 'SAS', 'Julia'}
<class 'set'> {'R', 'Python', 'SAS', 'Julia'}

Listing 1-29Example Code for Creating Sets

在 Python 中更改集合

尽管集合是可变的，但是由于它们是无序的，因此对它们进行索引没有意义。所以集合不支持使用索引或切片来访问或更改项目/元素。add()方法可用于添加单个元素，update()方法可用于添加多个元素。请注意，update()方法可以接受元组、列表、字符串或其他集合格式的参数。但是，在所有情况下，重复项都会被忽略。请参考清单 1-31 中更改集合元素的代码示例。

# initialize a set
languages = {'Python', 'R'}
print(languages)

# add an element
languages.add('SAS')
print(languages)

# add multiple elements
languages.update(['Julia','SPSS'])
print(languages)

# add list and set
languages.update(['Java','C'], {'Machine Learning','Data Science','AI'})
print(languages)
---- output ----
{'R', 'Python'}
{'R', 'Python', 'SAS'}
{'Julia', 'R', 'Python', 'SAS', 'SPSS'}
{'Julia', 'Machine Learning', 'R', 'Python', 'SAS', 'Java', 'C', 'Data Science', 'AI', 'SPSS'}

Listing 1-31Example Code for Changing Set Elements

从集合中移除项目

discard()或 remove()方法可用于从集合中移除特定项目。discard()和 remove()之间的根本区别在于，如果集合中不存在该项，则前者不会采取任何操作，而 remove()会在这种情况下引发错误。清单 1-32 给出了从集合中删除项目的示例代码。

# remove an element
languages.remove('AI')
print(languages)

# discard an element, although AI has already been removed discard will not throw an error
languages.discard('AI')
print(languages)

# Pop will remove a random item from set
print ("Removed:", (languages.pop( )), "from", languages)
---- output ----
{'Julia', 'Machine Learning', 'R', 'Python', 'SAS', 'Java', 'C', 'Data Science', 'SPSS'}
{'Julia', 'Machine Learning', 'R', 'Python', 'SAS', 'Java', 'C', 'Data Science', 'SPSS'}
Removed: Julia from {'Machine Learning', 'R', 'Python', 'SAS', 'Java', 'C', 'Data Science', 'SPSS'}

Listing 1-32Example Code for Removing Items from a Set

集合操作

如前所述，集合允许我们使用数学集合运算，如并、交、差和对称差。我们可以借助运算符或方法来实现这一点。

集合联合

两个集合 A 和 B 的并集将产生两个集合的所有项目的集合。有两种执行联合运算的方法:1)使用|运算符，2)使用 union()方法。请参考清单 1-33 中的联合操作代码示例。

# initialize A and B
A = {1, 2, 3, 4, 5}
B = {4, 5, 6, 7, 8}

# use | operator
print ("Union of A | B", A|B)

# alternative we can use union( )
print ("Union of A | B", A.union(B))
---- output ----
Union of A | B {1, 2, 3, 4, 5, 6, 7, 8}

Listing 1-33Example Code for Set Union Operation

设置交叉点

两个集合 A 和 B 的交集将产生两个集合中共同存在的一组项目。有两种方法可以实现交集运算:1)使用 and 运算符；2)使用 intersection()方法。集合交集操作示例代码参见清单 1-34 。

# use & operator
print ("Intersection of A & B", A & B)

# alternative we can use intersection( )
print ("Intersection of A & B", A.intersection(B))
---- output ----
Intersection of A & B {4, 5}

Listing 1-34Example Code for Set Intersection Operation

集合差异

两个集合 A 和 B 的差(即 A - B)将产生一组只存在于 A 中而不存在于 B 中的项目。有两种方法来执行差运算:1)使用'–，–'运算符；2)使用 difference()方法。参考清单 1-35 中的设置差操作码示例。

# use - operator on A
print ("Difference of A - B", A - B)

# alternative we can use difference( )
print ("Difference of A - B", A.difference(B))
---- output ----
Difference of A - B {1, 2, 3}

Listing 1-35Example Code for Set Difference Operation

设置对称差

两个集合 A 和 B 的对称差是两个集合中不常见的项的集合。执行对称差分有两种方法:1)使用 ^ 运算符，2)使用对称 _difference()方法。设置对称差分操作码示例参见清单 1-36 。

# use ^ operator
print ("Symmetric difference of A ^ B", A ^ B)

# alternative we can use symmetric_difference( )
print ("Symmetric difference of A ^ B", A.symmetric_difference(B))
---- output ----
Symmetric difference of A ^ B {1, 2, 3, 6, 7, 8}

Listing 1-36Example Code for Set Symmetric Difference Operation

基本操作

让我们看看可以在清单 1-37 代码示例中的 Python 集合上执行的基本操作。

# Return a shallow copy of a set
lang = languages.copy( )
print (languages)
print (lang)

# initialize A and B
A = {1, 2, 3, 4, 5}
B = {4, 5, 6, 7, 8}

print (A.isdisjoint(B))   # True, when two sets have a null intersection
print (A.issubset(B))     # True, when another set contains this set
print (A.issuperset(B))   # True, when this set contains another set
sorted(B)                 # Return a new sorted list
print (sum(A))            # Retrun the sum of all items
print (len(A))            # Return the length
print (min(A))            # Return the largest item
print (max(A))            # Return the smallest item
---- output ----
{'Machine Learning', 'R', 'Python', 'SAS', 'Java', 'C', 'Data Science', 'SPSS'}
{'Machine Learning', 'R', 'Python', 'SAS', 'Java', 'C', 'Data Science', 'SPSS'}
False
False
False
15
5
1
5

Listing 1-37Example Code for Basic Operations on Sets

词典

Python 字典中的每一项都有一个键和值对。键和值应该用花括号括起来。每个键和值用冒号(:)分隔，而且每个项目用逗号(，)分隔。请注意，键在特定的字典中是唯一的，并且必须是不可变的数据类型，例如字符串、数字或元组，而值可以接受任何类型的重复数据。表 1-14 总结了 Python 字典操作；代码示例参见清单 1-38 至 1-42 。

表 1-14

Python 字典操作

描述

Python 表达式

例子

结果

|
| --- | --- | --- | --- |
| 创建字典 | dict = {'key1 ':'值 1 '，' key2 ':'值 2'…..} | dict = {'Name': 'Jivin '，' Age': 8，' Class': 'Three'} | { '姓名':'吉文'，'年龄':8，'阶级':'三' } |
| 访问字典中的项目 | 字典['key'] | 字典['名称'] | 字典['名称']: Jivin |
| 删除字典 | del dict[' key ']；dict . clear()；del dict | del dict[' Name ']；dict . clear()；del dict | { '年龄':68，'阶级':'三' }；{}; |
| 更新字典 | 字典['key'] =新值 | 字典['年龄'] = 8.5 | 字典['年龄']: 8.5 |
| 长度 | len(字典) | len({'Name': 'Jivin '，' Age': 8，' Class': 'Three'}) | three |
| 字典的字符串表示 | str(字典) | dict = {'Name': 'Jivin '，' Age ':8 }；print("等效字符串: "，str (dict)) | 等效字符串:{'Age': 8，' Name': 'Jivin'} |
| 返回字典的浅拷贝 | dict.copy() | dict = {'Name': 'Jivin '，' Age ':8 }；dict1 = dict.copy( )print(dict1) | { '年龄':8，'姓名':' Jivin'} |
| 使用 seq 中的键和值设置为 value 创建新字典 | dict.fromkeys() | seq =('姓名'，'年龄'，'性别')dict = dict.fromkeys(seq)print("新字典: "，str(dict))dict = dict . from keys(seq，10)print("新字典: "，str(dict)) | 新字典:{ '年龄':无，'姓名':无，'性别':无}新字典:{ '年龄':10，'姓名':10，'性别':10} |
| 对于关键字 key，如果关键字不在字典中，则返回值或默认值 | dict.get(key，默认值=None) | dict = {'Name': 'Jivin '，' Age': 8}print ("Value for Age:"，dict . get(' Age '))print(" Value for Education:"，dict.get('Education '，"三年级")) | 价值:68 价值:三年级 |
| 如果字典字典中有关键字，则返回 True，否则返回 False | dict.has_key(key) | dict = {'Name': 'Jivin '，' Age': 8}print("年龄存在？"，dict.has_key('Age '))打印(“性是存在的？”，dict.has_key('Sex ')) | 值:Value 值:False |
| 返回字典(键，值)元组对的列表 | dict.items() | dict = {'Name': 'Jivin '，' Age': 8}print ("dict items:"，dict.items()) | 值:[('Age '，8)，(' Name '，' Jivin')] |
| 返回字典 dict 的关键字列表 | 关键字( ) | dict = {'Name': 'Jivin '，' Age': 8}print ("dict keys:"，dict.keys()) | 值:['年龄'，'姓名'] |
| 类似于 get()，但是如果关键字不在字典中，将设置字典[关键字]=默认值 | dict.setdefault(关键字，默认值=无) | dict = {'Name': 'Jivin '，' Age': 8}print("年龄的值: "，dict.setdefault('年龄'，无))print(" Sex 的值: "，dict.setdefault('Sex '，None)) | 值:8 值:无 |
| 将字典 dict2 的键值对添加到 dict | 字典更新(字典 2) | dict = {'Name': 'Jivin '，' Age ':8 } dict 2 = { ' Sex ':' male ' } dict . update(dict 2)print(" dict . update(dict 2)= "，dict) | 值:{ '年龄':8，'姓名':' Jivin '，'性别':'男性' } |
| 返回字典 dict 值的列表 | dict.values() | dict = {'Name': 'Jivin '，' Age': 8}print ("Value:"，dict.values()) | 值:[8，' Jivin'] |

# Basic operations

dict = {'Name': 'Jivin', 'Age': 8, 'Class': 'Three'}
print ("Length of dict: ", len(dict))

dict1 = {'Name': 'Jivin', 'Age': 8};
dict2 = {'Name': 'Pratham', 'Age': 9};
dict3 = {'Name': 'Pranuth', 'Age': 7};
dict4 = {'Name': 'Jivin', 'Age': 8};

# String representation of dictionary
dict = {'Name': 'Jivin', 'Age': 8}
print ("Equivalent String: ", str (dict))

# Copy the dict
dict1 = dict.copy( )
print (dict1)

# Create new dictionary with keys from tuple and values to set value
seq = ('name', 'age', 'sex')

dict = dict.fromkeys(seq)
print ("New Dictionary: ", str(dict))

dict = dict.fromkeys(seq, 10)
print ("New Dictionary: ", str(dict))

# Retrieve value for a given key
dict = {'Name': 'Jivin', 'Age': 8};
print ("Value for Age: ", dict.get('Age'))
# Since the key Education does not exist, the second argument will be returned
print ("Value for Education: ", dict.get('Education', "First Grade"))

# Check if key in dictionary
print ("Age exists? ", 'Age' in dict)
print ("Sex exists? ", 'Sex' in dict)

# Return items of dictionary
print ("dict items: ", dict.items( ))

# Return items of keys
print ("dict keys: ", dict.keys( ))

# return values of dict
print ("Value of dict: ",  dict.values( ))

# if key does not exists, then the arguments will be added to dict and returned
print ("Value for Age : ", dict.setdefault('Age', None))
print ("Value for Sex: ", dict.setdefault('Sex', None))

# Concatenate dicts
dict = {'Name': 'Jivin', 'Age': 8}
dict2 = {'Sex': 'male' }

dict.update(dict2)
print ("dict.update(dict2) = ",  dict)
---- output ----
Length of dict:  3
Equivalent String:  {'Name': 'Jivin', 'Age': 8}
{'Name': 'Jivin', 'Age': 8}
New Dictionary:  {'name': None, 'age': None, 'sex': None}
New Dictionary:  {'name': 10, 'age': 10, 'sex': 10}
Value for Age:  8
Value for Education:  First Grade
Age exists?  True
Sex exists?  False
dict items:  dict_items([('Name', 'Jivin'), ('Age', 8)])
dict keys:  dict_keys(['Name', 'Age'])
Value of dict:  dict_values(['Jivin', 8])
Value for Age :  8
Value for Sex:  None

dict.update(dict2) =  {'Name': 'Jivin', 'Age': 8, 'Sex': 'male'}

Listing 1-42Example Code for Basic Operations on the Dictionary

# Updating a dictionary

dict = {'Name': 'Jivin', 'Age': 8, 'Class': 'Three'}
print ("Sample dictionary: ", dict)
dict['Age'] = 8.5

print ("Dictionary post age value update: ", dict)
---- output ----
Sample dictionary:  {'Name': 'Jivin', 'Age': 8, 'Class': 'Three'}
Dictionary post age value update:  {'Name': 'Jivin', 'Age': 8.5, 'Class': 'Three'}

Listing 1-41Example Code for Updating the Dictionary

# Deleting a dictionary
dict = {'Name': 'Jivin', 'Age': 8, 'Class': 'Three'}
print ("Sample dictionary: ", dict)
del dict['Name'] # Delete specific item
print ("Sample dictionary post deletion of item Name:", dict)

dict = {'Name': 'Jivin', 'Age': 8, 'Class': 'Three'}
dict.clear( ) # Clear all the contents of dictionary
print ("dict post dict.clear( ):", dict)

dict = {'Name': 'Jivin', 'Age': 8, 'Class': 'Three'}
del dict # Delete the dictionary
---- output ----

Sample dictionary:  {'Name': 'Jivin', 'Age': 8, 'Class': 'Three'}
Sample dictionary post deletion of item Name: {'Age': 8, 'Class': 'Three'}
dict post dict.clear( ): {}

Listing 1-40Example for Deleting a Dictionary

print ("Value of key Name, from sample dictionary:", dict['Name'])
---- output ----
Value of key Name, from sample dictionary: Jivin

Listing 1-39Example Code for Accessing the Dictionary

# Creating a dictionary
dict = {'Name': 'Jivin', 'Age': 8, 'Class': 'Three'}

print ("Sample dictionary: ", dict)
---- output ----
Sample dictionary:  {'Name': 'Jivin', 'Age': 8, 'Class': 'Three'}

Listing 1-38Example Code for Creating a Dictionary

用户定义的函数

用户定义的函数是一组相关的代码语句，它们被组织起来以实现单个相关的操作。用户定义函数概念的一个关键目标是鼓励模块化并实现代码的可重用性。

定义函数

需要定义函数，下面是在 Python 中定义函数要遵循的一组规则。

关键字 def 表示函数块的开始，后面是函数名和左、右括号。在这之后，放一个冒号(:)来表示函数头的结尾。
函数可以接受自变量或参数。任何这样的输入变量或参数都应该放在参数头的括号内。
主代码语句放在函数头的下面，应该缩进，这表明代码是同一个函数的一部分。
函数可以将表达式返回给调用者。如果函数末尾没有使用 return 方法，它将作为一个子过程。函数和子过程之间的主要区别在于，函数总是返回表达式，而子过程则不会。

创建不带参数的函数的语法:

def function_name( ):
    1st block line
    2nd block line
    ...

参考清单 1-43 和 1-44 中的用户自定义函数示例。

# Simple function
def someFunction( ):
    print ("Hello World")

# Call the function
someFunction( )
----- output -----
Hello world

Listing 1-43Example Code for Creating Functions Without Argument

以下是创建带参数的函数的语法:

# simple function to add two numbers
def sum_two_numbers(a, b):
    return a + b

# after this line x will hold the value 3!
x = sum_two_numbers(1,2)
print (x)

# You can also set default value for argument(s) in a function. In the below example value of b is set to 10 as default
def sum_two_numbers(a, b = 10):
    return a + b

print (sum_two_numbers(10))
print (sum_two_numbers(10, 5))
----- output -----
3
20
15

Listing 1-44Example Code for Creating Functions with Arguments

def function_name(parameters):
    1st block line
    2nd block line
    ...
    return [expression]

变量的范围

程序中变量或标识符在执行期间和之后的可用性是由变量的作用域决定的。Python 中有两个基本的变量范围:

全局变量
局部变量

关于定义变量范围的代码示例，请参考清单 1-45 。

请注意，Python 支持全局变量，而无需您明确表示它们是全局变量。

# Global variable
a = 10

# Simple function to add two numbers
def sum_two_numbers(b):
    return a + b

# Call the function and print result
print (sum_two_numbers(10)) 

----- output -----
20

Listing 1-45Example Code for Defining Variable Scopes

默认参数

您可以为函数的参数定义默认值，这意味着如果在函数调用中没有为该参数提供任何值，函数将采用或使用默认值。参考清单 1-46 中的代码示例。

# Simple function to add two number with b having default value of 10
def sum_two_numbers(a, b = 10):
    return a + b
# Call the function and print result
print (sum_two_numbers(10))
print (sum_two_numbers(10, 5))
----- output -----
20
15

Listing 1-46Example Code for Function with Default Argument

可变长度参数

有些情况下，您在定义函数时不知道参数的确切数目，并且希望能够动态处理所有参数。Python 对这种情况的回答是可变长度参数，这使您能够处理比定义函数时指定的更多的参数。args 和 kwargs 是允许动态数量的参数的常见习惯用法。

args 将以元组的形式提供所有函数参数。代码示例参见清单 1-47 和 1-48 。

# Simple function to loop through arguments and print them
def foo(*args):
    for a in args:
        print (a)

# Call the function
foo(1,2,3)
----- output -----
1
2
3

Listing 1-47Example Code for Passing Arguments ∗args

kwargs 将使您能够处理没有预先定义的命名参数或关键字参数。

# Simple function to loop through arguments and print them
def foo(**kwargs):
    for a in kwargs:
        print (a, kwargs[a])

# Call the function
foo(name='Jivin', age=8)
----- output -----
name Jivin
age 8

Listing 1-48Example Code for Passing Arguments as ∗∗kwargs

模块

模块是一组逻辑上组织好的、多重的、独立但相关的代码、函数或类。创建模块背后的关键原则是它更容易理解和使用，并且具有高效的可维护性。您可以导入一个模块，Python 解释器将按照以下顺序搜索感兴趣的模块。

首先，它搜索当前活动的目录，也就是调用 Python 程序的目录。如果在当前活动目录中没有找到该模块，Python 就会在路径变量 PYTHONPATH 中搜索每个目录。如果失败，它会在默认的软件包安装路径中进行搜索

请注意，模块搜索路径作为 sys.path 变量存储在名为 sys 的系统模块中，其中包含当前目录、PYTHONPATH 和与安装相关的默认值。

当您导入一个模块时，无论导入多少次，它都只被加载一次。您还可以导入特定的元素(函数、类等。)从您的模块复制到当前命名空间。参考清单 1-49 获取导入模块的示例代码。

# Import all functions from a module
import module_name           # Method 1
from modname import*         # Method 2

# Import specific function from the module
# Syntax: from module_name import function_name
from os import abc

Listing 1-49Example Code for Importing Modules

Python 内部有一个名为名称空间的字典，它将每个变量或标识符名称存储为键，它们对应的值是各自的 Python 对象。有两种类型的命名空间，局部和全局。本地名称空间是在 Python 程序的执行过程中创建的，用于保存程序创建的所有对象。局部变量和全局变量同名，局部变量隐藏全局变量。每个类和函数都有自己的本地名称空间。Python 假设函数中任何被赋值的变量都是局部的。对于全局变量，需要显式指定。

另一个关键的内置函数是 dir()；运行它将返回一个字符串的排序列表，其中包含模块中定义的所有模块、变量和函数的名称。参考清单 1-50 中的示例代码。

import os
content = dir(os)
print(content)

---- output ----
['DirEntry', 'F_OK', 'MutableMapping', 'O_APPEND', 'O_BINARY', 'O_CREAT', 'O_EXCL', 'O_NOINHERIT', 'O_RANDOM', 'O_RDONLY', 'O_RDWR', 'O_SEQUENTIAL', 'O_SHORT_LIVED', 'O_TEMPORARY', 'O_TEXT', 'O_TRUNC', 'O_WRONLY', 'P_DETACH', 'P_NOWAIT', 'P_NOWAITO', 'P_OVERLAY', 'P_WAIT', 'PathLike', 'R_OK', 'SEEK_CUR', 'SEEK_END', 'SEEK_SET', 'TMP_MAX', 'W_OK', 'X_OK', '_Environ', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_execvpe', '_exists', '_exit', '_fspath', '_get_exports_list', '_putenv', '_unsetenv', '_wrap_close', 'abc', 'abort', 'access', 'altsep', 'chdir', 'chmod', 'close', 'closerange', 'cpu_count', 'curdir', 'defpath', 'device_encoding', 'devnull', 'dup', 'dup2', 'environ', 'error', 'execl', 'execle', 'execlp', 'execlpe', 'execv', 'execve', 'execvp', 'execvpe', 'extsep', 'fdopen', 'fsdecode', 'fsencode', 'fspath', 'fstat', 'fsync', 'ftruncate', 'get_exec_path', 'get_handle_inheritable', 'get_inheritable', 'get_terminal_size', 'getcwd', 'getcwdb', 'getenv', 'getlogin', 'getpid', 'getppid', 'isatty', 'kill', 'linesep', 'link', 'listdir', 'lseek', 'lstat', 'makedirs', 'mkdir', 'name', 'open', 'pardir', 'path', 'pathsep', 'pipe', 'popen', 'putenv', 'read', 'readlink', 'remove', 'removedirs', 'rename', 'renames', 'replace', 'rmdir', 'scandir', 'sep', 'set_handle_inheritable', 'set_inheritable', 'spawnl', 'spawnle', 'spawnv', 'spawnve', 'st', 'startfile', 'stat', 'stat_result', 'statvfs_result', 'strerror', 'supports_bytes_environ', 'supports_dir_fd', 'supports_effective_ids', 'supports_fd', 'supports_follow_symlinks', 'symlink', 'sys', 'system', 'terminal_size', 'times', 'times_result', 'truncate', 'umask', 'uname_result', 'unlink', 'urandom', 'utime', 'waitpid', 'walk', 'write']

Listing 1-50Example Code dir( ) Operation

看看前面的输出，name 是一个特殊的字符串变量名，表示模块的名称，而 _ _ file _ _ 是加载模块的文件名。

文件输入/输出

Python 提供了读取和写入文件信息的简单函数(表 1-15 )。要对文件执行读或写操作，我们需要首先打开它。一旦所需的操作完成，就需要关闭它，以便释放与该文件相关的所有资源。

以下是文件操作的顺序:

表 1-15

文件输入输出操作

描述

句法

例子

|
| --- | --- | --- |
| 打开文件 | obj=open(文件名，访问模式，缓冲区) | f =打开(' vehicles.txt '，' w ') |
| 从文件中读取 | fileobject.read(值) | f =打开(' vehicles.txt ')f.readlines() |
| 关闭文件 | fileobject.close（） | f.close() |
| 写入文件 | fileobject.write(字符串 str) | vehicles = ['scooter\n '，'自行车\n '，'汽车\n']f =打开(' vehicles.txt '，' w ')f.writelines(车辆)f.close() |

打开一个文件
执行读或写操作
关闭文件

打开文件

在打开一个文件时，access_mode 将决定文件的打开方式，即读、写、追加等。Read (r)模式是默认的文件访问模式，这是一个可选参数。参考表 1-16 了解文件打开方式，并列出 1-51 示例代码。

表 1-16

文件打开模式

模式

描述

|
| --- | --- |
| 稀有 | 只读 |
| 铷 | 仅以二进制格式读取 |
| r+ | 文件将可读写 |
| rb+ | 文件将以二进制格式读写 |
| W | 只写 |
| 韦伯 | 仅以二进制格式书写 |
| w+ | 对写作和阅读都开放；如果文件存在—覆盖，否则—创建 |
| wb+ | 以二进制格式读写；如果文件存在—覆盖，否则—创建 |
| A | 以追加模式打开文件。如果文件不存在，则创建文件 |
| 抗体 | 以追加模式打开文件。如果文件不存在，则创建文件 |
| a+ | 打开文件进行追加和读取。如果文件不存在，则创建文件 |
| ab+ | 以二进制格式打开文件进行追加和读取。如果文件不存在，则创建文件 |

# Below code will create a file named vehicles and add the items. \n is a newline character
vehicles = ['scooter\n', 'bike\n', 'car\n']
f = open('vehicles.txt', 'w')
f.writelines(vehicles)
f.close

# Reading from file
f = open(vehicles.txt')
print (f.readlines( ))
f.close( )

---- output ----

['scooter\n', 'bike\n', 'car\n']

Listing 1-51Example Code for File Operations

异常处理

在 Python 程序执行过程中发生的任何会中断程序预期流程的错误都被称为异常。你的程序应该被设计成既能处理预期的错误又能处理意外的错误。

Python 有一组丰富的内置异常，列在表 1-17 中，当程序出错时，这些异常会强制程序输出错误。

以下是 Python 官方文档中描述的 Python 标准异常列表( https://docs.python.org/2/library/exceptions.html )

表 1-17

Python 内置的异常处理

异常名

描述

|
| --- | --- |
| 例外 | 所有异常的基类 |
| 停止迭代 | 当迭代器的 next()方法不指向任何对象时引发 |
| 系统退出 | 由 sys.exit()函数引发 |
| 标准误差 | 除 StopIteration 和 SystemExit 之外的所有内置异常的基类 |
| 算术误差 | 数值计算中出现的所有错误的基类 |
| 溢出误差 | 当计算超过数值类型的最大限制时引发 |
| 浮点错误 | 浮点计算失败时引发 |
| 零除法错误 | 当所有数值类型都被零除或取模时引发 |
| 断言错误 | Assert 语句失败时引发 |
| 属性错误 | 在属性引用或赋值失败时引发 |
| 欧费罗 | 当 raw_input()或 input()函数没有输入并且到达文件结尾时引发 |
| 导入错误 | 当导入语句失败时引发 |
| 键盘中断 | 当用户中断程序执行时引发，通常是通过按 Ctrl+c |
| LookupError | 所有查找错误的基类 |
| 索引错误 | 在序列中找不到索引时引发 |
| 键错误 | 在字典中找不到指定的键时引发 |
| 名称错误 | 在本地或全局命名空间中找不到标识符时引发 |
| unboundlocalrerror | 当试图访问函数或方法中的局部变量，但没有给它赋值时引发 |
| 环境错误 | Python 环境之外发生的所有异常的基类 |
| 我错了 | 当输入/输出操作失败时引发，例如当 print 语句或 open()函数试图打开不存在的文件时 |
| 我错了 | 因操作系统相关错误而引发 |
| 句法误差 | 当 Python 语法中有错误时引发 |
| 内建 Error | 当未正确指定缩进时引发 |
| 系统误差 | 当解释器发现内部问题时引发，但是当遇到这个错误时 Python 解释器不退出 |
| 系统退出 | 使用 sys.exit()函数退出 Python 解释器时引发。如果没有在代码中处理，将导致解释器退出 |
| 类型错误 | 当尝试对指定的数据类型无效的操作或函数时引发 |
| 值错误 | 当数据类型的内置函数具有有效类型的参数，但这些参数指定了无效值时引发 |
| 运行时错误 | 当生成的错误不属于任何类别时引发 |
| notimplemontederror | 当需要在继承类中实现的抽象方法实际上没有实现时引发 |

您可以在 Python 程序中使用 try、raise、except 和 finally 语句处理异常。

try and except:try 子句可用于放置任何会在程序中引发异常的关键操作；异常子句应该有处理异常的代码。异常处理的示例代码参见清单 1-52 。

import sys

try:
    a = 1
    b = 1
    print ("Result of a/b: ", a / b)
except (ZeroDivisionError):
    print ("Can't divide by zero")
except (TypeError):
    print ("Wrong data type, division is allowed on numeric data type only")
except:
    print ("Unexpected error occurred", '\n', "Error Type: ", sys.exc_info( )[0], '\n', "Error Msg: ", sys.exc_info( )[1])
---- output ----
Result of a/b:  1.0

Listing 1-52Example Code for Exception Handling

注意

1)将前面代码中 b 的值更改为零将打印语句“不能被零除”

2)将 divide 语句中的“A”替换为“A”将打印以下输出:

出现意外错误

错误类型:

错误消息:名称“A”未定义

最后:这是一个可选条款，旨在定义在任何情况下都必须执行的清理操作。

参考清单 1-53 获取文件操作异常处理的示例代码。

try:
    f = open('C:\\Users\Manoh\\Documents\\ vehicles.txt')
    s = f.readline( )
    print (s)
    i = int(s.strip( ))
except IOError as e:
    print ("I/O error({0}): {1}".format(e.errno, e.strerror))
except ValueError:
    print ("Could not convert data to an integer.")
except:
    print ("Unexpected error occurred", '\n', "Error Type: ", sys.exc_info( )[0], '\n', "Error Msg: ", sys.exc_info( )[1])
finally:
    f.close( )
    print ("file has been closed")
---- output ----
scooter
Could not convert data to an integer.
file has been closed

Listing 1-53Example Code for Exception Handling with File Operations # Below code will open a file and try to convert the content to integer

Python 总是在离开 try 语句之前执行“finally”子句，而不管是否出现异常。如果异常子句不是为处理 try 子句中引发的异常而设计的，则在执行“finally”子句后，会再次引发该异常。参见图 1-2 了解错误处理器的理想代码流程。如果使用诸如 break、continue 或 return 之类的语句迫使程序退出 try 子句，那么“finally”仍然会在退出时执行。

图 1-2

错误处理程序的代码流

注意，通常使用“finally”来遵循单一出口点原则是最佳实践这意味着，在成功执行主代码或错误处理程序处理完错误后，它应该通过“finally ”,以便在所有情况下代码都在同一点退出。

摘要

在这一章中，我试图介绍 Python 3 的基础知识和基本主题。有大量在线/离线资源可以帮助您加深对 Python 这种编程语言的了解。表 1-18 提供了一些有用的资源供你将来参考。

表 1-18

额外资源

资源

描述

方式

|
| --- | --- | --- |
| http://docs.python-guide.org/en/latest/intro/learning/ | 这是 Python 的官方教程；它涵盖了所有的基础知识，并详细介绍了语言和标准库。 | 在线的 |
| http://awesome-python.com/ | 令人惊叹的 Python 框架、库、软件和资源的精选列表 | 在线的 |
| Python 黑客指南 | 这本书的目标读者是已经了解 Python，但希望向更有经验的 Python 开发人员学习的开发人员。 | 书 |

二、机器学习简介

机器学习(ML)是计算机科学的一个子领域，它是从人工智能(AI)中的模式识别和计算学习理论的研究中发展而来的。让我们看看 ML 定义的其他几个版本:

1959 年，计算机游戏、ML 和 AI 领域的美国先驱亚瑟·塞缪尔(Arthur Samuel)将机器学习定义为“在没有明确编程的情况下赋予计算机学习能力的研究领域。”
ML 是计算机科学的一个领域，涉及使用统计方法来创建程序，这些程序要么随着时间的推移提高性能，要么在大量数据中检测人类不太可能发现的模式。

前面的定义是正确的。简而言之，ML 是用于创建计算系统的算法和技术的集合，该计算系统从数据中学习以做出预测和推断。

ML 应用程序比比皆是。让我们看看发生在我们身边的一些最常见的 ML 日常应用。

推荐系统: YouTube 根据一个推荐系统，向它的每个用户推荐它认为该用户会感兴趣的视频。类似地，亚马逊和其他此类电子零售商通过查看客户的购买历史和大量产品库存来推荐客户感兴趣并可能购买的产品。

垃圾邮件检测:电子邮件服务提供商使用 ML 模型，该模型可以自动检测未经请求的邮件，并将其移动到垃圾邮件文件夹。

潜在客户识别:银行、保险公司和金融机构使用触发警报的 ML 模型，以便这些机构在适当的时间介入，开始用适当的优惠吸引客户，并说服他们尽早转换。这些模型观察用户在初始阶段的行为模式，并将其映射到所有用户的过去行为，试图确定哪些人会购买产品，哪些人不会。

在这一章中，我们将学习人工智能的历史和演变，以理解它在更广泛的人工智能家族中的位置。我们还将了解与 ML 并行存在的不同相关形式/术语，如统计学、数据或业务分析以及数据科学，以及它们存在的原因。还讨论了 ML 的高级类别，以及构建高效 ML 系统最常用的框架。我们还将简要地看一下用于数据分析的关键 ML 库。

历史和演变

ML 是 AI 的一个子集，所以让我们首先了解什么是 AI，以及 ML 在它更广泛的保护伞中的位置。人工智能是一个广义的术语，旨在使用数据为现有问题提供解决方案。它是在机器中复制甚至超越人类智能的科学和工程。这意味着观察或阅读、学习、感知和体验。

人工智能过程循环如图 2-1 所示。

图 2-1

人工智能过程循环

观察:使用数据识别模式。
计划:寻找所有可能的解决方案。
优化:从可能的解决方案列表中找到最优的解决方案。
动作:执行最优解。
学习和适应:结果是否给出了预期的结果？如果不是，那就适应。

人工智能过程循环可以使用智能代理来实现。机器人智能代理可以被定义为能够通过不同种类的传感器(摄像机、红外线等)感知其环境的组件。)，并将在环境内采取行动。在这里，机器人代理被设计成反映人类。我们有不同的感觉器官，如眼睛、耳朵、鼻子、舌头和皮肤来感知我们的环境，手、腿和嘴等器官是效应器，使我们能够根据我们的感知在我们的环境中采取行动

Stuart J. Russell 和 Peter Norvig 在《人工智能，一种现代方法》一书中讨论了关于设计代理的详细讨论。图 2-2 是一个样本图示。

图 2-2

机器人智能代理概念的描述，它通过传感器和效应器与环境交互

为了更好地理解这个概念，让我们看看为特定环境或用例设计的智能代理的组件(表 2-1 )。考虑设计一个自动出租车司机。

表 2-1

智能代理组件示例

智能代理的组件名

描述

|
| --- | --- |
| 代理类型 | 出租车司机 |
| 目标 | 安全出行，合法，舒适出行，利润最大化，方便，快捷 |
| 环境 | 道路、交通、信号、标志、行人、顾客 |
| 感知 | 速度计、麦克风、全球定位系统、照相机、声纳、传感器 |
| 行动 | 转向、加速、刹车、与乘客交谈 |

出租车司机机器人智能代理将需要知道它的位置，它行进的方向，它行进的速度，以及路上还有什么！这些信息可以从诸如适当位置的可控摄像机、速度计、里程表和加速度计等感知设备获得。为了了解车辆和发动机的机械状态，需要电气系统传感器。此外，卫星全球定位系统(GPS)可以帮助提供相对于电子地图的准确位置信息，红外/声纳传感器可以检测到周围其他汽车或障碍物的距离。智能出租车司机代理可用的动作是通过踏板控制发动机加速和制动，以及控制方向的转向。还应该有一种与乘客互动或交谈的方式，以了解目的地或目标。

1950 年，著名的计算机科学家艾伦·图灵在他著名的论文《计算机械和智能》中提出了一个测试，被称为图灵测试该测试旨在提供一个令人满意的智能操作定义，要求一个人不能通过对机器和另一个人的问题的回答来区分机器和另一个人。

为了能够通过图灵测试，计算机应该具备以下能力:

自然语言处理:能够用选定的语言成功交流
知识表示:存储审讯前或审讯过程中提供的信息，有助于查找信息、做出决策和计划。这也被称为专家系统。
自动推理(语音):使用存储的知识图谱信息来回答问题，并在需要时得出新的结论
机器学习:分析数据，检测和推断有助于适应新环境的模式
计算机视觉:感知物体或分析图像以发现图像的特征
机器人技术(Robotics):能够操控环境并与之互动的设备。这意味着根据环境移动物体。
计划、调度和优化:计算制定决策计划或实现指定目标的方法，以及分析计划和设计的性能

前面提到的人工智能的七个能力领域已经经历了多年的研究和发展。虽然这些领域的许多术语可以互换使用，但我们可以从描述中看出它们的目标是不同的(图 2-3 )。特别是，ML 的范围跨越了人工智能的所有七个领域。

图 2-3

人工智能领域

人工智能进化

让我们简单看一下 AI 的过去，现在，和未来。

【ANI】**:在特定任务上等于或超过人类智力或效率的机器智能。一个例子是 IBM 的 Watson，它需要主题或领域专家的密切参与，以提供数据/信息并评估其性能。
人工通用智能(AGI) : 有能力将智能应用于一个领域的任何问题，而不仅仅是一个特定问题的机器。自动驾驶汽车就是一个很好的例子。
人工超级智能(ASI) : 在几乎每个领域都比最优秀的人类大脑聪明得多的智力，一般智慧，社交技能，包括科学创造力。这里的关键主题是“不要模仿世界，要模仿思想。”

不同形式

ML 是唯一一门我们用数据来学习和用于预测/推断的学科吗？

要回答这个问题，让我们先来看看相对经常听到的其他几个关键术语的定义(维基百科)(不是一个详尽的列表):

统计:是对数据的收集、分析，解释、呈现、组织的研究。
数据挖掘 : 它是计算机科学的一个交叉子领域。它是在大型数据集(来自数据仓库)中发现模式的计算过程，涉及 AI 、 ML 、统计和数据库系统的交集的方法。
数据分析 : 对数据进行检查、清理、转换和建模的过程，目的是发现有用的信息，提出结论，支持决策。这也称为业务分析，在许多行业中广泛使用，允许公司/组织使用检查原始数据的科学，以得出有关该信息的结论，从而做出更好的业务决策。
数据科学 : 数据科学是一个关于从各种形式的数据中提取知识或洞察力的过程和系统的跨学科领域，无论是结构化的还是非结构化的，它是一些数据分析领域的延续，如统计、ML、数据挖掘和预测分析，类似于数据库 (KDD)中的知识发现。

是的，从前面的定义中，我们清楚而惊讶地发现，ML 并不是我们使用数据从中学习并进一步用于预测/推理的唯一主题。在这些领域中，几乎相同的主题、工具和技术正在被讨论。这提出了一个真正的问题，为什么有这么多不同的名字，围绕着从数据中学习有很多重叠。这些有什么区别？

简而言之，所有这些实际上都是一样的。然而，在这三者之间存在着意义、表达或声音的细微差别。为了更好地理解，我们必须回顾这些领域的历史，仔细研究这些术语的起源、核心应用领域和演变。

统计数字

德国学者 Gottfried Achenwall 在 18 世纪中叶(1749 年)引入了“统计学”一词。这个词在这一时期的使用意味着它与一个国家的行政职能有关，提供了反映其各个行政领域的定期现状的数字。统计这个词的起源可以追溯到拉丁语“Status”(“国务院”)或意大利语“Statista”(“statesman”或“political”)，也就是说，这些词的意思是“政治国家”或政府。莎士比亚在他的戏剧《哈姆雷特》(1602)中使用了 statist 一词。在过去，统治者使用统计学，指定分析关于国家的数据，象征着“国家科学”

在 19 世纪初，统计学获得了收集和分类数据的意义。苏格兰政治家约翰·辛克莱爵士于 1791 年在他的著作《苏格兰统计报告》中将其引入英语。因此，统计诞生的基本目的涉及政府和中央行政组织用来收集各州和地方人口普查数据的数据。

频率论者

约翰·格兰特是第一批人口学家之一，也是我们的第一位生命统计学家。他在《死亡法案》中发表了他的观察结果(1662 年)，这项工作经常被引用为描述统计学的第一个实例。他用几个表格提供了大量的数据，这些数据很容易理解，这种技术现在被广泛称为描述统计学。在这本书里，我们注意到，每周死亡率统计首次出现在英国是在 1603 年的教区事务员大厅。我们可以从中了解到，1623 年，在伦敦的大约 50，000 次葬礼中，只有 28 次死于瘟疫。到 1632 年，这种疾病实际上已经暂时消失了，但在 1636 年再次出现，并在 1665 年再次成为可怕的流行病。这说明了描述性统计的基本性质是计数。他从所有教区的登记簿上，统计了死亡人数，以及死于瘟疫的人数。计算出来的数字往往太大而难以理解，所以他也通过使用比例而不是实际数字来简化它们。例如，1625 年有 51，758 人死亡，其中 35，417 人死于瘟疫。为了简化这一点，他写道，“我们发现瘟疫与整体的比例为 35 比 51。或者 7 到 10。”通过这些，他引入了一个概念，即相对比例往往比原始数字更有意义。我们通常将比例表示为 70%。这种基于样本数据的比例分布或频率的推测被称为“频率统计”统计假设检验是基于一个推理框架，其中你假设观察到的现象是由未知但固定的过程引起的。

贝叶斯定理的

相比之下，贝叶斯统计(以托马斯·贝叶斯命名)基于可能与事件相关的条件，描述了事件的概率。贝叶斯统计的核心是贝叶斯定理，它使用条件概率的概念来描述相关(相依)事件的结果概率。例如，如果特定疾病与年龄和生活方式有关，那么通过考虑一个人的年龄和生活方式来应用贝叶斯定理，可以更准确地评估该人患病的概率。

贝叶斯定理在数学上表述为以下等式:

$\mathrm{P}\left(\mathrm{A}|\mathrm{B}\right)=\frac{\mathrm{P}\left(\mathrm{B}|\mathrm{A}\right)\kern0.5em \mathrm{P}\left(\mathrm{A}\right)}{\mathrm{P}\left(\mathrm{B}\right)}$

其中 A 和 B 是事件，P (B) ≠ 0。

P (A)和 P (B)是不考虑彼此的情况下观察 A 和 B 的概率。
P (A | B)，a 条件概率，是假设 B 为真，观察到事件 A 的概率。
P (B | A)是假设 A 为真，观察到事件 B 的概率。

例如，一位医生知道睡眠不足 50%的时候会导致偏头痛。任何患者睡眠不足的先验概率为 10，000/50，000，任何患者偏头痛的先验概率为 300/1，000。如果一个病人有睡眠障碍，让我们应用贝叶斯定理来计算他/她患偏头痛的概率。

P(睡眠障碍|偏头痛)= P(偏头痛|睡眠障碍)÷P(偏头痛)/ P(睡眠障碍)

p(睡眠障碍|偏头痛)= . 5∫10000/50000/(300/1000)= 33%

在前面的场景中，睡眠障碍患者有 33%的几率会出现偏头痛问题。

回归

统计学家的另一个重要里程碑是回归方法，该方法由勒让德于 1805 年和高斯于 1809 年发表。勒让德和高斯都将这种方法应用于从天文观测中确定天体围绕太阳运行的轨道的问题，这些天体主要是彗星，后来也包括新发现的小行星。高斯在 1821 年发表了最小二乘理论的进一步发展。回归分析是估计因素之间关系的基本统计过程。它包括许多分析和模拟各种因素的技术。这里主要关注的是一个相关因素和一个或多个独立因素之间的关系，也称为预测因素或变量或特征。我们将在 Scikit-learn 的 ML 基础中了解更多这方面的内容。

随着时间的推移，统计这个词背后的思想经历了非凡的转变。所提供的数据或信息的特征已经扩展到人类活动的所有领域。让我们来理解经常与统计学一起使用的两个术语之间的区别:1)数据和 2)方法。统计数据是事实的数字陈述，而统计方法处理的是在收集和分析这些数据时使用的原则和技术的信息。今天，统计学作为一门独立于数学的学科，与几乎所有的教育分支和人类活动都有着密切的联系，而这些都是用数字来表示的。在现代，它在质量和数量上都有无数不同的应用。在自然科学和社会科学、医学、商业和其他领域，个人和组织使用统计学来理解数据并做出明智的决策。统计学已经成为主干，并产生了许多其他学科，你会明白，因为你进一步阅读。

数据挖掘

“数据库中的知识发现”(KDD)这个术语是 Gregory Piatetsky-Shapiro 在 1989 年提出的。与此同时，他共同创建了第一个名为 KDD 的工作室。术语“数据挖掘”是在 20 世纪 90 年代在数据库社区中引入的，但数据挖掘是一个历史稍长的领域的演变。

数据挖掘技术是对业务流程和产品开发进行研究的结果。这一演变始于业务数据首次存储在计算机的关系数据库中，并随着数据访问的改进而继续，并进一步产生了允许用户实时浏览其数据的新技术。在商业社会中，数据挖掘专注于在“正确的时间”为“正确的决策”提供“正确的数据”。这是通过在分布式多处理器计算机的帮助下实现大量数据收集和应用算法来实现的，以提供来自数据的实时洞察。

在“构建 ML 系统的框架”一节中，我们将进一步了解 KDD 提出的数据挖掘的五个阶段。

数据分析

自从 19 世纪晚期美国机械工程师弗雷德里克·温斯洛·泰勒发起提高工业效率的管理运动以来，人们就知道分析在商业中的应用。制造业采用测量制造和装配线的速度，从而彻底改变了工业效率。但在 20 世纪 60 年代末，当计算机开始在组织的决策支持系统中发挥主导作用时，分析开始得到更多的关注。传统上，业务经理根据过去的经验或经验法则做出决策，或者有其他定性方面的决策。然而，随着数据仓库和企业资源规划(ERP)系统的发展，这种情况发生了变化。业务经理考虑数据，并依靠特别分析来确认他们基于经验/知识的日常和关键业务决策假设。这演变为用于决策过程的数据驱动的商业智能或商业分析，并被全球的组织和公司迅速采用。如今，各种规模的企业都在使用分析。在企业界，术语“业务分析”通常与“数据分析”互换使用。

企业需要对市场有一个整体的看法，以及一家公司如何在该市场中有效竞争，以增加他们的投资回报率(RoI)。这需要一个围绕各种可能的分析的强大分析环境。这些可以大致分为四种类型(图 2-4 )。

图 2-4

数据分析类型

描述性分析
诊断分析
预测分析
规定性分析

描述性分析

它们是描述过去的分析，告诉我们“发生了什么”详细说明，顾名思义，任何帮助我们将原始数据描述或总结成人类可以理解的东西的活动或方法都可以称为描述性分析。这些是有用的，因为它们允许我们从过去的行为中学习，并理解它们如何影响未来的结果。

统计数据，如计数、最小值、最大值、总和、平均值、百分比、百分比变化等的算术运算。属于这一类。描述性分析的常见示例是公司的商业智能报告，这些报告涵盖组织的不同方面，以提供关于公司的生产、运营、销售、收入、财务、库存、客户和市场份额的历史回顾。

诊断分析

这是描述性分析的下一步，描述性分析检查数据或信息来回答“为什么会发生”这个问题它的特点是技术，如钻取，数据发现，数据挖掘，相关性和因果关系。它基本上提供了对您想要解决的问题的有限部分的非常好的理解。然而，这是一项非常费力的工作，因为需要大量的人工干预来执行下钻或数据挖掘，以更深入地了解数据，从而了解为什么会发生某种情况或根本原因。它侧重于确定促成结果的因素和事件。

例如，假设一家零售公司强硬路线(通常包括家具、器具、工具、电子产品等的类别。)某些商店的销售业绩不达标，产品线经理希望了解根本原因。在这种情况下，产品经理可能希望根据产品线在商店中的位置(哪个楼层、角落、过道)，回顾不同商店中产品线销售的过去趋势和模式。经理可能还想了解与它密切相关的其他产品之间是否存在因果关系。他们可能会分别或同时考虑不同的外部因素，如人口统计、季节和宏观经济因素，以基于结论性解释来定义相关变量的相对排名。要做到这一点，没有一套明确定义的有序步骤，它取决于进行分析的人的经验水平和思维方式。

主题专家的大量参与，可能需要直观地展示数据/信息，以便更好地理解。有太多的工具可供使用，如 Excel、Tableau、QlikView、Spotfire 和 D3，这些工具支持诊断分析。

预测分析

它是根据过去或历史模式对未知未来事件的可能性做出预测或估计的能力。预测分析将让我们洞察“可能会发生什么？”它使用来自数据挖掘、统计、建模、ML 和 AI 的许多技术来分析当前数据，以对未来做出预测。

重要的是要记住，预测分析的基础是基于概率，统计算法的预测质量在很大程度上取决于输入数据的质量。因此，这些算法不能 100%确定地预测未来。然而，公司可以使用这些统计数据来预测未来可能发生的事情的概率，并且将这些结果与商业知识一起考虑应该会导致有利可图的决策。

ML 非常注重预测分析，我们将来自不同来源的历史数据结合起来，如组织 ERP、CRM(客户关系管理)、POS(销售点)、员工数据和市场研究数据。这些数据用于识别模式并应用统计模型/算法来捕捉各种数据集之间的关系，并进一步预测事件的可能性。

预测性分析的一些示例包括天气预报、垃圾电子邮件识别、欺诈检测、客户购买产品或续保的概率、预测某人患已知疾病的几率等。

规定性分析

它是数据或业务分析领域，致力于为给定情况找到最佳行动方案。说明性分析与其他三种形式的分析相关:描述性、诊断性和预测性。规范分析的目的是衡量未来决策的效果，使决策者在实际决策之前预见可能的结果。说明性分析系统是业务规则和 ML 算法的组合，这些工具可以应用于历史和实时数据馈送。这里的关键目标不仅仅是预测将会发生什么，还要预测为什么会发生，通过预测基于不同情景的多种未来，使公司能够根据他们的行动评估可能的结果。

说明性分析的一个例子是在设计环境中使用模拟来帮助用户识别不同配置下的系统行为。这可确保满足所有关键性能指标，如等待时间、队列长度等。另一个例子是在给定约束和目标函数的情况下，使用线性或非线性规划来确定业务的最佳结果。

数据科学

1960 年，彼得·诺尔在他的出版物计算机方法简明概览中使用了“数据科学”一词，这是关于当代数据处理方法在广泛应用中的情况。1991 年，计算机科学家蒂姆·伯纳斯·李在“新闻组小组”的一篇文章中宣布了我们今天所知的万维网的诞生，他提出了一个世界范围内互联的数据网络的规范，任何地方的任何人都可以访问。随着时间的推移，Web/Internet 每年增长十倍，并且已经成为提供各种信息和通信设施的全球计算机网络，包括使用标准化通信协议的互连网络。此外，存储系统也在发展，数字存储变得比纸质存储更具成本效益。

截至 2008 年，全世界的服务器处理了 9.57 泽塔字节(9.57 万亿吉字节)的信息，这相当于每人每天 12 吉字节的信息，根据“有多少信息？2010 年企业服务器信息报告

互联网的兴起极大地增加了结构化、半结构化和非结构化数据的数量。这导致了术语“大数据”的诞生，其特征是三个 v(图 2-5 ):数量、多样性和速度。需要特殊的工具和系统来处理高速产生的种类繁多(文本、数字、音频、视频等)的大量数据。

图 2-5

大数据的三个 v(来源: http://blog.sqlauthority.com )

大数据革命影响了术语“数据科学”的诞生虽然数据科学这个术语从 1960 年就存在了，但它变得流行起来，并归功于脸书和 LinkedIn 的杰夫·哈默巴赫尔和 DJ·帕蒂尔，因为他们精心选择了它，试图描述他们的团队和工作(根据 DJ·帕蒂尔在 2008 年出版的建立数据科学团队)。他们选定了“数据科学家”，于是一个时髦的词诞生了。图 2-6 很好地解释了 Drew Conway 在 2010 年提出的数据科学基本技能集。

图 2-6

德鲁·康威的数据科学维恩图

执行数据科学项目需要三项关键技能:

编程或黑客技能
数学和统计学
范围内给定领域的业务或主题专业知识

注意 ML 源于 AI。它不是数据科学的一个分支，而是仅仅使用 ML 作为工具。

统计与数据挖掘、数据分析与数据科学

我们可以从关于从数据中学习的主题的历史和演变中了解到，尽管他们使用相同的方法，但他们作为不同的文化进化，因此他们有不同的历史、命名、符号和哲学观点(图 2-7 )。

图 2-7

从数据进化中学习

所有形式一起:通往终极 AI 的道路(图 2-8 )。

图 2-8

所有形式一起:通向终极人工智能的道路

机器学习类别

在高层次上，基于期望的输出和产生输出所需的输入类型，ML 任务可以分为三组(图 2-9 )。

图 2-9

ML 的类型

监督学习

ML 算法提供有足够大的相应于输出或事件/类的示例输入数据集，通常与相应领域的主题专家协商准备。该算法的目标是学习数据中的模式，并构建一组通用规则来将输入映射到类或事件。

概括地说，有两种常用的监督学习算法:

回归:待预测的输出是与给定输入数据集相关的连续数。示例用例有零售销售预测、每班所需员工数量预测、零售店所需停车场空间数量预测、客户信用评分预测等。
分类:预测的输出是事件/类的实际或概率，预测的类数可以是两个或两个以上。该算法应该从历史数据中学习每个类的相关输入中的模式，并且能够在考虑它们的输入的情况下，预测未来看不见的类或事件。一个示例用例是垃圾邮件过滤，其中预期的输出是将电子邮件分类为垃圾邮件或非垃圾邮件。

构建监督学习 ML 模型有三个阶段:

训练:将为算法提供具有映射输出的历史输入数据。该算法将学习每个输出的输入数据中的模式，并将其表示为一个统计方程，通常也称为模型。
测试或验证:在这一阶段，对训练好的模型的性能进行评估，通常是通过将其应用于数据集(不是训练的一部分)来预测类或事件。
n:这里我们将训练好的模型应用到一个数据集，这个数据集既不是训练的一部分，也不是测试的一部分。该预测将用于推动业务决策。

无监督学习

有些情况下，对于历史数据，所需的输出类/事件是未知的。这种情况的目的是研究输入数据集中的模式，以便更好地理解和识别可以分组到特定类或事件中的相似模式。由于这些类型的算法事先不需要主题专家的任何干预，因此它们被称为无监督学习。

以下是一些无监督学习的例子:

聚类:假设对于给定的数据集，事先不知道类。这里的目标是将输入数据集分成相关项目的逻辑组。一些例子是对相似的新闻文章进行分组，或者根据客户的个人资料对相似的客户进行分组。
降维:这里的目标是通过将大型输入数据集映射到一个较低维度的空间来简化它们。例如，对大规模数据集进行分析是非常计算密集型的；因此，为了简化，您可能希望找到持有重要百分比(比如 95%)信息的关键变量，并且只使用它们进行分析。

强化学习

强化学习算法的基本目标是将情境映射到产生最大最终回报的行动。在映射动作时，算法不应该只考虑眼前的奖励，还应该考虑下一个和所有后续的奖励。例如，一个玩游戏或驾驶汽车的程序将不得不不断地与一个动态的环境交互，在这个环境中它被期望达到某个目标。我们将在第六章中详细了解这一点。

强化学习技术的例子有:

马尔可夫决策过程
q 学习
时间差分方法
蒙特卡罗方法

构建 ML 系统的框架

随着时间的推移，数据挖掘领域经历了巨大的扩展。许多专家做了大量的工作来标准化方法，并为不断增长的、多样化的和迭代的构建 ML 系统的过程定义最佳实践。在过去的十年中，由于其从大量数据中提取洞察力的能力，ML 领域对于不同的行业、企业和组织变得非常重要。以前，这些数据对于了解趋势/模式和预测有助于推动业务决策以获取利润的可能性毫无用处或未得到充分利用。最终，浪费丰富的业务数据源所包含的有价值的信息的风险增加了。这需要使用适当的技术来获得有用的知识，ML 领域在 20 世纪 80 年代初已经出现，并且已经有了很大的发展。随着这个领域的出现，不同的过程框架被引入。这些过程框架指导并承载着 ML 任务及其应用。人们努力使用数据挖掘过程框架来指导对大量数据进行数据挖掘。

主要是三种数据挖掘过程框架已经被数据挖掘专家/研究人员最流行和最广泛地实践来构建 ML 系统。这些模型是:

数据库中的知识发现(KDD)过程模型
数据挖掘的跨行业标准流程(CRISP-DM)
取样、探索、修改、建模和评估(SEMMA)

数据库中的知识发现

它是指从数据中发现有用知识的整个过程，由 Fayyad 等人于 1996 年提出。它集成了多种数据管理技术，如数据仓库、统计 ML、决策支持、可视化和并行计算。顾名思义，KDD 以从数据中发现知识的整个过程为中心，涵盖了数据的整个生命周期。这包括如何存储数据，如何访问数据，如何有效地将算法扩展到庞大的数据集，以及如何解释和可视化结果。

图 2-10 显示了 KDD 的五个阶段，将在以下章节中详细介绍。

图 2-10

KDD 数据挖掘流程

选择

在这一步骤中，对可能来自许多不同和异构来源的目标数据进行 s 选择和整合。然后从数据库中检索与分析任务相关的变量和数据样本的正确子集。

预处理

真实世界的数据集通常是不完整的。也就是说，属性值会缺失、有噪声(错误和异常值)和不一致，这意味着收集的数据之间存在差异。不干净的数据会混淆挖掘过程，并导致不可靠和无效的输出。此外，对大量此类污染数据执行复杂的分析和挖掘可能需要很长时间。预处理和清理应该通过增强实际的挖掘过程来提高数据和挖掘结果的质量。要采取的行动包括:

收集建模所需的数据或信息
异常值处理或噪声去除
使用先前的领域知识来消除数据中的不一致和重复
选择处理缺失数据的策略

转换

在这一步中，数据被转换或整合成适合挖掘的形式，也就是说，根据任务的目标找到有用的特征来表示数据。比如在高维空间或者大量属性中，物体之间的距离可能会变得没有意义。因此，可以使用降维和变换方法来减少所考虑的变量的有效数量，或者找到数据的不变表示。有多种数据转换技术:

平滑(宁滨、聚类、回归等。
聚合
一般化，其中原始数据对象可以被更高级别的概念所替代
归一化，包括最小-最大缩放或 z 值
从现有属性构造特征主成分分析(PCA)，多维标度(MDS)
应用数据简化技术来产生数据的简化表示(紧密保持原始数据完整性的较小体积)
压缩，例如小波、PCA、聚类

数据挖掘

在这个步骤中，应用 ML 算法来提取数据模式。探索/总结方法，如均值、中值、众数、标准差、类别/概念描述和低维图的图形技术，可用于理解数据。分类或回归等预测模型可用于预测事件或未来值。聚类分析可以用来了解相似群体的存在。选择用于模型和模式搜索的最合适的方法。

解释/评估

这一步的重点是解释主题模式，让用户能够理解它们，比如总结和可视化。挖掘的模式或模型被解释。模式是一种局部结构，它只对变量所跨越的空间的有限区域进行陈述。模型是对测量空间中的任意点进行陈述的全局结构，例如，Y = mX+C(线性模型)。

数据挖掘的跨行业标准流程

它通常以首字母缩写 CRISP-DM 为人所知。它是由欧洲信息技术研究战略计划倡议建立的，旨在创建一种不依赖于领域的公正方法。这是一个巩固数据挖掘过程最佳实践的努力，由专家来处理数据挖掘问题。它于 1996 年构思，于 1999 年首次发布，并在 2002 年、2004 年和 2007 年进行的民意调查中被报道为数据挖掘/预测分析项目的领先方法。在 2006 年和 2008 年之间有一个更新 CRISP-DM 的计划，但是那个更新没有发生，并且今天最初的 CRISP-DM.org 网站不再活跃。

这个框架是一系列理想化的活动。这是一个迭代的过程，许多任务回溯到以前的任务，并重复某些动作以带来更大的清晰度。有六个主要阶段，如图 2-11 所示，并在以下章节中讨论:

图 2-11

显示 CRISP-DM 六个阶段之间关系的流程图

商业理解
数据理解
数据准备
建模
估价
部署

阶段 1:业务理解

顾名思义，这个阶段的重点是从业务角度理解整个项目的目标和期望。这些目标被转换为数据挖掘或 ML 问题定义，并且围绕数据需求、企业所有者输入和结果性能评估度量来设计行动计划。

阶段 2:数据理解

在这一阶段，收集在前一阶段确定为需求的初始数据。开展活动是为了了解数据缺口或数据与手头对象的相关性、任何数据质量问题，以及对数据的初步洞察，以提出适当的假设。这个阶段的结果将被迭代地呈现给业务，以使业务理解和项目目标更加清晰。

阶段 3:数据准备

这个阶段是关于清理数据的，以便为模型构建阶段做好准备。清理数据可能涉及填补上一步中的已知数据缺口、缺失值处理、识别重要要素、应用变换以及在适用的情况下创建新的相关要素。这是最重要的阶段之一，因为模型的准确性将在很大程度上取决于输入算法以学习模式的数据质量。

阶段 4:建模

有多种 ML 算法可用于解决给定的问题。因此，各种适当的 ML 算法被应用于干净的数据集，并且它们的参数被调整到最佳的可能值。记录每个应用模型的模型性能。

第五阶段:评估

在这一阶段，将在被确定为具有高准确性的所有不同模型中进行基准测试。该模型将根据不作为训练一部分的数据进行测试，以评估其性能一致性。将根据阶段 1 中确定的业务需求验证结果。来自企业的主题专家将参与进来，以确保模型结果是准确的，并且可以按照项目目标的要求使用。

第 6 阶段:部署

这个阶段的重点是模型输出的可用性。因此，由主题专家签署的最终模型将被实现，模型输出的消费者将被培训如何解释或使用它来做出在业务理解阶段定义的业务决策。实现可以是生成一个预测报告并与消费者共享。此外，将根据业务需求安排定期的模型培训和预测时间。

SEMMA(取样、探索、修改、建模、评估)

SEMMA 是在 SAS Enterprise Miner 中构建 ML 模型的连续步骤，SAS Enterprise Miner 是 SAS Institute Inc .的产品，SAS Institute Inc .是商业、统计和商业智能软件的最大生产商之一。这些连续的步骤指导了 ML 系统的开发。让我们来看看这五个连续的步骤，以便更好地理解。

样品

这一步是从为建立模型而提供的大数据集中选择正确体积的子集。这将有助于高效地构建模型。当计算能力昂贵时，这是一个著名的实践；然而，它仍然在实践中。所选的数据子集应该是最初收集的整个数据集的实际表示，这意味着它应该包含足够的信息以供检索。在这个阶段，数据还被划分用于训练和验证。

探索

在这一阶段，开展活动以了解数据差距和变量之间的关系。两个关键活动是单变量和多变量分析。在单变量分析中，每个变量被单独检查以了解其分布，而在多变量分析中，每个变量之间的关系被探究。数据可视化大量用于帮助更好地理解数据。

修改

在这一阶段，需要清理变量。基于需求，通过将业务逻辑应用于现有功能来创建新的派生功能。如有必要，对变量进行转换。这个阶段的结果是一个干净的数据集，可以传递给 ML 算法来构建模型。

模型

在这个阶段，各种建模或数据挖掘技术被应用于预处理的数据，以根据期望的结果对它们的性能进行基准测试。

评定

这是最后一个阶段。在这里，模型性能根据测试数据(不用于模型训练)进行评估，以确保可靠性和业务有用性。

KDD 是三个框架中最古老的。CRISP-DM 和 SEMMA 似乎是 KDD 进程的实际实现。CRISP-DM 更加完整，因为已经清楚地定义了跨阶段和阶段之间的知识迭代流。此外，它涵盖了从商业世界的角度建立一个可靠的 ML 系统的所有领域。在 SEMMA 的取样阶段，重要的是你要真正了解业务的各个方面，以确保取样数据保留最大限度的信息。然而，最近的重大创新已经降低了数据存储和计算能力的成本，这使我们能够对整个数据有效地应用 ML 算法，几乎消除了采样的需要。

我们可以看到，一般来说，所有三个框架都涵盖了核心阶段，它们之间没有很大的区别(图 2-12 )。总的来说，这些过程指导我们如何将数据挖掘技术应用到实际场景中。一般来说，大多数研究人员和数据挖掘专家遵循 KDD 和克里斯普-DM 过程模型，因为它更完整和准确。我个人建议在商业环境中使用 CRISP-DM，因为它涵盖了端到端的商业活动和构建 ML 系统的生命周期。

图 2-12

数据挖掘框架综述

机器学习 Python 包

有大量的开源库可以用来促进实用的 ML。这些主要被称为科学 Python 库，通常在执行基本的 ML 任务时使用。在高层次上，我们可以根据它们的用途/目的将这些库分为数据分析和核心 ML 库。

数据分析:这些软件包为我们提供了执行数据预处理和转换所必需的数学和科学功能。

核心机器学习包:这些包为我们提供了所有必要的 ML 算法和功能，可以应用于给定的数据集来提取模式。

数据分析包

有四个最广泛用于数据分析的关键包:

NumPy
我的天啊
Matplotlib
熊猫

Pandas、NumPy 和 Matplotlib 在几乎所有的数据分析任务中都起着主要的作用并有使用范围(图 2-13 )。所以在这一章中，我们将尽可能地关注与这三个包相关的用法或概念。SciPy 是对 NumPy 库的补充，拥有多种关键的高级科学和工程模块；然而，这些功能的使用很大程度上取决于用例。因此，我们将在接下来的章节中尽可能地触及或强调一些有用的功能。

图 2-13

数据分析包

注意

为了简明起见，我们将只通过简短的介绍和代码实现来涵盖每个库中的关键概念。您可以随时参考这些软件包的官方用户文档，这些文档由开发人员社区精心设计，涵盖了更多的深度。

NumPy

NumPy 是 Python 中科学计算的核心库。它提供了一个高性能的多维数组对象和工具来处理这些数组。它是数字包的继承者。2005 年，Travis Oliphant 通过将竞争对手 Numarray 的功能整合到 Numeric 中，并进行大量修改，创造了 NumPy。我认为这些概念和代码示例在很大程度上已经在他的书NumPy中以最简单的形式解释过了。在这里，我们将只看一些关键的数字概念，这些概念是必须的，或者是与 ML 相关的知识。

排列

NumPy 数组是相似数据类型值的集合，由非负数元组索引。数组的秩是维度的数量，数组的形状是一组数字，给出了数组在每个维度上的大小。

我们可以从嵌套的 Python 列表中初始化 NumPy 数组，并使用方括号访问元素(清单 2-1 )。

import numpy as np

# Create a rank 1 array
a = np.array([0, 1, 2])
print (type(a))

# this will print (the dimension of the array
print (a.shape)
print (a[0])
print (a[1])
print (a[2])

# Change an element of the array
a[0] = 5
print (a)
# ----output-----
<class 'numpy.ndarray'>
(3,)
0
1
2
[5 1 2]

# Create a rank 2 array
b = np.array([[0,1,2],[3,4,5]])
print (b.shape)
print (b)
print (b[0, 0], b[0, 1], b[1, 0])
----output-----
(2, 3)
[[0 1 2]
 [3 4 5]]
0 1 3

Listing 2-1Example Code for Initializing NumPy Array

创建 NumPy 数组

NumPy 还提供了许多内置函数来创建数组。最好的学习方法是通过例子(清单 2-2 )，所以让我们直接进入代码。

# Create a 3x3 array of all zeros
a = np.zeros((3,3))
print (a)
----- output -----
[[ 0\.  0\.  0.]
 [ 0\.  0\.  0.]
 [ 0\.  0\.  0.]]

# Create a 2x2 array of all ones
b = np.ones((2,2))
print (b)
---- output ----
[[ 1\.  1.]
 [ 1\.  1.]]

# Create a 3x3 constant array
c = np.full((3,3), 7)
print (c)
---- output ----
[[7 7 7]
 [7 7 7]
 [7 7 7]]

# Create a 3x3 array filled with random values
d = np.random.random((3,3))
print (d)
---- output ----
[[0.67920283 0.54527415 0.89605908]
 [0.73966284 0.42214293 0.10170252]
 [0.26798364 0.07364324 0.260853  ]]

# Create a 3x3 identity matrix
e = np.eye(3)
print (e)
---- output ----
[[ 1\.  0\.  0.]
 [ 0\.  1\.  0.]
 [ 0\.  0\.  1.]]

# convert list to array
f = np.array([2, 3, 1, 0])
print (f)
---- output ----

[2 3 1 0]

# arange() will create arrays with regularly incrementing values
g = np.arange(20)
print (g)
---- output ----
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]

# note mix of tuple and lists
h = np.array([[0, 1,2.0],[0,0,0],(1+1j,3.,2.)])
print (h)
---- output ----
[[ 0.+0.j  1.+0.j  2.+0.j]
 [ 0.+0.j  0.+0.j  0.+0.j]
 [ 1.+1.j  3.+0.j  2.+0.j]]

# create an array of range with float data type
i = np.arange(1, 8, dtype=np.float)
print (i)
---- output ----
[ 1\.  2\.  3\.  4\.  5\.  6\.  7.]

# linspace() will create arrays with a specified number of items which are
# spaced equally between the specified beginning and end values
j = np.linspace(2., 4., 5)
print (j)
---- output ----
[ 2\.   2.5  3\.   3.5  4\. ]

# indices() will create a set of arrays

stacked as a one-higher
# dimensioned array, one per dimension with each representing variation
# in that dimension
k = np.indices((2,2))
print (k)
---- output ----
[[[0 0]
  [1 1]]

 [[0 1]
  [0 1]]]

Listing 2-2Creating NumPy Array

数据类型

该数组是同一数据类型的项的集合。NumPy 支持并提供了一个内置函数来构造一个带有可选参数的数组，以显式指定所需的数据类型(清单 2-3 )。

# Let numpy choose the data type
x = np.array([0, 1])
y = np.array([2.0, 3.0])

# Force a particular data type
z = np.array([5, 6], dtype=np.int64)

print (x.dtype, y.dtype, z.dtype)
---- output ----
int32 float64 int64

Listing 2-3NumPy Data Types

数组索引

NumPy 提供了几种方法来索引数组。标准 Python x[obj]语法可用于索引 NumPyarray，其中 x 是数组，obj 是选择。

有三种索引可供使用:

现场访问
基本切片
高级索引

现场访问

如果 ndarray 对象是一个结构化数组，可以通过用字符串索引数组来访问数组的字段，就像字典一样。索引 x['field-name']返回一个与 x 形状相同的数组的新视图，除非该字段是一个子数组，但数据类型为 x.dtype['field-name']，并且只包含指定字段中的部分数据(清单 2-4 )。

x = np.zeros((3,3), dtype=[('a', np.int32), ('b', np.float64, (3,3))])
print ("x['a'].shape: ",x['a'].shape)
print ("x['a'].dtype: ", x['a'].dtype)
print ("x['b'].shape: ", x['b'].shape)
print ("x['b'].dtype: ", x['b'].dtype)
----output-----
x['a'].shape:  (3, 3)
x['a'].dtype:  int32
x['b'].shape:  (3, 3, 3, 3)
x['b'].dtype:  float64

Listing 2-4Field Access

基本切片

NumPy 数组可以切片，类似于列表。您必须为数组的每个维度指定一个切片，因为数组可能是多维的。

基本的切片语法是 i: j: k，其中 I 是起始索引，j 是终止索引，k 是步长，k 不等于 0。这选择了相应维度中的 m 个元素，索引值为 I，i + k，...，i + (m - 1) k，其中 m = q + (r 不等于 0)而 q 和 r 是 j - i 除以 k 得到的商和余数:j - i = q k + r，这样 i + (m - 1) k < j. Refer to Listings 2-5 到 2-10 例如基本切片上的代码。

x = np.array([5, 6, 7, 8, 9])
x[1:7:2]
---- output ----
array([6, 8])

Listing 2-5Basic Slicing

负 k 使步进朝向更小的指数。负的 I 和 j 被解释为 n + i 和 n + j，其中 n 是相应维中的元素数。

print (x[-2:5])
print (x[-1:1:-1])
# ---- output ----
[8 9]
[9 8 7]

Listing 2-6Basic Slicing (continued)

如果 n 是被切片的维度中的项目数，如果没有给定 I，那么对于 k > 0，它默认为 0；对于 k < 0. If j is not given it defaults to n for k > 0，它默认为 n- 1；对于 k < 0，它默认为-1。如果没有给出 k，则默认为 1。请注意::与:相同，表示选择沿此轴的所有索引。

x[4:]
# ---- output ----
array([9])

Listing 2-7Basic Slicing (continued)

如果选择元组中的对象数小于 N，则:被假定用于任何后续维度。

y = np.array([[[1],[2],[3]], [[4],[5],[6]]])
print ("Shape of y: ", y.shape)
y[1:3]
# ---- output ----
Shape of y:  (2, 3, 1)

Listing 2-8Basic Slicing (continued)

省略号扩展到:对象的数量，需要生成与 x.ndim 长度相同的选择元组。可能只存在一个省略号。

x[...,0]
---- output ----
array(5)

# Create a rank 2 array with shape (3, 4)
a = np.array([[5,6,7,8], [1,2,3,4], [9,10,11,12]])
print ("Array a:", a)

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
#  [6 7]]
b = a[:2, 1:3]
print ("Array b:", b)
---- output ----
Array a:  [[ 5  6  7  8]
 [ 1  2  3  4]
 [ 9 10 11 12]]
Array b:  [[6 7]
 [2 3]]

Listing 2-9Basic Slicing (continued)

数组的切片是相同数据的视图，因此修改它将会修改原始数组。

print (a[0, 1])
b[0, 0] = 77
print(a[0, 1])
---- output ----
6
77

Listing 2-10Basic Slicing (continued)

可以以两种方式访问中间行数组:1)片连同整数索引将产生较低等级的数组，以及 2)仅使用片将产生相同等级的数组。

示例代码:

# Create the following rank 2 array with shape (3, 4)
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

row_r1 = a[1,:]# Rank 1 view of the second row of a
row_r2 = a[1:2,:]# Rank 2 view of the second row of a
print (row_r1, row_r1.shape)
print (row_r2, row_r2.shape)
---- output ----
[5 6 7 8] (4,)
[[5 6 7 8]] (1, 4)

# We can make the same distinction when accessing columns of an array:
col_r1 = a[:, 1]
col_r2 = a[:, 1:2]
print (col_r1, col_r1.shape)
print (col_r2, col_r2.shape)
---- output ----
[ 2  6 10] (3,)
[[ 2]
 [ 6]
 [10]] (3, 1)

高级索引

有两种高级索引:整数数组和布尔数组。

整数数组索引允许您将随机数组转换成另一个新数组，如清单 2-11 所示。

a = np.array([[1,2], [3, 4]])

# An example of integer array indexing.
# The returned array will have shape (2,) and
print (a[[0, 1], [0, 1]])

# The preceding example of integer array indexing is equivalent to this:
print (np.array([a[0, 0], a[1, 1]]))
--- output ----
[1 4]
[1 4]

# When using integer array indexing, you can reuse the same
# element from the source array:
print (a[[0, 0], [1, 1]])

# Equivalent to the previous integer array indexing example
print (np.array([a[0, 1], a[0, 1]]))
---- output ----
[2 2]
[2 2]

Listing 2-11Advanced Indexing

布尔数组索引对于从一个数组中挑选一个随机元素很有用，这通常用于过滤满足给定条件的元素(清单 2-12 )。

a = np.array([[1,2], [3, 4], [5, 6]])
# Find the elements of a that are bigger than 2
print (a > 2)

# to get the actual value
print (a[a > 2])
---- output ----
[[False False]
 [ True  True]
 [ True  True]]
[3 4 5 6]

Listing 2-12Boolean Array Indexing

数组数学

在 NumPy 中，基本的数学函数可以作为运算符使用，也可以作为函数使用。它对数组进行元素操作(清单 2-13 )。

import numpy as np

x=np.array([[1,2],[3,4],[5,6]])
y=np.array([[7,8],[9,10],[11,12]])

# Elementwise sum; both produce the array
print (x+y)
print (np.add(x, y))
---- output ----
[[ 8 10]
 [12 14]
 [16 18]]
[[ 8 10]
 [12 14]
 [16 18]]

# Elementwise difference; both produce the array
print(x-y)
print (np.subtract(x, y))
---- output ----
[[-6 -6]
 [-6 -6]
 [-6 -6]]
[[-6 -6]
 [-6 -6]
 [-6 -6]]

# Elementwise product; both produce the array
print (x∗y)
print (np.multiply(x, y))
---- output ----
[[ 7 16]
 [27 40]
 [55 72]]
[[ 7 16]
 [27 40]
 [55 72]]

# Elementwise division

; both produce the array
print (x/y)
print (np.divide(x, y))
---- output ----
[[0.14285714 0.25      ]
 [0.33333333 0.4       ]
 [0.45454545 0.5       ]]
[[0.14285714 0.25      ]
 [0.33333333 0.4       ]
 [0.45454545 0.5       ]]

# Elementwise square root; produces the array
print(np.sqrt(x))
---- output ----
[[1\.         1.41421356]
 [1.73205081 2\.        ]
 [2.23606798 2.44948974]]

Listing 2-13
Array Math

我们可以使用" dot "函数来计算向量的内积或者矩阵相乘或者向量乘矩阵，如清单 2-14 中的代码示例所示。

x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

a = np.array([9,10])
b = np.array([11, 12])

# Inner product of vectors; both produce 219
print (a.dot(b))
print (np.dot(a, b))
---- output ----
219
219

# Matrix / vector product; both produce the rank 1 array [29 67]
print (x.dot(a))
print (np.dot(x, a))
---- output ----
[29 67]
[29 67]

# Matrix / matrix product; both produce the rank 2 array
print (x.dot(y))
print (np.dot(x, y))
---- output ----
[[19 22]
 [43 50]]
[[19 22]
 [43 50]]

Listing 2-14Array Math (continued)

NumPy 为在数组上执行计算提供了许多有用的函数。其中最有用的是“和”；示例代码如清单 2-15 所示。

x = np.array([[1,2],[3,4]])

# Compute sum of all elements
print (np.sum(x))
# Compute sum of each column
print (np.sum(x, axis=0))
# Compute sum of each row
print (np.sum(x, axis=1))
---- output ----
10
[4 6]
[3 7]

Listing 2-15Sum Function

转置是经常在矩阵上执行的常见操作之一，可以使用数组对象的 T 属性来实现。代码示例请参考清单 2-16 。

x = np.array([[1,2],[3,4]])
print (x)
print (x.T)
---- output ----
[[1 2]
 [3 4]]
[[1 3]
 [2 4]]

# Note that taking the transpose of a rank 1 array does nothing:
v = np.array([1,2,3])
print (v)
print (v.T)
---- output ----
[1 2 3]
[1 2 3]

Listing 2-16Transpose Function

广播

广播使得算术运算能够在不同形状的阵列之间执行。让我们看一个简单的例子(清单 2-17 )向矩阵的每一行添加一个常量向量。

# create a matrix
a = np.array([[1,2,3], [4,5,6], [7,8,9]])
# create a vector
v = np.array([1, 0, 1])

# Create an empty matrix with the same shape as a
b = np.empty_like(a)

# Add the vector v to each row of the matrix x with an explicit loop
for i in range(3):
    b[i, :] = a[i, :] + v

print (b)
---- output ----
[[ 2  2  4]
 [ 5  5  7]
 [ 8  8 10]]

Listing 2-17Broadcasting

通过 Python 中的循环对大型矩阵执行上述操作可能会很慢。让我们看看清单 2-18 中显示的另一种方法。

# Stack 3 copies of v on top of each other
vv = np.tile(v, (3, 1))
print (vv)
---- output ----
[[1 0 1]
 [1 0 1]
 [1 0 1]]

# Add x and vv elementwise
b = a + vv
print (b)
---- output ----
[[ 2  2  4]
 [ 5  5  7]
 [ 8  8 10]]

Listing 2-18Broadcasting for Large Matrix

现在让我们看看如何使用清单 2-19 中的示例代码中的 NumPy 广播来实现上述功能。

a = np.array([[1,2,3], [4,5,6], [7,8,9]])
v = np.array([1, 0, 1])

# Add v to each row of a using broadcasting
b = a + v
print (b)
---- output ----
[[ 2  2  4]
 [ 5  5  7]
 [ 8  8 10]]

Listing 2-19Broadcasting Using NumPy

现在我们来看看广播的一些应用(列举 2-20 )。

# Compute outer product of vectors
# v has shape (3,)
v = np.array([1,2,3])
# w has shape (2,)
w = np.array([4,5])
# To compute an outer product, we first reshape v to be a column
# vector of shape (3, 1); we can then broadcast it against w to yield
# an output of shape (3, 2), which is the outer product of v and w:

print (np.reshape(v, (3, 1)) ∗ w)
---- output ----
[[ 4  5]
 [ 8 10]
 [12 15]]

# Add a vector to each row of a matrix
x = np.array([[1,2,3], [4,5,6]])
# x has shape (2, 3) and v has shape (3,) so they broadcast to (2, 3)

print (x + v)

---- output ----
[[2 4 6]
 [5 7 9]]

# Add a vector to each column of a matrix
# x has shape (2, 3) and w has shape (2,).
# If we transpose x then it has shape (3, 2) and can be broadcast
# against w to yield a result of shape (3, 2); transposing this result
# yields the final result of shape (2, 3) which is the matrix x with
# the vector w added to each column

print ((x.T + w).T)
---- output ----
[[ 5  6  7]
 [ 9 10 11]]

# Another solution is to reshape

w to be a row vector of shape (2, 1);
# we can then broadcast it directly against x to produce the same
# output.
print (x + np.reshape(w, (2, 1)))
---- output ----
[[ 5  6  7]
 [ 9 10 11]]

# Multiply a matrix by a constant:
# x has shape (2, 3). Numpy treats scalars as arrays of shape ();
# these can be broadcast together to shape (2, 3), producing the
# following array:
print (x ∗ 2)
---- output ----
[[ 2  4  6]
 [ 8 10 12]]

Listing 2-20Appliclations of Broadcasting

广播通常会使您的代码更简洁、更快，所以您应该尽可能地使用它。

熊猫

Python 在数据管理方面一直很棒；然而，与使用 SQL、Excel 或 R 数据框架的数据库相比，它并不适合分析。Pandas 是一个开源的 Python 包，它提供了快速、灵活、富于表现力的数据结构，旨在使处理“关系”或“标签”数据变得既简单又直观。Pandas 是 Wes McKinney 于 2008 年在 AQR Capital Management 工作时开发的，当时需要一个高性能、灵活的工具来执行财务数据的定量分析。在离开 AQR 之前，他能够说服管理层允许他开放这个库的源代码。

Pandas 非常适合具有不同类型列的表格数据，如 SQL 表或 Excel 电子表格。

数据结构

Pandas 为 Python 引入了两种新的数据结构——Series 和 data frame——这两种数据结构都建立在 NumPy 之上(这意味着它们很快)。

系列

这是一个一维对象，类似于电子表格或 SQL 表中的列。默认情况下，每个项目将被分配一个从 0 到 N 的索引标签(清单 2-21 )。

import pandas as pd

# creating a series by passing a list of values, and a custom index label.
# Note that the labeled index reference for each row and it can have duplicate values
s = pd.Series([1,2,3,np.nan,5,6], index=['A','B','C','D','E','F'])
print (s)
---- output ----
A    1.0
B    2.0
C    3.0
D    NaN
E    5.0
F    6.0
dtype: float64

Listing 2-21Creating a Pandas Series

数据帧

它是一个二维对象，类似于电子表格或 SQL 表。这是最常用的熊猫对象(清单 2-22 )。

data = {'Gender': ['F', 'M', 'M'],
        'Emp_ID': ['E01', 'E02', 'E03'],
        'Age': [25, 27, 25]}

# We want to order the columns, so lets specify in columns parameter
df = pd.DataFrame(data, columns=['Emp_ID','Gender', 'Age'])
df
---- output ----
       Emp_ID       Gender       Age
0      E01          F            25
1      E02          M            27
2      E03          M            25

Listing 2-22Creating a Pandas DataFrame

读取和写入数据

我们会看到三种常用的文件格式:csv、文本文件和 Excel(列表 2-23 )。

# Reading
df=pd.read_csv('Data/mtcars.csv')             # from csv
df=pd.read_csv('Data/mtcars.txt', sep="\t")   # from text file
df=pd.read_excel('Data/mtcars.xlsx','Sheet2') # from Excel

# reading from multiple sheets of same Excel into different dataframes
xlsx = pd.ExcelFile('file_name.xls')
sheet1_df = pd.read_excel(xlsx, 'Sheet1')
sheet2_df = pd.read_excel(xlsx, 'Sheet2')

# writing
# index = False parameter will not write the index values, default is True
df.to_csv('Data/mtcars_new.csv', index=False)
df.to_csv('Data/mtcars_new.txt', sep="\t", index=False)
df.to_excel('Data/mtcars_new.xlsx',sheet_name='Sheet1', index = False)

Listing 2-23Reading/Writing Data from csv, text, Excel

注意

默认情况下，Write 将覆盖任何同名的现有文件。

基本统计摘要

Pandas 有一些内置的函数来帮助我们更好地理解数据，使用基本的统计汇总方法(清单 2-24 )。

describe() 将返回数据帧每一列的快速统计数据，如计数、平均值、标准偏差、最小值、第一个四分位数、中值、第三个四分位数和最大值。

df = pd.read_csv('Data/iris.csv')
df.describe()
---- output ----
       Sepal.Length Sepal.Width  Petal.Length Petal.Width
count  150.000000   150.000000   150.000000   150.000000
mean   5.843333     3.057333     3.758000     1.199333
std    0.828066     0.435866     1.765298     0.762238
min    4.300000     2.000000     1.000000     0.100000
25%    5.100000     2.800000     1.600000     0.300000
50%    5.800000     3.000000     4.350000     1.300000
75%    6.400000     3.300000     5.100000     1.800000
max    7.900000     4.400000     6.900000     2.500000

Listing 2-24Basic Statistics on DataFrame

cov() 协方差表示两个变量是如何相关的。正协方差意味着变量正相关，而负协方差意味着变量负相关。协方差的缺点是它不能告诉你一个正或负关系的程度(列表 2-25 )。

df = pd.read_csv('Data/iris.csv')
df.cov()
---- output ----
Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
Sepal.Length 0.685694     -0.042434     1.274315     0.516271
Sepal.Width -0.042434      0.189979    -0.329656    -0.121639
Petal.Length 1.274315     -0.329656     3.116278     1.295609
Petal.Width  0.516271     -0.121639     1.295609     0.581006

Listing 2-25Creating Covariance on DataFrame

相关性是确定两个变量之间关系的另一种方式。除了告诉你变量是正相关还是负相关，相关性还告诉你变量一起移动的程度。当你说两个项目相关时，你是说一个项目的变化影响另一个项目的变化。你总是把相关性说成是介于-1 和 1 之间的范围。在下面的代码示例中，花瓣长度与萼片长度有 87%的正相关；这意味着花瓣长度的变化导致萼片长度的正 87%的变化，反之亦然(列表 2-26 )。

df = pd.read_csv('Data/iris.csv')
df.corr()
----output----
             Sepal.Length Sepal.Width  Petal.Length Petal.Width
Sepal.Length 1.000000     -0.117570    0.871754     0.817941
Sepal.Width  -0.117570    1.000000     -0.428440    -0.366126
Petal.Length 0.871754     -0.428440    1.000000     0.962865
Petal.Width  0.817941     -0.366126    0.962865     1.000000

Listing 2-26Creating Correlation Matrix on DataFrame

查看数据

Pandas DataFrame 自带内置函数来查看包含的数据(表 2-2 )。

表 2-2

熊猫视图功能

形容

句法

|
| --- | --- |
| 查看前 n 条记录如果未指定，默认 n 值为 5 | df.head(n=2) |
| 查看底部的 n 条记录 | df.tail() |
| 获取列名 | df.columns |
| 获取列数据类型 | df . dtypes |
| 获取数据帧索引 | df .索引 |
| 获取唯一值 | df[列名]。唯一() |
| 获取值 | df.values |
| 黑色数据帧 | df.sort_values(by =['Column1 '，' Column2']，ascending=[True，True']) |
| 按列名选择/查看 | df[列名] |
| 按行号选择/查看 | df[0:3] |
| 按索引选择 | df.loc[0:3] #索引 0 到 3df.loc[0:3，['column1 '，' column2']] #为特定列索引 0 到 3 |
| 按位置选择 | df.iloc[0:2] #使用范围，前 2 行 df.iloc[2，3，6] #具体位置 df.iloc[0:2，0:2] #前 2 行和前 2 列 |
| 不在索引中的选择 | print (df.iat[1，1]) #第一行第一列的值 print (df.iloc[:，2]) #第二个位置的列的所有行 |
| 获得标量值的 iloc 的更快替代方案 | print (df.iloc[1，1]) |
| 转置数据帧 | df。T |
| 基于一列的值条件筛选数据框架 | df[df['列名'] > 7.5] |
| 基于一列上的值条件筛选数据帧 | df[df['列名']。isin(['条件值 1 '，'条件值 2'])] |
| 使用 AND 运算符基于多个列上的多个条件进行筛选 | df[(df[' column 1 ']> 7.5)&(df[' column 2 ']> 3)] |
| 使用 OR 运算符基于多个列上的多个条件进行筛选 | df[(df[' column 1 ']> 7.5)|(df[' column 2 ']> 3)] |

基本操作

Pandas 自带了一套丰富的基本操作内置函数(表 2-3 )。

表 2-3

熊猫基本操作

描述

句法

|
| --- | --- |
| 将字符串转换为日期序列 | 到日期时间(pd。系列(['2017-04-01 '，' 2017-04-02 '，' 2017-04-03'])) |
| 重命名特定的列名 | df.rename(columns={ '旧列名':'新列名' }，inplace=True) |
| 重命名 DataFrame 的所有列名 | df.columns = ['列 1 _ 新名称'，'列 2 _ 新名称'…。] |
| 标记重复项 | df.duplicated() |
| 删除重复项 | df = df.drop_duplicates() |
| 删除特定列中的重复项 | df . drop _ duplicates([' column _ name ']) |
| 删除特定列中的重复项，但保留重复集中的第一个或最后一个观察项 | df . drop _ duplicates([' column _ name ']，keep = 'first') #更改为 last 以保留副本的最后一个 obs |
| 从现有列创建新列 | df['新列名'] = df['现有列名'] + 5 |
| 从两列的元素创建新列 | df['新列名'] = df['现有列 1'] + '_' + df['现有列 2'] |
| 向数据框架添加新列列表 | df['新列名'] = pd。系列(我的列表) |
| 删除缺少值的行和列 | df.dropna() |
| 用 0 替换所有缺少的值(或者可以使用任何 int 或 str) | df.fillna(值=0) |
| 用最后一个有效观察值替换缺失值(在时间序列数据中很有用)。例如，与以前的观测相比，温度没有剧烈变化。因此，填充 NA 的更好方法是向前或向后填充，而不是取平均值。主要有两种方法可用 1)“填充”/“填充”-向前填充 2)“回填”/“回填”-反向回填限制:如果指定了方法，这是向前/向后填充的连续 NaN 值的最大数量 | df.fillna(method='ffill '，inplace=True，limit = 1) |
| 检查缺失值条件，并为每个单元格返回布尔值 True 或 False | BOM . is null(df) |
| 用平均值替换给定列的所有缺失值 | mean=df['列名']。均值()；df['列名']。菲尔娜(平均值) |
| 每列返回平均值 | df.mean() |
| 返回每列的最大值 | df.max() |
| 返回每列的最小值 | df.min() |
| 返回每列的总和 | df.sum() |
| 每列的返回计数 | df.count() |
| 返回每列的累计总和 | df.cumsum()函数的值 |
| 沿数据帧的轴应用函数 | df.apply(np.cumsum) |
| 迭代一系列中的每个元素，并执行所需的操作 | df['列名']。map(lambda x: 1+x) #对列进行迭代，并将值 1 加到每个元素上 |
| 将函数应用于数据帧的每个元素 | func = lambda x: x + 1 #函数向 DataFrame 的每个元素添加常数 1df . appliymap(func) |

合并/加入

Pandas 提供了各种工具，在 join/merge 类型的操作中，使用各种集合逻辑为索引和关系代数功能轻松地将 Series、DataFrame 和 Panel 对象组合在一起(清单 2-27 )。

data = {
        'emp_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Jason', 'Andy', 'Allen', 'Alice', 'Amy'],
        'last_name': ['Larkin', 'Jacob', 'A', 'AA', 'Jackson']}
df_1 = pd.DataFrame(data, columns = ['emp_id', 'first_name', 'last_name'])

data = {
        'emp_id': ['4', '5', '6', '7'],
        'first_name': ['Brian', 'Shize', 'Kim', 'Jose'],
        'last_name': ['Alexander', 'Suma', 'Mike', 'G']}
df_2 = pd.DataFrame(data, columns = ['emp_id', 'first_name', 'last_name'])

# Usingconcat
df = pd.concat([df_1, df_2])
print (df)

# or

# Using append
print (df_1.append(df_2))

# Join the two DataFrames along columns
pd.concat([df_1, df_2], axis=1)

---- output ----
# Table df_1
 emp_idfirst_namelast_name
0      1      Jason    Larkin
1      2       Andy     Jacob
2      3      Allen         A
3      4      Alice        AA
4      5        Amy   Jackson

# Table df_2
emp_idfirst_namelast_name
0      4      Brian  Alexander
1      5      Shize       Suma
2      6        Kim       Mike
3      7       Jose          G

# concated table
  emp_idfirst_namelast_name
0      1      Jason     Larkin
1      2       Andy      Jacob
2      3      Allen          A
3      4      Alice         AA
4      5        Amy    Jackson
0      4      Brian  Alexander

1      5      Shize       Suma
2      6        Kim       Mike
3      7       Jose          G

# concated along columns
emp_idfirst_namelast_nameemp_idfirst_namelast_name
0      1      Jason    Larkin      4      Brian  Alexander
1      2       Andy     Jacob      5      Shize       Suma
2      3      Allen         A      6        Kim       Mike
3      4      Alice        AA      7       Jose          G
4      5        Amy   Jackson    NaNNaNNaN

Listing 2-27Concat or Append Operation

我们可能遇到的一个常见的数据帧操作是基于一个公共列合并两个数据帧(清单 2-28 )。

# Merge two DataFrames based on the emp_id value
# in this case only the emp_id's present in both tables will be joined
pd.merge(df_1, df_2, on="emp_id")

---- output ----
  emp_id first_name_x last_name_x first_name_y last_name_y
0      4        Alice          AA        Brian   Alexander
1      5          Amy     Jackson        Shize        Suma

Listing 2-28Merge Two DataFrames

加入

Pandas 也提供 SQL 风格的合并。Left join 从表 A 中产生一个完整的记录集，匹配的记录在表 b 中可用。如果没有匹配，右侧将包含 null(清单 2-29 )。

注意:可以加后缀避免重复；如果没有提供，它会自动将 x 添加到表 A，将 y 添加到表 b。

# Left join
print(pd.merge(df_1, df_2, on="emp_id", how="left"))

# Merge while adding a suffix to duplicate column names of both table
print(pd.merge(df_1, df_2, on="emp_id", how="left", suffixes=('_left', '_right')))

---- output ----
---- without suffix ----
  emp_id first_name_x last_name_x first_name_y last_name_y
0      1        Jason      Larkin          NaN         NaN
1      2         Andy       Jacob          NaN         NaN
2      3        Allen           A          NaN         NaN
3      4        Alice          AA        Brian   Alexander
4      5          Amy     Jackson        Shize        Suma
 ---- with suffix ----
  emp_id first_name_left last_name_left first_name_right last_name_right
0      1           Jason         Larkin              NaN             NaN
1      2            Andy          Jacob              NaN             NaN
2      3           Allen              A              NaN             NaN
3      4           Alice             AA            Brian       Alexander
4      5             Amy        Jackson            Shize            Suma

Listing 2-29Left Join Two DataFrames

Right join 从表 B 中生成一组完整的记录，匹配的记录在表 a 中可用。如果没有匹配，左侧将包含 null(清单 2-30 )。

# Left join
pd.merge(df_1, df_2, on="emp_id", how="right")
---- output ----
  emp_id first_name_x last_name_x first_name_y last_name_y
0      4        Alice          AA        Brian   Alexander
1      5          Amy     Jackson        Shize        Suma
2      6          NaN         NaN          Kim        Mike
3      7          NaN         NaN         Jose           G

Listing 2-30Right Join Two DataFrames

内部连接是数据帧上另一种常见的连接操作。它只产生在表 A 和表 B 中都匹配的记录集(清单 2-31 )。

pd.merge(df_1, df_2, on="emp_id", how="inner")
 ---- output ----
  emp_id first_name_x last_name_x first_name_y last_name_y
0      4        Alice          AA        Brian   Alexander
1      5          Amy     Jackson        Shize        Suma

Listing 2-31Inner Join Two DataFrames

外部连接:完全外部连接产生一组表 A 和表 B 中的所有记录，两边的匹配记录都可用。如果没有匹配，缺失的一边将包含 null(清单 2-32 )。

pd.merge(df_1, df_2, on="emp_id", how="outer")
---- output ----
  emp_id first_name_x last_name_x first_name_y last_name_y
0      1        Jason      Larkin          NaN         NaN
1      2         Andy       Jacob          NaN         NaN
2      3        Allen           A          NaN         NaN
3      4        Alice          AA        Brian   Alexander
4      5          Amy     Jackson        Shize        Suma
5      6          NaN         NaN          Kim        Mike
6      7          NaN         NaN         Jose           G

Listing 2-32Outer Join Two DataFrames

分组

分组包括以下一个或多个步骤(列表 2-33 ):

根据某些标准将数据分组
对每个组独立应用一个函数
将结果组合成数据结构

df = pd.DataFrame({'Name' : ['jack', 'jane', 'jack', 'jane', 'jack', 'jane',
                             'jack', 'jane'],
                   'State' : ['SFO', 'SFO', 'NYK', 'CA', 'NYK', 'NYK', 'SFO', ‘CA’],
                   'Grade':['A','A','B','A','C','B','C','A'],
                   'Age' : np.random.uniform(24, 50, size=8),
                   'Salary' : np.random.uniform(3000, 5000, size=8),})

# Note that the columns are ordered automatically in their alphabetic order
# for custom order please use below code
# df = pd.DataFrame(data, columns = ['Name', 'State', 'Age','Salary'])

# Find max age and salary by Name / State
# with the group by, we can use all aggregate functions such as min, max, mean, count, cumsum
df.groupby(['Name','State']).max()

---- output ----

---- DataFrame ----
         Age Grade  Name       Salary State
0  45.364742     A  jack  3895.416684   SFO
1  48.457585     A  jane  4215.666887   SFO
2  47.742285     B  jack  4473.734783   NYK
3  35.181925     A  jane  4866.492808    CA
4  30.285309     C  jack  4874.123001   NYK
5  35.649467     B  jane  3689.269083   NYK
6  42.320776     C  jack  4317.227558   SFO
7  46.809112     A  jane  3327.306419    CA

 ----- find max age and salary by Name / State -----

                  Age Grade       Salary
Name State
jack NYK    47.742285     C  4874.123001
     SFO    45.364742     C  4317.227558
jane CA     46.809112     A  4866.492808
     NYK    35.649467     B  3689.269083
     SFO    48.457585     A  4215.666887

Listing 2-33Grouping Operation

数据透视表

Pandas 提供了一个函数pivot_table来创建一个 MS-Excel 电子表格风格的数据透视表。它可以采用以下参数来执行关键操作(清单 2-34 )。

数据:DataFrame 对象
值:要聚合的列
索引:行标签
列:列标签
agg func:用于值的聚合函数；默认值为 NumPy.mean

# by state and name find mean age for each grade
pd.pivot_table(df, values="Age", index=['State', 'Name'], columns=['Grade'])
---- output ----
Grade               A          B          C
State Name
CA    jane  40.995519        NaN        NaN
NYK   jack        NaN  47.742285  30.285309
      jane        NaN  35.649467        NaN
SFO   jack  45.364742        NaN  42.320776
      jane  48.457585        NaN        NaN

Listing 2-34Pivot Tables

Matplotlib

Matplotlib 是对 NumPy 的一个数字数学扩展，也是一个很好的以图片或图形格式查看或呈现数据的软件包。它使分析师和决策者能够看到可视化的分析，因此他们可以掌握困难的概念或识别新的模式。有两种使用 pyplot 的广泛方法(matplotlib 提供 pyplot，这是一个命令风格函数的集合，使 Matplotlib 像 MATLAB 一样工作):

全局函数
面向对象

使用全局函数

最常见和简单的方法是使用全局函数来构建和显示全局图形，使用 matplotlib 作为全局状态机(清单 2-35 )。让我们来看看一些最常用的图表:

PLT . bar–创建条形图
PLT . scatter–制作散点图
PLT . box plot–绘制一个盒须图
绘制直方图
PLT . plot–创建线形图

import matplotlib.pyplot as plt
%matplotlib inline

# simple bar and scatter plot
x = np.arange(5)          # assume there are 5 students
y = (20, 35, 30, 35, 27)  # their test scores
plt.bar(x,y)              # Barplot

# need to close the figure using show() or close(), if not closed any follow-up plot commands will use the same figure.
plt.show()                # Try commenting this an run
plt.scatter(x,y)          # scatter plot
plt.show()

# ---- output ----

Listing 2-35Creating a Plot on Variables

您可以直接在数据帧上使用直方图、折线图和箱线图。您可以看到它非常快，不需要太多的编码工作(清单 2-36 )。

df = pd.read_csv('Data/iris.csv')   # Read sample data
df.hist()# Histogram
df.plot()                           # Line Graph
df.boxplot()                        # Box plot
#  --- histogram--------------line graph -----------box plot-------

Listing 2-36Creating Plot on DataFrame

自定义标签

您可以定制标签，使它们更有意义，如清单 2-37 所示。

# generate sample data
x = np.linspace(0, 20, 1000)  #100 evenly-spaced values from 0 to 50
y = np.sin(x)

# customize axis labels
plt.plot(x, y, label = 'Sample Label')
plt.title('Sample Plot Title')                        # chart title
plt.xlabel('x axis label')                            # x axis title
plt.ylabel('y axis label')                            # y axis title
plt.grid(True)                                        # show gridlines

# add footnote
plt.figtext(0.995, 0.01, 'Footnote', ha="right", va="bottom")

# add legend, location pick the best automatically
plt.legend(loc='best', framealpha=0.5, prop={'size':'small'})

# tight_layout() can take keyword arguments of pad, w_pad and h_pad.
# these control the extra padding around the figure border and between subplots.
# The pads are specified in fraction of fontsize.
plt.tight_layout(pad=1)

# Saving chart to a file
plt.savefig('filename.png')

plt.close()  # Close the current window

to allow new plot creation on separate window / axis, alternatively we can use show()
plt.show()

---- output ----

Listing 2-37Customize Labels

面向对象

您从全局工厂获得一个空图形，然后使用图形的方法和它包含的类显式地构建绘图。该图是画布上所有内容的顶级容器。Axes 是用于特定绘图的容器类。一个图形可能包含许多轴和/或子图形。支线剧情被安排在图中的格子里。轴可以放在图上的任何地方。我们可以使用支线剧情工厂一次得到图形和所有想要的轴。这在清单 2-38 到 2-48 和图 2-14 到 2-16 中有所展示。

fig, ax = plt.subplots()
fig,(ax1,ax2,ax3) = plt.subplots(nrows=3, ncols=1, sharex=True, figsize=(8,4))

# Iterating the Axes within a Figure
for ax in fig.get_axes():
    pass                            # do something

# ---- output ----

Listing 2-38Object-Oriented Customization

使用 ax.plot()绘制线图

这是一个用图形和轴构建的单一地块。

# generate sample data
x = np.linspace(0, 20, 1000)
y = np.sin(x)

fig = plt.figure(figsize=(8,4))                        # get an empty figure and add an Axes
ax = fig.add_subplot(1,1,1)                            # row-col-num
ax.plot(x, y, 'b-', linewidth=2, label='Sample label') # line plot data on the Axes

# add title, labels and legend, etc.
ax.set_ylabel('y axis label', fontsize=16)             # y label
ax.set_xlabel('x axis label', fontsize=16)             # x label
ax.legend(loc='best')                                  # legend
ax.grid(True)                                          # show grid
fig.suptitle('Sample Plot Title')                      # title
fig.tight_layout(pad=1)                                # tidy laytout

fig.savefig('filename.png', dpi=125)

# ---- output ----

Listing 2-39Single Line Plot Using ax.plot()

同一轴上的多条线

您可以在清单 2-40 中看到在同一轴上绘制多个线图的代码示例。

# get the Figure and Axes all at once
fig, ax = plt.subplots(figsize=(8,4))

x1 = np.linspace(0, 100, 20)
x2 = np.linspace(0, 100, 20)
x3 = np.linspace(0, 100, 20)
y1 = np.sin(x1)
y2 = np.cos(x2)
y3 = np.tan(x3)

ax.plot(x1, y1, label="sin")
ax.plot(x2, y2, label="cos")
ax.plot(x3, y3, label="tan")

# add grid, legend, title and save
ax.grid(True)

ax.legend(loc='best', prop={'size':'large'})

fig.suptitle('A Simple Multi Axis Line Plot')
fig.savefig('filename.png', dpi=125)

# ---- output ----

Listing 2-40Multiple Line Plot on the Same Axis

不同轴上的多条线

关于在不同轴上绘制多条线的代码示例，请参考清单 2-41 。

# Changing sharex to True will use the same x axis
fig, (ax1,ax2,ax3) = plt.subplots(nrows=3, ncols=1, sharex=False, sharey = False, figsize=(8,4))

# plot some lines
x1 = np.linspace(0, 100, 20)
x2 = np.linspace(0, 100, 20)
x3 = np.linspace(0, 100, 20)
y1 = np.sin(x1)
y2 = np.cos(x2)
y3 = np.tan(x3)

ax1.plot(x1, y1, label="sin")
ax2.plot(x2, y2, label="cos")
ax3.plot(x3, y3, label="tan")

# add grid, legend, title and save
ax1.grid(True)
ax2.grid(True)
ax3.grid(True)

ax1.legend(loc='best', prop={'size':'large'})
ax2.legend(loc='best', prop={'size':'large'})
ax3.legend(loc='best', prop={'size':'large'})

fig.suptitle('A Simple Multi Axis Line Plot')
fig.savefig('filename.png', dpi=125)
# ---- output ----

Listing 2-41Multiple Lines on Different Axes

控制线条样式和标记样式

参考清单 2-42 中的代码示例，了解如何控制图表的线条样式和标记样式。

# get the Figure and Axes all at once
fig, ax = plt.subplots(figsize=(8,4))

# plot some lines
N = 3 # the number of lines we will plot
styles =  ['-', '--', '-.', ':']
markers = list('+ox')
x = np.linspace(0, 100, 20)

for i in range(N): # add line-by-line
    y = x + x/5∗i + i
    s = styles[i % len(styles)]
    m = markers[i % len(markers)]
    ax.plot(x, y, alpha = 1, label='Line '+str(i+1)+' '+s+m,
                  marker=m, linewidth=2, linestyle=s)

# add grid, legend, title and save
ax.grid(True)
ax.legend(loc='best', prop={'size':'large'})
fig.suptitle('A Simple Line Plot')
fig.savefig('filename.png', dpi=125)

# ---- output ----

Listing 2-42
Line Style and Marker Style Controls

线条样式参考

图 2-14 总结了可用的 matplotlib 线条样式。

图 2-14

Matplotlib 线条样式参考

标记参考

图 2-15 总结了可用的 matplotlib 标记样式。

图 2-15

Matplotlib 标记参考

彩色地图参考

图 2-16 中显示的所有色图都可以通过追加 _r 来反转，例如 gray_r 是 gray 的反转。

图 2-16

Matplotlib colormaps 参考

使用 ax.bar()绘制条形图

参考清单 2-43 中使用 ax.bar()的条形图代码示例

# get the data
N = 4
labels = list('ABCD')
data = np.array(range(N)) + np.random.rand(N)

#plot the data
fig, ax = plt.subplots(figsize=(8, 3.5))
width = 0.5;
tickLocations = np.arange(N)
rectLocations = tickLocations-(width/2.0)

# for color either HEX value of the name of the color can be used
ax.bar(rectLocations, data, width,
       color='lightblue',
       edgecolor='#1f10ed', linewidth=4.0)

# tidy-up the plot
ax.set_xticks(ticks= tickLocations)
ax.set_xticklabels(labels)
ax.set_xlim(min(tickLocations)-0.6, max(tickLocations)+0.6)
ax.set_yticks(range(N)[1:])
ax.set_ylim((0,N))
ax.yaxis.grid(True)
ax.set_ylabel('y axis label', fontsize=8)             # y label
ax.set_xlabel('x axis label', fontsize=8)             # x label

# title and save
fig.suptitle("Bar Plot")
fig.tight_layout(pad=2)
fig.savefig('filename.png', dpi=125)
# ---- output ----

Listing 2-43Bar Plots Using ax.bar() and ax.barh()

使用 ax.barh()的水平条形图

正如刻度位置需要用垂直条来管理一样，位于 y 刻度线上方的水平条也是如此(清单 2-44 )。

# get the data
N = 4
labels = list('ABCD')
data = np.array(range(N)) + np.random.rand(N)

#plot the data
fig, ax = plt.subplots(figsize=(8, 3.5))
width = 0.5;
tickLocations = np.arange(N)
rectLocations = tickLocations-(width/2.0)

# for color either HEX value of the name of the color can be used
ax.barh(rectLocations, data, width, color="lightblue")

# tidy-up the plot
ax.set_yticks(ticks= tickLocations)
ax.set_yticklabels(labels)
ax.set_ylim(min(tickLocations)-0.6, max(tickLocations)+0.6)
ax.xaxis.grid(True)
ax.set_ylabel('y axis label', fontsize=8)             # y label
ax.set_xlabel('x axis label', fontsize=8)             # x label

# title and save
fig.suptitle("Bar Plot")
fig.tight_layout(pad=2)
fig.savefig('filename.png', dpi=125)
# ---- output ----

Listing 2-44Horizontal Bar Charts

并排条形图

并排绘制条形图的代码示例参见清单 2-45 。

# generate sample data
pre = np.array([19, 6, 11, 9])
post = np.array([15, 11, 9, 8])
labels=['Survey '+x for x in list('ABCD')]

# the plot – left then right
fig, ax = plt.subplots(figsize=(8, 3.5))
width = 0.4 # bar width
xlocs = np.arange(len(pre))
ax.bar(xlocs-width, pre, width,
       color='green', label="True")
ax.bar(xlocs, post, width,
       color='#1f10ed', label="False")

# labels, grids and title, then save
ax.set_xticks(ticks=range(len(pre)))
ax.set_xticklabels(labels)
ax.yaxis.grid(True)
ax.legend(loc='best')
ax.set_ylabel('Count')
fig.suptitle('Sample Chart')
fig.tight_layout(pad=1)
fig.savefig('filename.png', dpi=125)
# ---- output ----

Listing 2-45
Side by Side Bar Chart

堆积条形图示例代码

关于创建堆叠条形图的代码示例，请参考清单 2-46 。

# generate sample data
pre = np.array([19, 6, 11, 9])
post = np.array([15, 11, 9, 8])
labels=['Survey '+x for x in list('ABCD')]

# the plot – left then right
fig, ax = plt.subplots(figsize=(8, 3.5))
width = 0.4 # bar width
xlocs = np.arange(len(pre)+2)
adjlocs = xlocs[1:-1] - width/2.0
ax.bar(adjlocs, pre, width,
       color='grey', label="True")
ax.bar(adjlocs, post, width,
       color='cyan', label="False",
       bottom=pre)

# labels, grids and title, then save
ax.set_xticks(ticks=xlocs[1:-1])
ax.set_xticklabels(labels)
ax.yaxis.grid(True)
ax.legend(loc='best')
ax.set_ylabel('Count')
fig.suptitle('Sample Chart')
fig.tight_layout(pad=1)
fig.savefig('filename.png', dpi=125)
# ---- output ----

Listing 2-46Stacked Bar Charts

使用 ax.pie()的饼图

参见清单 2-47 中创建饼图的代码示例。

# generate sample data
data = np.array([15,8,4])
labels = ['Feature Engineering', 'Model Tuning', 'Model Building']
explode = (0, 0.1, 0) # explode feature engineering
colrs=['cyan', 'tan', 'wheat']

# plot
fig, ax = plt.subplots(figsize=(8, 3.5))
ax.pie(data, explode=explode,
       labels=labels, autopct='%1.1f%%',
       startangle=270, colors=colrs)
ax.axis('equal') # keep it a circle

# tidy-up and save
fig.suptitle("ML Pie")
fig.savefig('filename.png', dpi=125)
# ---- output ----

Listing 2-47Pie Chart

网格创建的示例代码

网格创建的代码示例参见清单 2-48 。

# Simple subplot grid layouts
fig = plt.figure(figsize=(8,4))
fig.text(x=0.01, y=0.01, s="Figure",color='#888888', ha="left", va="bottom", fontsize=20)

for i in range(4):
    # fig.add_subplot(nrows, ncols, num)
    ax = fig.add_subplot(2, 2, i+1)
    ax.text(x=0.01, y=0.01, s='Subplot 2 2 '+str(i+1),  color='red', ha="left", va="bottom", fontsize=20)
    ax.set_xticks([]); ax.set_yticks([])
ax.set_xticks([]); ax.set_yticks([])
fig.suptitle('Subplots')
fig.savefig('filename.png', dpi=125)
# ---- output ----

Listing 2-48Grid Creation

绘制默认值

Matplotlib 使用 matplotlibrc 配置文件自定义各种属性，我们称之为 rc 设置或 rc 参数。您可以控制 matplotlib 中几乎每个属性的默认值，例如图形大小和 dpi、线宽、颜色和样式、轴、轴和网格属性、文本和字体属性等等(清单 2-49 )。使用下面的代码可以找到配置文件的位置，这样您就可以在需要时编辑它。

# get the configuration file location
print (matplotlib.matplotlib_fname())

# get configuration current settings
print (matplotlib.rcParams)

# Change the default settings
plt.rc('figure', figsize=(8,4), dpi=125,facecolor='white', edgecolor="white")
plt.rc('axes', facecolor='#e5e5e5',  grid=True, linewidth=1.0, axisbelow=True)
plt.rc('grid', color="white", linestyle='-',    linewidth=2.0, alpha=1.0)
plt.rc('xtick', direction="out")
plt.rc('ytick', direction="out")
plt.rc('legend', loc="best")

Listing 2-49
Plotting Defaults

机器学习核心库

Python 有过多的开源 ML 库。表 2-4 给出了根据贡献者数量排名的前 10 个 Python ML 库的快速总结。它还显示了 2016 年至 2018 年期间他们的贡献者人数增长百分比的变化。

表 2-4

Python ML 库

| |

贡献者

|
| --- | --- |
|

项目名

Two thousand and sixteen

Two thousand and eighteen

变化%

许可证

来源

|
| --- | --- | --- | --- | --- | --- |
| Scikit-learn | Seven hundred and thirty-two | One thousand two hundred and thirty-seven | 69% | BSD 3 | www.github.com/scikit-learn/scikit-learn |
| 硬 | 不适用的 | Seven hundred and seventy | 不适用的 | 用它 | https://github.com/keras-team/keras |
| Xgboost | 不适用的 | Three hundred and thirty-eight | 不适用的 | 阿帕奇 2.0 | https://github.com/dmlc/xgboost |
| 统计模型 | 不适用的 | One hundred and sixty-seven | 不适用的 | BSD 3 | https://github.com/statsmodels/statsmodels |
| Pylearn2 | One hundred and fifteen | One hundred and sixteen | 1% | BSD 3 | www.github.com/lisa-lab/pylearn2 |
| 昵图网 | Seventy-five | Eighty-six | 15% | AGPL 3 号 | www.github.com/numenta/nupic |
| 尼勒恩 | Forty-six | Eighty-one | 76% | 加州大学伯克利分校软件(Berkeley Software Distribution) | www.github.com/nilearn/nilearn |
| 皮布里 | Thirty-one | Thirty-two | 3% | BSD 3 | www.github.com/idiap/bobwww.github.com/pybrain/pybrain |
| 模式 | Twenty | Nineteen | -5% | BSD 3 | www.github.com/clips/pattern |
| 燃料 | Twenty-nine | Thirty-two | 10% | 用它 | www.github.com/luispedro/milkwww.github.com/mila-udem/fuel |

注:2016 年的数字基于 KDNuggets 新闻

Scikit-learn 是最流行和使用最广泛的 ML 库。它建立在 SciPy 之上，具有大量监督和非监督学习算法。

我们将学习更多关于 Scikit 的不同算法——在下一章详细学习。

摘要

至此，我们已经到了这一章的结尾。我们已经了解了什么是机器学习，以及它在更广泛的人工智能家族中的位置。我们还了解了与 ML 并行存在的不同相关形式/术语(如统计学、数据或业务分析、数据科学)以及它们存在的原因。我们已经简要了解了 ML 的高级类别，以及构建高效 ML 系统最常用的框架。最后，我们了解到 ML 库可以分为数据分析和核心 ML 包。我们还查看了三个重要数据分析包的关键概念和示例实现代码:NumPy、Pandas 和 Matplotlib。我想给你们留下一些有用的资源(表 2-5 )供你们将来参考，加深你们对数据分析包的了解。

表 2-5

额外资源

资源

描述

方式

|
| --- | --- | --- |
| https://docs.scipy.org/doc/numpy/reference/ | 这是 NumPy 的快速入门教程，详细介绍了所有概念。 | 在线的 |
| http://pandas.pydata.org/pandas-docs/stable/tutorials.html | 这是许多熊猫教程的指南，主要面向新用户。 | 在线的 |
| http://matplotlib.org/users/beginner.html | 初学者指南，Pyplot 教程 | 在线的 |
| Python 进行数据分析 | 这本书关注的是在 Python 中操作、处理、清理和处理数据的具体细节。 | 书 |

三、机器学习的基础

本章使用两个关键的 Python 包，重点介绍监督和非监督机器学习(ML)的不同算法。

Scikit-learn:2007 年，David Cournapeau 开发了 Scikit-learn，作为谷歌代码之夏项目的一部分。INRIA 在 2010 年参与进来，并向公众发布了测试版 v0.1。目前，有超过 700 个活跃的贡献者，以及来自 INRIA、Python 软件基金会、谷歌和 Tinyclues 的付费赞助。Scikit-learn 的许多功能都建立在 SciPy(科学 Python)库的基础上，它提供了大量有效实现的、基本的、有监督的和无监督的学习算法。

注意

Scikit-learn 也称为 sklearn，因此这两个术语在本书中可以互换使用。

Statsmodels: 这是对 SciPy 包的补充，是运行回归模型的最佳包之一，因为它为模型的每个估计器提供了一个广泛的统计结果列表。

数据的机器学习视角

数据是我们在业务环境中可用的事实和数字(也可以称为原始数据)。数据由两个方面组成:

物体如人、树、动物等。
为对象记录的属性，如年龄、大小、重量、成本等。

当我们测量一个对象的属性时，我们获得的值因对象而异。例如，如果我们将花园中的单个植物视为对象，则它们之间的属性“高度”会有所不同。相应地，不同的属性在不同的对象之间有所不同，所以属性更统称为变量。

我们为对象测量、控制或操作的东西就是变量。不同之处在于它们能被很好地衡量，也就是说，它们的衡量尺度能提供多少可衡量的信息。一个变量所能提供的信息量是由其测量尺度的类型决定的。

从较高的层面来看，基于变量可以取值的类型，有两种类型的变量:

连续定量:变量可以取大范围内的任意正数值或负数值。零售额和保险索赔额是连续变量的例子，它可以取大范围内的任何数字。这些类型的变量通常也称为数值变量。
离散或定性:变量只能取特定值。零售商店位置区域、州和城市是离散变量的示例，因为它只能为商店取一个特定值(这里“商店”是我们的对象)。这些类型的变量也称为分类变量。

测量尺度

一般来说，变量可以用四种不同的尺度来衡量(名义尺度、顺序尺度、区间尺度和比率尺度)。均值、中位数和众数是理解数据分布的中心趋势(中间点)的方法。标准差、方差和范围是用于理解数据分布的最常用的离差度量。

标称测量标度

当每个案例被归入若干离散类别之一时，数据是在名义水平上测量的。这也称为分类，即仅用于分类。由于均值没有意义，我们所能做的就是统计每种类型出现的次数，并计算比例(每种类型出现的次数/总出现次数)。参考表 3-1 中的标称刻度示例。

表 3-1

标称规模示例

变量名

示例测量值

|
| --- | --- |
| 颜色 | 红色、绿色、黄色等。 |
| 性别 | 女性，男性 |
| 足球运动员的球衣号码 | 1、2、3、4、5 等。 |

测量的顺序标度

如果类别意味着顺序，那么数据是按顺序来衡量的。军衔之间的区别在方向和权威上是一致的，但在数量上是不同的。参考表 3-2 中的顺序刻度示例。

表 3-2

序数标度示例

变量名

示例测量值

|
| --- | --- |
| 军衔 | 少尉、第一上尉、上尉、少校、中校、上校等。 |
| 服装尺寸 | 小号、中号、大号、特大号等。 |
| 考试中的班级排名 | 1、2、3、4、5 等。 |

测量的区间尺度

如果值之间的差异有意义，则数据在区间尺度上测量。参考表 3-3 中的间隔刻度示例。

表 3-3

区间标度示例

变量名

示例测量值

|
| --- | --- |
| 温度 | 10、20、30、40 等等。 |
| 智商等级 | 85–114、115–129、130–144、145–159 等。 |

测量比例

在比率尺度上测量的数据具有有意义的差异，并且与一些真正的零点相关。这是最常见的测量尺度。比率刻度示例参见表 3-4 。

表 3-4

比率标度示例

变量名

示例测量值

|
| --- | --- |
| 重量 | 10，20，30，40，50，60 等等。 |
| 高度 | 5、6、7、8、9 等。 |
| 年龄 | 1、2、3、4、5、6、7 等等。 |

表 3-5 提供了不同关键测量尺度的快速总结。

表 3-5

不同测量尺度的比较

| |

测量尺度

|
| --- | --- |
| |

名义上的

序数

间隔

比例

|
| --- | --- | --- | --- | --- |
| 属性 | 身份 | 身份重要 | 身份重要相等的间隔 | 身份重要相等的间隔真零点 |
| 数学****操作 | 数数 | 等级次序 | 添加减法 | 添加减法增加分开 |
| 描述性****统计数据 | 方式比例 | 方式中位数范围统计 | 方式中位数范围统计变化标准偏差 | 方式中位数范围统计变化标准偏差 |

特征工程

任何 ML 算法的输出或预测质量主要取决于被传递的输入的质量。通过应用业务上下文创建适当的数据特征的过程被称为特征工程，它是构建高效的 ML 系统的最重要的方面之一。这里的业务环境意味着我们试图解决的业务问题的表达，我们为什么试图解决它，以及预期的结果是什么。因此，在着手研究不同类型的最大似然算法之前，让我们先了解特征工程的基本原理。图 3-1 显示了原始数据到 ML 算法的逻辑流程。

图 3-1

ML 模型建立中的数据逻辑流程

来自不同来源的数据“按原样”是原始数据，当我们应用业务逻辑来处理原始数据时，结果是信息(经过处理的数据)。进一步的洞察力来自信息。将原始数据转换为信息，并结合业务环境来解决特定业务问题的过程是特征工程的一个重要方面。特征工程的输出是一组清晰而有意义的特征，算法可以使用这些特征来识别模式并建立 ML 模型，该模型可以进一步应用于看不见的数据以预测可能的结果。为了获得高效的最大似然系统，通常进行特征优化以降低特征维数并仅保留重要的/有意义的特征，这将减少计算时间并提高预测性能。请注意，ML 模型构建是一个迭代过程。让我们看看特性工程中的一些常见实践。

处理缺失数据

缺失的数据可能会误导数据分析，或者给数据分析带来问题。为了避免任何此类问题，您需要估算缺失数据。有四种最常用的数据插补技术:

删除:您可以简单地删除包含缺失值的行。当缺失值的行数与总记录数相比微不足道(比如说< 5%)时，这种技术更加合适和有效。您可以使用 Panda 的 dropna()函数来实现这一点。
替换为摘要:这可能是最常用的插补技术。这里的汇总是各个列的平均值、众数或中值。对于连续或定量变量，可以使用相应列的平均值或众数或中值来替换缺失值。而对于分类或定性变量，模式(最频繁)求和技术效果更好。你可以使用 Panda 的 fillna()函数来实现这一点(请参考第二章“Pandas”一节)。
随机替换:您也可以用从相应列中随机选取的值替换缺失的值。这种技术适用于缺失值行数不重要的情况。
使用预测模型:这是一种先进的技术。在这里，您可以使用可用数据为连续变量训练一个回归模型，为分类变量训练一个分类模型，并使用该模型预测缺失值。

处理分类数据

大多数 ML 库被设计成能很好地处理数值变量。因此，原始文本描述形式的分类变量不能直接用于建模。让我们根据级别数来学习一些处理分类数据的常用方法。

创建一个虚拟变量: 这是一个布尔变量，表示类别的存在，值为 1，0 表示不存在。您应该创建 k-1 个虚拟变量，其中 k 是级别数。Scikit-learn 提供了一个有用的函数，一个热编码器，为给定的分类变量创建一个虚拟变量(清单 3-1 )。

import pandas as pd
from patsy import dmatrices

df = pd.DataFrame({'A': ['high', 'medium', 'low'],
                   'B': [10,20,30]},
                    index=[0, 1, 2])

print df
#----output----
        A   B
0    high  10
1  medium  20
2     low  30

# using get_dummies function of pandas package
df_with_dummies= pd.get_dummies(df, prefix="A", columns=['A'])
print (df_with_dummies)
#----output----
    B  A_high  A_low  A_medium
0  10     1.0    0.0       0.0
1  20     0.0    0.0       1.0
2  30     0.0    1.0       0.0

Listing 3-1Creating Dummy Variables

转换为数字: 另一个简单的方法是利用 Scikit-learn 的标签编码器函数(清单 3-2 )用数字表示每一级的文本描述。如果级别数很高(例如邮政编码、州等。)，然后应用业务逻辑将级别组合到组中。例如，邮政编码或州可以与地区组合在一起；但是，这种方法存在丢失关键信息的风险。另一种方法是基于相似的频率组合类别(新的类别可以是高、中、低)。

import pandas as pd

# using pandas package's factorize function
df['A_pd_factorized'] = pd.factorize(df['A'])[0]

# Alternatively you can use sklearn package's LabelEncoder function
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

df['A_LabelEncoded'] = le.fit_transform(df.A)
print (df)
#----output----
        A   B  A_pd_factorized  A_LabelEncoded
0    high  10                0               0
1  medium  20                1               2
2     low  30                2               1

Listing 3-2Converting the Categorical Variable to Numerics

标准化数据

测量的单位或尺度因变量而异，因此使用原始测量值的分析可能会人为地偏向绝对值较高的变量。将所有不同类型的变量单位归入相同的数量级，从而消除了可能会歪曲调查结果并对结论的准确性产生负面影响的异常值测量。重新调整数据的两种广泛使用的方法是归一化和标准化。

归一化数据可以通过最小-最大缩放来实现。下面给出的公式将对范围 0 到 1 内的所有数值进行缩放。

X _归一化 = $\frac{\left(\mathrm{X}\hbox{--} {\mathrm{X}}_{\mathrm{min}}\right)}{\left({\mathrm{X}}_{\mathrm{max}}\hbox{--} \kern0.5em {\mathrm{X}}_{\mathrm{min}}\right)}$

注意

请确保在应用上述技术之前移除极端异常值，因为它会使数据中的正常值偏向一个小区间。

标准化技术会将变量转换为具有零均值和一的标准差。标准化公式如下所示，结果通常称为 z 分数。

Z = $\frac{\left(\mathrm{X}-\upmu \right)}{\upsigma}$

其中，μ是均值，σ是标准差。

标准化通常是各种分析的首选方法，因为它告诉我们每个数据点在其分布中的位置，并给出离群值的粗略指示。参考清单 3-3 获取标准化和缩放的示例代码。

from sklearn import datasets
import numpy as np
from sklearn import preprocessing

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target

std_scale = preprocessing.StandardScaler().fit(X)
X_std = std_scale.transform(X)

minmax_scale = preprocessing.MinMaxScaler().fit(X)
X_minmax = minmax_scale.transform(X)

print('Mean before standardization: petal length={:.1f}, petal width={:.1f}'
      .format(X[:,0].mean(), X[:,1].mean()))
print('SD before standardization: petal length={:.1f}, petal width={:.1f}'
      .format(X[:,0].std(), X[:,1].std()))

print('Mean after standardization: petal length={:.1f}, petal width={:.1f}'
      .format(X_std[:,0].mean(), X_std[:,1].mean()))
print('SD after standardization: petal length={:.1f}, petal width={:.1f}'
      .format(X_std[:,0].std(), X_std[:,1].std()))

print('\nMin value before min-max scaling: patel length={:.1f}, patel width={:.1f}'
      .format(X[:,0].min(), X[:,1].min()))
print('Max value before min-max scaling: petal length={:.1f}, petal width={:.1f}'
      .format(X[:,0].max(), X[:,1].max()))

print('Min value after min-max scaling: patel length={:.1f}, patel width={:.1f}'
      .format(X_minmax[:,0].min(), X_minmax[:,1].min()))

print('Max value after min-max scaling: petal length={:.1f}, petal width={:.1f}'
      .format(X_minmax[:,0].max(), X_minmax[:,1].max()))
#----output----
Mean before standardization: petal length=3.8, petal width=1.2
SD before standardization: petal length=1.8, petal width=0.8
Mean after standardization: petal length=-0.0, petal width=-0.0
SD after standardization: petal length=1.0, petal width=1.0

Min value before min-max scaling: patel length=1.0, patel width=0.1
Max value before min-max scaling: petal length=6.9, petal width=2.5

Min value after min-max scaling: patel length=0.0, patel width=0.0
Max value after min-max scaling: petal length=1.0, petal width=1.0

Listing 3-3Normalization and Scaling

特征构造或生成

只有当我们为机器学习算法提供针对您试图解决的问题的最佳可能特征时，它们才会给出最佳结果。通常，这些功能必须手动创建，方法是花费大量时间处理实际的原始数据，并试图理解它与您为解决业务问题而收集的所有其他数据的关系。

这意味着考虑聚合、拆分或组合要素以创建新要素或分解要素。通常这部分被认为是一种艺术形式，是竞争型 ML 的关键区别。

特征构建是手动的、缓慢的，并且需要主题专家的大量干预来创建丰富的特征，这些特征可以暴露给预测建模算法以产生最佳结果。

汇总数据是帮助我们理解数据质量和问题/差距的基本技术。图 3-2 映射了不同数据类型的表格和图形数据汇总方法。请注意，该映射显示了显而易见的或常用的方法，而不是一个详尽的列表。

图 3-2

常用的数据汇总方法

探索性数据分析

EDA 就是通过使用总结和可视化技术来理解你的数据。在高层次上，EDA 可以以两种方式执行:单变量分析和多变量分析。

让我们学习考虑一个实例数据集来实际学习。虹膜数据集是在模式识别文献中广泛使用的众所周知的数据集。它位于加州大学欧文分校的机器学习库。数据集包含三种鸢尾花的花瓣长度、花瓣宽度、萼片长度和萼片宽度测量值:Setosa、Versicolor 和 Virginica(图 3-3 )。

图 3-3

虹膜异色

单变量分析

单个变量被孤立地分析，以便更好地理解它们。Pandas 提供了一个 describe 函数，以表格形式为所有变量创建汇总统计数据(清单 3-4 )。这些统计数据对于数值型变量非常有用，有助于理解任何质量问题，如缺失值和异常值的存在。

from sklearn import datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

iris = datasets.load_iris()

# Let's convert to dataframe
iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['species'])

# replace the values with class labels
iris.species = np.where(iris.species == 0.0, 'setosa', np.where(iris.species==1.0,'versicolor', 'virginica'))

# let's remove spaces from column name
iris.columns = iris.columns.str.replace(' ',")
iris.describe()

#----output----
       sepallength(cm) sepalwidth(cm) petallength(cm) petalwidth(cm)
Count  150.00          150.00         150.00          150.00
Mean   5.84            3.05           3.75            1.19
std    0.82            0.43           1.76            0.76
min    4.30            2.00           1.00            0.10
25%    5.10            2.80           1.60            0.30
50%    5.80            3.00           4.35            1.30
75%    6.40            3.30           5.10            1.80
max    7.90            4.40           6.90            2.50

The columns 'species' is categorical, so let's check the frequency distribution for each category.

print (iris['species'].value_counts())
#----output----
Setosa       50
versicolor   50
virginica    50

Listing 3-4Univariate Analysis

Pandas 支持绘图功能，用于属性的快速可视化。从图中我们可以看到,“物种”有三个类别，每个类别有 50 条记录(列表 3-5 )。

# Set the size of the plot
plt.figure(figsize=(15,8))

iris.hist()        # plot histogram
plt.suptitle("Histogram", fontsize=12) # use suptitle to add title to all sublots
plt.tight_layout(pad=1)
plt.show()

iris.boxplot()     # plot boxplot
plt.title("Bar Plot", fontsize=16)
plt.tight_layout(pad=1)

plt.show()
#----output----

Listing 3-5Pandas DataFrame Visualization

多变量分析

在多元分析中，你试图建立所有变量之间的关系。让我们根据物种类型确定每个特征的平均值(列表 3-6 )。

# print the mean for each column by species
iris.groupby(by = "species").mean()

# plot for mean of each feature for each label class
iris.groupby(by = "species").mean().plot(kind="bar")

plt.title('Class vs Measurements')
plt.ylabel('mean measurement(cm)')
plt.xticks(rotation=0)  # manage the xticks rotation
plt.grid(True)

# Use bbox_to_anchor option to place the legend outside plot area to be tidy
plt.legend(loc="upper left", bbox_to_anchor=(1,1))
#----output----
        sepallength(cm) sepalwidth(cm) petallength(cm) petalwidth(cm)

setosa       5.006      3.418          1.464           0.244
versicolor   5.936      2.770          4.260           1.326
virginica    6.588      2.974          5.552           2.026

Listing 3-6A Multivariate Analysis

配对图

您可以通过查看每对属性的交互分布来理解关系属性。这使用一个内置函数来创建一个所有属性相对于所有属性的散点图矩阵(清单 3-8 )。

from pandas.plotting import scatter_matrix
scatter_matrix(iris, figsize=(10, 10))

# use suptitle to add title to all sublots

plt.suptitle("Pair Plot", fontsize=20)
#----output----

Listing 3-8
Pair Plot

EDA 的发现

没有丢失的值。
萼片比花瓣长。萼片长度在 4.3 和 7.9 之间，平均长度为 5.8，而花瓣长度在 1 和 6.9 之间，平均长度为 3.7。
萼片也比花瓣宽。萼片宽度在 2 到 4.4 之间，平均宽度为 3.05，而花瓣宽度在 0.1 到 2.5 之间，平均宽度为 1.19
刚毛藻的平均花瓣长度远小于杂色菊和海滨菊；然而，刚毛藻的平均萼片宽度大于杂色花和海滨锦葵
花瓣长度和宽度密切相关，即 96%的时间宽度随着长度的增加而增加。
花瓣长度与萼片宽度呈负相关，即萼片宽度增加 42%的时间会减少花瓣长度。
从数据得出的初步结论:仅根据萼片/花瓣的长度和宽度，你可以得出结论，云芝/海滨锦葵在大小上可能彼此相似，然而，刚毛藻的特征似乎与其他两种显著不同。

进一步观察图 3-4 中三种鸢尾花的特征，我们可以从我们的 EDA 中确定假设。

图 3-4

鸢尾花

统计学和数学构成了最大似然算法的基础。让我们从理解一些来自统计世界的基本概念和算法开始，并逐渐转向高级的 ML 算法。

监督学习–回归

您能猜出表 3-6 中给出的跨不同领域的一组业务问题的共同点吗？

表 3-6

监督学习用例示例

领域

问题

|
| --- | --- |
| 零售 | 给定商店未来 3 年的日、月、年销售额是多少？ |
| 零售 | 零售店应分配多少停车位？ |
| 制造业 | 产品制造的人工成本是多少？ |
| 制造业/零售业 | 未来 3 年我每月的电费是多少？ |
| 银行业务 | 客户的信用评分是多少？ |
| 保险 | 今年会有多少客户要求投保？ |
| 能源/环境 | 未来 5 天的气温是多少？ |

你可能猜对了！单词“多少”和“多少”的出现意味着这些问题的答案将是一个定量的或连续的数字。回归是一种基本技术，通过研究与问题相关的不同变量之间的关系，可以帮助我们找到这类问题的答案。

让我们考虑一个用例，我们从一组智商相似的学生那里收集了学生的平均考试成绩和他们各自的平均学习时间(列表 3-9 )。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('Data/Grade_Set_1.csv')
print(df)

# Simple scatter plot
df.plot(kind='scatter', x="Hours_Studied", y="Test_Grade", title='Grade vs Hours Studied')
plt.show()
# check the correlation between variables
print("Correlation Matrix: ")
print(df.corr())
# ---- output ----
   Hours_Studied  Test_Grade
0              2          57
1              3          66
2              4          73
3              5          76
4              6          79
5              7          81
6              8          90
7              9          96
8             10         100

Listing 3-9
Students’ Score vs. Hours Studied

Correlation Matrix

:
               Hours_Studied            Test_Grade
Hours_Studied      1.000000             0.987797
Test_Grade         0.987797             1.000000

一个简单的散点图显示，x 轴为学习时间，y 轴为考试成绩，成绩随着学习时间的增加而逐渐增加。这意味着这两个变量之间存在线性关系。此外，执行相关分析显示，在两个变量之间存在 98%的正关系；这意味着有 98%的可能性学习时间的任何变化都会导致成绩的变化。

拟合斜坡

让我们试着拟合一条穿过所有点的斜线，以使误差或残差(即直线到每个点的距离)尽可能最小(图 3-5 )。

图 3-5

线性回归模型组件

误差可能是正的，也可能是负的，取决于它在斜坡上的位置，因此，如果我们对所有误差进行简单求和，误差将为零。所以我们应该对误差求平方，去掉负值，然后对误差平方求和。因此，斜率也称为最小二乘直线。

斜率方程由 Y = mX + c 给出，其中 Y 是给定 x 值的预测值。
m 是 y 的变化，除以 x 的变化(即，m 是 x 变量的直线斜率，它表示 x 变量值每增加一个单位，其增加的陡度)。
c 是截距，表示它在轴上相交的位置或点；在图 3-5 的情况下是 49.67。截距是一个常数，表示 X 无法解释的 Y 的可变性。它是 X 为零时 Y 的值。

斜率和截距共同定义了两个变量之间的线性关系，可用于预测或估计平均变化率。现在，对一个新学生使用这个关系，我们可以根据他/她的学习时间来确定分数。假设一个学生计划学习 6 个小时来准备考试。简单地从 x 轴和 y 轴画一条连线到斜率，就可以看出这个学生有 80 分的可能性。我们可以使用斜率方程来预测任何给定学习时间的分数。在这种情况下，考试成绩是因变量，用“Y”表示，学习时间是自变量或预测值，用“x”表示。让我们使用 Scikit-learn 库中的线性回归函数来查找 m (x 的系数)和 c(截距)的值。参考清单 3-10 获取示例代码。

# Create linear regression object
lr = lm.LinearRegression()

x= df.Hours_Studied[:, np.newaxis] # independent variable
y= df.Test_Grade.values            # dependent variable

# Train the model using the training sets
lr.fit(x, y)
print("Intercept: ", lr.intercept_)
print("Coefficient: ", lr.coef_)

# manual prediction for a given value of x
print("Manual prediction :", 49.67777777777776 + 5.01666667*6)

# predict using the built-in function
print("Using predict function: ", lr.predict([[6]]))

# plotting fitted line
plt.scatter(x, y,  color='black')
plt.plot(x, lr.predict(x), color="blue", linewidth=3)
plt.title('Grade vs Hours Studied')
plt.ylabel('Test_Grade')
plt.xlabel('Hours_Studied')
# ---- output ----
Intercept:  49.67777777777776
Coefficient:  [5.01666667]
Manual prediction : 79.77777779777776
Using predict function:  [79.77777778]

Listing 3-10Linear Regression

我们把合适的值放到斜率方程(m∫X+c = Y)中，5.01∫6+49.67 = 79.77；这意味着学习 6 个小时的学生有可能获得 79.77 分的考试成绩。

注意，如果 X 是零，Y 的值将是 49.67。这意味着即使学生不学习，分数也有可能是 49.67。这意味着还有其他变量对分数有因果关系的影响，而我们目前还不知道。

你的模型有多好？

有三个广泛用于评估线性模型性能的指标:

r 平方
均方根误差
平均绝对误差

拟合优度的 r 平方

R 平方度量是评估模型与数据拟合程度的最常用方法。r 平方值表示由自变量解释的因变量中方差的总比例。它是一个介于 0 和 1 之间的值；值越接近 1，表示模型拟合越好。R 平方计算说明见表 3-7 ，代码实现示例见清单 3-11 。

表 3-7

R 平方计算的示例表

在哪里

             Total Sum of Square Residual (∑ SSR)
R-squared =  ------------------------------------
                  Sum of Square Total(∑ SST)

R-squared =  1510.01 / 1547.55 = 0.97

在这种情况下，R 平方可以解释为因变量(考试分数)中 97%的可变性，可以用自变量(学习小时数)来解释。

均方根误差

这是误差平方平均值的平方根。RMSE 表示预测值与实际值的接近程度；因此，较低的 RMSE 值表示模型性能良好。RMSE 的关键属性之一是单位将与目标变量相同。

$\sqrt{\frac{1}{n};{\sum}_i^n=1{\left({y}_i-{\hat{y}}_i\right)}²}$

绝对平均误差

这(MAE)是误差绝对值的平均值，即预测值-实际值。

$\frac{1}{n};{\sum}_in={1}{\left|{y}_i-{\hat{y}}_i\right|}$

# function to calculate r-squared, MAE, RMSE
from sklearn.metrics import r2_score , mean_absolute_error, mean_squared_error

# add predict value to the data frame
df['Test_Grade_Pred'] = lr.predict(x)

# Manually calculating R Squared
df['SST'] = np.square(df['Test_Grade'] - df['Test_Grade'].mean())
df['SSR'] = np.square(df['Test_Grade_Pred'] - df['Test_Grade'].mean())

print("Sum of SSR:", df['SSR'].sum())
print("Sum of SST:", df['SST'].sum())

print(df)
df.to_csv('r-squared.csv', index=False)

print("R Squared using manual calculation: ", df['SSR'].sum() / df['SST'].sum()))

# Using built-in function
print("R Squared using built-in function: ", r2_score(df.Test_Grade,  df.Test_Grade_Pred))
print("Mean Absolute Error: ", mean_absolute_error(df.Test_Grade, df.Test_Grade_Pred))
print("Root Mean Squared Error: ", np.sqrt(mean_squared_error(df.Test_Grade, df.Test_Grade_Pred)))
# ---- output ----
Sum of SSR: 1510.01666667
Sum of SST: 1547.55555556
R Squared using manual calculation:  0.97574310741
R Squared using built-in function:  0.97574310741
Mean Absolute Error:  1.61851851852
Root Mean Squared Error:  2.04229959955

Listing 3-11Linear Regression Model Accuracy Matrices

极端值

先介绍一个离群值) :一个学生学习了 5 个小时，得了 100 分。假设这个学生的智商比组里其他人都高。注意 R 平方值的下降。因此，应用业务逻辑以避免在训练数据集中包含异常值、一般化模型并提高准确性非常重要(清单 3-12 )。

# Load data
df = pd.read_csv('Data/Grade_Set_1.csv')

df.loc[9] = np.array([5, 100]) )

x= df.Hours_Studied[:, np.newaxis] # independent variable
y= df.Test_Grade.values            # dependent variable

# Train the model using the training sets
lr.fit(x, y)
print("Intercept: ", lr.intercept_)
print("Coefficient: ", lr.coef_)

# manual prediction for a given value of x
print("Manual prediction :", 54.4022988505747 + 4.64367816*6)

# predict using the built-in function
print("Using predict function: ", lr.predict([[6]]))

# plotting fitted line
plt.scatter(x, y,  color='black')
plt.plot(x, lr.predict(x), color="blue", linewidth=3)
plt.title('Grade vs Hours Studied')
plt.ylabel('Test_Grade')
plt.xlabel('Hours_Studied')

# add predict value to the data frame)

df['Test_Grade_Pred'] = lr.predict(x)

# Using built-in function
print("R Squared : ", r2_score(df.Test_Grade,  df.Test_Grade_Pred))
print("Mean Absolute Error: ", mean_absolute_error(df.Test_Grade, df.Test_Grade_Pred))
print("Root Mean Squared Error: ", np.sqrt(mean_squared_error(df.Test_Grade, df.Test_Grade_Pred)))
# ---- output ----
Intercept:  54.4022988505747
Coefficient:  [4.64367816]
Manual prediction : 82.2643678105747
Using predict function:  [82.26436782]
R Squared :  0.6855461390206965
Mean Absolute Error:  4.480459770114941)

Root Mean Squared Error:  7.761235830020588

Listing 3-12Outlier vs. R-Squared Value

多项式回归

它是一种高阶线性回归的形式，在因变量和自变量之间建模为 n 次多项式。虽然是线性的，但是可以更好的拟合曲线。本质上，我们将在等式中引入同一自变量的高阶次变量(表 3-8 和列表 3-13 )。

表 3-8

高次多项式回归

程度

回归方程式

|
| --- | --- |
| 二次的(2) | Y = m ₁ X + m ₂ X² + c |
| 立方(3) | y = m₁x+m₂x²+ m₃x³+c |
| 北 | y = m₁x+m₂x²+m₃x³+…x^n+·c |

x = np.linspace(-3,3,1000) # 1000 sample number between -3 to 3

# Plot subplots
fig, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(nrows=2, ncols=3)

ax1.plot(x, x)
ax1.set_title('linear')
ax2.plot(x, x**2)
ax2.set_title('degree 2')
ax3.plot(x, x**3)
ax3.set_title('degree 3')
ax4.plot(x, x**4)
ax4.set_title('degree 4')
ax5.plot(x, x**5)
ax5.set_title('degree 5')
ax6.plot(x, x**6)
ax6.set_title('degree 6')

plt.tight_layout()# tidy layout

# --- output ----

Listing 3-13
Polynomial Regression

让我们考虑另一组学生的平均考试成绩和他们各自的平均学习时数，对于智商相近的学生(列出 3-14 )。

# importing linear regression function
import sklearn.linear_model as lm

# Load data
df = pd.read_csv('Data/Grade_Set_2.csv')
print(df)

# Simple scatter plot
df.plot(kind='scatter', x="Hours_Studied", y="Test_Grade", title='Grade vs Hours Studied')

# check the correlation between variables
print("Correlation Matrix: ")
print(df.corr())

# Create linear regression object
lr = lm.LinearRegression()

x= df.Hours_Studied[:, np.newaxis]           # independent variable
y= df.Test_Grade                             # dependent variable

# Train the model using the training sets

lr.fit(x, y)

# plotting fitted line
plt.scatter(x, y,  color='black')
plt.plot(x, lr.predict(x), color="blue", linewidth=3)
plt.title('Grade vs Hours Studied')
plt.ylabel('Test_Grade')
plt.xlabel('Hours_Studied')

print("R Squared: ", r2_score(y, lr.predict(x)))
# ---- output ----
    Hours_Studied  Test_Grade
0             0.5          20
1             1.0          21
2             2.0          22
3             3.0          23
4             4.0          25
5             5.0          37
6             6.0          48
7             7.0          56
8             8.0          67
9             9.0          76
10           10.0          90
11           11.0          89
12           12.0          90

Correlation Matrix

:
                  Hours_Studied       Test_Grade
Hours_Studied     1.000000            0.974868
Test_Grade        0.974868            1.000000

R Squared:  0.9503677767

Listing 3-14Polynomial Regression Example

相关分析显示，学习时间和考试成绩之间有 97%的正相关，而考试成绩的 95% (R 平方)的变化可以用学习时间来解释。请注意，长达 4 个小时的平均学习成绩不到 30 分，9 个小时的学习后，成绩没有增值。这不是一个完美的线性关系，虽然我们可以拟合一条线性线。让我们试试更高阶的多项式次数(列出 3-15 )。

lr = lm.LinearRegression()

x= df.Hours_Studied        # independent variable
y= df.Test_Grade           # dependent variable

# NumPy's vander function will return powers of the input vector
for deg in [1, 2, 3, 4, 5]:
    lr.fit(np.vander(x, deg + 1), y);
    y_lr = lr.predict(np.vander(x, deg + 1))
    plt.plot(x, y_lr, label='degree ' + str(deg));
    plt.legend(loc=2);
    print("R-squared for degree " + str(deg) + " = ",  r2_score(y, y_lr))
plt.plot(x, y, 'ok')

# ---- output ----
R-squared for degree 1 =  0.9503677767
R-squared for degree 2 =  0.960872656868
R-squared for degree 3 =  0.993832312037
R-squared for degree 4 =  0.99550001841
R-squared for degree 5 =  0.99562049139

Listing 3-15R-Squared for Different Polynomial Degrees

请注意，这里的 1 阶是线性拟合，高阶多项式回归更好地拟合了曲线，R 平方在 3 阶跃升了 4%。超过 3 度后，R 平方没有大的变化，所以我们可以说 3 度更合适。

Scikit-learn 提供了一个函数来生成一个新的特征矩阵，该矩阵由阶数小于或等于指定阶数的特征的所有多项式组合组成(列表 3-16 )。

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

x= df.Hours_Studied[:, np.newaxis] # independent variable
y= df.Test_Grade                   # dependent variable

degree = 3
model = make_pipeline(PolynomialFeatures(degree), lr)

model.fit(x, y)

plt.scatter(x, y,  color='black')
plt.plot(x, model.predict(x), color="green")
plt.title('Grade vs Hours Studied')
plt.ylabel('Test_Grade')
plt.xlabel('Hours_Studied')

print("R Squared using built-in function: ", r2_score(y, model.predict(x)))
# ---- output ----
R Squared using built-in function:  0.993832312037

Listing 3-16Scikit-learn Polynomial Features

多变量回归

到目前为止，我们已经看到了一个自变量对一个给定的因变量的简单回归。在大多数现实生活的用例中，会有不止一个自变量，所以有多个自变量的概念被称为多元回归。该方程采用以下形式。

y = m₁x₁+m₂x₂+m₃x₃+...+m _n x _n

其中每个独立变量由 xs 表示，ms 是相应的系数。我们将使用“stats models”Python 库来学习多元回归的基础知识，因为它提供了更有用的统计结果，从学习的角度来看，这些结果很有帮助。一旦理解了基本概念，就可以使用 Scikit-learn 或 statsmodels 包，因为两者都很有效。

我们将使用住房数据集(来自 RDatasets 的表 3-9 ),其中包含温莎市的房屋销售价格。下面是每个变量的简要描述。

表 3-9

住房数据集(来自 RDatasets)

变量名

描述

数据类型

|
| --- | --- | --- |
| 价格 | 房子的售价 | 数字的 |
| 洛兹 | 以平方英尺为单位的房产面积 | 数字的 |
| 卧室 | 卧室数量 | 数字的 |
| Bathrms | 全浴室数量 | 数字的 |
| 故事 | 不包括地下室的层数 | 绝对的 |
| 私人车道 | 房子有车道吗？ | 布尔/分类 |
| 再房间 | 这房子有娱乐室吗？ | 布尔/分类 |
| 全碱基 | 这栋房子有完整的地下室吗？ | 布尔/分类 |
| Gashw | 房子用燃气热水供暖吗？ | 布尔/分类 |
| 河南艾瑞科 | 房子有中央空调吗？ | 布尔/分类 |
| Garagepl | 车库位置数量 | 数字的 |
| 预制区域 | 房子位于城市的首选街区吗？ | 布尔/分类 |

让我们建立一个模型，将其余变量视为自变量来预测房价(因变量)。

在运行模型的第一次迭代之前，需要适当地处理分类变量。Scikit-learn 提供了有用的内置预处理函数来处理分类变量。

LabelBinarizer :这将用数值替换二进制变量文本。我们将把这个函数用于二进制分类变量。
LabelEncoder :这将用数字表示代替类别级别。
OneHotEncoder :这将把 n 个级别转换成 n-1 个新变量，新变量将使用 1 来表示级别的存在，否则使用 0。注意，在调用 OneHotEncoder 之前，我们应该使用 LabelEncoder 将级别转换为数字。或者，我们可以使用 Pandas 包的 get_dummies 实现同样的功能。这样使用起来更有效率，因为我们可以直接在带有文本描述的列上使用它，而不必先转换成数字。

多重共线性和变异膨胀因子

因变量应该与自变量有很强的关系。但是，任何独立变量都不应该与其他独立变量有很强的相关性。多重共线性是指一个或多个独立变量彼此高度相关的情况。在这种情况下，我们应该只使用相关自变量中的一个。

多重共线性和方差膨胀因子(VIF)是多重共线性存在的指标，statsmodel 提供了一个函数来计算每个自变量的 VIF '值大于 10 是可能存在高度多重共线性的经验法则。VIF 值的标准准则是:VIF = 1 表示不存在相关性，VIF >1 但<5 means moderate correlation exists (Listing 3-17 。

${VIF}_i=\frac{1}{1-{R_i}²}$

其中 ${\mathrm{R}}_{\mathrm{i}}²$ 是变量 X _i 的决定系数。

# Load data
df = pd.read_csv('Data/Housing_Modified.csv')

# Convert binary fields to numeric boolean fields
lb = preprocessing.LabelBinarizer()

df.driveway = lb.fit_transform(df.driveway)
df.recroom = lb.fit_transform(df.recroom)
df.fullbase = lb.fit_transform(df.fullbase)
df.gashw = lb.fit_transform(df.gashw)
df.airco = lb.fit_transform(df.airco)
df.prefarea = lb.fit_transform(df.prefarea)

# Create dummy variables for stories
df_stories = pd.get_dummies(df['stories'], prefix="stories", drop_first=True)

# Join the dummy variables to the main dataframe
df = pd.concat([df, df_stories], axis=1)
del df['stories']

# lets plot the correlation matrix using statmodels graphics packages' plot_corr

# create correlation matrix
corr = df.corr()
sm.graphics.plot_corr(corr, xnames=list(corr.columns))
plt.show()
# ---- output ----

Listing 3-17
Multicollinearity and VIF

从剧情中我们可以看出，stories _ 1 与 stories _ 2 有很强的负相关关系。让我们执行 VIF 分析来消除强相关的独立变量(列表 3-18 )。

from statsmodels.stats.outliers_influence import variance_inflation_factor, OLSInfluence

# create a Python list of feature names
independent_variables = ['lotsize', 'bedrooms', 'bathrms','driveway', 'recroom',
                         'fullbase','gashw','airco','garagepl', 'prefarea',
                         'stories_one','stories_two','stories_three']

# use the list to select a subset from original DataFrame
X = df[independent_variables]
y = df['price']

thresh = 10

for i in np.arange(0,len(independent_variables)):
    vif = [variance_inflation_factor(X[independent_variables].values, ix) for ix in range(X[independent_variables].shape[1])]
    maxloc = vif.index(max(vif))
    if max(vif) > thresh:
        print("vif :", vif)
        print('dropping \" + X[independent_variables].columns[maxloc] + '\' at index: ' + str(maxloc))

        del independent_variables[maxloc]
    else:
        break

print('Final variables:', independent_variables)
# ---- output ----
vif : [8.9580980878443359, 18.469878559519948, 8.9846723472908643, 7.0885785420918861, 1.4770152815033917, 2.013320236472385, 1.1034879198994192, 1.7567462065609021, 1.9826489313438442, 1.5332946465459893, 3.9657526747868612, 5.5117024083548918, 1.7700402770614867]
dropping 'bedrooms' at index: 1
Final variables: ['lotsize', 'bathrms', 'driveway', 'recroom', 'fullbase', 'gashw', 'airco', 'garagepl', 'prefarea', 'stories_one', 'stories_two', 'stories_three']

Listing 3-18Remove Multicollinearity

我们可以看到 VIF 分析已经排除了大于 10 的卧室；但是，stories _ 1 和 stories _ 2 被保留了下来。

让我们用通过 VIF 分析的一组独立变量来运行多元回归模型的第一次迭代。

为了测试模型性能，通常的做法是将数据集分成 80/20(或 70/30)分别用于训练/测试，并使用训练数据集来建立模型。然后在测试数据集上应用训练好的模型，评估模型的性能(清单 3-19 )。

from sklearn.model_selection import train_test_split
from sklearn import metrics
# create a Python list of feature names
independent_variables = ['lotsize', 'bathrms','driveway', 'fullbase','gashw', 'airco','garagepl',
                         'prefarea','stories_one','stories_three']

# use the list to select a subset from original DataFrame
X = df[independent_variables]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.80, random_state=1)

# create a fitted model
lm = sm.OLS(y_train, X_train).fit()

# print the summary

print(lm.summary())

# make predictions on the testing set
y_train_pred = lm.predict(X_train)
y_test_pred = lm.predict(X_test)
y_pred = lm.predict(X) # full data
print("Train MAE: ", metrics.mean_absolute_error(y_train, y_train_pred))
print("Train RMSE: ", np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))

print("Test MAE: ", metrics.mean_absolute_error(y_test, y_test_pred))

print("Test RMSE: ", np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))

# ---- output ----
Train MAE:  11987.660160035877
Train RMSE:  15593.474917800835
Test MAE:  12722.079675396284
Test RMSE:  17509.25004003038

Listing 3-19Build the Multivariate Linear Regression Model

解读普通最小二乘(OLS)回归结果

调整后的 R 平方:简单的 R 平方值会随着自变量的加入而不断增加。为了解决这个问题，调整的 R 平方被考虑用于多元回归，以了解独立变量的解释能力。

$Adjusted;{R}²=1-\frac{\left(1-{R}²\right);\left(N-1\right)}{N-p-1}$

其中 N 是总观察值或样本量，p 是预测值的数量。

图 3-6

R 平方与调整后的 R 平方

图 3-6 显示了随着更多变量的增加，R 平方如何遵循调整后的 R 平方
随着包含更多变量，R 平方总是趋于增加
如果添加的变量不能解释因变量中的变量，调整后的 R 平方将下降

系数:这是各个独立变量的单独系数。它可以是正数，也可以是负数，表示自变量的每一个单位的增加都会对因变量的值产生积极或消极的影响。

标准误差:这是各个独立观察值离回归线的平均距离。较小的值表明模型拟合良好。

Durbin-Watson: 这是用于确定多重共线性存在的常用统计量之一，多重共线性是指多元回归模型中使用的两个或多个自变量高度相关。德宾-沃森统计值总是介于 0 和 4 之间的数字。2 左右的值比较理想(1.5 到 2.5 的范围比较正常)；这意味着模型中使用的变量之间没有自相关。

置信区间:这是计算自变量斜率 95%置信区间的系数。

t 和 p 值: p 值是重要的统计量之一。为了更好地理解，我们将不得不探讨假设检验和正态分布的概念。

假设检验是关于观察值分布的断言，并验证这一断言。假设检验的步骤如下:

提出一个假设。
假设的有效性得到了检验。
如果假设被发现是真实的，它被接受。
如果发现不真实，就拒绝接受
被检验可能被拒绝的假设被称为零假设。
零假设由 H ₀ 表示。
当零假设被拒绝时被接受的假设被称为替代假设 H _a 。
另一种假设通常是有趣的，也是有人想要证明的。
例如，零假设 H0 是批量大小对房价有实际影响；在这种情况下，回归方程中的系数 m 等于零(y = m÷批量+ c)。
替代假设 H _a 是批量大小对房价没有实质影响，你看到的影响是由于偶然性，也就是说回归方程中系数 m 不等于零。
In order to be able to say whether the regression estimate is close enough to the hypothesized value to be acceptable, we take the range of estimate implied by the estimated variance and see whether this range will contain the hypothesized value. To do this, we can transform the estimate into a standard normal distribution, and we know that 95% of all values of a variable that has a mean of 0 and variance of 1 will lie within 0 to 2 standard deviation. Given a regression estimate and its standard error, we can be 95% confident that the true (unknown) value of m will lie in this region (Figure 3-7).

图 3-7

正态分布(红色是拒绝区域)
t 值用于确定 p 值(概率)，p 值≤0.05 表示存在反对原假设的有力证据，因此您拒绝原假设。p 值> 0.05 表示反对零假设的证据很弱，因此您无法拒绝零假设。所以在我们的例子中，变量≤0.05 意味着变量对模型是有意义的。
检验假设的过程表明有可能出错。任何给定的数据集都有两种类型的错误，这两种类型的错误是反向相关的，这意味着一种风险越小，另一种风险越高。
第一类错误:拒绝零假设 H0 的错误，即使 H0 为真
第二类错误:即使 H0 为假，也接受原假设 H0 的错误
请注意，变量“stories_three”和“recroom”具有较大的 p 值，表明它无关紧要。所以让我们在没有这个变量的情况下重新运行回归，看看结果。

火车前：11993.3436816

开往 RMSE 的火车:16860 . 686868686617

测试： 12902.4799591

测试 RMSE:18660 . 686868686617

请注意，删除变量不会对调整后的 R 平方产生负面影响。

回归诊断

关于我们的模型结果，有一套程序和假设需要验证，否则模型可能会产生误导。让我们看看一些重要的回归诊断。

极端值

远离拟合回归线的数据点称为异常值，这些数据点会影响模型的准确性。绘制标准化残差与杠杆的关系图将使我们很好地理解异常点。残差是实际值与预测值之间的差异，而杠杆是对观察值的独立变量值与其他观察值之间距离的度量(列表 3-20 )。

# lets plot the normalized residual vs leverage
from statsmodels.graphics.regressionplots import plot_leverage_resid2
fig, ax = plt.subplots(figsize=(8,6))
fig = plot_leverage_resid2(lm, ax = ax)
# ---- output ----

Listing 3-20Plot the Normalized Residual vs. Leverage

从图表中，我们看到有许多观察值具有高杠杆和残差。运行 Bonferroni 异常值测试将为我们提供每个观察值的 p 值，以及那些 p 值为<0.05 are the outliers affecting the accuracy. It is a good practice to consult or apply business domain knowledge to make a decision on removing the outlier points and rerunning the model; these points could be natural in the process, although they are mathematically found as outliers (Listing 3-21 的观察值。

# Find outliers #
# Bonferroni outlier test
test = lm.outlier_test()

print('Bad data points (bonf(p) < 0.05):')
print(test[test['bonf(p)'] < 0.05])
# ---- output ----
Bad data points (bonf(p) < 0.05):
     student_resid   unadj_p   bonf(p)
377       4.387449  0.000014  0.006315

Listing 3-21Find Outliers

同异方差和正态性

误差方差应该是常数，称为同方差，误差应该是正态分布的(列表 3-22 )。

# plot to check homoscedasticity
plt.plot(lm.resid,'o')
plt.title('Residual Plot')
plt.ylabel('Residual')
plt.xlabel('Observation Numbers')
plt.show()
plt.hist(lm.resid, normed=True)
# ---- output ----

Listing 3-22Homoscedasticity Test

预测因素和结果变量之间的关系应该是线性的。如果关系不是线性的，则进行适当的变换(如对数、平方根和高阶多项式等)。)应用于因变量/自变量以解决问题(列表 3-23 )。

# linearity plots
fig = plt.figure(figsize=(10,15))
fig = sm.graphics.plot_partregress_grid(lm, fig=fig)
# ---- output ----

Listing 3-23Linearity Check

过度拟合和欠拟合

当模型不能很好地拟合数据，并且不能捕捉其中的潜在趋势时，就会出现拟合不足。在这种情况下，我们可以注意到训练和测试数据集中的低准确性。

相反，当模型与数据拟合得太好，捕获了所有的噪声时，就会发生过度拟合。在这种情况下，我们可以注意到训练数据集中的高准确性，而相同的模型将导致测试数据集中的低准确性。这意味着该模型将该线与训练数据集拟合得如此之好，以至于未能将其推广到与未知数据集拟合得如此之好。图 3-8 显示了之前讨论的示例用例中不同配件的外观。选择正确的多项式次数对于避免回归中的过拟合或欠拟合问题非常重要。我们还将在下一章详细讨论处理这些问题的不同方法。

图 3-8

模型拟合

正规化

随着变量数量的增加和模型复杂性的增加，过拟合的概率也增加。正则化是一种避免过拟合问题的技术。

Statsmodel 和 Scikit-learn 提供了岭和套索(最小绝对收缩和选择运算符)回归来处理过度拟合问题。随着模型复杂性的增加，系数的大小呈指数增长，因此岭和套索回归(图 3-9 )对系数的大小应用惩罚来处理这个问题。代码实现示例参见清单 3-24 。

LASSO:这提供了一个稀疏的解决方案，也称为 L1 正则化。它引导参数值为零(即，向模型添加较小值的变量的系数将为零)，并添加与系数大小的绝对值相等的惩罚。

岭回归:也称为吉洪诺夫(L2)正则化，它引导参数接近零，但不是零。当您有许多单独为模型精度增加较小值的变量时，您可以使用此方法，但是总体上会提高模型精度，并且不能从模型中排除。岭回归将应用罚分来降低所有变量的系数的大小，这些变量对模型精度增加了较小的值，增加的罚分相当于系数大小的平方。Alpha 是正则化强度，必须是正浮点数。

图 3-9

正规化

from sklearn import linear_model

# Load data
df = pd.read_csv('Data/Grade_Set_2.csv')
df.columns = ['x','y']

for i in range(2,50):               # power of 1 is already there
    colname = 'x_%d'%i              # new var will be x_power
    df[colname] = df['x']**i

independent_variables = list(df.columns)
independent_variables.remove('y')

X= df[independent_variables]       # independent variable
y= df.y                            # dependent variable

# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.80, random_state=1)

# Ridge regression

lr = linear_model.Ridge(alpha=0.001)
lr.fit(X_train, y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

print("------ Ridge Regression ------")
print("Train MAE: ", metrics.mean_absolute_error(y_train, y_train_pred))
print("Train RMSE: ", np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))

print("Test MAE: ", metrics.mean_absolute_error(y_test, y_test_pred))

print("Test RMSE: ", np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))
print("Ridge Coef: ", lr.coef_)

# LASSO regression
lr = linear_model.Lasso(alpha=0.001)
lr.fit(X_train, y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

print("----- LASSO Regression -----")
print("Train MAE: ", metrics.mean_absolute_error(y_train, y_train_pred))
print("Train RMSE: ", np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))

print("Test MAE: ", metrics.mean_absolute_error(y_test, y_test_pred))
print("Test RMSE: ", np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))
print("LASSO Coef: ", lr.coef_)
#--- output ----
------ Ridge Regression ------
Train MAE:  12.775326528414379

Train RMSE:  16.72063936357992
Test MAE:  22.397943556789926
Test RMSE:  22.432642089791898
Ridge Coef:  [ 1.01446487e-88  1.27690319e-87  1.41113660e-86  1.49319913e-85
  1.54589299e-84  1.58049535e-83  1.60336716e-82  1.61825366e-81
  1.62742313e-80  1.63228352e-79  1.63372709e-78  1.63232721e-77
  1.62845333e-76  1.62233965e-75  1.61412730e-74  1.60389073e-73
  1.59165478e-72  1.57740595e-71  1.56110004e-70  1.54266755e-69
  1.52201757e-68  1.49904080e-67  1.47361205e-66  1.44559243e-65
  1.41483164e-64  1.38117029e-63  1.34444272e-62  1.30448024e-61
  1.26111524e-60  1.21418622e-59  1.16354417e-58  1.10906042e-57
  1.05063662e-56  9.88217010e-56  9.21803842e-55  8.51476330e-54
  7.77414158e-53  6.99926407e-52  6.19487106e-51  5.36778815e-50
  4.52745955e-49  3.68659929e-48  2.86198522e-47  2.07542549e-46
  1.35493365e-45  7.36155358e-45  2.64098894e-44 -4.76790286e-45
  2.09597530e-46]
----- LASSO Regression -----
Train MAE:  0.8423742988874519
Train RMSE:  1.219129185560593
Test MAE:  4.32364759404346
Test RMSE:  4.872324349696696
LASSO Coef:  [ 1.29948409e+00  3.92103580e-01  1.75369422e-02  7.79647589e-04

  3.02339084e-05  3.35699852e-07 -1.13749601e-07 -1.79773817e-08
 -1.93826156e-09 -1.78643532e-10 -1.50240566e-11 -1.18610891e-12
 -8.91794276e-14 -6.43309631e-15 -4.46487394e-16 -2.97784537e-17
 -1.89686955e-18 -1.13767046e-19 -6.22157254e-21 -2.84658206e-22
 -7.32019963e-24  5.16015995e-25  1.18616856e-25  1.48398312e-26
  1.55203577e-27  1.48667153e-28  1.35117812e-29  1.18576052e-30
  1.01487234e-31  8.52473862e-33  7.05722034e-34  5.77507464e-35
  4.68162529e-36  3.76585569e-37  3.00961249e-38  2.39206785e-39
  1.89235649e-40  1.49102460e-41  1.17072537e-42  9.16453614e-44
  7.15512017e-45  5.57333358e-46  4.33236496e-47  3.36163309e-48
  2.60423554e-49  2.01461728e-50  1.55652093e-51  1.20123190e-52
  9.26105400e-54]

Listing 3-24
Regularization

非线性回归

线性模型本质上大多是线性的，尽管它们不需要直线拟合。相比之下，非线性模型的拟合线可以采取任何形状。这种情况通常发生在基于物理或生物因素推导模型时。非线性模型对所研究的过程有直接的解释。SciPy 库提供了一个 curve_fit 函数，根据理论将模型与科学数据进行拟合，以确定物理系统的参数。一些示例用例是米氏酶动力学、威布尔分布和幂律分布(列表 3-25 )。

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
%matplotlib inline

y = np.array([1.0, 1.5, 2.4, 2, 1.49, 1.2, 1.3, 1.2, 0.5])

# Function for non-liear curve fitting
def func(x, p1,p2):
    return p1*np.sin(p2*x) + p2*np.cos(p1*x)

popt, pcov = curve_fit(func, x, y,p0=(1.0,0.2))

p1 = popt[0]
p2 = popt[1]
residuals = y - func(x,p1,p2)
fres = sum(residuals**2)

curvex=np.linspace(-2,3,100)
curvey=func(curvex,p1,p2)

plt.plot(x,y,'bo ')
plt.plot(curvex,curvey,'r')
plt.title('Non-linear fitting')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['data','fit'],loc='best')
plt.show()

# ---- output ----

Listing 3-25
Nonlinear Regression

监督学习–分类

再来看另一组问题(表 3-10 )。您能猜出不同领域的这组业务问题的共同点吗？

表 3-10

分类用例示例

域

问题

|
| --- | --- |
| 电信 | 客户有可能离开网络吗？(流失预测) |
| 零售 | 他是潜在客户吗(即购买与不购买的可能性)？ |
| 保险 | 要开保险，是否应该送客户去体检？ |
| 保险 | 客户会续保吗？ |
| 银行业务 | 客户会拖欠贷款金额吗？ |
| 银行业务 | 应该给客户贷款吗？ |
| 制造业 | 设备会失效吗？ |
| 卫生保健 | 病人感染了疾病吗？ |
| 卫生保健 | 病人患的是什么类型的疾病？ |
| 娱乐 | 音乐的流派是什么？ |

这些问题的答案是一个独立的类。级别或类别的数量可以从最少两个(例如:真或假，是或否)到多类别不等。在 ML 中，分类处理识别一个新对象是一个类或集合的成员的概率。分类器是将输入数据(也称为特征)映射到类别的算法。

逻辑回归

让我们考虑一个用例，其中我们必须预测学生的测试结果:通过(1)或失败(0)，基于学习的小时数。在这种情况下，要预测的结果是离散的。让我们建立一个线性回归并尝试使用一个阈值:任何超过某个值的都是通过，否则失败(清单 3-26 )。

import sklearn.linear_model as lm

# Load data
df = pd.read_csv('Data/Grade_Set_1_Classification.csv')
print (df)
x= df.Hours_Studied[:, np.newaxis] # independent variable
y= df.Result                       # dependent variable

# Create linear regression object
lr = lm.LinearRegression()

# Train the model using the training sets
lr.fit(x, y)

# plotting fitted line
plt.scatter(x, y,  color='black')
plt.plot(x, lr.predict(x), color="blue", linewidth=3)
plt.title('Hours Studied vs Result')
plt.ylabel('Result')
plt.xlabel('Hours_Studied')

# add predict value to the data frame
df['Result_Pred'] = lr.predict(x)

# Using built-in function
print ("R Squared : ", r2_score(df.Result, df.Result_Pred))
print ("Mean Absolute Error: ", mean_absolute_error(df.Result, df.Result_Pred))
print ("Root Mean Squared Error: ", np.sqrt(mean_squared_error(df.Result, df.Result_Pred)))
# ---- output ----
R Squared :  0.675
Mean Absolute Error:  0.22962962963
Root Mean Squared Error:  0.268741924943

Listing 3-26
Logistic Regression

我们期待的结果不是 1 就是 0；线性回归的问题是它可以给出大于 1 或小于 0 的值。在前面的图中，我们可以看到线性回归无法绘制边界来对观察值进行分类。

对此的解决方案是将 sigmoid 或 logit 函数(呈 S 形)引入回归方程。这里的基本思想是假设将使用线性近似，然后使用逻辑函数映射进行二元预测。这种情况下的线性回归方程是 y = mx + c。

Logistic 回归可以用比值比更好的解释。事件发生的几率定义为事件发生的概率除以事件不发生的概率。

通过与失败的比值比= 概率(y= 1)/1概率 ( y = 1)

logit 是赔率的对数基数 e(log ),因此使用 logit 模型:

log(p / p(1 - p)) = mx + c

图 3-10 显示了逻辑回归方程概率(y = 1)= 1/1+e^-(MX+c)和清单 3-27 显示了绘制 sigmoid 的代码实现。

图 3-10

线性回归与逻辑回归

# plot sigmoid function
x = np.linspace(-10, 10, 100)
y = 1.0 / (1.0 + np.exp(-x))

plt.plot(x, y, 'r-', label="logit")
plt.legend(loc='lower right')

# --- output ----

Listing 3-27
Plot Sigmoid Function

清单 3-28 显示了使用 Scikit-learn 包实现逻辑回归的示例代码。

from sklearn.linear_model import LogisticRegression

# manually add intercept
df['intercept'] = 1
independent_variables = ['Hours_Studied', 'intercept']

x = df[independent_variables]         # independent variable
y = df['Result']                      # dependent variable

# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(x, y)

# check the accuracy on the training set
model.score(x, y)

# predict_proba will return array containing probability of y = 0 and y = 1
print ('Predicted probability:', model.predict_proba(x)[:,1])

# predict will give convert the probability(y=1) values > .5 to 1 else 0
print ('Predicted Class:',model.predict(x))

# plotting fitted line
plt.scatter(df.Hours_Studied, y,  color='black')
plt.yticks([0.0, 0.5, 1.0])
plt.plot(df.Hours_Studied, model.predict_proba(x)[:,1], color="blue", linewidth=3)
plt.title('Hours Studied vs Result')
plt.ylabel('Result')
plt.xlabel('Hours_Studied')

plt.show()
# ---- output ----
Predicted probability: [0.38623098 0.49994056 0.61365629 0.71619252 0.80036836 0.86430823 0.91006991 0.94144416 0.96232587]
Predicted Class: [0 0 1 1 1 1 1 1 1]

Listing 3-28Logistic Regression Using Scikit-learn

评估分类模型性能

混淆矩阵是用于描述分类模型性能的表格。图 3-11 显示了混淆矩阵。

图 3-11

混淆矩阵

真阴性(TN):预测为假的实际假
假阳性(FP):预测为真的实际错误(I 类错误)
假阴性(FN):预测为假的实际真值(第二类错误)
真正值(TP):预测为真的实际真值

理想情况下，一个好的模型应该具有较高的 TN 和 TP，以及较少的 I 型和 II 型误差。表 3-11 描述了源自混淆矩阵的关键指标，用于测量分类模型性能。(表 3-11 )。清单 3-29 是产生混淆矩阵的代码示例。

表 3-11

分类性能矩阵

公制的

描述

公式

|
| --- | --- | --- |
| 准确 | 百分之多少的预测是正确的？ | （TP+TN）/（TP+TN+FP+FN） |
| 错误分类率 | 百分之多少的预测是错误的？ | (FP+FN)/(TP+TN+FP+FN) |
| 真实阳性率或灵敏度或回忆(完整性) | 模型捕捉到的阳性病例的百分比是多少？ | TP/(联合国+TP) |
| 假阳性率 | 有百分之多少的否定被预测为肯定？ | FP/(FP+TN) |
| 特征 | 预测 No 为 No 的百分比是多少？ | TN/(TN+FP) |
| 精确度 | 多少%的正面预测是正确的？ | TP/(TP+FP) |
| F1 分数 | 精确度和召回率的加权平均值 | 2(精度召回)/(精度+召回)) |

from sklearn import metrics

# generate evaluation metrics

print ("Accuracy :", metrics.accuracy_score(y, model.predict(x)))
print ("AUC :", metrics.roc_auc_score(y, model.predict_proba(x)[:,1]))

print ("Confusion matrix :",metrics.confusion_matrix(y, model.predict(x)))
print ("classification report :", metrics.classification_report(y, model.predict(x)))
# ----output----
Accuracy : 0.88
AUC : 1.0
Confusion matrix : [[2 1] [0 6]]
classification report :
                precision    recall  f1-score   support

           0       1.00      0.67      0.80         3
           1       0.86      1.00      0.92         6

   micro avg       0.89      0.89      0.89         9
   macro avg       0.93      0.83      0.86         9
weighted avg       0.90      0.89      0.88         9

Listing 3-29Confusion Matrix

受试者工作特征曲线

ROC(接收器操作特性)曲线是一个更重要的度量，是可视化二元分类器性能的最常用方法；AUC 被认为是用一个数字概括业绩的最佳方式之一。AUC 表示分类器对随机选择的阳性样本的概率评分高于随机选择的阴性样本。如果您有多个精确度几乎相同的模型，您可以选择一个给出更高 AUC 的模型(清单 3-30 )。

# Determine the false positive and true positive rates
fpr, tpr, _ = metrics.roc_curve(y, model.predict_proba(x)[:,1])

# Calculate the AUC
roc_auc = metrics.auc(fpr, tpr)
print ('ROC AUC: %0.2f' % roc_auc)

# Plot of a ROC curve for a specific class
plt.figure()
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()
#---- output ----

Listing 3-30Area Under the Curve

在前一种情况下，AUC 是 100%，因为该模型能够将所有阳性情况预测为真阳性。

装配线

正则化的逆过程是拟合逻辑回归直线的关键方面之一。它定义了拟合线的复杂度。让我们尝试为这个参数(C，默认为 1)的不同值拟合直线，看看拟合直线和精度如何变化(清单 3-31 )。

# instantiate a logistic regression model with default c value, and fit with X and y
model = LogisticRegression()
model = model.fit(x, y)

# check the accuracy on the training set
print ("C = 1 (default), Accuracy :", metrics.accuracy_score(y, model.predict(x)))

# instantiate a logistic regression model with c = 10, and fit with X and y
model1 = LogisticRegression(C=10)
model1 = model1.fit(x, y)

# check the accuracy on the training set
print ("C = 10, Accuracy :", metrics.accuracy_score(y, model1.predict(x)))

# instantiate a logistic regression model with c = 100, and fit with X and y
model2 = LogisticRegression(C=100)
model2 = model2.fit(x, y)

# check the accuracy on the training set
print ("C = 100, Accuracy :", metrics.accuracy_score(y, model2.predict(x)))

# instantiate a logistic regression model with c = 1000, and fit with X and y
model3 = LogisticRegression(C=1000)
model3 = model3.fit(x, y)

# check the accuracy on the training set
print ("C = 1000, Accuracy :", metrics.accuracy_score(y, model3.predict(x)))

# plotting fitted line
plt.scatter(df.Hours_Studied, y,  color='black', label="Result")
plt.yticks([0.0, 0.5, 1.0])
plt.plot(df.Hours_Studied, model.predict_proba(x)[:,1], color="gray", linewidth=2, label='C=1.0')
plt.plot(df.Hours_Studied, model1.predict_proba(x)[:,1], color="blue", linewidth=2,label='C=10')
plt.plot(df.Hours_Studied, model2.predict_proba(x)[:,1], color="green", linewidth=2,label='C=100')
plt.plot(df.Hours_Studied, model3.predict_proba(x)[:,1], color="red", linewidth=2,label='C=1000')
plt.legend(loc='lower right') # legend location
plt.title('Hours Studied vs Result')
plt.ylabel('Result')
plt.xlabel('Hours_Studied')

plt.show()
#----output----
C = 1 (default), Accuracy : 0.88
C = 10, Accuracy : 1.0
C = 100, Accuracy : 1.0
C = 1000, Accuracy : 1.0

Listing 3-31Controling Complexity for Fitting a Line

随机梯度下降

为大型数据集拟合使误差最小化的正确斜率(也称为成本函数)可能很棘手。然而，这可以通过随机梯度下降(最速下降)优化算法来实现。在回归问题的情况下，学习权重的成本函数 J 可以定义为实际值与预测值之间的误差平方和(SSE)。

J(w) = $\frac{1}{2}{\sum}_{\mathrm{i}}\left({\mathrm{y}}{\mathrm{i}}-{\hat{\mathrm{y}}}{\mathrm{i}}\right)$ ，其中 y ⁱ 第 i ^个为实际值， ${\hat{\mathrm{y}}}^{\mathrm{i}}$ 为第 i ^个个预测值。

对于每个训练样本 I 的每个权重 j，更新权重(w)的随机梯度下降算法可以被给出为(重复直到收敛) ${\mathrm{W}}_{\mathrm{j}}:= {\mathrm{W}}_{\mathrm{j}}+\upalpha\ \sum \limits_{\mathrm{i}=1}{\mathrm{m}}\left({\mathrm{y}}{\mathrm{i}}-{\hat{\mathrm{y}}}{\mathrm{i}}\right){\mathrm{x}}_{\mathrm{j}}{\mathrm{i}}$ 。Alpha (α)是学习率，为其选择较小的值将确保算法不会错过全局成本最小值(图 3-12 )。

图 3-12

梯度下降

Scikit-learn 中逻辑回归的默认求解器参数是“liblinear”这对于较小的数据集来说很好。对于包含大量独立变量的大型数据集，建议使用“sag”(随机平均梯度下降)求解器来更快地拟合最佳斜率。

正规化

随着变量数量的增加，过度拟合的可能性也会增加。LASSO (L1)和 Ridge (L2)也可以用于逻辑回归，以避免过度拟合。让我们看一个例子来理解逻辑回归中的过拟合/欠拟合问题(清单 3-32 )。

import pandas as pd
data = pd.read_csv('Data\LR_NonLinear.csv')

pos = data['class'] == 1
neg = data['class'] == 0
x1 = data['x1']
x2 = data['x2']

# function to draw scatter plot between two variables
def draw_plot():
    plt.figure(figsize=(6, 6))
    plt.scatter(np.extract(pos, x1),
                np.extract(pos, x2),
                c='b', marker="s", label="pos")
    plt.scatter(np.extract(neg, x1),
                np.extract(neg, x2),
                c='r', marker="o", label="neg")
    plt.xlabel('x1');
    plt.ylabel('x2');
    plt.axes().set_aspect('equal', 'datalim')
    plt.legend();

# create hihger order polynomial for independent variables
order_no = 6

# map the variable 1 & 2 to its higher order polynomial
def map_features(variable_1, variable_2, order=order_no):
    assert order >= 1
    def iter():
        for i in range(1, order + 1):
            for j in range(i + 1):
                yield np.power(variable_1, i - j) * np.power(variable_2, j)

    return np.vstack(iter())

out = map_features(data['x1'], data['x2'], order=order_no)
X = out.transpose()
y = data['class']

# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# function to draw classifier line
def draw_boundary(classifier):
    dim = np.linspace(-0.8, 1.1, 100)
    dx, dy = np.meshgrid(dim, dim)
    v = map_features(dx.flatten(), dy.flatten(), order=order_no)
    z = (np.dot(classifier.coef_, v) + classifier.intercept_).reshape(100, 100)
    plt.contour(dx, dy, z, levels=[0], colors=['r'])

# fit with c = 0.01
clf = LogisticRegression(C=0.01).fit(X_train, y_train)
print ('Train Accuracy for C=0.01: ', clf.score(X_train, y_train))
print ('Test Accuracy for C=0.01: ', clf.score(X_test, y_test))
draw_plot()

plt.title('Fitting with C=0.01')
draw_boundary(clf)
plt.legend();

# fit with c = 1
clf = LogisticRegression(C=1).fit(X_train, y_train)
print ('Train Accuracy for C=1: ', clf.score(X_train, y_train))
print ('Test Accuracy for C=1: ', clf.score(X_test, y_test))
draw_plot()
plt.title('Fitting with C=1')
draw_boundary(clf)

plt.legend();

# fit with c = 10000
clf = LogisticRegression(C=10000).fit(X_train, y_train)
print ('Train Accuracy for C=10000: ', clf.score(X_train, y_train))
print ('Test Accuracy for C=10000: ', clf.score(X_test, y_test))
draw_plot()
plt.title('Fitting with C=10000')
draw_boundary(clf)
plt.legend();
#----output----
Train Accuracy for C=0.01:  0.624242424242
Test Accuracy for C=0.01:  0.619718309859
Train Accuracy for C=1:  0.842424242424
Test Accuracy for C=1:  0.859154929577
Train Accuracy for C=10000:  0.860606060606
Test Accuracy for C=10000:  0.788732394366

Listing 3-32Underfitting, Right-Fitting, and Overfitting

请注意，对于更高阶的正则化，会发生值过度拟合。同样可以通过查看训练和测试数据集之间的准确性来确定(即，准确性在测试数据集中显著下降)。

多类逻辑回归

逻辑回归也可以用来预测多类的因变量或目标变量。让我们用 Iris 数据集学习多类预测，它是模式识别文献中最著名的数据库之一。数据集包含三类，每类 50 个实例，其中每类涉及一种鸢尾植物。这是 Scikit-learn 数据集的一部分，其中第三列表示花瓣长度，第四列表示花朵样本的花瓣宽度。这些类已经转换为整数标注，其中 0 =鸢尾-Setosa，1 =鸢尾-杂色，2 =鸢尾-海滨。

加载数据

我们可以从 sklearn 数据集加载数据，如清单 3-33 所示。

from sklearn import datasets
import numpy as np
import pandas as pd
iris = datasets.load_iris()
X = iris.data
y = iris.target
print('Class labels:', np.unique(y))
#----output----
('Class labels:', array([0, 1, 2]))

Listing 3-33
Load Data

标准化数据

度量单位可能会有所不同，所以让我们在构建模型之前将数据标准化(清单 3-34 )。

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X)
X = sc.transform(X)

Listing 3-34
Normalize Data

分割数据

将数据分成训练和测试。每当我们使用随机函数时，建议使用种子来确保结果的可重复性(清单 3-35 )。

# split data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Listing 3-35Split Data into Train and Test

训练逻辑回归模型并评估

清单 3-36 是逻辑回归模型训练和评估的示例代码实现。

from sklearn.linear_model import LogisticRegression

# l1 regularization gives better results
lr = LogisticRegression(penalty='l1', C=10, random_state=0)
lr.fit(X_train, y_train)

from sklearn import metrics

# generate evaluation metrics
print("Train - Accuracy :", metrics.accuracy_score(y_train, lr.predict(X_train)))
print("Train - Confusion matrix :",metrics.confusion_matrix(y_train, lr.predict(X_train)))
print("Train - classification report :", metrics.classification_report(y_train, lr.predict(X_train)))

print("Test - Accuracy :", metrics.accuracy_score(y_test, lr.predict(X_test)))
print("Test - Confusion matrix :",metrics.confusion_matrix(y_test, lr.predict(X_test)))
print("Test - classification report :", metrics.classification_report(y_test, lr.predict(X_test)))

#----output----
Train - Accuracy : 0.9809523809523809
Train - Confusion matrix : [[34  0  0]
                            [ 0 30  2]
                            [ 0  0 39]]
Train - classification report :
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        34
           1       1.00      0.94      0.97        32
           2       0.95      1.00      0.97        39

   micro avg       0.98      0.98      0.98       105
   macro avg       0.98      0.98      0.98       105
weighted avg       0.98      0.98      0.98       105

Test - Accuracy : 0.9777777777777777
Test - Confusion matrix : [[16  0  0]
                           [ 0 17  1]
                           [ 0  0 11]]
Test - classification report :
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       1.00      0.94      0.97        18
           2       0.92      1.00      0.96        11

   micro avg       0.98      0.98      0.98        45
   macro avg       0.97      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45

Listing 3-36Logistic Regression Model Training and Evaluation

广义线性模型

广义线性模型(GLM)是约翰·内尔德和罗伯特·威德伯恩统一常用的各种统计模型，如线性、逻辑、泊松等。(表 3-12 )。代码实现示例参见清单 3-37 。

表 3-12

不同的 GLM 分布族

家庭的

描述

|
| --- | --- |
| 二项式 | 目标变量是二元响应 |
| 泊松 | 目标变量是出现次数 |
| 高斯的 | 目标变量是一个连续数字 |
| 微克 | 当泊松分布事件之间的等待时间相关时(即，在两个时间段之间发生了多个事件)，出现这种分布。 |
| 逆高斯 | 分布的尾部比正态分布下降得更慢(即，走过单位距离所需的时间与单位时间内走过的距离成反比)。 |
| 负二项式 | 目标变量表示在随机失败之前序列中的成功次数 |

df = pd.read_csv('Data/Grade_Set_1.csv')

print('####### Linear Regression Model ########')
# Create linear regression object
lr = lm.LinearRegression()

x= df.Hours_Studied[:, np.newaxis] # independent variable
y= df.Test_Grade.values            # dependent variable

# Train the model using the training sets
lr.fit(x, y)

print ("Intercept: ", lr.intercept_)
print ("Coefficient: ", lr.coef_)

print('\n####### Generalized Linear Model ########')
import statsmodels.api as sm

# To be able to run GLM, we'll have to add the intercept constant to x variable
x = sm.add_constant(x, prepend=False)

# Instantiate a gaussian family model with the default link function.
model = sm.GLM(y, x, family = sm.families.Gaussian())
model = model.fit()
print (model.summary())

#----output----

####### Linear Regression Model ########
Intercept:  49.6777777778
Coefficient:  [ 5.01666667]

####### Generalized Linear Model ########
                 Generalized Linear Model Regression Results
===========================================================================
Dep. Variable:                   y   No. Observations:                    9
Model:                         GLM   Df Residuals:                        7
Model Family:             Gaussian   Df Model:                            1
Link Function:            identity   Scale:                          5.3627
Method:                       IRLS   Log-Likelihood:                -19.197
Date:             Sat, 09 Feb 2019   Deviance:                       37.539
Time:                     10:01:22   Pearson chi2:                     37.5
No. Iterations:                  3   Covariance Type:             nonrobust
===========================================================================
              coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------
x1          5.0167      0.299     16.780      0.000       4.431      5.603
const      49.6778      1.953     25.439      0.000      45.850     53.505

===========================================================================

Listing 3-37Generalized Linear Model

注意，线性回归和 GLM 的系数是相同的。然而，GLM 可用于其他分布，如二项式分布、泊松分布等。通过改变族参数。

监督学习–流程

至此，您已经看到了如何构建回归和逻辑回归模型，所以让我在图 3-13 中总结一下监督学习的流程。

图 3-13

监督学习流程

首先，您需要通过对历史数据应用 ML 技术来训练和验证监督模型。然后将该模型应用于新的数据集，以预测未来值。

决策树

1986 年，J.R. Quinlan 发表了“决策树归纳”,总结了一种使用 ML 合成决策树的方法。它使用了一个说明性的示例数据集，目标是决定是否在周六早上出去玩。顾名思义，决策树是一种树状结构，其中内部节点代表对属性的测试，每个分支代表测试的结果，每个叶子节点代表类标签，在计算所有属性后做出决策。从根到叶的路径代表分类规则。因此，决策树由三种类型的节点组成:

根节点
分支节点
叶节点(类别标签)

决策树模型输出很容易解释，它提供了驱动决策或事件的规则。在前面的用例中，我们可以得到导致不玩场景的规则:晴天，温度> 30 ° 30 ⁰ 摄氏度，下雨和刮风是真的。通常，企业可能对这些决策规则比对决策本身更感兴趣。例如，保险公司可能更感兴趣的是保险申请人应该被送去进行体检的规则或条件，而不是将申请人的数据提供给黑盒模型来找到决定(图 3-14 )。

图 3-14

J.R. Quinlan 合成决策树的例子

使用训练数据构建树生成器模型，该模型将确定在节点处拆分哪个变量以及拆分的值。停止或再次分割的决定将叶节点分配给一个类。决策树的优点是不需要专门创建虚拟变量。

树是如何分裂和生长的

基本算法被称为贪婪算法，在该算法中，树是以自上而下的递归分治方式构造的。
开始时，所有的训练示例都是在根上。
输入数据基于选定的属性进行递归分区。
每个节点的测试属性都是基于启发式或统计杂质测量示例(Gini 或信息增益(熵))来选择的。
- Gini = 1 - $\sum \limits_{\mathrm{i}}{\left({\mathrm{p}}_{\mathrm{i}}\right)}²$ ，其中 p _i 是每个标签的概率。
- 熵=-p log2(p)–q log2(q)，其中 p 和 q 分别表示给定节点中成功/失败的概率。

停止分区的条件

给定节点的所有样本属于同一类别。
没有剩余的属性用于进一步划分——采用多数表决来对叶进行分类
没有样本了。

注意

默认标准是“基尼系数”，因为它比“熵”计算起来相对更快；然而，这两种方法对拆分给出了几乎相同的决定。

清单 3-38 提供了一个在 Iris 数据集上实现决策树模型的例子。

from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn import tree

iris = datasets.load_iris()

# X = iris.data[:, [2, 3]]
X = iris.data
y = iris.target
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X)
X = sc.transform(X)

# split data into train and test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

clf = tree.DecisionTreeClassifier(criterion = 'entropy', random_state=0)
clf.fit(X_train, y_train)

# generate evaluation metrics
print("Train - Accuracy :", metrics.accuracy_score(y_train, clf.predict(X_train)))

print("Train - Confusion matrix :",metrics.confusion_matrix(y_train, clf.predict(X_train)))
print("Train - classification report :", metrics.classification_report(y_train, clf.predict(X_train)))

print("Test - Accuracy :", metrics.accuracy_score(y_test, clf.predict(X_test)))
print("Test - Confusion matrix :",metrics.confusion_matrix(y_test, clf.predict(X_test)))
print("Test - classification report :", metrics.classification_report(y_test, clf.predict(X_test)))

tree.export_graphviz(clf, out_file='tree.dot')

from sklearn.externals.six import StringIO
import pydot
out_data = StringIO()

tree.export_graphviz(clf, out_file=out_data,
                    feature_names=iris.feature_names,
                    class_names=clf.classes_.astype(int).astype(str),
                    filled=True, rounded=True,
                    special_characters=True,
                    node_ids=1,)
graph = pydot.graph_from_dot_data(out_data.getvalue())
graph[0].write_pdf("iris.pdf")  # save to pdf
#----output----
Train - Accuracy : 1.0
Train - Confusion matrix : [[34  0  0]
                            [ 0 32  0]
                            [ 0  0 39]]
Train - classification report :
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        34
           1       1.00      1.00      1.00        32
           2       1.00      1.00      1.00        39

   micro avg       1.00      1.00      1.00       105
   macro avg       1.00      1.00      1.00       105
weighted avg       1.00      1.00      1.00       105

Test - Accuracy : 0.9777777777777777
Test - Confusion matrix : [[16  0  0]
                           [ 0 17  1]
                           [ 0  0 11]]
Test - classification report

:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       1.00      0.94      0.97        18
           2       0.92      1.00      0.96        11

   micro avg       0.98      0.98      0.98        45
   macro avg       0.97      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45

Listing 3-38Decision Tree Model

阻止树木生长的关键参数

决策树的一个关键问题是，树可能会变得非常大，最终每观察一次就生成一片叶子。

max_features :决定每次分割时要考虑的最大特征；default = "None "，这意味着将考虑所有要素。

min_samples_split :不满足此数量的节点不允许拆分。

min_samples_leaf :小于最小样本数的节点不允许叶节点。

max_depth :不允许进一步分割；default = "无。"

支持向量机

弗拉基米尔·瓦普尼克和阿列克谢·亚。切尔沃嫩基斯在 1963 年提出了 SVM。SVM 的一个关键目标是绘制一个超平面，该超平面最优地将两个类分开，使得超平面和观测值之间的余量最大。图 3-15 说明了存在不同超平面的可能性。然而，SVM 的目标是找到能给我们带来高利润的产品(图 3-15 )。

图 3-15

支持向量机

为了最大化余量，我们需要最小化(1/2)||w||2 服从 yi(WTXi + b)-1 ≥ 0 对于所有 I。

最终的 SVM 方程在数学上可以写成

L = $\sum \limits_id$ i $-\frac{1}{2}\sum \limits_{ij}{\alpha}_{\mathrm{i}}{\alpha}_{\mathrm{j}}{\mathrm{y}}_{\mathrm{i}}{\mathrm{y}}_{\mathrm{j}}$ ( $\overline{X}$ i $\overline{X}$ j)

注意

与逻辑回归相比，SVM 不太容易出现异常值，因为它只关心最接近决策边界或支持向量的点。

关键参数

C :这是惩罚参数，有助于平滑、恰当地拟合边界；默认值= 1。

内核:内核是用于模式分析的相似度函数。它必须是 RBF/线性/多边形/sigmoid/预计算之一；default = "rbf "(径向基函数)。选择合适的内核将导致更好的模型拟合(清单 3-39 )。

from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import metrics

iris = datasets.load_iris()

X = iris.data[:, [2, 3]]
y = iris.target

print('Class labels:', np.unique(y))
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(X)
X = sc.transform(X)

# split data into train and test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
from sklearn.svm import SVC

clf = SVC(kernel='linear', C=1.0, random_state=0)
clf.fit(X_train, y_train)

# generate evaluation metrics
# generate evaluation metrics
print("Train - Accuracy :", metrics.accuracy_score(y_train, clf.predict(X_train)))
print("Train - Confusion matrix :",metrics.confusion_matrix(y_train, clf.predict(X_train)))
print("Train - classification report :", metrics.classification_report(y_train, clf.predict(X_train)))

print("Test - Accuracy :", metrics.accuracy_score(y_test, clf.predict(X_test)))
print("Test - Confusion matrix :", metrics.confusion_matrix(y_test, clf.predict(X_test)))
print("Test - classification report :", metrics.classification_report(y_test, clf.predict(X_test)))
#----output----
Train - Accuracy : 0.9523809523809523
Train - Confusion matrix : [[34  0  0]
                            [ 0 30  2]
                            [ 0  3 36]]
Train - classification report :
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        34
           1       0.91      0.94      0.92        32
           2       0.95      0.92      0.94        39

   micro avg       0.95      0.95      0.95       105
   macro avg       0.95      0.95      0.95       105
weighted avg       0.95      0.95      0.95       105

Test - Accuracy : 0.9777777777777777
Test - Confusion matrix : [[16  0  0]
                           [ 0 17  1]
                           [ 0  0 11]]
Test - classification report :
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       1.00      0.94      0.97        18
           2       0.92      1.00      0.96        11

   micro avg       0.98      0.98      0.98        45
   macro avg       0.97      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45

Listing 3-39Support Vector Machine (SVM) Model

绘制决策边界:为了简单起见，让我们考虑一个两类的例子(列表 3-40 )。

# Let's use sklearn make_classification function to create some test data.
from sklearn.datasets import make_classification
X, y = make_classification(100, 2, 2, 0, weights=[.5, .5], random_state=0)

# build a simple logistic regression model
clf = SVC(kernel='linear', random_state=0)
clf.fit(X, y)

# get the separating hyperplane
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - (clf.intercept_[0]) / w[1]

# plot the parallels to the separating hyperplane that pass through the
# support vectors
b = clf.support_vectors_[0]
yy_down = a * xx + (b[1] - a * b[0])

b = clf.support_vectors_[-1]
yy_up = a * xx + (b[1] - a * b[0])

# Plot the decision boundary
plot_decision_regions(X, y, classifier=clf)

# plot the line, the points, and the nearest vectors to the plane

plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=80, facecolors="none")
plt.plot(xx, yy_down, 'k--')
plt.plot(xx, yy_up, 'k--')

plt.xlabel('X1')
plt.ylabel('X2')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
#----output----

Listing 3-40Ploting SVM Decision Boundaries

k-最近邻

当概率密度的可靠参数估计未知或难以确定时，k-最近邻分类(kNN)是从执行判别分析的需要发展而来的。Fix 和 Hodges 在 1951 年引入了一种用于模式分类的非参数方法，这种方法后来被称为 k-最近邻规则。

顾名思义，该算法基于其 k 近邻类的多数投票来工作。在图 3-16 中，基于所选择的距离度量来识别未知数据点的 k = 5 个最近邻居，并且将基于所识别的最近数据点类中的多数类来对未知点进行分类。kNN 的主要缺点是为每个样本搜索最近邻的复杂性(列表 3-41 )。

要记住的事情:

图 3-16

k = 5 的 k 个最近邻

两类问题选奇数 k 值。
k 不能是班级人数的倍数。

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=5, p=2, metric="minkowski")
clf.fit(X_train, y_train)

# generate evaluation metrics
print("Train - Accuracy :", metrics.accuracy_score(y_train, clf.predict(X_train)))
print("Train - Confusion matrix :",metrics.confusion_matrix(y_train, clf.predict(X_train)))
print("Train - classification report :", metrics.classification_report(y_train, clf.predict(X_train)))

print("Test - Accuracy :", metrics.accuracy_score(y_test, clf.predict(X_test)))

print("Test - Confusion matrix :", metrics.confusion_matrix(y_test, clf.predict(X_test)))
print("Test - classification report :", metrics.classification_report(y_test, clf.predict(X_test)))
#----output----
Train - Accuracy : 0.9714285714285714

Train - Confusion matrix : [[34  0  0]
                            [ 0 31  1]
                            [ 0  2 37]]
Train - classification report :
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        34
           1       0.94      0.97      0.95        32
           2       0.97      0.95      0.96        39

   micro avg       0.97      0.97      0.97       105
   macro avg       0.97      0.97      0.97       105
weighted avg       0.97      0.97      0.97       105

Test - Accuracy : 0.9777777777777777
Test - Confusion matrix : [[16  0  0]
                           [ 0 17  1]
                           [ 0  0 11]]
Test - classification report :
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       1.00      0.94      0.97        18
           2       0.92      1.00      0.96        11

   micro avg       0.98      0.98      0.98        45
   macro avg       0.97      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45

Listing 3-41k Nearest Neighbor Model

注意

决策树、SVM 和基于 kNN 的算法概念本质上可以应用于预测因变量，这些因变量本质上是连续的数字；Scikit-learn 为此提供了 DecisionTreeRegressor、SVR(支持向量回归机)和 kNeighborsRegressor。

时间序列预测

简而言之，在一段时间内以规则的间隔顺序收集的一系列数据点称为“时间序列数据”均值和方差为常数的时间序列数据称为平稳时间序列。

时间序列往往在滞后变量之间有一个线性关系，这被称为自相关。因此，时间序列的历史数据可以建模，以预测未来的数据点，而不涉及任何其他独立变量；这些类型的模型通常被称为时间序列预测。时间序列的一些关键应用领域有:销售预测、经济预测和股票市场预测。

时间序列的组成部分

时间序列可以由三个关键部分组成(图 3-17 )。

图 3-17

时间序列组件

趋势:长期的上升或下降称为趋势。
季节性:固定或已知时期内季节性因素的影响。例如，零售商店的销售额在周末和假日季节会很高。
周期:这是由外部因素引起的不固定或不知道周期的较长的起伏。

清单 3-42 显示了分解时序组件的代码实现。

自回归综合移动平均线(ARIMA)

ARIMA 是一个关键且受欢迎的时间序列模型，因此理解其中涉及的概念将为您围绕时间序列建模奠定基础。

自回归模型(AM): 顾名思义，是变量对自身的回归(即利用变量过去值的线性组合来预测未来值)。

y_t= c+φ₁y_??+φ₂y_??+…+φ_ny_{t n}+e_t，其中 c 为常数，e _t 为随机误差，y_??为一阶相关，并且

移动平均线(MA): 使用过去的预测误差而不是过去的值来建立模型。

y_t= c+y_t-1+o₂和_t-2+…+o_n和

带有积分(与差分相反)的自回归(AR)、移动平均(MA)模型称为 ARIMA 模型。

y_{= c+_{和_{+2 和_₊+}}}

等式(p，d，q)右侧的预测值是滞后值和误差。这些是 ARIMA 的关键参数，为 p、d 和 q 选择正确的值会产生更好的模型结果。

p =自回归部分的阶。这是未知项的数量乘以你的信号过去的次数(过去的次数等于你的 p 值)。

d =所涉及的一阶差分的阶数。为了得到一个稳定的时间序列，你需要微分时间序列的次数。

q =移动平均部分的阶数。这是未知项的数量乘以你在过去时间的预测误差(过去的次数等于你的 q 值)。

运行 ARIMA 模型

绘制图表以确保数据集中存在趋势、周期或季节性。
平稳化序列:为了平稳化一个序列，我们需要从序列中去除趋势(变化均值)和季节性(方差)成分。移动平均和差分技术可用于稳定趋势，而对数变换将稳定季节性差异。此外，迪基富勒测试可用于评估序列的平稳性，即迪基富勒测试的零假设是数据是平稳的，因此 p 值>为 0.05 的测试结果意味着数据是非平稳的(列表 3-43 )。
寻找最佳参数:一旦序列平稳化，您可以查看自相关函数(ACF)和偏自相关函数(PACF)图形，以选择消除自相关所需的 AR 或 MA 项的数量。ACF 是相关系数和滞后之间的条形图；同样，PACF 是偏相关(变量和滞后本身之间的相关性并不能解释所有低阶滞后的相关性)系数和滞后之间的条形图(列表 3-44 )。
建立模型和评估:由于时间序列是一个连续的数，MAE 和 RMSE 可以用来评估训练数据集中实际值和预测值之间的偏差。其他有用的度量标准是阿凯克信息标准(AIC)和贝叶斯信息标准(BIC)。这些是信息论的一部分，用于估计给定模型集合的单个模型的质量，并且它们倾向于残差更小的模型(列表 3-45 )。

AIC = -2log(L) + 2(p+q+k+1)，其中 L 是拟合模型的最大似然函数，p，q，k 是模型中的参数个数

BIC = AIC+(日志(t)-2)(p+q+k+1)

# Data Source: O.D. Anderson (1976), in file: data/anderson14, Description: Monthly sales of company X Jan '65 – May '71 C. Cahtfield
df = pd.read_csv('Data/TS.csv')
ts = pd.Series(list(df['Sales']), index=pd.to_datetime(df['Month'],format='%Y-%m'))

from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(ts)

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

plt.subplot(411)
plt.plot(ts_log, label="Original")
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label="Trend")
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')

plt.tight_layout()

Listing 3-42
Decompose Time Series

检查信纸

from statsmodels.tsa.stattools import adfuller

# log transform
ts_log = np.log(ts)
ts_log.dropna(inplace=True)

s_test = adfuller(ts_log, autolag="AIC")
print ("Log transform stationary check p value: ", s_test[1])

#Take first difference:
ts_log_diff = ts_log - ts_log.shift()
ts_log_diff.dropna(inplace=True)

plt.title('Trend removed plot with first order difference')
plt.plot(ts_log_diff)
plt.ylabel('First order log diff')

s_test = adfuller(ts_log_diff, autolag="AIC")
print ("First order difference stationary check p value: ", s_test[1] )

# moving average smoothens the line
moving_avg = ts_log.rolling(12).mean()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (10,3))
ax1.set_title('First order difference')
ax1.tick_params(axis='x', labelsize=7)
ax1.tick_params(axis='y', labelsize=7)
ax1.plot(ts_log_diff)

ax2.plot(ts_log)
ax2.set_title('Log vs Moving AVg')
ax2.tick_params(axis='x', labelsize=7)
ax2.tick_params(axis='y', labelsize=7)
ax2.plot(moving_avg, color="red")
plt.tight_layout()

#----output----
Log transform stationary check p value:  0.785310212485
First order difference stationary check p value:  0.0240253928399

Listing 3-43Check Stationary

自相关检验

我们确定时间序列的日志需要至少一个不同的顺序来平稳化。现在让我们为一阶对数系列绘制 ACF 和 PACF 图。

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (10,3))

# ACF chart
fig = sm.graphics.tsa.plot_acf(ts_log_diff.values.squeeze(), lags=20, ax=ax1)

# draw 95% confidence interval line
ax1.axhline(y=-1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
ax1.axhline(y=1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
ax1.set_xlabel('Lags')

# PACF chart

fig = sm.graphics.tsa.plot_pacf(ts_log_diff, lags=20, ax=ax2)

# draw 95% confidence interval line
ax2.axhline(y=-1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
ax2.axhline(y=1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
ax2.set_xlabel('Lags')
#----output----

Listing 3-44Check Autocorrelation

PACF 图仅在滞后 1 处具有显著的尖峰，这意味着所有高阶自相关都可以由滞后 1 和滞后 2 自相关有效地解释。理想的滞后值是 p = 2 和 q = 2(即 ACF/PACF 图表第一次越过置信区间上限的滞后值)。

构建模型并评估

让我们在数据集上拟合 ARIMA 模型，并评估模型性能。

# build model
model = sm.tsa.ARIMA(ts_log, order=(2,0,2))
results_ARIMA = model.fit(disp=-1)

ts_predict = results_ARIMA.predict()

# Evaluate model
print("AIC: ", results_ARIMA.aic)
print("BIC: ", results_ARIMA.bic)

print("Mean Absolute Error: ", mean_absolute_error(ts_log.values, ts_predict.values))
print("Root Mean Squared Error: ", np.sqrt(mean_squared_error(ts_log.values, ts_predict.values)))

# check autocorrelation

print("Durbin-Watson statistic :", sm.stats.durbin_watson(results_ARIMA.resid.values))
#----output-----
AIC:  7.8521105380873735
BIC:  21.914943069209478
Mean Absolute Error:  0.19596606887750853
Root Mean Squared Error:  0.2397921908617542
Durbin-Watson statistic : 1.8645776109746208

Listing 3-45Build ARIMA Model and Evaluate

通常的做法是用不同的 p 和 q 建立几个模型，选择 AIC、BIC、MAE 和 RMSE 中值最小的一个。现在让我们将 p 增加到 3，看看结果是否有任何不同(清单 3-46 )。

model = sm.tsa.ARIMA(ts_log, order=(3,0,2))
results_ARIMA = model.fit(disp=-1)

ts_predict = results_ARIMA.predict()
plt.title('ARIMA Prediction - order(3,0,2)')
plt.plot(ts_log, label="Actual")
plt.plot(ts_predict, 'r--', label="Predicted")
plt.xlabel('Year-Month')
plt.ylabel('Sales')
plt.legend(loc='best')

print("AIC: ", results_ARIMA.aic)
print("BIC: ", results_ARIMA.bic)

print("Mean Absolute Error: ", mean_absolute_error(ts_log.values, ts_predict.values))
print("Root Mean Squared Error: ", np.sqrt(mean_squared_error(ts_log.values, ts_predict.values)))

# check autocorrelation

print("Durbin-Watson statistic :", sm.stats.durbin_watson(results_ARIMA.resid.values))
AIC:  -7.786042455163056
BIC:  8.620595497812733
Mean Absolute Error:  0.16721947678957297
Root Mean Squared Error:  0.21618486190507652
Durbin-Watson statistic : 2.5184568082461936

Listing 3-46Build ARIMA Model and Evaluate by Increasing p to 3

让我们用一阶差分(即 d = 1)来尝试一下，看看模型性能是否有所提高(列表 3-47 )。

model = sm.tsa.ARIMA(ts_log, order=(3,1,2))
results_ARIMA = model.fit(disp=-1)

ts_predict = results_ARIMA.predict()

# Correctcion for difference
predictions_ARIMA_diff = pd.Series(ts_predict, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)

#----output----
plt.title('ARIMA Prediction - order(3,1,2)')
plt.plot(ts_log, label="Actual")

plt.plot(predictions_ARIMA_log, 'r--', label="Predicted")
plt.xlabel('Year-Month')
plt.ylabel('Sales')
plt.legend(loc='best')

print("AIC: ", results_ARIMA.aic)
print("BIC: ", results_ARIMA.bic)

print("Mean Absolute Error: ", mean_absolute_error(ts_log_diff.values, ts_predict.values))
print("Root Mean Squared Error: ", np.sqrt(mean_squared_error(ts_log_diff.values, ts_predict.values)))

# check autocorrelation

print("Durbin-Watson statistic :", sm.stats.durbin_watson(results_ARIMA.resid.values))

#----output----
AIC:  -35.41898773672588
BIC:  -19.103854354721562
Mean Absolute Error:  0.13876538862134086
Root Mean Squared Error:  0.1831024379477494
Durbin-Watson statistic : 1.941165833847913

Listing 3-47ARIMA with First Order Differencing

在上图中，我们可以看到该模型在某些地方预测过高，AIC 和 BIC 值都高于之前的模型。注意:AIC/BIC 可以是正数，也可以是负数；但是，要看它的绝对值来评价。

预测未来值

下面的值(p=3，d=0，q=2)给出了评估指标的较小数字，因此让我们使用它作为最终模型来预测 1972 年的未来值(列表 3-48 )。

# final model
model = sm.tsa.ARIMA(ts_log, order=(3,0,2))
results_ARIMA = model.fit(disp=-1)

# predict future values
ts_predict = results_ARIMA.predict('1971-06-01', '1972-05-01')
plt.title('ARIMA Future Value Prediction - order(3,1,2)')
plt.plot(ts_log, label="Actual")
plt.plot(ts_predict, 'r--', label="Predicted")
plt.xlabel('Year-Month')
plt.ylabel('Sales')
plt.legend(loc='best')

#----output----

Listing 3-48ARIMA Predict Function

注意

至少需要 3 到 4 年的历史数据，以确保季节性模式是有规律的。

无监督学习流程

无监督学习过程流程如下图 3-18 所示。类似于监督学习，我们可以训练一个模型，用它来预测未知数据集。然而，关键的区别在于，没有可用于目标变量的预定义类别或标签，目标通常是基于数据中可用的模式创建类别或标签。

图 3-18

无监督学习流程

使聚集

聚类是一个无监督的学习问题。关键目标是根据给定数据集中的相似性概念来识别不同的组(称为聚类)。聚类分析起源于 20 世纪 30 年代的人类学和心理学领域。最常用的聚类技术是 k-means(分裂)和 hierarchical(凝聚)。

k 均值

K-means 算法的主要目标是将数据组织成簇，使得簇内相似性高，簇间相似性低。一个项目将只属于一个群集，而不是几个群集(即，它生成特定数量的不相交、非分层的群集)。K-means 使用 divide and concur 策略，是期望最大化(EM)算法的经典例子。EM 算法由两个步骤组成:第一步，称为期望(E ),是找到与聚类相关的期望点；第二步，称为最大化(M ),是使用来自第一步的知识来改进聚类的估计。重复处理这两个步骤，直到达到收敛。

假设我们有“n”个数据点，我们需要将它们聚类成 k (c1，c2，c3)组(图 3-19 )。

图 3-19

期望值最大化算法工作流

步骤 1:在第一步中，随机选取 k 个质心(在前面的情况中 k = 3 )(仅在第一次迭代中),并且将最接近每个质心点的所有点分配给该特定聚类。质心是所有点的算术平均值或平均位置。

步骤 2:这里，使用该簇中所有点的平均坐标来重新计算质心点。然后重复第一步(指定最近点)，直到聚类收敛。

注意:K-means 仅适用于欧氏距离。

$Euclidean\kern0.17em Distance=d=\sqrt{\sum \limits_{i=1}^N{\left({X}_i-{Y}_i\right)}²}$

K 均值的局限性

K-means 聚类需要指定聚类数。
当聚类具有不同的大小、密度和非球形形状时，K-means 存在问题。
异常值的存在会扭曲结果。

让我们加载 Iris 数据，并假设物种列缺失—我们有萼片长度/宽度和花瓣长度/宽度的测量值，但我们不知道存在多少个物种。

现在让我们使用无监督学习(聚类)来找出有多少物种存在。这里的目标是将所有相似的项目分组到一个集群中。我们现在可以假设 k 为 3；稍后我们将了解求 k 值的方法(清单 3-49 )。

from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()

# Let's convert to dataframe
iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['species'])

# let's remove spaces from column name
iris.columns = iris.columns.str.replace(' ',")
iris.head()

X = iris.ix[:,:3]  # independent variables
y = iris.species   # dependent variable

sc = StandardScaler()
sc.fit(X)
X = sc.transform(X)

# K Means Cluster
model = KMeans(n_clusters=3, random_state=11)
model.fit(X)
print (model.labels_)
# ----output----
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 0 1 1 1 1 1 1 1 1 2 2 2 0 2 0 2 0 2 0 0 0 0 0 0 2 0 0 0 0 2 0 0 0
 2 2 2 2 0 0 0 0 0 0 0 2 2 0 0 0 0 2 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2
 2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 2 2 2 2 2 2 2 0 2 2 2 0 2 2 2 0 2 2 2 0 2
 2 0]

Listing 3-49
k-means Clustering

我们看到聚类算法已经为每条记录分配了一个聚类标签。让我们将此与实际物种标签进行比较，以了解相似记录分组的准确性(列表 3-50 )。

# since its unsupervised the labels

have been assigned
# not in line with the actual lables so let's convert all the 1s to 0s and 0s to 1s
# 2's look fine
iris['pred_species'] =  np.choose(model.labels_, [1, 0, 2]).astype(np.int64)

print ("Accuracy :", metrics.accuracy_score(iris.species, iris.pred_species))
print ("Classification report :", metrics.classification_report(iris.species, iris.pred_species))

# Set the size of the plot

plt.figure(figsize=(10,7))

# Create a colormap for red, green and blue
cmap = ListedColormap(['r', 'g', 'b'])

# Plot Sepal
plt.subplot(2, 2, 1)
plt.scatter(iris['sepallength(cm)'], iris['sepalwidth(cm)'], c=cmap(iris.species), marker="o", s=50)
plt.xlabel('sepallength(cm)')
plt.ylabel('sepalwidth(cm)')
plt.title('Sepal (Actual)')

plt.subplot(2, 2, 2)
plt.scatter(iris['sepallength(cm)'], iris['sepalwidth(cm)'], c=cmap(iris.pred_species), marker="o", s=50)
plt.xlabel('sepallength(cm)')
plt.ylabel('sepalwidth(cm)')
plt.title('Sepal (Predicted)')

plt.subplot(2, 2, 3)
plt.scatter(iris['petallength(cm)'], iris['petalwidth(cm)'], c=cmap(iris.species),marker='o', s=50)
plt.xlabel('petallength(cm)')
plt.ylabel('petalwidth(cm)')
plt.title('Petal (Actual)')

plt.subplot(2, 2, 4)
plt.scatter(iris['petallength(cm)'], iris['petalwidth(cm)'], c=cmap(iris.pred_species),marker='o', s=50)
plt.xlabel('petallength(cm)')
plt.ylabel('petalwidth(cm)')
plt.title('Petal (Predicted)')
plt.tight_layout()
#----output----
Accuracy : 0.8066666666666666
Classification report :
               precision    recall  f1-score   support

         0.0       1.00      0.98      0.99        50
         1.0       0.71      0.70      0.71        50
         2.0       0.71      0.74      0.73        50

   micro avg       0.81      0.81      0.81       150
   macro avg       0.81      0.81      0.81       150
weighted avg       0.81      0.81      0.81       150

Listing 3-50Accuracy of k-means Clustering

从前面的图表中我们可以看到，K-means 在相似标签的聚类方面做得相当不错，与实际标签相比，准确率达到了 80%。

求 k 的值

通常使用两种方法来确定 k 值:

肘法
平均轮廓法

肘法

对数据集中的 K 值范围(例如 1 到 10)执行 K 均值聚类，并计算 SSE 或解释每个 K 值的方差百分比。绘制聚类数与 SSE 的折线图，然后在折线图上寻找肘部形状，这是理想的聚类数。随着 k 的增加，上证综指有向 0 下降的趋势。如果等于数据集中数据点的总数，则 SSE 为零，因为在此阶段，每个数据点都成为其自己的聚类，并且聚类与其中心之间不存在误差。所以肘形法的目标是选择一个较小的 k 值，它具有较低的 SSE，肘形通常代表这个值。解释的方差百分比往往随着 k 的增加而增加，我们将选择肘部形状出现的点(清单 3-51 )。

from scipy.spatial.distance import cdist, pdist
from sklearn.cluster import KMeans

K = range(1,10)
KM = [KMeans(n_clusters=k).fit(X) for k in K]
centroids = [k.cluster_centers_ for k in KM]

D_k = [cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/X.shape[0] for d in dist]

# Total with-in sum of square
wcss = [sum(d**2) for d in dist]
tss = sum(pdist(X)**2)/X.shape[0]
bss = tss-wcss
varExplained = bss/tss*100

kIdx = 10-1
##### plot ###
kIdx = 2

# elbow curve
# Set the size of the plot
plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)
plt.plot(K, avgWithinSS, 'b*-')
plt.plot(K[kIdx], avgWithinSS[kIdx], marker="o", markersize=12,
    markeredgewidth=2, markeredgecolor="r", markerfacecolor="None")
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')

plt.subplot(1, 2, 2)
plt.plot(K, varExplained, 'b*-')
plt.plot(K[kIdx], varExplained[kIdx], marker="o", markersize=12,
    markeredgewidth=2, markeredgecolor="r", markerfacecolor="None")
plt.grid(True)
plt.xlabel('Number of clusters')

plt.ylabel('Percentage of variance explained')
plt.title('Elbow for KMeans clustering')
plt.tight_layout()
#----output----

Listing 3-51
Elbow Method

平均轮廓法

1986 年，Peter J. Rousseuw 描述了剪影方法，旨在解释聚类数据内的一致性。轮廓值将介于-1 和 1 之间。高值表示项目在群集内匹配良好，与相邻群集的匹配较弱(列表 3-52 )。

s(I)= b(i)–a(i)/max { a(I)，b(i)}，其中 a(I)是第 I 项与来自同一聚类的其他数据点的平均相异度，b(I)是 I 与 I 不属于的其他聚类的最低平均相异度。

from sklearn.metrics import silhouette_samples, silhouette_score
from matplotlib import cm

score = []
for n_clusters in range(2,10):
    kmeans = KMeans(n_clusters=n_clusters)
    kmeans.fit(X)

    labels = kmeans.labels_
    centroids = kmeans.cluster_centers_

    score.append(silhouette_score(X, labels, metric="euclidean"))

    # Set the size of the plot
plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)
plt.plot(score)
plt.grid(True)
plt.ylabel("Silouette Score")
plt.xlabel("k")
plt.title("Silouette for K-means")

# Initialize the clusterer with n_clusters value and a random generator
model = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=0)
model.fit_predict(X)

cluster_labels = np.unique(model.labels_)
n_clusters = cluster_labels.shape[0]

# Compute the silhouette scores for each sample
silhouette_vals = silhouette_samples(X, model.labels_)

plt.subplot(1, 2, 2)

# Get spectral values for colormap.
cmap = cm.get_cmap("Spectral")

y_lower, y_upper = 0,0
yticks = []
for i, c in enumerate(cluster_labels):
    c_silhouette_vals = silhouette_vals[cluster_labels]
    c_silhouette_vals.sort()
    y_upper += len(c_silhouette_vals)
    color = cmap(float(i) / n_clusters)
    plt.barh(range(y_lower, y_upper), c_silhouette_vals, facecolor=color, edgecolor=color, alpha=0.7)
    yticks.append((y_lower + y_upper) / 2)
    y_lower += len(c_silhouette_vals)
silhouette_avg = np.mean(silhouette_vals)

plt.yticks(yticks, cluster_labels+1)

# The vertical line for average silhouette score of all the values

plt.axvline(x=silhouette_avg, color="red", linestyle="--")

plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')
plt.title("Silouette for K-means")
plt.show()

#---output----

Listing 3-52
Silhouette Method

分层聚类

凝聚聚类是一种分层聚类技术，它使用自下而上的方法构建嵌套聚类，其中每个数据点从自己的聚类开始，随着我们向上移动，这些聚类将根据距离矩阵进行合并。

关键参数

n_clusters:这是要查找的聚类数；默认值为 2。

联系:它必须是下列之一:沃德氏或完全或平均；默认=病房(图 3-20 )。

让我们更深入地了解每一种联系。如果类内方差或误差平方和最小，Ward 的方法将合并类。两个聚类的所有成对距离都在“平均”方法中使用，并且它受离群值的影响较小。“完全”方法考虑了两个簇的最远元素之间的距离，因此也称为最大连锁。清单 3-53 是分层集群的示例实现代码。

图 3-20

凝聚聚类连锁

from sklearn.cluster import AgglomerativeClustering

# Agglomerative Cluster
model = AgglomerativeClustering(n_clusters=3)

# lets fit the model to the iris data set that we imported in Listing 3-49
model.fit(X)

print(model.labels_)
iris['pred_species'] =  model.labels_

print("Accuracy :", metrics.accuracy_score(iris.species, iris.pred_species))
print("Classification report :", metrics.classification_report(iris.species, iris.pred_species))
#----outout----
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 2 2 2 1 2 1 2 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1
 1 2 2 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 2 1 2 1 2 2
 1 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 1 2 2 2 1 1 1 2 2 2 1 2 2 2 1 2 2 2 1 2
 2 1]
Accuracy : 0.7733333333333333
Classification report :
               precision    recall  f1-score   support

         0.0       1.00      0.98      0.99        50
         1.0       0.64      0.74      0.69        50
         2.0       0.70      0.60      0.65        50

   micro avg       0.77      0.77      0.77       150
   macro avg       0.78      0.77      0.77       150
weighted avg       0.78      0.77      0.77       150

Listing 3-53
Hierarchical Clustering

树状可视化可以更好地解释分层聚类结果的排列。SciPy 为树突可视化提供了必要的功能(清单 3-54 )。目前，Scikit-learn 缺少这些功能。

from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import pdist

# generate the linkage matrix
Z = linkage(X, 'ward')
c, coph_dists = cophenet(Z, pdist(X))

# calculate full dendrogram

plt.figure(figsize=(25, 10))
plt.title('Agglomerative Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
)
plt.tight_layout()
#----output----

Listing 3-54Hierarchical Clustering

因为我们知道 k = 3，所以我们可以以大约 10 的距离阈值来切割树，以得到正好三个不同的聚类。

主成分分析

大量特征或维度的存在使得分析计算量大，并且难以执行用于模式识别的 ML 任务。主成分分析(PCA)是最流行的无监督线性变换降维技术。PCA 在高维数据中寻找方差最大的方向，从而保留大部分信息，并将其投影到一个更小的维度子空间上(图 3-21 )。

图 3-21

主成分分析

PCA 方法可以总结如下:

标准化数据。
使用标准化数据生成协方差矩阵或相关矩阵。
执行特征分解:计算作为主分量的特征向量，这将给出方向，并计算特征值，这将给出幅度。
对特征对进行排序，并选择具有最大特征值的特征向量，这累积地捕获了某个阈值(比如 95%)以上的信息。

from sklearn import datasets

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

iris = datasets.load_iris()
X = iris.data

# standardize data
X_std = StandardScaler().fit_transform(X)

# create covariance matrix
cov_mat = np.cov(X_std.T)

print('Covariance matrix \n%s' %cov_mat)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

# sort eigenvalues in decreasing order
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
print("Cummulative Variance Explained", cum_var_exp)

plt.figure(figsize=(6, 4))

plt.bar(range(4), var_exp, alpha=0.5, align="center",
        label='Individual explained variance')
plt.step(range(4), cum_var_exp, where="mid",
         label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
#----output----
Covariance matrix

[[ 1.00671141 -0.11835884  0.87760447  0.82343066]
 [-0.11835884  1.00671141 -0.43131554 -0.36858315]
 [ 0.87760447 -0.43131554  1.00671141  0.96932762]
 [ 0.82343066 -0.36858315  0.96932762  1.00671141]]
Eigenvectors
[[ 0.52106591 -0.37741762 -0.71956635  0.26128628]
 [-0.26934744 -0.92329566  0.24438178 -0.12350962]
 [ 0.5804131  -0.02449161  0.14212637 -0.80144925]
 [ 0.56485654 -0.06694199  0.63427274  0.52359713]]

Eigenvalues
[2.93808505 0.9201649  0.14774182 0.02085386]

Cummulative Variance Explained
[ 72.96244541  95.8132072   99.48212909 100\.        ]

Listing 3-55Principal Component Analysis

在上图中，我们可以看到前三个主成分解释了 99%的方差。让我们使用 Scikit-learn 执行 PCA，并绘制前三个特征向量。

清单 3-56 显示了可视化 PCA 的代码实现示例。

# source: http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html#
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.decomposition import PCA

# import some data to play with
iris = datasets.load_iris()
Y = iris.target

# To getter a better understanding of interaction of the dimensions
# plot the first three PCA dimensions
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform(iris.data)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y, cmap=plt.cm.Paired)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])
plt.show()

#---output----

Listing 3-56Visualize PCA

摘要

至此，我们已经完成了第二步。我们简要地学习了不同的基本 ML 概念及其实现。数据质量是构建高效 ML 系统的一个重要方面。为此，我们学习了不同类型的数据、了解数据质量的常用 EDA 技术，以及修复数据缺口的基本预处理技术。监督模型(如线性和非线性回归技术)对于建模模式以预测连续数值数据类型非常有用。而逻辑回归、决策树、SVM 和 kNN 对分类问题建模很有用(函数也可用于回归)。我们还学习了 ARIMA，这是一个关键的时间序列预测模型。k-means 和层次聚类等无监督技术可用于对相似项目进行分组，而主成分分析可用于将大维数据降维以实现高效计算。

在下一步中，您将学习如何为模型选择最佳参数，这通常被称为“超参数调整”以提高模型准确性。对于给定的问题，从多个模型中选择最佳模型的常见做法是什么？您还将学习组合多个模型，以从单个模型中获得最佳效果。

四、模型诊断和调整

在这一章中，我们将了解在构建机器学习(ML)系统时应该意识到和会遇到的不同陷阱。我们还将学习行业标准的高效设计实践来解决这个问题。

在本章中，我们将主要使用来自 UCI 知识库的数据集“Pima Indian diabetes”，它有 768 个记录、8 个属性、2 个类、268 个(34.9%)糖尿病测试的阳性结果和 500 个(65.1%)阴性结果。所有患者均为至少 21 岁的皮马印第安裔女性。

数据集的属性:

怀孕次数
口服葡萄糖耐量试验中 2 小时的血浆葡萄糖浓度
舒张压(毫米汞柱)
三头肌皮褶厚度(毫米)
2 小时血清胰岛素(微单位/毫升)
身体质量指数(体重，单位为千克/(身高，m)²)
糖尿病谱系功能
年龄(岁)

最佳概率截止点

预测概率是一个介于 0 和 1 之间的数字。传统上，> . 5 是用于将预测概率转换为 1(正)的分界点，否则为 0(负)。当你的训练数据集有一个正例和反例相等的例子时，这个逻辑工作得很好；然而，在真实的场景中却不是这样。

解决方法是找到最优的截断点(即真阳性率高，假阳性率低的点)。任何高于这个阈值的都可以标记为 1，否则为 0。清单 4-1 应该说明了这一点，所以让我们加载数据并检查类分布。

import pandas as pd
import pylab as plt
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# read the data in
df = pd.read_csv("Data/Diabetes.csv")

# target variable % distribution
print (df['class'].value_counts(normalize=True))
#----output----
0    0.651042
1    0.348958

Listing 4-1Load Data and Check the Class Distribution

让我们建立一个快速逻辑回归模型，并检查其准确性(清单 4-2 )。

X = df.ix[:,:8]     # independent variables
y = df['class']     # dependent variables

# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# instantiate a logistic regression model, and fit
model = LogisticRegression()
model = model.fit(X_train, y_train)

# predict class labels for the train set. The predict fuction converts probability values > .5 to 1 else 0
y_pred = model.predict(X_train)

# generate class probabilities
# Notice that 2 elements will be returned in probs array,
# 1st element is probability for negative class,
# 2nd element gives probability for positive class
probs = model.predict_proba(X_train)
y_pred_prob = probs[:, 1]

# generate evaluation metrics
print ("Accuracy: ", metrics.accuracy_score(y_train, y_pred))
#----output----
Accuracy:  0.767225325885

Listing 4-2Build a Logistic Regression Model and Evaluate the Performance

最佳截止点是真阳性率(tpr)高而假阳性率(fpr)低，并且 tpr - (1-fpr)为零或接近零。清单 4-3 是绘制 tprvs 的接收器工作特性(ROC)图的示例代码。1-fpr。

# extract false positive, true positive rate
fpr, tpr, thresholds = metrics.roc_curve(y_train, y_pred_prob)
roc_auc = metrics.auc(fpr, tpr)
print("Area under the ROC curve : %f" % roc_auc)

i = np.arange(len(tpr)) # index for df
roc = pd.DataFrame({'fpr' : pd.Series(fpr, index=i),'tpr' : pd.Series(tpr, index = i),'1-fpr' : pd.Series(1-fpr, index = i), 'tf' : pd.Series(tpr - (1-fpr), index = i),'thresholds' : pd.Series(thresholds, index = i)})
roc.ix[(roc.tf-0).abs().argsort()[:1]]

# Plot tpr vs 1-fpr
fig, ax = plt.subplots()
plt.plot(roc['tpr'], label="tpr")
plt.plot(roc['1-fpr'], color = 'red', label='1-fpr')
plt.legend(loc='best')
plt.xlabel('1-False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.show()
#----output----

Listing 4-3Find Optimal Cutoff Point

从图表中可以看出，tpr 与 1-fpr 的交叉点是最佳分界点。为了简化寻找最佳概率阈值和实现可重用性，我创建了一个函数来寻找最佳概率截止点(清单 4-4 )。

def Find_Optimal_Cutoff(target, predicted):
    """ Find the optimal probability cutoff point for a classification model related to the event rate
    Parameters
    ----------
    target: Matrix with dependent or target data, where rows are observations

    predicted: Matrix with predicted data, where rows are observations

    Returns
    -------
    list type, with optimal cutoff value

    """
    fpr, tpr, threshold = metrics.roc_curve(target, predicted)
    i = np.arange(len(tpr))
    roc = pd.DataFrame({'tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i)})
    roc_t = roc.ix[(roc.tf-0).abs().argsort()[:1]]

    return list(roc_t['threshold']) 

# Find optimal probability threshold
# Note: probs[:, 1] will have the probability of being a positive label
threshold = Find_Optimal_Cutoff(y_train, probs[:, 1])
print ("Optimal Probability Threshold: ", threshold)

# Applying the threshold to the prediction probability
y_pred_optimal = np.where(y_pred_prob >= threshold, 1, 0)

# Let's compare the accuracy of traditional/normal approach vs optimal cutoff
print ("\nNormal - Accuracy: ", metrics.accuracy_score(y_train, y_pred))
print ("Optimal Cutoff - Accuracy: ", metrics.accuracy_score(y_train, y_pred_optimal))
print ("\nNormal - Confusion Matrix: \n", metrics.confusion_matrix(y_train, y_pred))
print ("Optimal - Cutoff Confusion Matrix: \n", metrics.confusion_matrix(y_train, y_pred_optimal))
#----output----
Optimal Probability Threshold:  [0.36133240553264734]

Normal - Accuracy:  0.767225325885
Optimal Cutoff - Accuracy:  0.761638733706

Normal - Confusion Matrix:
[[303  40]
 [ 85 109]]
Optimal - Cutoff Confusion Matrix:
[[260  83]
 [ 47 147]]

Listing 4-4A Function for Finding Optimal Probability Cutoff

请注意，正常截止方法与最佳截止方法之间的总体准确性没有显著差异；都是 76%。但是，在最佳临界值方法中，真实阳性率增加了 36%(即，您现在能够将 36%以上的阳性病例捕获为阳性)。此外，假阳性(I 型错误)增加了一倍(即，预测个体没有糖尿病为阳性的概率增加了)。

哪个错误代价大？

嗯，这个问题没有一个答案！这取决于领域、您试图解决的问题以及业务需求(图 4-1 )。在我们的皮马糖尿病案例中，相比之下，第二类错误可能比第一类错误更具破坏性，但这是有争议的。

图 4-1

I 型与 II 型误差

罕见事件或不平衡数据集

向分类算法提供正和负实例的相等样本将产生最佳结果。事实证明，高度偏向一个或多个类的数据集是一个挑战。

重采样是解决不平衡数据集问题的常见做法。虽然重采样中有许多技术，但在这里我们将学习三种最流行的技术(图 4-2 ):

图 4-2

不平衡数据集处理技术

随机欠采样:减少多数类以匹配少数类计数
随机过采样:通过在少数类中随机选取样本来增加少数类，直到两个类的计数匹配
合成少数过采样技术(SMOTE) :通过使用特征空间相似性(欧几里德距离)连接所有 k 个(缺省值= 5)少数类最近邻居，通过引入合成例子来增加少数类

让我们使用 sklearn 的 make_classification 函数创建一个样本不平衡数据集(清单 4-5 )。

# Load libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

# Generate the dataset with 2 features to keep it simple
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, weights=[0.9, 0.1], random_state=2017)

print ("Positive class: ", y.tolist().count(1))
print ("Negative class: ", y.tolist().count(0))
#----output----
Positive class:  514
Negative class:  4486

Listing 4-5Rare Event or Imbalanced Data Handling

让我们将前面描述的三种采样技术应用于数据集，以平衡数据集并进行可视化，以便更好地理解。

# Apply the random under-sampling
rus = RandomUnderSampler()
X_RUS, y_RUS = rus.fit_sample(X, y)

# Apply the random over-sampling
ros = RandomOverSampler()
X_ROS, y_ROS = ros.fit_sample(X, y)

# Apply regular SMOTE
sm = SMOTE(kind='regular')
X_SMOTE, y_SMOTE = sm.fit_sample(X, y)

# Original vs resampled subplots
plt.figure(figsize=(10, 6))
plt.subplot(2,2,1)
plt.scatter(X[y==0,0], X[y==0,1], marker="o", color="blue")
plt.scatter(X[y==1,0], X[y==1,1], marker='+', color="red")
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Original: 1=%s and 0=%s' %(y.tolist().count(1), y.tolist().count(0)))

plt.subplot(2,2,2)
plt.scatter(X_RUS[y_RUS==0,0], X_RUS[y_RUS==0,1], marker="o", color="blue")
plt.scatter(X_RUS[y_RUS==1,0], X_RUS[y_RUS==1,1], marker='+', color="red")
plt.xlabel('x1')
plt.ylabel('y2')
plt.title('Random Under-sampling: 1=%s and 0=%s' %(y_RUS.tolist().count(1), y_RUS.tolist().count(0)))

plt.subplot(2,2,3)
plt.scatter(X_ROS[y_ROS==0,0], X_ROS[y_ROS==0,1], marker="o", color="blue")
plt.scatter(X_ROS[y_ROS==1,0], X_ROS[y_ROS==1,1], marker='+', color="red")
plt.xlabel('x1')
plt.ylabel('x2') 

plt.title('Random over-sampling: 1=%s and 0=%s' %(y_ROS.tolist().count(1), y_ROS.tolist().count(0)))

plt.subplot(2,2,4)
plt.scatter(X_SMOTE[y_SMOTE==0,0], X_SMOTE[y_SMOTE==0,1], marker="o", color="blue")
plt.scatter(X_SMOTE[y_SMOTE==1,0], X_SMOTE[y_SMOTE==1,1], marker='+', color="red")
plt.xlabel('x1')
plt.ylabel('y2')
plt.title('SMOTE: 1=%s and 0=%s' %(y_SMOTE.tolist().count(1), y_SMOTE.tolist().count(0)))

plt.tight_layout()

plt.show()
#----output----

警告

请记住，随机欠采样增加了信息或概念丢失的机会，因为我们正在减少多数类，并且随机过采样& SMOTE 会由于多个相关实例而导致过拟合问题。

哪种重采样技术是最好的？

嗯，这个问题还是没有答案！让我们对前面三个重采样数据尝试一个快速分类模型，并比较其准确性。我们将使用 AUC 指标，因为这是模型性能的最佳表现之一(清单 4-6 )。

from sklearn import tree
from sklearn import metrics
from sklearn.cross_ model_selection import train_test_split

X_RUS_train, X_RUS_test, y_RUS_train, y_RUS_test = train_test_split(X_RUS, y_RUS, test_size=0.3, random_state=2017)
X_ROS_train, X_ROS_test, y_ROS_train, y_ROS_test = train_test_split(X_ROS, y_ROS, test_size=0.3, random_state=2017)
X_SMOTE_train, X_SMOTE_test, y_SMOTE_train, y_SMOTE_test = train_test_split(X_SMOTE, y_SMOTE, test_size=0.3, random_state=2017)

# build a decision tree classifier
clf = tree.DecisionTreeClassifier(random_state=2017)
clf_rus = clf.fit(X_RUS_train, y_RUS_train)
clf_ros = clf.fit(X_ROS_train, y_ROS_train)
clf_smote = clf.fit(X_SMOTE_train, y_SMOTE_train)

# evaluate model performance
print ("\nRUS - Train AUC : ",metrics.roc_auc_score(y_RUS_train, clf.predict(X_RUS_train)))
print ("RUS - Test AUC : ",metrics.roc_auc_score(y_RUS_test, clf.predict(X_RUS_test)))
print ("ROS - Train AUC : ",metrics.roc_auc_score(y_ROS_train, clf.predict(X_ROS_train)))
print ("ROS - Test AUC : ",metrics.roc_auc_score(y_ROS_test, clf.predict(X_ROS_test)))
print ("\nSMOTE - Train AUC : ",metrics.roc_auc_score(y_SMOTE_train, clf.predict(X_SMOTE_train)))
print ("SMOTE - Test AUC : ",metrics.roc_auc_score(y_SMOTE_test, clf.predict(X_SMOTE_test)))
#----output----

RUS - Train AUC :  0.988945248974
RUS - Test AUC :  0.983964646465
ROS - Train AUC :  0.985666951094
ROS - Test AUC :  0.986630288452

SMOTE - Train AUC :  1.0
SMOTE - Test AUC :  0.956132695918

Listing 4-6Build Models on Various Resampling Methods and Evaluate Performance

这里，随机过采样在训练集和测试集上都表现得更好。作为一种最佳实践，在现实世界的用例中，建议查看其他指标(如精确度、召回率、混淆矩阵)并应用业务上下文或领域知识来评估模型的真实性能。

偏差和方差

监督学习的一个基本问题是偏差–方差权衡。理想情况下，模型应该具有两个关键特征:

它应该足够敏感，以准确地捕获训练数据集中的关键模式。
它应该足够一般化，以便在任何看不见的数据集上都能很好地工作。

不幸的是，在试图实现上述第一点时，很有可能过度拟合有噪声或不具有代表性的训练数据点，从而导致模型泛化失败。另一方面，试图概括一个模型可能会导致无法捕捉重要的规律性(图 4-3 )。

偏见

如果训练数据集和测试数据集上的模型精度较低，则该模型被称为拟合不足或具有较高的偏差。这意味着模型在回归中没有很好地拟合训练数据集点，或者决策边界在分类中没有很好地分离类。偏差的两个主要原因是 1)没有包括正确的特征，以及 2)没有为模型拟合选择正确的多项式次数。

要解决拟合不足的问题或减少偏差，请尝试包含更有意义的特征，并通过尝试更高阶的多项式拟合来增加模型的复杂性。

变化

如果模型在训练数据集上给出高精度，但是在测试数据集上精度急剧下降，则该模型被称为过度拟合或具有高方差。过度拟合的主要原因是使用更高阶的多项式次数(可能不是必需的)，这将使决策边界工具很好地拟合所有数据点，包括训练数据集的噪声，而不是基础关系。这将导致训练数据集中的高准确度(实际与预测),并且当应用于测试数据集时，预测误差将会很高。

要解决过度拟合问题:

尝试减少要素的数量，即仅保留有意义的要素，或者尝试保留所有要素但减少要素参数大小的正则化方法。
降维可以消除噪声特征，从而降低模型方差。
引入更多的数据点来增大训练数据集也将减少方差。
例如，选择正确的模型参数有助于减少偏差和方差。
- 使用正确的正则化参数可以减少基于回归的模型中的方差。
- For a decision tree, reducing the depth of the decision tree will reduce the variance.
  
  图 4-3
  
  偏差-方差权衡

k 倍交叉验证

K-fold 交叉验证将训练数据集分成 k 个折叠，而不进行替换-任何给定的数据点都将只是其中一个子集的一部分，其中 k-1 个折叠用于模型训练，一个折叠用于测试。该过程重复 k 次，以便我们获得 k 个模型和性能估计值(图 4-4 )。

然后，我们基于单个折叠计算模型的平均性能，以获得与维持或单个折叠方法相比对训练数据的子划分不太敏感的性能估计。

图 4-4

k 倍交叉验证

清单 4-7 显示了使用 sklearn 的 k-fold 交叉验证来构建分类模型的示例代码。

from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# read the data in
df = pd.read_csv("Data/Diabetes.csv")

X = df.ix[:,:8].values     # independent variables
y = df['class'].values     # dependent variables

# Normalize Data
sc = StandardScaler()
sc.fit(X)
X = sc.transform(X)

# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2017)

# build a decision tree classifier
clf = tree.DecisionTreeClassifier(random_state=2017)

# evaluate the model using 10-fold cross-validation
train_scores = cross_val_score(clf, X_train, y_train, scoring="accuracy", cv=5)
test_scores = cross_val_score(clf, X_test, y_test, scoring="accuracy", cv=5)
print ("Train Fold AUC Scores: ", train_scores)
print ("Train CV AUC Score: ", train_scores.mean())

print ("\nTest Fold AUC Scores: ", test_scores)
print ("Test CV AUC Score: ", test_scores.mean())
#---output----
Train Fold AUC Scores:  [0.80555556 0.73148148 0.81308411 0.76635514 0.71028037]
Train CV AUC Score:  0.7653513326410523

Test Fold AUC Scores:  [0.80851064 0.78723404 0.78723404 0.77777778 0.8   ]
Test CV AUC Score:  0.7921513002364066

Listing 4-7
K-fold Cross-Validation

分层 K 倍交叉验证

扩展交叉验证是分层的 k-fold 交叉验证，其中类别比例在每个 fold 中保持不变，从而导致更好的偏差和方差估计(列表 4-8 和 4-9 )。

from sklearn.metrics import roc_curve, auc
from itertools import cycle
from scipy import interp

kfold = model_selection.StratifiedKFold(n_splits=5, random_state=2019)

mean_tpr = 0.0
mean_fpr = np.linspace(0, 1, 100)

colors = cycle(['cyan', 'indigo', 'seagreen', 'yellow', 'blue', 'darkorange'])
lw = 2

i = 0
for (train, test), color in zip(kfold.split(X, y), colors):
    probas_ = clf.fit(X[train], y[train]).predict_proba(X[test])
    # Compute ROC curve and area the curve
    fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
    mean_tpr += interp(mean_fpr, fpr, tpr)
    mean_tpr[0] = 0.0
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=lw, color=color,
             label='ROC fold %d (area = %0.2f)' % (i, roc_auc))

    i += 1
plt.plot([0, 1], [0, 1], linestyle="--", lw=lw, color="k",
         label='Luck')

mean_tpr /= kfold.get_n_splits(X, y)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color="g", linestyle="--",
         label='Mean ROC (area = %0.2f)' % mean_auc, lw=lw)

plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate') 

plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
#----Output----

Listing 4-9Plotting the ROC Curve for Stratified K-fold Cross-Validation

from sklearn import model_selection

kfold = model_selection.StratifiedKFold(n_splits=5, random_state=2019)

train_scores = []
test_scores = []
k = 0
for (train, test) in kfold.split(X_train, y_train):
    clf.fit(X_train[train], y_train[train])
    train_score = clf.score(X_train[train], y_train[train])
    train_scores.append(train_score)
    # score for test set
    test_score = clf.score(X_train[test], y_train[test])
    test_scores.append(test_score)

    k += 1
    print('Fold: %s, Class dist.: %s, Train Acc: %.3f, Test Acc: %.3f'
          % (k, np.bincount(y_train[train]), train_score, test_score))

print('\nTrain CV accuracy: %.3f' % (np.mean(train_scores)))
print('Test CV accuracy: %.3f' % (np.mean(test_scores)))
#----output----
Fold: 1, Class dist.: [277 152], Train Acc: 0.758, Test Acc: 0.806
Fold: 2, Class dist.: [277 152], Train Acc: 0.779, Test Acc: 0.731
Fold: 3, Class dist.: [278 152], Train Acc: 0.767, Test Acc: 0.813
Fold: 4, Class dist.: [278 152], Train Acc: 0.781, Test Acc: 0.766
Fold: 5, Class dist.: [278 152], Train Acc: 0.781, Test Acc: 0.710

Train CV accuracy: 0.773
Test CV accuracy: 0.765

Listing 4-8
Stratified K-fold Cross-Validation

集成方法

集成方法能够将多个模型分数组合成单个分数，以创建稳健的通用模型。

在高层次上，有两种类型的集成方法:

组合相似类型的多个模型。
- 引导聚集
- 助推
组合各种类型的多个模型。
- 投票分类
- 混合或堆叠

制袋材料

Bootstrap aggregation(也称为 bagging)由 Leo Breiman 于 1994 年提出；这是一种减少模型方差的模型聚合技术。训练数据被分成多个样本，替换为引导样本。引导样本大小将与原始样本大小相同，原始值的 3/4 和替换导致值的重复(图 4-5 )。

图 4-5

拔靴带

构建每个引导样本的独立模型，并使用回归预测的平均值或分类的多数投票来创建最终模型。

图 4-6 显示了装袋工艺流程。如果 N 是从原始训练集中创建的引导样本的数量，对于 i = 1 到 N，训练一个基本 ML 模型 C _i 。

C _最终= y 的累计最大值 $\sum \limits_{\mathrm{i}}\mathrm{I}\left({\mathrm{C}}_{\mathrm{i}}=\mathrm{y}\right)$

图 4-6

制袋材料

我们来比较一下独立决策树模型和 100 棵树的 bagging 决策树模型(清单 4-10 )的性能。

# Bagged Decision Trees for Classification
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# read the data in
df = pd.read_csv("Data/Diabetes.csv")

X = df.ix[:,:8].values     # independent variables
y = df['class'].values     # dependent variables

#Normalize
X = StandardScaler().fit_transform(X)

# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2019)

kfold = model_selection.StratifiedKFold(n_splits=5, random_state=2019)
num_trees = 100

# Decision Tree with 5 fold cross validation
clf_DT = DecisionTreeClassifier(random_state=2019).fit(X_train,y_train)
results = model_selection.cross_val_score(clf_DT, X_train,y_train, cv=kfold)
print ("Decision Tree (stand alone) - Train : ", results.mean())
print ("Decision Tree (stand alone) - Test : ", metrics.accuracy_score(clf_DT.predict(X_test), y_test))

# Using Bagging Lets build 100 decision tree models and average/majority vote prediction
clf_DT_Bag = BaggingClassifier(base_estimator=clf_DT, n_estimators=num_trees, random_state=2019).fit(X_train,y_train)
results = model_selection.cross_val_score(clf_DT_Bag, X_train, y_train, cv=kfold)
print ("\nDecision Tree (Bagging) - Train : ", results.mean())
print ("Decision Tree (Bagging) - Test : ", metrics.accuracy_score(clf_DT_Bag.predict(X_test), y_test))
#----output----
Decision Tree (stand alone) - Train :  0.6742199894235854
Decision Tree (stand alone) - Test :  0.6428571428571429

Decision Tree (Bagging) - Train :  0.7460074034902167
Decision Tree (Bagging) - Test :  0.8051948051948052

Listing 4-10Stand-Alone Decision Tree vs. Bagging

特征重要性

决策树模型具有显示重要特征的属性，这些特征基于基尼或熵信息增益(列表 4-11 )。

feature_importance = clf_DT.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align="center")
plt.yticks(pos, df.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

#----output----

Listing 4-11Decision Tree Feature Importance Function

随机森林

随机选取一个观察子集和一个变量子集来构建多个独立的基于树的模型。这些树更不相关，因为在树的分割过程中只使用了变量的子集，而不是在树的构造中贪婪地选择最佳分割点(清单 4-12 )。

from sklearn.ensemble import RandomForestClassifier
num_trees = 100

kfold = model_selection.StratifiedKFold(n_splits=5, random_state=2019)

clf_RF = RandomForestClassifier(n_estimators=num_trees).fit(X_train, y_train)
results = model_selection.cross_val_score(clf_RF, X_train, y_train, cv=kfold)

print ("\nRandom Forest (Bagging) - Train : ", results.mean())
print ("Random Forest (Bagging) - Test : ", metrics.accuracy_score(clf_RF.predict(X_test), y_test))
#----output----
Random Forest - Train :  0.7379693283976732
Random Forest - Test :  0.8051948051948052

Listing 4-12
RandomForest Classifier

极度随机化的树(ExtraTree)

这种算法是为了给装袋过程引入更多的随机性。树分裂是从每个分裂的样本值范围中完全随机选择的，这允许进一步减少模型的方差，但代价是偏差略有增加(清单 4-13 )。

from sklearn.ensemble import ExtraTreesClassifier
num_trees = 100

kfold = model_selection.StratifiedKFold(n_splits=5, random_state=2019)

clf_ET = ExtraTreesClassifier(n_estimators=num_trees).fit(X_train, y_train)
results = cross_validation.cross_val_score(clf_ET, X_train, y_train, cv=kfold)

print ("\nExtraTree - Train : ", results.mean())
print ("ExtraTree - Test : ", metrics.accuracy_score(clf_ET.predict(X_test), y_test))
#----output----
ExtraTree - Train :  0.7410893707033315
ExtraTree - Test :  0.7987012987012987

Listing 4-13Extremely Randomized Trees (ExtraTree)

决策边界看起来如何？

让我们执行主成分分析，为了便于绘图，只考虑前两个主要成分。模型构建代码将保持不变，除了在规范化之后和分割数据以进行训练和测试之前，我们需要添加以下代码。

一旦我们成功地运行了模型，我们就可以使用下面的代码来绘制独立模型和不同 bagging 模型的决策边界。

from sklearn.decomposition import PCA
from matplotlib.colors import ListedColormap
# PCA
X = PCA(n_components=2).fit_transform(X)

def plot_decision_regions(X, y, classifier):

    h = .02  # step size in the mesh
    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, h),
                           np.arange(x2_min, x2_max, h))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
                    alpha=0.8, c=colors[idx],
                    marker=markers[idx], label=cl)

# Plot the decision boundary
plt.figure(figsize=(10,6))
plt.subplot(221)
plot_decision_regions(X, y, clf_DT)
plt.title('Decision Tree (Stand alone)')
plt.xlabel('PCA1')
plt.ylabel('PCA2')

plt.subplot(222)
plot_decision_regions(X, y, clf_DT_Bag)
plt.title('Decision Tree (Bagging - 100 trees)')
plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.legend(loc='best')

plt.subplot(223)
plot_decision_regions(X, y, clf_RF)
plt.title('RandomForest Tree (100 trees)')
plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.legend(loc='best')

plt.subplot(224)
plot_decision_regions(X, y, clf_ET)
plt.title('Extream Random Tree (100 trees)')
plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.legend(loc='best')
plt.tight_layout()

#----output----

Decision Tree (stand alone) - Train :  0.5781332628239026
Decision Tree (stand alone) - Test :  0.6688311688311688

Decision Tree (Bagging) - Train :  0.6319936541512428
Decision Tree (Bagging) - Test :  0.7467532467532467

Random Forest - Train :  0.6418297197250132
Random Forest  - Test :  0.7662337662337663

ExtraTree - Train :  0.6205446853516658
ExtraTree - Test :  0.7402597402597403

Listing 4-14Plot the Decision Boudaries

bagging—基本调谐参数

让我们看看获得更好模型结果的关键调整参数。

n_estimators: 这是树的数量——越大越好。注意，超过某一点，结果不会有明显改善。
max_features: 这是用于分割节点的随机特征子集，越低越有利于减少方差(但会增加偏差)。理想情况下，对于回归问题，它应该等于 n_features(要素总数),对于分类，它应该等于 n_features 的平方根。
n_jobs: 用于平行建造树的核心数量。如果设置为-1，则使用系统中所有可用的核心，或者您可以指定数量。

助推

Freud 和 Schapire 在 1995 年用著名的 AdaBoost 算法引入了 boosting 的概念(自适应 boosting)。boosting 的核心概念是，与其说是一个独立的个体假设，不如说是将假设按顺序组合起来提高了准确性。本质上，助推算法将弱学习者转化为强学习者。升压算法经过精心设计，可以解决偏差问题(图 4-7 )。

概括来说，AdaBoosting 过程可以分为三个步骤:

图 4-7

AdaBoosting

为所有数据点分配统一的权重 W ₀ (x) = 1 / N，其中 N 为训练数据点的总数。
在每次迭代中，将分类器 y _m (x _n 拟合到训练数据，并更新权重以最小化加权误差函数。

重量计算为 ${\mathrm{W}}_{\mathrm{n}}{\left(\mathrm{m}+1\right)}={\mathrm{W}}_{\mathrm{n}}{\left(\mathrm{m}\right)}\exp \left{\ {\propto}_{\mathrm{m}}{\mathrm{y}}_{\mathrm{m}}\ \left({\mathrm{x}}_{\mathrm{n}}\right)\ne {\mathrm{t}}_{\mathrm{n}}\right}$ 。

假设权重或损失函数由 ${\propto}_{\mathrm{m}}=\frac{1}{2}\log\ \left{\ \frac{1-{\in}_{\mathrm{m}}}{\in_{\mathrm{m}}}\ \right}$ 给出，期限利率由 ${\in}_{\mathrm{m}}=\kern0.5em \frac{\sum_{\mathrm{n}=1}{\mathrm{N}}{\mathrm{W}}_{\mathrm{n}}{\left(\mathrm{m}\right)}\ \mathrm{I}\ \left({\mathrm{y}}_{\mathrm{m}}\left({\mathrm{x}}_{\mathrm{n}}\right)\ne {\mathrm{t}}_{\mathrm{n}}\right)}{\sum_{\mathrm{n}=1}{\mathrm{N}}{\mathrm{W}}_{\mathrm{n}}{\left(\mathrm{m}\right)}}$ 给出，其中 $\left({\mathrm{y}}_{\mathrm{m}}\left({\mathrm{x}}_{\mathrm{n}}\right)\ne {\mathrm{t}}_{\mathrm{n}}\right)\ \mathrm{has}\ \mathrm{values}\ \frac{0}{1}\ \mathrm{i}.\mathrm{e}.,\kern0.375em 0\ \mathrm{i}\mathrm{f}\ \left({\mathrm{x}}_{\mathrm{n}}\right)\ \mathrm{correctly}\ \mathrm{classified}\ \mathrm{else}\ 1$
最终模型由 ${\mathrm{Y}}_{\mathrm{m}}=\mathit{\operatorname{sign}}\left(\sum \limits_{\mathrm{m}=1}^{\mathrm{M}}{\propto}_{\mathrm{m}}{\mathrm{y}}_{\mathrm{m}}\left(\mathrm{x}\right)\right)$ 给出

AdaBoost 的示例图

让我们考虑具有十个数据点的两个类别标签的训练数据。假设，最初，所有数据点将具有由 1/10 给出的相等权重，如图 4-8 所示。

图 4-8

有十个数据点的样本数据集

提升迭代 1

注意在图 4-9 中，正类的三个点被第一个简单分类模型错误分类，因此它们将被赋予更高的权重。误差项和损失函数(学习率)分别计算为 0.30 和 0.42。由于分类错误，数据点 P3、P4 和 P5 将获得更高的权重(0.15)，而其他数据点将保留原始权重(0.1)。

图 4-9

Y _m1 是第一个分类或假设

推进迭代 2

让我们拟合另一个分类模型，如图 4-10 所示，并注意到负类的三个数据点(P6、P7 和 P8)被错误分类。因此，根据计算，这些点将被赋予更高的权重 0.17，而其余数据点的权重将保持不变，因为它们被正确分类。

图 4-10

Y_m2是第二个分类或假设

推进迭代 3

第三分类模型错误分类了总共三个数据点:两个阳性类，P1 和 P2；和一个消极阶层，P9。因此，根据计算，这些错误分类的数据点将被分配一个新的更高的权重 0.19，而其余的数据点将保留其先前的权重(图 4-11 )。

图 4-11

Y_m3是第三种分类或假设

最终模型

现在，根据 AdaBoost 算法，让我们结合如图 4-12 所示的弱分类模型。注意，最终的组合模型将具有最小的误差项和最大的学习率，从而导致更高的精确度。

图 4-12

结合弱分类器的 AdaBoost 算法

让我们从 Pima 糖尿病数据集中挑选弱预测器，并比较独立决策树模型与 AdaBoost 在决策树模型上进行 100 轮提升的性能(清单 4-15 )。

# Bagged Decision Trees for Classification
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# read the data in
df = pd.read_csv("Data/Diabetes.csv")

# Let's use some week features to build the tree
X = df[['age','serum_insulin']]     # independent variables
y = df['class'].values              # dependent variables

#Normalize
X = StandardScaler().fit_transform(X)

# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2019)

kfold = model_selection.StratifiedKFold(n_splits=5, random_state=2019)
num_trees = 100

# Dection Tree with 5 fold cross validation
# lets restrict max_depth to 1 to have more impure leaves
clf_DT = DecisionTreeClassifier(max_depth=1, random_state=2019).fit(X_train,y_train)
results = model_selection.cross_val_score(clf_DT, X_train,y_train, cv=kfold.split(X_train, y_train))
print("Decision Tree (stand alone) - CV Train : %.2f" % results.mean())
print("Decision Tree (stand alone) - Test : %.2f" % metrics.accuracy_score(clf_DT.predict(X_train), y_train))
print("Decision Tree (stand alone) - Test : %.2f" % metrics.accuracy_score(clf_DT.predict(X_test), y_test))

# Using Adaptive Boosting of 100 iteration
clf_DT_Boost = AdaBoostClassifier(base_estimator=clf_DT, n_estimators=num_trees, learning_rate=0.1, random_state=2019).fit(X_train,y_train)
results = model_selection.cross_val_score(clf_DT_Boost, X_train, y_train, cv=kfold.split(X_train, y_train))
print("\nDecision Tree (AdaBoosting) - CV Train : %.2f" % results.mean())
print("Decision Tree (AdaBoosting) - Train : %.2f" % metrics.accuracy_score(clf_DT_Boost.predict(X_train), y_train))
print("Decision Tree (AdaBoosting) - Test : %.2f" % metrics.accuracy_score(clf_DT_Boost.predict(X_test), y_test))
#----output----
Decision Tree (stand alone) - CV Train : 0.64
Decision Tree (stand alone) - Test : 0.64
Decision Tree (stand alone) - Test : 0.70

Decision Tree (AdaBoosting) - CV Train : 0.68
Decision Tree (AdaBoosting) - Train : 0.71
Decision Tree (AdaBoosting) - Test : 0.79

Listing 4-15Stand-Alone Decision Tree vs. AdaBoost

请注意，在这种情况下，与独立的决策树模型相比，AdaBoost 算法在训练/测试数据集之间平均增加了 9%的错误分数。

梯度推进

由于分阶段的可加性，损失函数可以用适合于优化的形式来表示。这就产生了一类被称为广义提升机器(GBM)的广义提升算法。梯度推进是 GBM 的一个示例实现，它可以处理不同的损失函数，如回归、分类、风险建模等。顾名思义，这是一种通过梯度识别弱学习者缺点的 boosting 算法(AdaBoost 使用高权重数据点)，因此得名梯度 boosting。

迭代地将分类器 y _m (x _n 拟合到训练数据。初始模型将具有恒定值 ${\mathrm{y}}_0\left(\mathrm{x}\right)=\arg \kern0.375em \min \delta \sum \limits_{\mathrm{i}=1}^{\mathrm{n}}\mathrm{L}\left({\mathrm{y}}_{\mathrm{m}},\delta \right)$ 。
计算每个模型拟合迭代 g _m (x)的损失(即预测值对实际值),或者计算负梯度并使用它来拟合新的基本学习函数 h _m (x ),并找到最佳梯度下降步长 ${\delta}_{\mathrm{m}}=\arg \kern0.375em \min \delta \sum \limits_{\mathrm{i}=1}^{\mathrm{n}}\mathrm{L}\left({\mathrm{y}}_{\mathrm{m}},{\mathrm{y}}_{\mathrm{m}-1}\left(\mathrm{x}\right)+\delta\ {\mathrm{h}}_{\mathrm{m}}\left(\mathrm{x}\right)\ \right)$ 。
更新函数估计 y_m(x)= y_m1(x)+δh_m(x)并输出 y _m (x)。

清单 4-16 显示了一个梯度提升分类器的示例代码实现。

from sklearn.ensemble import GradientBoostingClassifier

# Using Gradient Boosting of 100 iterations
clf_GBT = GradientBoostingClassifier(n_estimators=num_trees, learning_rate=0.1, random_state=2019).fit(X_train, y_train)
results = model_selection.cross_val_score(clf_GBT, X_train, y_train, cv=kfold)

print ("\nGradient Boosting - CV Train : %.2f" % results.mean())
print ("Gradient Boosting - Train : %.2f" % metrics.accuracy_score(clf_GBT.predict(X_train), y_train))
print ("Gradient Boosting - Test : %.2f" % metrics.accuracy_score(clf_GBT.predict(X_test), y_test))
#----output----
Gradient Boosting - CV Train : 0.66
Gradient Boosting - Train : 0.79
Gradient Boosting - Test : 0.75

Listing 4-16Gradient Boosting Classifier

让我们看看数字分类，以说明模型性能如何随着每次迭代而提高。

from sklearn.ensemble import GradientBoostingClassifier

df= pd.read_csv('Data/digit.csv')

X = df.iloc[:,1:17].values
y = df['lettr'].values

# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2019)

kfold = model_selection.StratifiedKFold(n_splits=5, random_state=2019)
num_trees = 10

clf_GBT = GradientBoostingClassifier(n_estimators=num_trees, learning_rate=0.1, random_state=2019).fit(X_train, y_train)
results = model_selection.cross_val_score(clf_GBT, X_train, y_train, cv=kfold)

print ("\nGradient Boosting - Train : %.2f" % metrics.accuracy_score(clf_GBT.predict(X_train), y_train))
print ("Gradient Boosting - Test : %.2f" % metrics.accuracy_score(clf_GBT.predict(X_test), y_test))

# Let's predict for the letter 'T' and understand how the prediction accuracy changes in each boosting iteration
X_valid= (2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8)
print ("Predicted letter: ", clf_GBT.predict([X_valid]))

# Staged prediction will give the predicted probability for each boosting iteration
stage_preds = list(clf_GBT.staged_predict_proba([X_valid]))
final_preds = clf_GBT.predict_proba([X_valid])

# Plot
x = range(1,27)
label = np.unique(df['lettr'])

plt.figure(figsize=(10,3))
plt.subplot(131)
plt.bar(x, stage_preds[0][0], align="center")
plt.xticks(x, label)
plt.xlabel('Label')
plt.ylabel('Prediction Probability')
plt.title('Round One')
plt.autoscale()

plt.subplot(132)
plt.bar(x, stage_preds[5][0],align='center')
plt.xticks(x, label)
plt.xlabel('Label')
plt.ylabel('Prediction Probability')
plt.title('Round Five')
plt.autoscale()

plt.subplot(133)
plt.bar(x, stage_preds[9][0],align='center')
plt.xticks(x, label)
plt.autoscale()
plt.xlabel('Label')
plt.ylabel('Prediction Probability')
plt.title('Round Ten')

plt.tight_layout()
plt.show()
#----output----
Gradient Boosting - Train :  0.75
Gradient Boosting - Test :  0.72
Predicted letter: 'T'

梯度提升在后续迭代中纠正错误的增强迭代的负面影响。请注意，在第一次迭代中，字母“T”的预测概率为 0.25，到第十次迭代时逐渐增加到 0.76，而其他字母的概率百分比在每一轮中都有所下降。

升压—基本调谐参数

模型复杂性和过拟合可以通过使用两类参数的正确值来控制:

树形结构

n_estimators :这是要建立的弱学习器的数量。

max_depth :这是单个估算器的最大深度。最佳值取决于输入变量的相互作用。

min_samples_leaf :这将有助于确保 leaf 中有足够数量的样本结果。

子样本:这是用于拟合单个模型的样本分数(默认值=1)。通常，0.8(80%)用于引入样本的随机选择，这反过来增加了对过拟合的稳健性。
正则化参数

learning_rate :控制估值器的变化幅度。学习率越低越好，这就需要更高的 n 估计量(这就是权衡)。

Xgboost(极限梯度提升)

2014 年 3 月，Tianqui Chen 用 C++构建了 xgboost，作为分布式(深度)ML 社区的一部分，它有一个 Python 的接口。它是梯度推进算法的一个扩展的、更加规则的版本。这是表现最出色、大规模、可扩展的 ML 算法之一，在 Kaggle(预测建模和分析竞赛论坛)数据科学竞赛中赢得解决方案方面一直发挥着主要作用。

XGBoost 目标函数 obj(θ)= $\sum \limits_{\mathrm{i}}^{\mathrm{n}}\mathrm{l}\left({\mathrm{y}}_{\mathrm{i}}-{\hat{\mathrm{y}}}_{\mathrm{i}}\right)+\sum \limits_{\mathrm{k}=1}^{\mathrm{K}}\varOmega \left({\mathrm{f}}_{\mathrm{k}}\right)$

正则项由下式给出

梯度下降技术用于优化目标函数，关于算法的更多数学知识可在 http://xgboost.readthedocs.io/en/latest/ 网站找到。

xgboost 算法的一些主要优点是

它实现了并行处理。
它有一个处理缺失值的内置标准，这意味着用户可以指定一个不同于其他观察值的特定值(如-1 或-999)，并将其作为参数传递。
它会将树分割到最大深度，这与梯度提升不同，梯度提升在分割中遇到负损失时会停止分割节点。

XGboost 有一组参数，在较高层次上，我们可以将它们分为三类。让我们看看这些类别中最重要的。

一般参数
1. nthread :并行线程数；如果没有给定值，将使用所有内核。
2. Booster :这是要运行的模型类型，默认为 gbtree(基于树的模型)。“gblinear”用于线性模型
增压参数
1. eta :这是防止过拟合的学习率或步长收缩；默认值为 0.3，范围在 0 和 1 之间。
2. max_depth :树的最大深度，默认为 6
3. min_child_weight :一个孩子需要的所有观察的最小权重之和。从事件率的 1/平方根开始。
4. colsample_bytree :对每棵树随机抽样的列的分数，默认值为 1。
5. 子样本:每棵树随机抽样的一部分观察值，默认值为 1。降低该值会使算法变得保守，以避免过度拟合。
6. lambda :关于权重的 L2 正则化项，默认值为 1
7. alpha :权重上的 L1 正则项
任务参数
1. 目标:定义要最小化的损失函数，默认值为“reg: linear”对于二进制分类，它应该是“二进制:逻辑”,对于多类，它应该是“多:softprob”以获得概率值，而“多:softmax”以获得预测类。对于多类，将指定 num_class(唯一类的数量)。
2. eval_metric :用于验证模型性能的指标

sklearn 有一个 xgboost (XGBClassifier)的包装器。让我们继续使用糖尿病数据集，并使用弱学习者建立一个模型(清单 4-17 )。

import xgboost as xgb
from xgboost.sklearn import XGBClassifier

# read the data in
df = pd.read_csv("Data/Diabetes.csv")

predictors = ['age','serum_insulin']
target = 'class'

# Most common preprocessing step include label encoding and missing value treatment
from sklearn import preprocessing
for f in df.columns:
    if df[f].dtype=='object':
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(df[f].values))
        df[f] = lbl.transform(list(df[f].values))

df.fillna((-999), inplace=True)

# Let's use some week features to build the tree
X = df[['age','serum_insulin']] # independent variables
y = df['class'].values          # dependent variables

#Normalize
X = StandardScaler().fit_transform(X)

# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2017)
num_rounds = 100

kfold = model_selection.StratifiedKFold(n_splits=5, random_state=2017)

clf_XGB = XGBClassifier(n_estimators = num_rounds,
                        objective= 'binary:logistic',
                        seed=2017)

# use early_stopping_rounds to stop the cv when there is no score imporovement
clf_XGB.fit(X_train,y_train, early_stopping_rounds=20, eval_set=[(X_test, y_test)], verbose=False)

results = model_selection.cross_val_score(clf_XGB, X_train,y_train, cv=kfold)
print ("\nxgBoost - CV Train : %.2f" % results.mean())
print ("xgBoost - Train : %.2f" % metrics.accuracy_score(clf_XGB.predict(X_train), y_train))
print ("xgBoost - Test : %.2f" % metrics.accuracy_score(clf_XGB.predict(X_test), y_test))
#----output----
xgBoost - CV Train : 0.69
xgBoost - Train : 0.73
xgBoost - Test : 0.74

Listing 4-17xgboost Classifier Using sklearn Wrapper

现在让我们看看如何使用 xgboost 原生接口构建一个模型。用于输入数据的 xgboostfor 的内部数据结构。将大型数据集转换为 DMatrix 对象以节省预处理时间是一个很好的做法(清单 4-18 )。

xgtrain = xgb.DMatrix(X_train, label=y_train, missing=-999)
xgtest = xgb.DMatrix(X_test, label=y_test, missing=-999)

# set xgboost params
param = {'max_depth': 3,  # the maximum depth of each tree
         'objective': 'binary:logistic'}

clf_xgb_cv = xgb.cv(param, xgtrain, num_rounds,
                    stratified=True,
                    nfold=5,
                    early_stopping_rounds=20,
                    seed=2017)

print ("Optimal number of trees/estimators is %i" % clf_xgb_cv.shape[0])

watchlist  = [(xgtest,'test'), (xgtrain,'train')]
clf_xgb = xgb.train(param, xgtrain,clf_xgb_cv.shape[0], watchlist)

# predict function will produce the probability

# so we'll use 0.5 cutoff to convert probability to class label
y_train_pred = (clf_xgb.predict(xgtrain, ntree_limit=clf_xgb.best_iteration) > 0.5).astype(int)
y_test_pred = (clf_xgb.predict(xgtest, ntree_limit=clf_xgb.best_iteration) > 0.5).astype(int)

print ("XGB - Train : %.2f" % metrics.accuracy_score(y_train_pred, y_train))
print ("XGB - Test : %.2f" % metrics.accuracy_score(y_test_pred, y_test))

Listing 4-18xgboost Using It’s Native Python Package Code

-输出-

Optimal number of trees (estimators) is 6
[0]    test-error:0.344156    train-error:0.299674
[1]    test-error:0.324675    train-error:0.273616
[2]    test-error:0.272727    train-error:0.281759
[3]    test-error:0.266234    train-error:0.278502
[4]    test-error:0.266234    train-error:0.273616
[5]    test-error:0.311688    train-error:0.254072
XGB - Train : 0.73
XGB - Test : 0.73

集合投票——机器学习最大的英雄联盟

图 4-13

合奏:ML 最大的英雄联盟

投票分类器使我们能够通过来自不同类型的多个 ML 算法的多数投票来组合预测，不像 Bagging/Boosting，其中相似类型的多个分类器用于多数投票。

首先，您可以从训练数据集创建多个独立模型。然后，当要求对新数据进行预测时，可以使用投票分类器来包装您的模型，并对子模型的预测进行平均。子模型的预测可以被加权，但是手动地或者甚至启发式地指定分类器的权重是困难的。更高级的方法可以学习如何对子模型的预测进行最佳加权，但这被称为堆叠(stacked aggregation ),目前 Scikit-learn 中没有提供。

让我们在 Pima 糖尿病数据集上构建单独的模型，并尝试投票分类器，以结合模型结果来比较准确性的变化(清单 4-19 )。

import pandas as pd
import numpy as np

# set seed for reproducability
np.random.seed(2017)

import statsmodels.api as sm
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier

# currently its available as part of mlxtend and not sklearn
from mlxtend.classifier import EnsembleVoteClassifier
from sklearn import model_selection
from sklearn import metrics
from sklearn.model_selection import train_test_split

# read the data in
df = pd.read_csv("Data/Diabetes.csv")

X = df.iloc[:,:8]     # independent variables
y = df['class']       # dependent variables

# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2017)

LR = LogisticRegression(random_state=2017)
RF = RandomForestClassifier(n_estimators = 100, random_state=2017)
SVM = SVC(random_state=0, probability=True)
KNC = KNeighborsClassifier()
DTC = DecisionTreeClassifier()
ABC = AdaBoostClassifier(n_estimators = 100)
BC = BaggingClassifier(n_estimators = 100)
GBC = GradientBoostingClassifier(n_estimators = 100)

clfs = []
print('5-fold cross validation:\n')
for clf, label in zip([LR, RF, SVM, KNC, DTC, ABC, BC, GBC],
                      ['Logistic Regression',
                       'Random Forest',
                       'Support Vector Machine',
                       'KNeighbors',
                       'Decision Tree',
                       'Ada Boost',
                       'Bagging',
                       'Gradient Boosting']):
    scores = model_selection.cross_val_score(clf, X_train, y_train, cv=5, scoring="accuracy") 

    print("Train CV Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
    md = clf.fit(X, y)
    clfs.append(md)
    print("Test Accuracy: %0.2f " % (metrics.accuracy_score(clf.predict(X_test), y_test)))
#----output----
5-fold cross validation:

Train CV Accuracy: 0.76 (+/- 0.03) [Logistic Regression]
Test Accuracy: 0.79
Train CV Accuracy: 0.74 (+/- 0.03) [Random Forest]
Test Accuracy: 1.00
Train CV Accuracy: 0.65 (+/- 0.00) [Support Vector Machine]
Test Accuracy: 1.00
Train CV Accuracy: 0.70 (+/- 0.05) [KNeighbors]
Test Accuracy: 0.84
Train CV Accuracy: 0.69 (+/- 0.02) [Decision Tree]
Test Accuracy: 1.00
Train CV Accuracy: 0.73 (+/- 0.04) [Ada Boost]
Test Accuracy: 0.83
Train CV Accuracy: 0.75 (+/- 0.04) [Bagging]
Test Accuracy: 1.00
Train CV Accuracy: 0.75 (+/- 0.03) [Gradient Boosting]
Test Accuracy: 0.92

Listing 4-19
Ensemble Model

从之前的基准测试中我们可以看出，与其他模型相比，“逻辑回归”、“随机森林”、“Bagging”和 Ada/梯度提升算法具有更好的准确性。让我们结合非相似模型，如逻辑回归(基础模型)、随机森林(bagging 模型)和梯度推进(boosting 模型)来创建一个健壮的通用模型。

硬投票与软投票

多数投票也被称为硬投票。预测概率之和的 argmax 被称为软投票。参数“权重”可用于为分类器分配特定权重。每个分类器的预测分类概率乘以分类器权重并进行平均。然后，从最高平均概率类别标签中导出最终类别标签。

假设我们给所有的分类器分配一个相等的权重 1(表 4-1 )。基于软投票，预测的类别标签是 1，因为它具有最高的平均概率。集合投票模型的示例代码实现参见清单 4-20 。

表 4-1

软投票

注意

Scikit-learn 的一些分类器不支持 predict_proba 方法。

# ### Ensemble Voting
clfs = []
print('5-fold cross validation:\n')

ECH = EnsembleVoteClassifier(clfs=[LR, RF, GBC], voting="hard")
ECS = EnsembleVoteClassifier(clfs=[LR, RF, GBC], voting="soft", weights=[1,1,1])

for clf, label in zip([ECH, ECS],
                      ['Ensemble Hard Voting',
                       'Ensemble Soft Voting']):
    scores = model_selection.cross_val_score(clf, X_train, y_train, cv=5, scoring="accuracy")
    print("Train CV Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
    md = clf.fit(X, y)
    clfs.append(md)
    print("Test Accuracy: %0.2f " % (metrics.accuracy_score(clf.predict(X_test), y_test)))
#----output----
5-fold cross validation:

Train CV Accuracy: 0.75 (+/- 0.02) [Ensemble Hard Voting]
Test Accuracy: 0.93
Train CV Accuracy: 0.76 (+/- 0.02) [Ensemble Soft Voting] 

Test Accuracy: 0.95

Listing 4-20Ensemble Voting Model

堆垛

David h . WOL pert(1992 年)在他与 Neural Networks journal 发表的文章中提出了堆叠泛化的概念，通常被称为“堆叠”。在堆叠中，最初在训练/测试数据集上训练不同类型的多个基础模型。理想的情况是混合使用不同的模型(kNN、bagging、boosting 等)。)这样他们就可以了解问题的某一部分。在第 1 级，使用基本模型的预测值作为特征，并训练一个模型，该模型称为元模型。因此，组合单个模型的学习将导致提高的准确性。这是一个简单的一级堆叠，同样，您可以堆叠不同类型的模型的多个级别(图 4-14 )。

图 4-14

简单的二级堆叠模型

让我们应用之前在糖尿病数据集上讨论的堆叠概念，并比较基本模型与元模型的准确性(清单 4-21 )。

# Classifiers
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

seed = 2019

np.random.seed(seed)  # seed to shuffle the train set

# read the data in
df = pd.read_csv("Data/Diabetes.csv")

X = df.iloc[:,0:8] # independent variables
y = df['class'].values     # dependent variables

#Normalize
X = StandardScaler().fit_transform(X)

# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

kfold = model_selection.StratifiedKFold(n_splits=5, random_state=seed)
num_trees = 10
verbose = True # to print the progress

clfs = [KNeighborsClassifier(),
        RandomForestClassifier(n_estimators=num_trees, random_state=seed),
        GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)]

# Creating train and test sets for blending
dataset_blend_train = np.zeros((X_train.shape[0], len(clfs)))
dataset_blend_test = np.zeros((X_test.shape[0], len(clfs)))

print('5-fold cross validation:\n')
for i, clf in enumerate(clfs):
    scores = model_selection.cross_val_score(clf, X_train, y_train, cv=kfold, scoring="accuracy")
    print("##### Base Model %0.0f #####" % i)
    print("Train CV Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()))
    clf.fit(X_train, y_train)
    print("Train Accuracy: %0.2f " % (metrics.accuracy_score(clf.predict(X_train), y_train)))
    dataset_blend_train[:,i] = clf.predict_proba(X_train)[:, 1]
    dataset_blend_test[:,i] = clf.predict_proba(X_test)[:, 1]
    print("Test Accuracy: %0.2f " % (metrics.accuracy_score(clf.predict(X_test), y_test)))

print ("##### Meta Model #####")

clf = LogisticRegression()
scores = model_selection.cross_val_score(clf, dataset_blend_train, y_train, cv=kfold, scoring="accuracy")
clf.fit(dataset_blend_train, y_train)
print("Train CV Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()))
print("Train Accuracy: %0.2f " % (metrics.accuracy_score(clf.predict(dataset_blend_train), y_train)))
print("Test Accuracy: %0.2f " % (metrics.accuracy_score(clf.predict(dataset_blend_test), y_test)))
#----output----
5-fold cross validation:

##### Base Model 0 #####
Train CV Accuracy: 0.71 (+/- 0.03)
Train Accuracy: 0.83
Test Accuracy: 0.75
##### Base Model 1 #####
Train CV Accuracy: 0.73 (+/- 0.02)
Train Accuracy: 0.98
Test Accuracy: 0.79
##### Base Model 2 #####
Train CV Accuracy: 0.74 (+/- 0.01)
Train Accuracy: 0.80
Test Accuracy: 0.80
##### Meta Model #####
Train CV Accuracy: 0.99 (+/- 0.02)
Train Accuracy: 0.99
Test Accuracy: 0.77 

Listing 4-21Model Stacking

超参数调谐

ML 过程中的主要目标和挑战之一是基于数据模式和观察到的证据来提高性能分数。为了实现这一目标，几乎所有的 ML 算法都有一组特定的参数，需要从数据集进行估计，这将使性能得分最大化。假设这些参数是您需要调整到不同值的旋钮，以找到使您获得最佳模型精度的最佳参数组合(图 4-15 )。选择一个好的超参数的最好方法是通过试错所有可能的参数值组合。Scikit-learn 提供 GridSearchCV 和 RandomSearchCV 函数，以促进超参数调整的自动化和可重复方法。

图 4-15

超参数调谐

网格搜索

对于给定的模型，您可以定义一组想要尝试的参数值。然后，使用 Scikit-learn 的 GridSearchCV 功能，为您提供的超参数值预设列表的所有可能组合构建模型，并根据交叉验证分数选择最佳组合。GridSearchCV 有两个缺点:

计算量大:很明显，参数值越多，网格搜索的计算量就越大。考虑一个例子，其中有五个参数，假设您想为每个参数尝试五个值，这将导致 5∫5 = 3，125 个组合。进一步乘以所使用的交叉验证折叠数(例如，如果 k-fold 为 5，则 3125÷5 = 15，625 个模型拟合)。
不完全最优但接近最优的参数 : GridSearch 将查看您为数字参数提供的固定点，因此很有可能会错过位于固定点之间的最优点。例如，假设您想要尝试决策树模型的‘n _ estimators’:[100，250，500，750，1000]的固定点，并且最优点可能位于两个固定点之间。然而，GridSearch 并不是为了在固定点之间进行搜索而设计的。

让我们在 Pima 糖尿病数据集上为 RandomForest 分类器尝试 GridSearchCV，以找到最佳参数值(清单 4-22 )。

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
seed = 2017

# read the data in
df = pd.read_csv("Data/Diabetes.csv")

X = df.iloc[:,:8].values     # independent variables
y = df['class'].values       # dependent variables

#Normalize
X = StandardScaler().fit_transform(X)

# evaluate the model by splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=seed)

kfold = model_selection.StratifiedKFold(n_splits=5, random_state=seed)
num_trees = 100

clf_rf = RandomForestClassifier(random_state=seed).fit(X_train, y_train)

rf_params = {
    'n_estimators': [100, 250, 500, 750, 1000],
    'criterion':  ['gini', 'entropy'],
    'max_features': [None, 'auto', 'sqrt', 'log2'],
    'max_depth': [1, 3, 5, 7, 9]
}

# setting verbose = 10 will print

the progress for every 10 task completion
grid = GridSearchCV(clf_rf, rf_params, scoring="roc_auc", cv=kfold, verbose=10, n_jobs=-1)
grid.fit(X_train, y_train)

print ('Best Parameters: ', grid.best_params_)

results = model_selection.cross_val_score(grid.best_estimator_, X_train,y_train, cv=kfold)
print ("Accuracy - Train CV: ", results.mean())
print ("Accuracy - Train : ", metrics.accuracy_score(grid.best_estimator_.predict(X_train), y_train))
print ("Accuracy - Test : ", metrics.accuracy_score(grid.best_estimator_.predict(X_test), y_test))
#----output----
Fitting 5 folds for each of 200 candidates, totalling 1000 fits
Best Parameters:  {'criterion': 'entropy', 'max_depth': 5, 'max_features': 'log2', 'n_estimators': 500}
Accuracy - Train CV:  0.7447905849775008
Accuracy - Train :  0.8621973929236499
Accuracy - Test :  0.7965367965367965

Listing 4-22Grid Search

for Hyperparameter Tuning

随机搜索

顾名思义，RandomSearch 算法尝试给定参数的一系列值的随机组合。数字参数可以指定为一个范围(与 GridSearch 中的固定值不同)。您可以控制想要执行的随机搜索的迭代次数。众所周知，与 GridSearch 相比，可以在更短的时间内找到非常好的组合；但是，您必须仔细选择参数的范围和随机搜索迭代的次数，因为它可能会错过迭代次数较少或范围较小的最佳参数组合。

让我们使用与 GridSearch 相同的组合来尝试 RandomSearchCV，并比较时间/准确性(清单 4-23 )。

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint

# specify parameters and distributions to sample from
param_dist = {'n_estimators':sp_randint(100,1000),
              'criterion': ['gini', 'entropy'],
              'max_features': [None, 'auto', 'sqrt', 'log2'],
              'max_depth': [None, 1, 3, 5, 7, 9]
             }

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf_rf, param_distributions=param_dist, cv=kfold, n_iter=n_iter_search, verbose=10, n_jobs=-1, random_state=seed)

random_search.fit(X_train, y_train)
# report(random_search.cv_results_)

print ('Best Parameters: ', random_search.best_params_)

results = model_selection.cross_val_score(random_search.best_estimator_, X_train,y_train, cv=kfold)
print ("Accuracy - Train CV: ", results.mean())
print ("Accuracy - Train : ", metrics.accuracy_score(random_search.best_estimator_.predict(X_train), y_train))
print ("Accuracy - Test : ", metrics.accuracy_score(random_search.best_estimator_.predict(X_test), y_test))
#----output----
Fitting 5 folds for each of 20 candidates, totalling 100 fits

Best Parameters:  {'criterion': 'entropy', 'max_depth': 3, 'max_features': None, 'n_estimators': 694}
Accuracy - Train CV:  0.7542402215299411

Accuracy - Train :  0.7802607076350093
Accuracy - Test :  0.8051948051948052

Listing 4-23Random Search for Hyperparameter Tuning

请注意，在这种情况下，使用 RandomSearchCV，我们能够以 100 次拟合获得与 GridSearchCV 的 1000 次拟合相当的精度结果。

图 4-16 是两个参数之间网格搜索与随机搜索结果差异的示例说明(不是实际表示)。假设 max_depth 的最佳区域位于 3 和 5 之间(蓝色阴影)，n_estimators 的最佳区域位于 500 和 700 之间(琥珀色阴影)。组合参数的理想最佳值将位于各个区域的相交处。这两种方法都能够找到接近最优的参数，而不一定是完美的最佳点。

图 4-16

网格搜索与随机搜索

贝叶斯优化

一个关键的新兴超参数调整技术是贝叶斯优化，使用参数及其相关目标值的观测融合的高斯过程回归。贝叶斯优化的目标是在尽可能少的迭代中找到未知函数的最大值。与网格和随机搜索相比，关键区别在于空间对于每个超参数具有概率分布，而不是离散值。这种技术特别适合于高成本函数的优化，在这种情况下，勘探和开发之间的平衡非常重要。虽然这种技术对连续变量很有效，但是没有直观的方法来处理离散参数。参考清单 4-24 中随机搜索超参数调整的简单代码实现示例。

您可以在 https://github.com/fmfn/BayesianOptimization 了解更多关于包装和示例的信息

# pip install bayesian-optimization
from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score
from bayes_opt.util import Colours
from sklearn.ensemble import RandomForestClassifier as RFC

def rfc_cv(n_estimators, min_samples_split, max_features, data, targets):
    """Random Forest cross validation.
    This function will instantiate a random forest classifier with parameters
    n_estimators, min_samples_split, and max_features. Combined with data and
    targets this will in turn be used to perform cross-validation. The result of cross validation is returned. Our goal is to find combinations of n_estimators, min_samples_split, and
    max_features that minimzes the log loss.
    """
    estimator = RFC(
        n_estimators=n_estimators,
        min_samples_split=min_samples_split,
        max_features=max_features,
        random_state=2
    )
    cval = cross_val_score(estimator, data, targets,
                           scoring='neg_log_loss', cv=4)
    return cval.mean()

def optimize_rfc(data, targets):
    """Apply Bayesian Optimization to Random Forest parameters."""
    def rfc_crossval(n_estimators, min_samples_split, max_features):
        """Wrapper of RandomForest cross validation.
        Notice how we ensure n_estimators and min_samples_split are casted

        to integer before we pass them along. Moreover, to avoid max_features
        taking values outside the (0, 1) range, we also ensure it is capped
        accordingly.
        """
        return rfc_cv(
            n_estimators=int(n_estimators),
            min_samples_split=int(min_samples_split),
            max_features=max(min(max_features, 0.999), 1e-3),
            data=data,
            targets=targets,
        )

    optimizer = BayesianOptimization(
        f=rfc_crossval,
        pbounds={
            "n_estimators": (10, 250),
            "min_samples_split": (2, 25),
            "max_features": (0.1, 0.999),
        },
        random_state=1234,
        verbose=2
    )
    optimizer.maximize(n_iter=10)

    print("Final result:", optimizer.max)
    return optimizer

print(Colours.green("--- Optimizing Random Forest ---"))
optimize_rfc(X_train, y_train)
#----output----

--- Optimizing Random Forest ---
|   iter    |  target   | max_fe... | min_sa... | n_esti... |
-------------------------------------------------------------
|  1        | -0.5112   |  0.2722   |  16.31    |  115.1    |
|  2        | -0.5248   |  0.806    |  19.94    |  75.42    |
|  3        | -0.5075   |  0.3485   |  20.44    |  240.0    |
|  4        | -0.528    |  0.8875   |  10.23    |  130.2    |
|  5        | -0.5098   |  0.7144   |  18.39    |  98.86    |
|  6        | -0.51     |  0.999    |  25.0     |  176.7    |
|  7        | -0.5113   |  0.7731   |  24.94    |  249.8    |
|  8        | -0.5339   |  0.999    |  2.0      |  250.0    |
|  9        | -0.5107   |  0.9023   |  24.96    |  116.2    |
|  10       | -0.8284   |  0.1065   |  2.695    |  10.04    |
|  11       | -0.5235   |  0.1204   |  24.89    |  208.1    |
|  12       | -0.5181   |  0.1906   |  2.004    |  81.15    |
|  13       | -0.5203   |  0.1441   |  2.057    |  185.3    |
|  14       | -0.5257   |  0.1265   |  24.85    |  153.1    |
|  15       | -0.5336   |  0.9906   |  2.301    |  219.3    |
=============================================================
Final result: {'target': -0.5075247858575866, 'params': {'max_features': 0.34854136537364394, 'min_samples_split': 20.443060083305443, 'n_estimators': 239.95344488408924}}

Listing 4-24Random Search for Hyperparameter Tuning

时序物联网数据的降噪

在过去的十年里，技术的软件和硬件方面都有了巨大的增长，这催生了互联世界或物联网(IoT)的概念。这意味着物理设备、日常物品和硬件形式安装有微型传感器，以不同参数的形式持续捕获机器状态，并扩展到互联网连接，以便这些设备能够相互通信/交互，并可以远程监控/控制。预测分析是处理挖掘物联网数据以获得洞察力的 ML 领域。概括来说，有两个关键方面。一个是异常检测，这是一个尽早识别前兆故障特征(异常行为，也称为异常)的过程，以便可以计划必要的措施来避免阻碍故障，例如设备电压或温度的突然升高或降低。另一个是剩余使用寿命(RUL)预测。这是一个从异常检测的角度预测 RUL 或离即将发生的故障有多远的过程。我们可以使用回归模型来预测 RUL。

我们从传感器收集的数据集容易受到噪声的影响，尤其是高带宽测量，如振动或超声信号。傅立叶变换是一种众所周知的技术，它允许我们在频域或频谱域进行分析，以获得对高带宽信号(如振动测量曲线)的更深入了解。傅立叶是一系列正弦波，傅立叶变换实质上是将信号分解成单独的正弦波分量。然而，傅立叶变换的缺点是，如果信号的频谱分量随时间快速变化，它不能提供局部信息。傅立叶变换的主要缺点是，一旦信号从时域变换到频域，所有与时间相关的信息都会丢失。小波变换解决了傅里叶变换的主要缺点，是高带宽信号处理的理想选择。有大量的文献解释小波变换及其应用，所以在本书中你不会得到太多的细节。本质上，小波可以用来将信号分解成一系列系数。前几个系数代表最低频率，后几个系数代表最高频率。通过移除较高频率的系数，然后用截断的系数重构信号，我们可以平滑信号，而不用像移动平均那样平滑所有感兴趣的峰值。

小波变换将时间序列信号分解为两部分，一部分是低频或低通滤波器，用于平滑原始信号近似值，另一部分是高频或高通滤波器，用于产生详细的局部特性，例如异常，如图 4-17 所示。

图 4-17。

小波分解的滤波概念

小波变换的优势之一是通过迭代获得多分辨率来执行多级分解过程的能力，其中近似值依次被连续分解，使得一个信号可以被分解成许多较低分辨率的分量。小波变换函数 f(x)是一个数列，其中称为母小波函数的小波基ψ用于分解。

$f(x)=\frac{1}{\sqrt{M}}\ \sum \limits_k{W}_{\phi}\left({j}_0,k\right){\phi}_{j_0,k}(x)+\frac{1}{\sqrt{M}}\ \sum \limits_{j={j}_0}^{\infty}\sum \limits_k{W}_{\psi}\left(j,k\right){\psi}_{j,k}(x)$

其中， j ₀ 是被称为近似或缩放系数的任意起始比例。W_ψ(j，k)称为细节或小波系数。

import pywt
from statsmodels.robust import mad
import pandas as pd
import numpy as np

df = pd.read_csv('Data/Temperature.csv')

# Function to denoise the sensor data using wavelet transform
def wp_denoise(df):
    for column in df:
        x = df[column]
        wp = pywt.WaveletPacket(data=x, wavelet="db7", mode="symmetric")
        new_wp = pywt.WaveletPacket(data=None, wavelet="db7", mode="sym")
        for i in range(wp.maxlevel):
            nodes = [node.path for node in wp.get_level(i, 'natural')]
            # Remove the high and low pass signals
            for node in nodes:
                sigma = mad(wp[node].data)
                uthresh = sigma * np.sqrt( 2*np.log( len( wp[node].data ) ) )
                new_wp[node] = pywt.threshold(wp[node].data, value=uthresh, mode="soft")
        y = new_wp.reconstruct(update=False)[:len(x)]
        df[column] = y
    return df

# denoise the sensor data
df_denoised = wp_denoise(df.iloc[:,3:4])
df['Date'] = pd.to_datetime(df['Date'])

plt.figure(1)
ax1 = plt.subplot(221)
df['4030CFDC'].plot(ax=ax1, figsize=(8, 8), title='Signal with noise')

ax2 = plt.subplot(222)
df_denoised['4030CFDC'].plot(ax=ax2, figsize=(8, 8), title='Signal without noise')
plt.tight_layout()
#----output----

Listing 4-25Wavelet Transform Implementation

摘要

在这一步中，我们已经了解了可能妨碍模型准确性的各种常见问题，例如没有为类创建、方差和偏差选择最佳概率截止点。我们还简要介绍了不同的模型调整技术，如 bagging、boosting、集成投票、网格搜索/随机搜索和贝叶斯优化技术，用于超参数调整。我们还研究了物联网数据的降噪技术。简而言之，我们只看了所讨论的每个主题中最重要的方面，以帮助您入门。然而，每种算法都有更多的优化选项，而且每种技术都在快速发展。所以我鼓励你留意他们各自的官方托管网页和 GitHub 资源库(表 4-2 )。

表 4-2

额外资源

名字

网页

Github 知识库

|
| --- | --- | --- |
| Scikit-learn | http://scikit-learn.org/stable/# | https://github.com/scikit-learn/scikit-learn |
| Xgboost | https://xgboost.readthedocs.io/en/latest/ | https://github.com/dmlc/xgboost |
| 贝叶斯优化 | 不适用的 | https://github.com/fmfn/BayesianOptimization |
| 小波变换 | https://pywavelets.readthedocs.io/en/latest/# | https://github.com/PyWavelets/pywt |

我们已经到达了步骤 4 的末尾，这意味着您已经通过了机器学习旅程的一半。在下一章，我们将学习文本挖掘技术。

五、文本挖掘和推荐系统

人工智能的一个关键领域是自然语言处理(NLP)，或众所周知的文本挖掘，它涉及教计算机如何从文本中提取意义。在过去的 20 年里，随着互联网世界的爆炸和社交媒体的兴起，大量有价值的数据以文本的形式产生。从文本数据中挖掘出有意义的模式的过程称为文本挖掘。本章概述了高级文本挖掘过程、关键概念和涉及的常用技术。

除了 Scikit-learn 之外，还有许多已建立的面向 NLP 的库可供 Python 使用，而且数量还在不断增加。表 5-1 根据截至 2016 年的贡献者数量列出了最受欢迎的图书馆。

表 5-1

流行的 Python 文本挖掘库

包名

贡献者数量(2019 年)

许可证

描述

|
| --- | --- | --- | --- |
| 我是 NLTK | Two hundred and fifty-five | 街头流氓 | 它是最流行和最广泛使用的工具包，主要用于支持 NLP 的研究和开发 |
| 玄诗 | Three hundred and eleven | LGPL-2 突击步枪 | 主要用于大型语料库的主题建模、文档索引和相似性检索 |
| 宽大的 | Three hundred | 用它 | 使用 Python + Cython 构建，用于 NLP 概念的高效生产实现 |
| 文本 blob | Thirty-six | 用它 | 它是 NLTK 和模式库的包装器，便于访问它们的功能。适合快速原型制作 |
| 懂得多种语言的 | Twenty-two | GPL-3 | 这是一个多语言文本处理工具包，支持大量多语言应用程序。 |
| 模式 | Nineteen | BSD-3 | 这是一个 Python 的 web 挖掘模块，具有抓取、NLP、机器学习和网络分析/可视化功能。 |

注意

另一个著名的库是 Stanford CoreNLP，这是一套基于 Java 的工具包。有许多 Python 包装器可用于同样的目的；然而，到目前为止，这些包装器的贡献者数量还很少。

文本挖掘过程概述

整个文本挖掘过程可以大致分为以下四个阶段，如图 5-1 所示:

图 5-1

文本挖掘过程概述

文本数据汇编
文本数据预处理
数据探索或可视化
模型结构

数据汇编(文本)

据观察，任何企业都有 70%的可用数据是非结构化的。第一步是整理来自不同来源(如开放式反馈)的非结构化数据；电话；电子邮件支持；在线聊天；以及 Twitter、LinkedIn 和脸书等社交媒体网络。汇集这些数据并应用挖掘/机器学习(ML)技术来分析它们，为组织提供了在客户体验中构建更多力量的宝贵机会。

有几个库可用于从所讨论的不同格式中提取文本内容。到目前为止，为多种格式提供简单、单一接口的最好的库是“textract”(开源 MIT 许可证)。请注意，到目前为止，这个库/包适用于 Linux 和 Mac OS，但不适用于 Windows。表 5-2 列出了支持的格式。

表 5-2

textract 支持的格式

格式

支持方式

附加信息

|
| --- | --- | --- |
| 。csv /.eml /。json /。odt /。txt / | Python 内置 | |
| 。文件 | 反词 | www.winfield.demon.nl/ |
| 。文档 | Python-docx | https://python-docx.readthedocs.io/en/latest/ |
| 。电子书 | 电子书 | https://github.com/aerkalov/ebooklib |
| 。gif /.jpg /。jpeg /。png /。tiff /。tif(基准) | tessera CT ocr | https://github.com/tesseract-ocr |
| 。html /。html 文件的后缀 | Beautifulsoup4 | http://beautiful-soup-4.readthedocs.io/en/latest/ |
| . mp3 / .ogg / .wav 档案 | 演讲识别和 sox | URL 1:URL 2: http://sox.sourceforge.net/ |
| 。味精 | 味精提取器 | https://github.com/mattgwwalker/msg-extractor |
| 。可移植文档格式文件的扩展名（portable document format 的缩写） | pdftotext 和 pdfminer.six | URL 1:URL 2: https://github.com/pdfminer/pdfminer.six |
| 。附 | Python-pptx | https://python-pptx.readthedocs.io/en/latest/ |
| 。著名图象处理软件 | PS2 文本 | http://pages.cs.wisc.edu/~ghost/doc/pstotext.htm |
| 。普适文本格式 | Unrtf | www.gnu.org/software/unrtf/ |
| . xlsx / .xls 档 | xlrd . xlrd . xlrd . xlrd . xlrd . xlrd . xlrd . xlrd . xlrd | https://pypi.python.org/pypi/xlrd |

让我们看看商业世界中最普遍的格式的代码:pdf、jpg 和音频文件(清单 5-1 )。注意，从其他格式中提取文本也相对简单。

# You can read/learn more about latest updates about textract on their official documents site at http://textract.readthedocs.io/en/latest/
import textract

# Extracting text from normal pdf
text = textract.process('Data/PDF/raw_text.pdf', language="eng")

# Extrcting text from two columned pdf
text = textract.process('Data/PDF/two_column.pdf', language="eng")

# Extracting text from scanned text pdf

text = textract.process('Data/PDF/ocr_text.pdf', method="tesseract", language="eng")

# Extracting text from jpg
text = textract.process('Data/jpg/raw_text.jpg', method="tesseract", language="eng")

# Extracting text from audio file
text = textract.process('Data/wav/raw_text.wav', language="eng")

Listing 5-1Example Code for Extracting Data from pdf, jpg, Audio

社会化媒体

你知道吗，在线新闻和社交网络服务提供商 Twitter 拥有 3.2 亿用户，平均每天有 4200 万条活跃推文！(来源:smart insights 2016 年全球社交媒体研究总结)

让我们了解如何探索社交媒体的丰富信息(我将考虑 Twitter 作为一个例子)，以探索关于一个选定的主题正在谈论什么(图 5-2 )。大多数论坛都为开发者提供了访问帖子的 API。

图 5-2

提取 Twitter 帖子进行分析

步骤 1—获取访问密钥(一次性活动)。采取以下步骤来设置一个新的 Twitter 应用程序，以获得消费者/访问密钥、秘密和令牌(不要与未经授权的人共享密钥令牌)。

前往 https://apps.twitter.com/
点击“创建新应用”
填写所需信息，然后点击“创建您的 Twitter 应用程序”
您将在“密钥和访问令牌”选项卡下获得访问详细信息

第二步——获取推文。一旦有了授权秘密和访问令牌，就可以使用清单 5-2 代码示例来建立连接。

#Import the necessary methods from tweepy library
import tweepy
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

#provide your access details below
access_token = "Your token goes here"
access_token_secret = "Your token secret goes here"
consumer_key = "Your consumer key goes here"
consumer_secret = "Your consumer secret goes here"

# establish a connection
auth = tweepy.auth.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

Listing 5-2
Twitter Authentication

让我们假设你想了解关于 iPhone 7 及其相机功能的讨论。所以，我们来拉十个最近的帖子。

注意:根据帖子的数量，您最多只能获取 10 到 15 天内关于某个主题的历史用户帖子。

#fetch recent 10 tweets containing words iphone7, camera
fetched_tweets = api.search(q=['iPhone 7','iPhone7','camera'], result_type="recent", lang="en", count=10)
print ("Number of tweets: ",len(fetched_tweets))
#----output----
Number of tweets:  10

# Print the tweet text
for tweet in fetched_tweets:
    print ('Tweet ID: ', tweet.id)
    print ('Tweet Text: ', tweet.text, '\n')
#----output----
Tweet ID:  825155021390049281
Tweet Text:  RT @volcanojulie: A Tau Emerald dragonfly. The iPhone 7 camera is exceptional!
#nature #insect #dragonfly #melbourne #australia #iphone7 #…

Tweet ID:  825086303318507520
Tweet Text:  Fuzzy photos? Protect your camera lens instantly with #iPhone7 Full Metallic Case. Buy now! https://t.co/d0dX40BHL6 https://t.co/AInlBoreht

您可以将有用的特征捕捉到数据帧中，以供进一步分析(清单 5-3 )。

# function to save required basic tweets info to a dataframe
def populate_tweet_df(tweets):
    #Create an empty dataframe
    df = pd.DataFrame()

    df['id'] = list(map(lambda tweet: tweet.id, tweets))
    df['text'] = list(map(lambda tweet: tweet.text, tweets))
    df['retweeted'] = list(map(lambda tweet: tweet.retweeted, tweets))
    df['place'] = list(map(lambda tweet: tweet.user.location, tweets))
    df['screen_name'] = list(map(lambda tweet: tweet.user.screen_name, tweets))
    df['verified_user'] = list(map(lambda tweet: tweet.user.verified, tweets))
    df['followers_count'] = list(map(lambda tweet: tweet.user.followers_count, tweets))
    df['friends_count'] = list(map(lambda tweet: tweet.user.friends_count, tweets))

    # Highly popular user's tweet could possibly seen by large audience, so lets check the popularity of user
    df['friendship_coeff'] = list(map(lambda tweet: float(tweet.user.followers_count)/float(tweet.user.friends_count), tweets))
    return df

df = populate_tweet_df(fetched_tweets)
print df.head(10)
#---output----
                   id                                               text
0  825155021390049281  RT @volcanojulie: A Tau Emerald dragonfly. The...
1  825086303318507520  Fuzzy photos? Protect your camera lens instant...
2  825064476714098690  RT @volcanojulie: A Tau Emerald dragonfly. The...
3  825062644986023936  RT @volcanojulie: A Tau Emerald dragonfly. The...
4  824935025217040385  RT @volcanojulie: A Tau Emerald dragonfly. The...
5  824933631365779458  A Tau Emerald dragonfly. The iPhone 7 camera i...
6  824836880491483136  The camera on the IPhone 7 plus is fucking awe...
7  823805101999390720  'Romeo and Juliet' Ad Showcases Apple's iPhone...
8  823804251117850624  iPhone 7 Images Show Bigger Camera Lens - I ha...
9  823689306376196096  RT @computerworks5: Premium HD Selfie Stick &a...

  retweeted                          place      screen_name verified_user
0     False            Melbourne, Victoria        MonashEAE         False
1     False                California, USA        ShopNCURV         False
2     False    West Islip, Long Island, NY  FusionWestIslip         False
3     False  6676 Fresh Pond Rd Queens, NY  FusionRidgewood         False
4     False                                   Iphone7review         False
5     False   Melbourne; Monash University     volcanojulie         False
6     False                  Hollywood, FL       Hbk_Mannyp         False
7     False       Toronto.NYC.the Universe  AaronRFernandes         False
8     False                 Lagos, Nigeria    moyinoluwa_mm         False
9     False                                   Iphone7review         False

   followers_count  friends_count  friendship_coeff
0              322            388          0.829897
1              279            318          0.877358
2               13            193          0.067358
3               45            218          0.206422
4              199           1787          0.111360
5              398            551          0.722323
6               57             64          0.890625
7            18291              7       2613.000000
8              780            302          2.582781
9              199           1787          0.111360

Listing 5-3Save Features to Dataframe

除了主题，您还可以选择一个专注于某个主题的 screen_name。让我们看看(列表 5-4 )网名为 Iphone7review 的帖子。

# For help about api look here http://tweepy.readthedocs.org/en/v2.3.0/api.html
fetched_tweets =  api.user_timeline(id='Iphone7review', count=5)

# Print the tweet text
for tweet in fetched_tweets:
    print 'Tweet ID: ', tweet.id
    print 'Tweet Text: ', tweet.text, '\n'
#----output----
Tweet ID:  825169063676608512
Tweet Text:  RT @alicesttu: iPhone 7S to get Samsung OLED display next year #iPhone https://t.co/BylKbvXgAG #iphone

Tweet ID:  825169047138533376
Tweet Text:  Nothing beats the Iphone7! Who agrees? #Iphone7 https://t.co/e03tXeLOao

Listing 5-4Example Code for Extracting Tweets Based on Screen Name

快速浏览这些帖子，人们通常可以得出结论，iPhone 7 的摄像头功能得到了积极的评价。

数据预处理(文本)

该步骤处理净化合并的文本以去除噪声，从而确保有效的句法、语义文本分析，以便从文本中获得有意义的见解。下面简要介绍一些常见的清洁步骤。

转换为小写并标记化

在这里，所有的数据都被转换成小写。这是为了防止像“like”或“LIKE”这样的词被解释为不同的词。Python 提供了一个函数 lower() 将文本转换成小写。

标记化是将一大组文本分解成更小的有意义的块，如句子、单词、短语的过程。

句子标记化

NLTK(自然语言工具包)库提供 sent_tokenize 用于句子级标记化，它使用一个预训练的模型 PunktSentenceTokenize 来确定标点符号和标记欧洲语言句子结尾的字符(清单 5-5 )。

import nltk
from nltk.tokenize import sent_tokenize

text='Statistics skills, and programming skills are equally important for analytics. Statistics skills and domain knowledge are important for analytics. I like reading books and traveling.'

sent_tokenize_list = sent_tokenize(text)
print(sent_tokenize_list)
#----output----
['Statistics skills, and programming skills are equally important for analytics.', 'Statistics skills, and domain knowledge are important for analytics.', 'I like reading books and travelling.']

Listing 5-5Example Code for Sentence Tokenizing

NLTK 总共支持 17 种欧洲语言的句子标记化。清单 5-6 给出了为特定语言加载标记化模型的示例代码，作为 nltk.data 的一部分保存为 pickle 文件

import nltk.data
spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')
spanish_tokenizer.tokenize('Hola. Esta es una frase espanola.')
#----output----
['Hola.', 'Esta es una frase espanola.']

Listing 5-6Sentence Tokenizing for European Languages

单词标记化

NLTK 的 word_tokenize 函数是一个包装器函数，由 TreebankWordTokenizer 调用 tokenize(清单 5-7 )。

from nltk.tokenize import word_tokenize
print word_tokenize(text)

# Another equivalent call method using TreebankWordTokenizer
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
print (tokenizer.tokenize(text))
#----output----
['Statistics', 'skills', ',', 'and', 'programming', 'skills', 'are', 'equally', 'important', 'for', 'analytics', '.', 'Statistics', 'skills', ',', 'and', 'domain', 'knowledge', 'are', 'important', 'for', 'analytics', '.', 'I', 'like', 'reading', 'books', 'and', 'travelling', '.']

Listing 5-7Example Code for Word Tokenizing

消除噪音

你应该删除所有与文本分析无关的信息。这可以被视为文本分析的噪声。最常见的干扰是数字、标点符号、停用词、空格等。(列表 5-8 )。

数字:数字被删除，因为它们可能不相关并且不包含有价值的信息。

def remove_numbers(text):
    return re.sub(r'\d+', ", text)

text = 'This is a     sample  English   sentence, \n with whitespace and numbers 1234!'
print ('Removed numbers: ', remove_numbers(text))
#----output----
Removed numbers:  This is a     sample  English   sentence,
 with whitespace and numbers!

Listing 5-8Example Code for Removing Noise from Text

标点:为了更好地识别每个单词，从数据集中删除标点字符，需要将其删除。例如:“like”和“like”或“coca-cola”和“CocaCola”会被解释为不同的单词，如果不去掉标点符号的话(清单 5-9 )。

import string
# Function to remove punctuations
def remove_punctuations(text):
    words = nltk.word_tokenize(text)
    punt_removed = [w for w in words if w.lower() not in string.punctuation]
    return " ".join(punt_removed)

print (remove_punctuations('This is a sample English sentence, with punctuations!'))
#----output----
This is a sample English sentence with punctuations

Listing 5-9Example Code for Removing Punctuations from Text

停用词:像“the”、“and”和“or”这样的词是没有意义的，会给分析增加不必要的干扰。由于这个原因，它们被移除(列表 5-10 )。

from nltk.corpus import stopwords

# Function to remove stop words
def remove_stopwords(text, lang="english"):
    words = nltk.word_tokenize(text)
    lang_stopwords = stopwords.words(lang)
    stopwords_removed = [w for w in words if w.lower() not in lang_stopwords]
    return " ".join(stopwords_removed)

print (remove_stopwords('This is a sample English sentence'))
#----output----
sample English sentence

Listing 5-10Example Code for Removing Stop Words from the Text

注意

删除自己的停用词(如果需要)。某些单词可能在特定领域中非常常用。除了英语停用词，我们还可以删除我们自己的停用词。我们自己的停用词的选择可能取决于话语的领域，并且可能直到我们做了一些分析之后才变得明显。

空白:通常在文本分析中，多余的空白(空格、制表符、回车、换行符)会被识别为一个单词。这种异常可以通过该步骤中的基本编程程序来避免(列表 5-11 )。

# Function to remove whitespace
def remove_whitespace(text):
    return " ".join(text.split())
text = 'This is a     sample  English   sentence, \n with whitespace and numbers 1234!'
print ('Removed whitespace: ', remove_whitespace(text))
#----output----
Removed whitespace:  This is a sample English sentence, with whitespace and numbers 1234!

Listing 5-11Example Code for Removing Whitespace from Text

词性标注

词性标注是分配特定语言词性的过程，如名词、动词、形容词、副词等。，对于给定文本中的每个单词。

NLTK 支持多种 PoS 标记模型，默认的标记器是 maxent_treebank_pos_tagger，使用的是 Penn(宾夕法尼亚大学)treebank 语料库(表 5-3 )。同样有 36 个可能的 PoS 标签。语法分析器将一个句子表示为一棵有三个子树的树:名词短语(NP)、动词短语(VP)和句号(。).树的根将是 s。清单 5-12 和 5-13 为您提供了词性标注和可视化句子树的示例代码。

表 5-3

NLTK PoS 标签

词性标注

简短描述

|
| --- | --- |
| maxent_treebank_pos_tagger | 这是基于最大熵(ME)分类原则训练的华尔街日报子集的宾夕法尼亚树银行语料库 |
| 布里特格 | Brill 的基于规则的转换标记器 |
| CRFTagger | 条件随机场 |
| HiddenMarkovModelTagger | 隐马尔可夫模型(hmm)主要用于将正确的标签序列分配给序列数据，或者评估给定标签和数据序列的概率 |
| 洪博塔格 | 与 HunPos 开源 Pos 标记器接口的模块 |
| 感知标签 | 基于 Matthew Honnibal 提出的平均感知机技术 |
| 森纳塔格 | 使用神经网络架构的语义/句法提取 |
| SequentialBackoffTagger | 从左到右顺序标记句子的类 |
| 斯坦福·波斯塔格 | 斯坦福大学的研究者和开发者 |
| 三硝基甲苯 | Thorsten Brants 实现“TnT——统计词性标注器” |

from nltk import chunk

tagged_sent = nltk.pos_tag(nltk.word_tokenize('This is a sample English sentence'))
print (tagged_sent)

tree = chunk.ne_chunk(tagged_sent)
tree.draw() # this will draw the sentence tree
#----output----
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'JJ'), ('English', 'JJ'), ('sentence', 'NN')]

Listing 5-12Example Code for PoS Tagging the Sentence and Visualizing the Sentence Tree

# To use PerceptronTagger
from nltk.tag.perceptron import PerceptronTagger
PT = PerceptronTagger()
print (PT.tag('This is a sample English sentence'.split()))
#----output----
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'JJ'), ('English', 'JJ'), ('sentence', 'NN')]

# To get help about tags
nltk.help.upenn_tagset('NNP')
#----output----
NNP: noun, proper, singular

Listing 5-13Example Code for Using Perceptron Tagger and Getting Help on Tags

堵塞物

词干是转化为词根的过程。它使用一种算法来删除英语单词的常见词尾，如“ly”、“es”、“ed”和“s”。例如，假设在进行分析时，您可能希望将“小心”、“关心”、“关心”和“关心地”视为“关心”，而不是单独的单词。图 5-3 中列出了三种最广泛使用的词干算法。清单 5-14 提供了词干提取的示例代码。

图 5-3

最流行的 NLTK 词干分析器

from nltk import PorterStemmer, LancasterStemmer, SnowballStemmer

# Function to apply stemming to a list of words
def words_stemmer(words, type="PorterStemmer", lang="english", encoding="utf8"):
    supported_stemmers = ["PorterStemmer","LancasterStemmer","SnowballStemmer"]
    if type is False or type not in supported_stemmers:
        return words
    else:
        stem_words = []
        if type == "PorterStemmer":
            stemmer = PorterStemmer()
            for word in words:
                stem_words.append(stemmer.stem(word).encode(encoding))
        if type == "LancasterStemmer":
            stemmer = LancasterStemmer()
            for word in words:
                stem_words.append(stemmer.stem(word).encode(encoding))
        if type == "SnowballStemmer":
            stemmer = SnowballStemmer(lang)
            for word in words:
                stem_words.append(stemmer.stem(word).encode(encoding))
        return " ".join(stem_words)

words =  'caring cares cared caringly carefully'

print ("Original: ", words)
print ("Porter: ", words_stemmer(nltk.word_tokenize(words), "PorterStemmer"))

print ("Lancaster: ", words_stemmer(nltk.word_tokenize(words), "LancasterStemmer"))
print ("Snowball: ", words_stemmer(nltk.word_tokenize(words), "SnowballStemmer"))
#----output----
Original:  caring cares cared caringly carefully
Porter:  care care care caringly care
Lancaster:  car car car car car
Snowball:  care care care care care

Listing 5-14Example Code for Stemming

词汇化

它是转换到字典基本形式的过程。为此，你可以使用 WordNet，这是一个大型的英语词汇数据库，通过它们的语义关系连接在一起。它就像一个词库:它根据单词的意思将单词组合在一起(列表 5-15 )。

from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

# Function to apply lemmatization to a list of words
def words_lemmatizer(text, encoding="utf8"):
    words = nltk.word_tokenize(text)
    lemma_words = []
    wl = WordNetLemmatizer()
    for word in words:
        pos = find_pos(word)
        lemma_words.append(wl.lemmatize(word, pos).encode(encoding))
    return " ".join(lemma_words)

# Function to find part of speech tag for a word
def find_pos(word):
    # Part of Speech constants
    # ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
    # You can learn more about these at http://wordnet.princeton.edu/wordnet/man/wndb.5WN.html#sect3
    # You can learn more about all the penn tree tags at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    pos = nltk.pos_tag(nltk.word_tokenize(word))[0][1]
    # Adjective tags - 'JJ', 'JJR', 'JJS'
    if pos.lower()[0] == 'j':
        return 'a'
    # Adverb tags - 'RB', 'RBR', 'RBS'

    elif pos.lower()[0] == 'r':
        return 'r'
    # Verb tags - 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'
    elif pos.lower()[0] == 'v':
        return 'v'
    # Noun tags - 'NN', 'NNS', 'NNP', 'NNPS'
    else:
        return 'n'

print ("Lemmatized: ", words_lemmatizer(words))
#----output----
Lemmatized:  care care care caringly carefully

In the preceding case,'caringly'/'carefully' are inflected forms of care and they are an entry word listed in WordNet Dictionary so they are retained in their actual form itself.

Listing 5-15Example Code for Lemmatization

NLTK 英语 WordNet 包括大约 155，287 个单词和 117，000 个同义词集。对于给定的单词，WordNet 包括/提供了定义、示例、同义词(一组相似的名词、形容词、动词)、反义词(意思与另一个相反)等。清单 5-16 提供了 wordnet 的示例代码。

from nltk.corpus import wordnet

syns = wordnet.synsets("good")
print "Definition: ", syns[0].definition()
print "Example: ", syns[0].examples()

synonyms = []
antonyms = []

# Print  synonums and antonyms (having opposite meaning words)
for syn in wordnet.synsets("good"):

    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print ("synonyms: \n", set(synonyms))
print ("antonyms: \n", set(antonyms))
#----output----
Definition:  benefit
Example:  [u'for your own good', u"what's the good of worrying?"]
synonyms:
set([u'beneficial', u'right', u'secure', u'just', u'unspoilt', u'respectable', u'good', u'goodness', u'dear', u'salutary', u'ripe', u'expert', u'skillful', u'in_force', u'proficient', u'unspoiled', u'dependable', u'soundly', u'honorable', u'full', u'undecomposed', u'safe', u'adept', u'upright', u'trade_good', u'sound', u'in_effect', u'practiced', u'effective', u'commodity', u'estimable', u'well', u'honest', u'near', u'skilful', u'thoroughly', u'serious'])
antonyms:
set([u'bad', u'badness', u'ill', u'evil', u'evilness'])

Listing 5-16Example Code for Wordnet

N-grams

文本挖掘中的一个重要概念是 n 元文法，它基本上是来自给定的大文本序列的 n 个项目的一组共现或连续序列。这里的项目可以是单词、字母和音节。让我们考虑一个例句，试着提取 n 的不同值的 n-grams(清单 5-17 )。

from nltk.util import ngrams
from collections import Counter

# Function to extract n-grams from text
def get_ngrams(text, n):
    n_grams = ngrams(nltk.word_tokenize(text), n)
    return [ ' '.join(grams) for grams in n_grams]

text = 'This is a sample English sentence'
print ("1-gram: ", get_ngrams(text, 1))
print ("2-gram: ", get_ngrams(text, 2))
print ("3-gram: ", get_ngrams(text, 3))
print ("4-gram: ", get_ngrams(text, 4))
#----output----
1-gram:['This', 'is', 'a', 'sample', 'English', 'sentence']
2-gram:['This is', 'is a', 'a sample', 'sample English', 'English sentence']
3-gram:['This is a', 'is a sample', 'a sample English', 'sample English sentence']
4-gram: ['This is a sample', 'is a sample English', 'a sample English sentence']

Listing 5-17Example Code for Extracting n-grams from the Sentence

注意

1-gram 也称为 unigram 二元模型和三元模型分别是二元模型和三元模型。

N-gram 技术相对简单，简单的增加 n 的值会给我们更多的上下文。它在概率语言模型中广泛用于预测序列中的下一项。例如，当用户键入时，搜索引擎使用这种技术来预测/推荐序列中下一个字符/单词的可能性(清单 5-18 )。

text = 'Statistics skills, and programming skills are equally important for analytics. Statistics skills and domain knowledge are important for analytics'

# remove punctuations
text = remove_punctuations(text)

# Extracting bigrams
result = get_ngrams(text,2)

# Counting bigrams
result_count = Counter(result)

# Converting the result to a data frame
import pandas as pd
df = pd.DataFrame.from_dict(result_count, orient="index")
df = df.rename(columns={'index':'words', 0:'frequency'}) # Renaming index and column name
print (df)
#----output----
                      frequency
are equally                   1
domain knowledge              1
skills are                    1
knowledge are                 1
programming skills            1
are important                 1
skills and                    2
for analytics                 2
and domain                    1
important for                 2
and programming               1
Statistics skills             2
equally important             1
analytics Statistics          1

Listing 5-18Example Code for Extracting 2-grams from the Sentence and Storing in a Dataframe

一袋单词

文本必须用数字表示才能应用任何算法。单词包(BoW)是一种计算文档中单词出现次数的方法，而不考虑语法和单词顺序的重要性。这可以通过创建术语文档矩阵(TDM)来实现。它只是一个矩阵，以术语为行，以文档名为列，以词频计数为矩阵的单元(图 5-4 )。让我们通过一个例子来学习创建 TDM:考虑三个包含一些文本的文本文档。Sklearn 在 feature_extraction.text 下提供了很好的函数，将一个文本文档集合转换成字数矩阵(清单 5-19 )。

图 5-4

术语文档矩阵

import os
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Function to create a dictionary with key as file names and values as text for all files in a given folder
def CorpusFromDir(dir_path):
    result = dict(docs = [open(os.path.join(dir_path,f)).read() for f in os.listdir(dir_path)],
               ColNames = map(lambda x: x, os.listdir(dir_path)))
    return result

docs = CorpusFromDir('Data/')

# Initialize

vectorizer = CountVectorizer()
doc_vec = vectorizer.fit_transform(docs.get('docs'))

#create dataFrame
df = pd.DataFrame(doc_vec.toarray().transpose(), index = vectorizer.get_feature_names())

# Change column headers to be file names
df.columns = docs.get('ColNames')
print (df)
#----output----
             Doc_1.txt  Doc_2.txt  Doc_3.txt
analytics            1          1          0
and                  1          1          1
are                  1          1          0
books                0          0          1
domain               0          1          0
equally              1          0          0
for                  1          1          0
important            1          1          0
knowledge            0          1          0
like                 0          0          1
programming          1          0          0
reading              0          0          1
skills               2          1          0
statistics           1          1          0
travelling           0          0          1

Listing 5-19Creating a Term Document Matrix from a Corpus of Sample Documents

注意

术语文档矩阵(TDM)是术语文档矩阵的转置。在 TDM 中，行是文档名，列标题是术语。两者都是矩阵格式，有助于进行分析；然而，由于术语的数量通常比文档数大得多，所以通常使用 TDM。在这种情况下，拥有更多行比拥有大量列更好。

术语频率-逆文档频率(TF-IDF)

在信息检索领域，TF-IDF 是一种很好的统计方法，可以反映术语与文档集合或语料库中文档的相关性。我们来分解一下 TF_IDF，应用一个例子来更好的理解。

术语频率将告诉你一个给定术语出现的频率。

TF (term) = $\frac{Number\ of\ times\ term\ appears\ in\ a\ document}{Total\ number\ of\ term s\ in\ the\ document}$

例如，考虑包含 100 个单词的文档，其中单词“ML”出现三次，则 TF (ML) = 3 / 100 = 0.03

文档频率会告诉你一个术语有多重要。

DF (term) = $\frac{d\ \left( number\ of\ documents\ containing\ a\ given\ term\right)}{D\ \left( the\ size\ of\ the\ collection\ of\ documents\right)}$

假设我们有一千万个文档，单词 ML 出现在其中的一千个文档中，那么 DF (ML) = 1000 / 10，000，000 = 0.0001。

为了归一化，我们以 log (d/D)为例，log (0.0001) = -4

如前例所示，D > d 和 log (d/D)通常会给出负值。因此，为了解决这个问题，让我们对 log 表达式内部的比率进行反演，这就是所谓的逆文档频率(IDF)。本质上，我们正在压缩价值的尺度，以便可以平滑地比较非常大或非常小的数量。

IDF(期限)= $\log \left(\frac{Total\ number\ of\ documents}{Number\ of\ documents\ with\ a\ given\ term\ in\ it}\right)$

继续前面的例子，IDF(ML) = log(10，000，000 / 1，000) = 4。

TF-IDF 是数量的重量乘积；对于前面的例子，TF-IDF (ML) = 0.03 * 4 = 0.12。Sklearn 提供了一个函数 TfidfVectorizer，为文本计算 TF-IDF；然而，默认情况下，它使用 L2 归一化对术语向量进行归一化，并且通过将文档频率加 1 来平滑 IDF，以防止零划分(列表 5-20 )。

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

doc_vec = vectorizer.fit_transform(docs.get('docs'))
#create dataFrame
df = pd.DataFrame(doc_vec.toarray().transpose(), index = vectorizer.get_feature_names())

# Change column headers to be file names
df.columns = docs.get('ColNames')
print (df)
#----output----
             Doc_1.txt  Doc_2.txt  Doc_3.txt
analytics     0.276703   0.315269   0.000000
and           0.214884   0.244835   0.283217
are           0.276703   0.315269   0.000000
books         0.000000   0.000000   0.479528
domain        0.000000   0.414541   0.000000
equally       0.363831   0.000000   0.000000
for           0.276703   0.315269   0.000000
important     0.276703   0.315269   0.000000
knowledge     0.000000   0.414541   0.000000
like          0.000000   0.000000   0.479528
programming   0.363831   0.000000   0.000000
reading       0.000000   0.000000   0.479528
skills        0.553405   0.315269   0.000000
statistics    0.276703   0.315269   0.000000
travelling    0.000000   0.000000   0.479528

Listing 5-20Create a Term Document Matrix (TDM) with TF-IDF

数据探索(文本)

在这一阶段，探索语料库以理解共同的关键词、内容、关系以及噪声的存在和水平。这可以通过创建基本统计数据和采用可视化技术来实现，如词频计数、词共现或相关图等。，这将有助于我们发现隐藏的模式，如果有的话。

频率表

这个可视化呈现了一个条形图，其长度对应于特定单词出现的频率。让我们为 Doc_1.txt 文件绘制一个频率图(清单 5-21 )。

words = df.index
freq = df.ix[:,0].sort(ascending=False, inplace=False)

pos = np.arange(len(words))
width=1.0
ax=plt.axes(frameon=True)
ax.set_xticks(pos)
ax.set_xticklabels(words, rotation="vertical", fontsize=9)
ax.set_title('Word Frequency Chart')
ax.set_xlabel('Words')
ax.set_ylabel('Frequency')
plt.bar(pos, freq, width, color="b")
plt.show()
#----output----

Listing 5-21Example Code for Frequency Chart

词云

这是文本数据的可视化表示，有助于从数据中的重要关键词出现的角度获得高层次的理解。wordcloud 包可以用来生成字体大小与其频率相关的单词(清单 5-22 )。

from wordcloud import WordCloud

# Read the whole text.
text = open('Data/Text_Files/Doc_1.txt').read()

# Generate a word cloud image
wordcloud = WordCloud().generate(text)

# Display the generated image:
# the matplotlib way:
import matplotlib.pyplot as plt
plt.imshow(wordcloud.recolor(random_state=2017))
plt.title('Most Frequent Words')
plt.axis("off")
plt.show()
#----output----

Listing 5-22Example Code for the Word Cloud

从上面的图表中我们可以看出，相对而言,“技能”出现的次数最多。

词汇离差图

这个图有助于确定一个单词在一系列文本句子中的位置。x 轴上是单词偏移量，y 轴上的每一行代表整个文本，标记表示感兴趣的单词的一个实例(清单 5-23 )。

from nltk import word_tokenize

def dispersion_plot(text, words):
    words_token = word_tokenize(text)
    points = [(x,y) for x in range(len(words_token)) for y in range(len(words)) if words_token[x] == words[y]]

    if points:
        x,y=zip(*points)
    else:
        x=y=()

    plt.plot(x,y,"rx",scalex=.1)
    plt.yticks(range(len(words)),words,color="b")
    plt.ylim(-1,len(words))
    plt.title("Lexical Dispersion Plot")
    plt.xlabel("Word Offset")
    plt.show()

text = 'statistics skills and programming skills are equally important for analytics. statistics skills and domain knowledge are important for analytics'

dispersion_plot(text, ['statistics', 'skills', 'and', 'important'])
#----output----

Listing 5-23Example Code for Lexical Dispersion Plot

共生矩阵

计算文本序列中单词之间的共现将有助于解释单词之间的关系。共现矩阵告诉我们每个单词与当前单词共现了多少次。将这个矩阵进一步绘制成热图是一个强大的可视化工具，可以有效地发现单词之间的关系(清单 5-24 )。

import statsmodels.api as sm
import scipy.sparse as sp

# default unigram model
count_model = CountVectorizer(ngram_range=(1,1))
docs_unigram = count_model.fit_transform(docs.get('docs'))

# co-occurrence matrix in sparse csr format
docs_unigram_matrix = (docs_unigram.T * docs_unigram)

# fill same word cooccurence to 0
docs_unigram_matrix.setdiag(0)

# co-occurrence matrix in sparse csr format
docs_unigram_matrix = (docs_unigram.T * docs_unigram) docs_unigram_matrix_diags = sp.diags(1./docs_unigram_matrix.diagonal())

# normalized co-occurence matrix
docs_unigram_matrix_norm = docs_unigram_matrix_diags * docs_unigram_matrix

# Convert to a dataframe

df = pd.DataFrame(docs_unigram_matrix_norm.todense(), index = count_model.get_feature_names())
df.columns = count_model.get_feature_names()

# Plot
sm.graphics.plot_corr(df, title='Co-occurrence Matrix', xnames=list(df.index))
plt.show()
#----output----

Listing 5-24Example Code for Cooccurrence Matrix

模型结构

您现在可能已经很熟悉了，建模是理解和建立变量之间关系的过程。到目前为止，您已经学习了如何从各种来源提取文本内容，预处理以消除噪声，以及执行探索性分析以获得关于手头文本数据的基本理解/统计。现在，您将学习对处理后的数据应用 ML 技术来构建模型。

文本相似度

这是指示两个对象有多相似的度量，通过用对象的特征(这里是文本)表示的维度的距离度量来描述。较小的距离表示高度相似，反之亦然。请注意，相似性是高度主观的，取决于领域或应用。对于文本相似性，选择合适的距离度量来获得更好的结果是很重要的。有各种各样的距离度量，欧几里德度量是最常见的，它是两点之间的直线距离。然而，在文本挖掘领域已经进行了大量的研究，以了解余弦距离更适合于文本相似性。

让我们看一个简单的例子(表 5-4 )来更好地理解相似性。考虑包含某些简单文本关键词的三个文档，并假设前两个关键词是“事故”和“纽约”目前，忽略其他关键字，让我们基于这两个关键字的频率来计算文档的相似性。

表 5-4

样本术语文档矩阵

文档编号

“事故”计数

“纽约”伯爵

|
| --- | --- | --- |
| one | Two | eight |
| Two | three | seven |
| three | seven | three |

图 5-5 描绘了在二维图表上绘制文档词向量点。请注意，余弦相似性方程是两个数据点之间角度的表示，而欧几里德距离是数据点之间直线差的平方根。余弦相似性方程将产生一个介于 0 和 1 之间的值。余弦角越小，余弦值越大，表示相似度越高。在这种情况下，欧几里得距离将导致零。让我们将这些值放入公式中，找出文档 1 和文档 2 之间的相似性。

欧几里德距离(doc1，doc2) = $\sqrt{\left(2-3\right)\hat 2+\left(8-7\right)\hat 2}=\sqrt{\left(1+1\right)}$ = 1.41 = 0

余弦(doc1，doc2) = $\frac{62}{8.24\ast 7.61}$ = 0.98，其中

文档 1 = (2，8)

文档 2 = (3，7)

1 号文件。doc 2 =(2 * 3+8 * 7)=(56+6)= 62

||doc1|| = ` $\sqrt{\left(2\ast 2\right)+\left(8\ast 8\right)}$ = 8.24

||doc2|| = $\sqrt{\left(3\ast 3\right)+\left(7\ast 7\right)}$ = 7.61

同样，让我们找出文档 1 和 3 之间的相似之处(图 5-5 )。

欧几里德距离(doc1，doc3) = $\sqrt{\left(2-7\right)\hat 2+\left(8-3\right)\hat 2}=\sqrt{\left(25+25\right)}=7.07$ = 0

余弦(doc1，doc3)= $\frac{38}{8.24\ast 7.61}$ = 0.60

图 5-5

欧几里德与余弦

根据余弦方程，文件 1 和文件 2 有 98%相似；这可能意味着这两个文档更多地讨论了纽约，而文档 3 可以被认为更多地关注于“事故”然而，有几次提到了纽约，导致文档 1 和 3 之间有 60%的相似性。

清单 5-25 为图 5-5 中给出的例子提供了计算余弦相似度的示例代码。

from sklearn.metrics.pairwise import cosine_similarity

print "Similarity b/w doc 1 & 2: ", cosine_similarity([df['Doc_1.txt']], [df['Doc_2.txt']])
print "Similarity b/w doc 1 & 3: ", cosine_similarity([df['Doc_1.txt']], [df['Doc_3.txt']])
print "Similarity b/w doc 2 & 3: ", cosine_similarity([df['Doc_2.txt']], [df['Doc_3.txt']])
#----output----
Similarity b/w doc 1 & 2:  [[ 0.76980036]]
Similarity b/w doc 1 & 3:  [[ 0.12909944]]
Similarity b/w doc 2 & 3:  [[ 0.1490712]]

Listing 5-25Example Code for Calculating Cosine Similarity for Documents

文本聚类

例如，我们将使用 20 个新闻组数据集，它由 20 个主题的 18，000 多篇新闻组帖子组成。您可以在 http://qwone.com/~jason/20Newsgroups/ 了解更多关于数据集的信息。让我们加载数据并检查主题名称(清单 5-26 )。

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer
from sklearn import metrics
from sklearn.cluster import KMeans, MiniBatchKMeans
import numpy as np

# load data and print topic names
newsgroups_train = fetch_20newsgroups(subset='train')
print(list(newsgroups_train.target_names))
#----output----
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

Listing 5-26Example Code for Text Clustering

为了简单起见，我们只过滤三个主题。假设我们不知道题目。让我们运行聚类算法，检查每个聚类的关键字。

categories = ['alt.atheism', 'comp.graphics', 'rec.motorcycles']

dataset = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=2017)

print("%d documents" % len(dataset.data))
print("%d categories" % len(dataset.target_names))

labels = dataset.target

print("Extracting features from the dataset using a sparse vectorizer")
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dataset.data)
print("n_samples: %d, n_features: %d" % X.shape)
#----output----
2768 documents
3 categories
Extracting features from the dataset using a sparse vectorizer
n_samples: 2768, n_features: 35311

潜在语义分析(LSA)

LSA 是一种数学方法，它试图揭示文档集合中的潜在关系。它不是孤立地查看每个文档，而是将所有文档作为一个整体来查看，并查看其中的术语来确定关系。让我们通过对数据运行奇异值分解(SVD)来执行 LSA，以降低维数。

矩阵 A 的奇异值分解= U * ∑ * V ^T

r =矩阵 X 的秩

U =列正交 m * r 矩阵

∑ =对角 r∫r 矩阵，奇异值按降序排序

V =列正交 r∫n 矩阵

在我们的例子中，我们有三个主题、2768 个文档和 35311 个单词的词汇表(图 5-6 )。

原始矩阵= 276835311 ~ 10 ⁸

SVD = 3 * 2768+3+3 * 35311 ~ 10^5.3

得到的 SVD 占用的空间比原始矩阵少大约 460 倍。清单 5-27 提供了通过 SVD 实现 LSA 的示例代码。

注意

潜在语义分析(LSA)和潜在语义索引(LSI)是同一个东西，后者的名称有时被用来特指为搜索(信息检索)而索引一组文档。

图 5-6

奇异值分解

from sklearn.decomposition import TruncatedSVD

# Lets reduce the dimensionality to 2000
svd = TruncatedSVD(2000)
lsa = make_pipeline(svd, Normalizer(copy=False))

X = lsa.fit_transform(X)

explained_variance = svd.explained_variance_ratio_.sum()
print("Explained variance of the SVD step: {}%".format(int(explained_variance * 100)))
#----output----
Explained variance of the SVD step: 95%

Listing 5-27Example Code for LSA Through SVD

清单 5-28 是在 SVD 输出上运行 k-means 聚类的示例代码。

from __future__ import print_function

km = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=1)

# Scikit learn provides MiniBatchKMeans to run k-means in batch mode suitable for a very large corpus
# km = MiniBatchKMeans(n_clusters=5, init='k-means++', n_init=1, init_size=1000, batch_size=1000)

print("Clustering sparse data with %s" % km)
km.fit(X)

print("Top terms per cluster:")
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
for i in range(3):
    print("Cluster %d:" % i, end=“)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end=“)
    print()
#----output----
Top terms per cluster:
Cluster 0: edu graphics university god subject lines organization com posting uk
Cluster 1: com bike edu dod ca writes article sun like organization
Cluster 2: keith sgi livesey caltech com solntze wpd jon edu sandvik

Listing 5-28k-means Clustering on SVD Dataset

清单 5-29 是在 SVD 数据集上运行层次聚类的示例代码。

from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(X)

from scipy.cluster.hierarchy import ward, dendrogram

linkage_matrix = ward(dist) #define the linkage_matrix using ward clustering pre-computed distances

fig, ax = plt.subplots(figsize=(8, 8)) # set size
ax = dendrogram(linkage_matrix, orientation="right")

plt.tick_params(axis= 'x', which="both")

plt.tight_layout() #show plot with tight layout
plt.show()
#----output----

Listing 5-29Hierarchical Clustering on SVD Dataset

主题建模

主题建模算法使您能够在大量文档中发现隐藏的主题模式或主题结构。最流行的主题建模技术是 LDA 和 NMF。

潜在狄利克雷分配

LDA 是由大卫·布雷、吴恩达和迈克尔·乔丹在 2003 年提出的图形模型(图 5-7 )。

图 5-7

LDA 图模型

LDA 是由 P ( d 、w=p(d)**p*(θ)

其中，φ^(z)=主题的词分布，

α =在每个文档主题分布之前的狄利克雷参数，

β =狄利克雷参数在每文档单词分布之前，

θ^(d)=文档的主题分布。

LDA 的目标是最大化投影主题均值之间的分离，最小化每个投影主题内的方差。因此，LDA 通过执行如下所述的三个步骤将每个主题定义为一个单词包(图 5-8 )。

图 5-8

潜在狄利克雷分配

步骤 1:初始化 k 个簇，并将每个文档中的每个单词分配给 k 个主题中的一个。

步骤 2:根据 a)一个文档的单词与一个主题的比例，以及 b)一个主题在所有文档中的比例，将单词重新分配给一个新的主题。

第三步:重复第二步，直到产生连贯的主题。

清单 5-30 提供了实现 LDA 的示例代码。

from sklearn.decomposition import LatentDirichletAllocation

# continuing with the 20 newsgroup dataset and 3 topics
total_topics = 3
lda = LatentDirichletAllocation(n_components=total_topics,
                                max_iter=100,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=2017)
lda.fit(X)

feature_names = np.array(vectorizer.get_feature_names())

for topic_idx, topic in enumerate(lda.components_):
    print("Topic #%d:" % topic_idx)
    print(" ".join([feature_names[i] for i in topic.argsort()[:-20 - 1:-1]]))
#----output----
Topic #0:
edu com writes subject lines organization

article posting university nntp host don like god uk ca just bike know graphics
Topic #1:
anl elliptical maier michael_maier qmgate separations imagesetter 5298 unscene appreshed linotronic l300 iici amnesia glued veiw halftone 708 252 dot
Topic #2:
hl7204 eehp22 raoul vrrend386 qedbbs choung qed daruwala ims kkt briarcliff kiat philabs col op_rows op_cols keeve 9327 lakewood gans

Listing 5-30Example Code for LDA

非负矩阵分解

NMF 是一种用于多元数据的分解方法，由 V = MH 给出，其中 V 是矩阵 W 和 H 的乘积。W 是特征中单词等级的矩阵，H 是系数矩阵，每行是一个特征。这三个矩阵没有负元素(列表 5-31 )。

from sklear.n.decomposition import NMF

nmf = NMF(n_components=total_topics, random_state=2017, alpha=.1, l1_ratio=.5)
nmf.fit(X)

for topic_idx, topic in enumerate(nmf.components_):
    print("Topic #%d:" % topic_idx)
    print(" ".join([feature_names[i] for i in topic.argsort()[:-20 - 1:-1]]))
#----output----
Topic #0:
edu com god writes article don subject lines organization just university bike people posting like know uk ca think host
Topic #1:
sgi livesey keith solntze wpd jon caltech morality schneider cco moral com allan edu objective political cruel atheists gap writes
Topic #2:
sun east green ed egreen com cruncher microsystems ninjaite 8302 460 rtp 0111 nc 919 grateful drinking pixel biker showed

Listing 5-31Example Code for Nonnegative Matrix Factorization

文本分类

将文本特征表示为数字的能力为运行分类 ML 算法提供了机会。让我们使用 20 个新闻组数据的子集来构建一个分类模型并评估其准确性(清单 5-32 )。

categories = ['alt.atheism', 'comp.graphics', 'rec.motorcycles', 'sci.space', 'talk.politics.guns']

newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=2017, remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,
shuffle=True, random_state=2017, remove=('headers', 'footers', 'quotes'))

y_train = newsgroups_train.target
y_test = newsgroups_test.target

vectorizer = TfidfVectorizer(sublinear_tf=True, smooth_idf = True, max_df=0.5,  ngram_range=(1, 2), stop_words="english")
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)

print("Train Dataset")
print("%d documents" % len(newsgroups_train.data))
print("%d categories" % len(newsgroups_train.target_names))
print("n_samples: %d, n_features: %d" % X_train.shape)

print("Test Dataset")
print("%d documents" % len(newsgroups_test.data))
print("%d categories" % len(newsgroups_test.target_names))
print("n_samples: %d, n_features: %d" % X_test.shape)
#----output----
Train Dataset
2801 documents
5 categories
n_samples: 2801, n_features: 241036
Test Dataset
1864 documents
5 categories
n_samples: 1864, n_features: 241036

Listing 5-32Example Code Text Classification on 20 News Groups Dataset

让我们构建一个简单的朴素贝叶斯分类模型，并评估其准确性。本质上，我们可以用任何其他分类算法来代替朴素贝叶斯，或者使用集成模型来建立一个有效的模型(清单 5-33 )。

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

clf = MultinomialNB()
clf = clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

print ('Train accuracy_score: ', metrics.accuracy_score(y_train, y_train_pred))
print ('Test accuracy_score: ',metrics.accuracy_score(newsgroups_test.target, y_test_pred))

print ("Train Metrics: ", metrics.classification_report(y_train, y_train_pred))
print ("Test Metrics: ", metrics.classification_report(newsgroups_test.target, y_test_pred))
#----output----
Train accuracy_score:  0.9760799714387719
Test accuracy_score:  0.8320815450643777
Train Metrics:       precision    recall  f1-score   support

           0              1.00      0.97      0.98       480
           1              1.00      0.97      0.98       584
           2              0.91      1.00      0.95       598
           3              0.99      0.97      0.98       593
           4              1.00      0.97      0.99       546

   micro avg              0.98      0.98      0.98      2801
   macro avg              0.98      0.98      0.98      2801
weighted avg              0.98      0.98      0.98      2801

Test Metrics:        precision    recall  f1-score   support

           0              0.91      0.62      0.74       319
           1              0.90      0.90      0.90       389
           2              0.81      0.90      0.86       398
           3              0.80      0.84      0.82       394
           4              0.78      0.86      0.82       364

   micro avg              0.83      0.83      0.83      1864
   macro avg              0.84      0.82      0.83      1864
weighted avg              0.84      0.83      0.83      1864

Listing 5-33Example Code Text Classification Using Multinomial Naïve Bayes

情感分析

发现和分类一段文本(如评论/反馈文本)中表达的观点的过程被称为情感分析。这种分析的预期结果是确定作者对某个主题、产品、服务等的态度。是中性的、正的或负的(列表 5-34 )。

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
data = pd.read_csv('Data/customer_review.csv')

SIA = SentimentIntensityAnalyzer()
data['polarity_score']=data.Review.apply(lambda x:SIA.polarity_scores(x)['compound'])
data['neutral_score']=data.Review.apply(lambda x:SIA.polarity_scores(x)['neu'])
data['negative_score']=data.Review.apply(lambda x:SIA.polarity_scores(x)['neg'])
data['positive_score']=data.Review.apply(lambda x:SIA.polarity_scores(x)['pos'])
data['sentiment']=“
data.loc[data.polarity_score>0,'sentiment']='POSITIVE'
data.loc[data.polarity_score==0,'sentiment']='NEUTRAL'
data.loc[data.polarity_score<0,'sentiment']='NEGATIVE'
data.head()

data.sentiment.value_counts().plot(kind='bar',title="sentiment analysis")
plt.show()
#----output----
   ID                                             Review  polarity_score
0   1  Excellent service my claim was dealt with very...          0.7346
1   2  Very sympathetically dealt within all aspects ...         -0.8155
2   3  Having received yet another ludicrous quote fr...          0.9785
3   4  Very prompt and fair handling of claim. A mino...          0.1440
4   5  Very good and excellent value for money simple...          0.8610

   neutral_score  negative_score  positive_score sentiment
0          0.618           0.000           0.382  POSITIVE
1          0.680           0.320           0.000  NEGATIVE
2          0.711           0.039           0.251  POSITIVE
3          0.651           0.135           0.214  POSITIVE
4          0.485           0.000           0.515  POSITIVE

Listing 5-34Example Code for Sentiment Analysis

深度自然语言处理(DNLP)

首先，让我澄清一下，DNLP 不会被误认为是深度学习 NLP。诸如主题建模之类的技术通常被称为浅层 NLP，其中您试图通过语义或句法分析方法从文本中提取知识(即，试图通过保留相似的词并在句子/文档中占据较高权重来形成组)。浅 NLP 比 n-gram 噪音小；然而，关键的缺点是，它没有指定项目在句子中的作用。相比之下，DNLP 侧重于语义方法。也就是说，它检测句子内的关系，并且进一步地，它可以被表示或表达为形式的复杂结构，例如句法分析的句子中的主语:谓语:宾语(称为三元组或三元组),以保留上下文。句子由参与者、动作、对象和命名实体(人、组织、地点、日期等)的任意组合组成。).例如，考虑句子“爆胎被司机换了。”这里“司机”是主语(演员)，“被替换”是谓语(动作)，“爆胎”是宾语。所以这个三元组就是 driver:replaced:tire，它抓住了句子的上下文。请注意，三元组是广泛使用的形式之一，您可以根据手头的领域或问题形成类似的复杂结构。

对于 ae 演示，我将使用 sopex 包，它使用了斯坦福核心 NLP 树解析器(清单 5-35 )。

from chunker import PennTreebackChunker
from extractor import SOPExtractor

# Initialize chunker
chunker = PennTreebackChunker()
extractor = SOPExtractor(chunker)

# function to extract triples
def extract(sentence):
    sentence = sentence if sentence[-1] == '.' else sentence+'.'
    global extractor
    sop_triplet = extractor.extract(sentence)
    return sop_triplet

sentences = [
  'The quick brown fox jumps over the lazy dog.',
  'A rare black squirrel has become a regular visitor to a suburban garden',
  'The driver did not change the flat tire',
  "The driver crashed the bike white bumper"
]

#Loop over sentence and extract triples
for sentence in sentences:
    sop_triplet = extract(sentence)
    print sop_triplet.subject + ':' + sop_triplet.predicate + ':' + sop_triplet.object

#----output----
fox:jumps:dog
squirrel:become:visitor
driver:change:tire
driver:crashed:bumper

Listing 5-35Example Code for Deep NLP

Word2Vec

谷歌的托马斯·米科洛夫(Tomas Mikolov)领导的团队在 2013 年创建了 Word2Vec(单词到向量)模型，该模型使用文档来训练神经网络模型，以最大化给定单词的上下文的条件概率。

它使用两种模型:CBOW 和 skip-gram。

图 5-9

2 窗口的跳过程序

连续单词袋(CBOW)模型从周围上下文单词的窗口中预测当前单词，或者在给定一组上下文单词的情况下，预测该上下文中可能出现的缺失单词。CBOW 的训练速度比 skip-gram 更快，对频繁出现的单词的准确率更高。
连续跳格模型使用当前单词预测上下文单词的周围窗口，或者给定一个单词，预测在该上下文中可能出现在它附近的其他单词的概率。众所周知，Skip-gram 对常用词和生僻字都有很好的效果。让我们看一个例句，为 2 的窗口创建一个跳转程序(图 5-9 )。用黄色突出显示的单词是输入单词。

你可以为 Word2Vec 下载谷歌的预训练模型(从以下链接),它包括从谷歌新闻数据集中的 1000 亿个单词中提取的 300 万个单词/短语。

URL: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

清单 5-36 提供了 Word2Vec 实现的示例代码。

import gensim

# Load Google's pre-trained Word2Vec model.
model = gensim.models. KeyedVectors.load_word2vec_format('Data/GoogleNews-vectors-negative300.bin', binary=True)

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)
#----output----
[(u'queen', 0.7118192911148071),
 (u'monarch', 0.6189674139022827),
 (u'princess', 0.5902431607246399),
 (u'crown_prince', 0.5499460697174072),
 (u'prince', 0.5377321243286133)]

model.most_similar(['girl', 'father'], ['boy'], topn=3)
#----output----
[(u'mother', 0.831214427947998),
 (u'daughter', 0.8000643253326416),
 (u'husband', 0.769158124923706)]

model.doesnt_match("breakfast cereal dinner lunch".split())
#----output----
'cereal'

Listing 5-36Example Code for Word2Vec

可以在自己的数据集上训练一个 Word2Vec 模型。需要记住的关键模型参数是尺寸、窗口、最小计数和 sg(列表 5-37 )。

大小:向量的维数。较大的大小值需要更多的训练数据，但可以产生更精确的模型。

对于 CBOW 模型 sg = 0，对于 skip-gram 模型 SG = 1。

min_count:忽略总频率低于此的所有单词。

窗口:句子中当前单词和预测单词之间的最大距离。

sentences = [['cigarette','smoking','is','injurious', 'to', 'health'],['cigarette','smoking','causes','cancer'],['cigarette','are','not','to','be','sold','to','kids']]

# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1, sg=1, window = 3)

model.most_similar(positive=['cigarette', 'smoking'], negative=['kids'], topn=1)
#----output----
[('injurious', 0.16142114996910095)]

Listing 5-37Example Code for Training word2vec on Your Own Dataset

摘要

在这一步中，您学习了文本挖掘过程的基础，以及从各种文件格式中提取文本的不同工具/技术。您还了解了从数据中去除噪声的基本文本预处理步骤，以及更好地理解手头语料库的不同可视化技术。然后，您学习了各种模型，这些模型可以用来理解关系并从数据中获得洞察力。

我们还学习了两种重要的推荐系统方法，如基于内容的过滤和协同过滤。

六、深度强化学习

最近一段时间，深度学习一直是机器学习(ML)界的时髦词汇。迄今为止，深度学习算法的主要目标是使用 ML 来实现人工通用智能(AGI)(即，在机器中复制人类水平的智能，以解决给定领域的任何问题)。深度学习在计算机视觉、音频处理和文本挖掘方面显示出了有前途的成果。这一领域的进步带来了突破，如自动驾驶汽车。在这一章中，你将了解深度学习的核心概念、进化(感知机到卷积神经网络[CNN])、关键应用和实现。

在过去几年中，已经建立了许多强大而流行的开源库，主要集中在深度学习上(表 6-1 )。

表 6-1

热门深度学习库(截至 2019 年底)

库名

发布年份

许可证

贡献者数量

官方网站

|
| --- | --- | --- | --- | --- |
| 提亚诺 | Two thousand and ten | 加州大学伯克利分校软件(Berkeley Software Distribution) | Three hundred and thirty-three | http://deeplearning.net/software/theano/ |
| Pylearn2 | Two thousand and eleven | BSD-3 条款 | One hundred and fifteen | http://deeplearning.net/software/pylearn2/ |
| TensorFlow | Two thousand and fifteen | 阿帕奇-2.0 | One thousand nine hundred and sixty-three | http://tensorflow.org |
| PyTorch | Two thousand and sixteen | 加州大学伯克利分校软件(Berkeley Software Distribution) | One thousand and twenty-three | https://pytorch.org/ |
| 硬 | Two thousand and fifteen | 用它 | Seven hundred and ninety-two | https://keras.io/ |
| mxnet 系统 | Two thousand and fifteen | 阿帕奇-2.0 | Six hundred and eighty-four | http://mxnet.io/ |
| 框架 | Two thousand and fifteen | BSD-2 条款 | Two hundred and sixty-six | http://caffe.berkeleyvision.org/ |
| 千层面 | Two thousand and fifteen | 用它 | Sixty-five | http://lasagne.readthedocs.org/ |

以下是每个库的简短描述(来自表 6-1 )。他们的官方网站提供高质量的文档和例子。如果需要的话，我强烈建议你在完成这一章后访问相应的网站以了解更多信息。

这是一个 Python 库，主要由蒙特利尔大学的学者开发。Theano 允许您高效地定义、优化和评估涉及复杂多维数组的数学表达式。它旨在与 GPU 一起工作，并执行有效的符号微分。它快速而稳定，具有广泛的单元测试。
TensorFlow :根据官方文档，它是一个使用可扩展 ML 的数据流图进行数值计算的库，由 Google 研究人员开发。它目前正被谷歌产品用于研究和生产。它于 2015 年开源，并在 ML 世界中获得了广泛的欢迎。
基于 Theano 的 ML 库，这意味着用户可以使用数学表达式编写新的模型/算法，Theano 将优化、稳定和编译这些表达式。
PyTorch:这是一个开源的深度学习平台，提供了从研究原型到生产部署的无缝路径。它具有混合前端、分布式培训的关键特性，允许使用流行的 Python 库，丰富的工具/库生态系统扩展了 PyTorch。
Keras :它是一个高级神经网络库，用 Python 编写，可以在 TensorFlow 或 Theano 上运行。它是一个接口，而不是一个端到端的 ML 框架。它是用 Python 编写的，入门简单，高度模块化，简单但足够深入，可以扩展以构建/支持复杂的模型。
MXNet :它是由来自 CMU、NYU、新加坡国立大学和麻省理工学院的研究人员合作开发的。它是一个轻量级的、可移植的、灵活的、分布式/移动库，支持多种语言，如 Python、R、Julia、Scala、Go、Javascript 等。
Caffe :是伯克利视觉与学习中心用 C++编写的深度学习框架，具有 Python/Matlab 构建能力。
Lasagne :它是一个轻量级的库，用于在 Theano 中构建和训练神经网络。

在这一章中，Scikit-learn 和 Keras 库(后端为 TensorFlow 或 Theano)被恰当地使用，因为它们是初学者掌握概念的最佳选择。此外，这些是最广泛使用的 ML 从业者。

注意

关于如何使用 TensorFlow 或 Theano 设置 Keras，已经有足够多的好材料，所以这里不再赘述。另外，记得安装“graphviz”和“pydot-ng”包来支持神经网络的图形视图。本章中的 Keras 代码是在 Linux 平台上构建的；但是，如果正确安装了支持包，它们应该可以在其他平台上正常工作，无需任何修改。

人工神经网络

在进入深度学习的细节之前，我认为简单了解一下人类视觉是如何工作的非常重要。人脑是一个复杂的、相互连接的神经网络，大脑的不同区域负责不同的工作；这些区域是大脑的机器，接收信号并处理它们以采取必要的行动。图 6-1 显示了人脑的视觉通路。

图 6-1

视觉通路

我们的大脑是由一簇称为神经元的小连接单元组成的，神经元相互发送电信号。长期知识由神经元之间的连接强度来表示。当我们看到物体时，光穿过视网膜，视觉信息被转换成电信号。此外，电信号在几毫秒内穿过大脑内不同区域的连接神经元的层级，以解码信号/信息。

当计算机看一幅图像时，背后发生了什么？

在计算机中，图像被表示为一个大型的三维数字阵列。例如，考虑图 6-2 :它是 28 × 28 × 1(宽度×高度×深度)大小的手写灰度数字图像，产生 784 个数据点。数组中的每个数字都是从 0(黑色)到 255(白色)的整数。在一个典型的分类问题中，模型必须将这个大矩阵转换成单个标签。对于一幅彩色图像，它还有三个颜色通道——每个像素有红、绿、蓝(RGB)——因此同一幅彩色图像的大小为 28×28×3 = 2352 个数据点。

图 6-2

手写数字(零)图像和相应的数组

为什么没有一个简单的图像分类模型？

图像分类对计算机来说可能是具有挑战性的，因为存在与图像的表示相关联的各种挑战。如果没有大量的功能工程工作，简单的分类模型可能无法解决大多数问题。让我们了解一些关键问题(表 6-2 )。

表 6-2

图像数据中的视觉挑战

描述

例子

|
| --- | --- |
| 视点变化 :同一物体可以有不同的朝向。 | |
| 比例和光照变化 :物体大小的变化和像素级的光照水平可以变化。 | |
| 变形/扭曲和组内变化 :非刚体可以以很大的方式变形，在一个类中可以有不同类型的具有不同外观的对象。 | |
| 堵塞 :可能只有一小部分感兴趣的对象可见。 | |
| 背景杂乱 :物体能融入其所处的环境，会使其难以辨认。 | |

感知器——单一人工神经元

受生物神经元的启发，麦卡洛克和皮茨在 1943 年引入了感知机作为人工神经元的概念，这是人工神经网络的基本组成部分。它们不仅以其生物对应体命名，还模仿了我们大脑中神经元的行为(图 6-3 )。

图 6-3

生物与人工神经元

生物神经元有树突接收信号，细胞体处理信号，轴突/轴突末端将信号传递给其他神经元。类似地，人工神经元具有多个输入通道以接受表示为向量的训练样本，以及一个处理级，其中权重(w)被调整以使输出误差(实际与预测)最小化。然后，将结果输入激活函数，以产生输出，例如分类标签。分类问题的激活函数是阈值截止值(标准为 0.5)，高于该阈值，分类为 1，否则为 0。让我们看看如何使用 Scikit-learn 实现这一点(清单 6-1 )。

# import sklearn.linear_model.perceptron
from sklearn.linear_model import perceptron
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# Let's use sklearn to make_classification function to create some test data.
from sklearn.datasets import make_classification
X, y = make_classification(20, 2, 2, 0, weights=[.5, .5], random_state=2017)

# Create the model
clf = perceptron.Perceptron(n_iter=100, verbose=0, random_state=2017, fit_intercept=True, eta0=0.002)
clf.fit(X,y)

# Print the results

print ("Prediction: " + str(clf.predict(X)))
print ("Actual:     " + str(y))
print ("Accuracy:   " + str(clf.score(X, y)*100) + "%")

# Output the values
print ("X1 Coefficient: " + str(clf.coef_[0,0]))
print ("X2 Coefficient: " + str(clf.coef_[0,1]))
print ("Intercept:      " + str(clf.intercept_))

# Plot the decision boundary using custom function 'plot_decision_regions'
plot_decision_regions(X, y, classifier=clf)
plt.title('Perceptron Model Decision Boundry')
plt.xlabel('X1')
plt.ylabel('X2')
plt.legend(loc='upper left')
plt.show()
#----output----
Prediction: [1 1 1 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 1 1]
Actual:     [1 1 1 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 1 1]
Accuracy:   100.0%
X1 Coefficient: 0.00575308754305
X2 Coefficient: 0.00107517941422
Intercept:      [-0.002]

Listing 6-1Example Code for Sklearn Perceptron

注意

单一感知器方法的缺点是它只能学习线性可分函数。

多层感知器(前馈神经网络)

为了解决单一感知器的缺点，提出了多层感知器，通常也称为前馈神经网络。它是多个感知器的组合，以不同的方式连接，并对不同的激活功能进行操作，以实现改进的学习机制。训练样本通过网络向前传播，输出误差向后传播；使用梯度下降法将误差降至最低，该方法将计算网络中所有权重的损失函数(图 6-4 )。

图 6-4

多层感知器表示

多层感知器的简单一级隐藏层的激活函数可以由下式给出:

$f(x)=g\left(\ \sum \limits_{j=0}M{W}_{kj}{(2)}g\left(\ \sum \limits_{i=0}d{W}_{ji}{(1)}{x}_i\ \right)\right)$ ，其中 x _i 为输入， ${W}_{ji}^{(1)}$ 为输入层权重， ${W}_{kj}^{(2)}$ 为隐藏层权重。

多层神经网络可以有许多隐藏层，其中网络保存训练样本的内部抽象表示。上层将在前几层的基础上构建新的抽象。因此，复杂数据集有更多的隐藏层将有助于神经网络更好地学习。

从图 6-4 中可以看出，MLP(多层感知器)架构至少有三层:输入层、隐藏层和输出层。输入图层的神经元数量将等于要素的总数，在某些库中，还会有一个额外的神经元用于截取/偏移。这些神经元被表示为节点。输出层将具有用于回归模型和二元分类器的单个神经元；否则，它将等于多类分类模型的类标签总数。

请注意，对复杂数据集使用太少的神经元会导致模型不合适，因为它可能无法学习复杂数据中的模式。然而，使用太多的神经元会导致模型过度拟合，因为它有能力捕获可能是噪声或特定于给定训练数据集的模式。因此，为了建立一个有效的多层神经网络，需要回答的关于隐含层的基本问题是:1)隐含层的理想数量是多少？2)隐层神经元的数量应该是多少？

被广泛接受的经验法则是，你可以从一个隐藏层开始，因为有一种理论认为一个隐藏层对于大多数问题来说是足够的。然后在试错的基础上逐渐增加层数，看看精度有没有提高。理想情况下，隐藏层中神经元的数量可以是输入层和输出层中神经元的平均值。

让我们看看 Scikit-learn 库中的 MLP 算法在分类问题上的应用。我们将使用作为 Scikit-learn 数据集一部分的 digits 数据集，该数据集由 1，797 个样本(MNIST 数据集的子集)组成——手写灰度 digits 8×8 图像。

加载 MNIST 数据

清单 6-2 提供了为训练 MLPClassifier 加载 MNIST 数据的示例代码。MNIST 数字数据是 sklearn 数据集的一部分。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

from sklearn.datasets import load_digits
np.random.seed(seed=2017)

# load data
digits = load_digits()
print('We have %d samples'%len(digits.target))

## plot the first 32 samples

to get a sense of the data
fig = plt.figure(figsize = (8,8))
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
for i in range(32):
    ax = fig.add_subplot(8, 8, i+1, xticks=[], yticks=[])
    ax.imshow(digits.images[i], cmap=plt.cm.gray_r)
ax.text(0, 1, str(digits.target[i]), bbox=dict(facecolor='white'))
#----output----
We have 1797 samples

Listing 6-2Example Code for Loading MNIST Data for Training MLPClassifier

Scikit 的关键参数-了解 MLP

让我们来看看调整 Scikit-learn MLP 模型的关键参数。清单 6-3 提供了实现 MLPClassifier 的示例代码。

max_iter :这是求解器收敛的最大迭代次数，默认为 200
learning _ rate _ init:这是初始学习速率，用于控制更新权重的步长(仅适用于解算器 sgd/adam)，默认为 0.001
Adam:Diederik Kingma 和 Jimmy Ba 提出的基于随机梯度的优化器，适用于大型数据集
lbfgs :属于拟牛顿法家族，适用于小数据集
sgd :随机梯度下降
解算器 :这是为了权重优化。有三个选项可用，默认为“adam”
relu :整流后的线性单位函数，返回 f(x) = max(0，x)
逻辑:逻辑 sigmoid 函数，返回 f(x) = 1 / (1 + exp(-x))。
identity : No-op 激活，对实现线性瓶颈有用，返回 f(x) = x
tanh :双曲正切函数，返回 f(x) = tanh(x)。
hidden _ layer _ sizes:你要提供若干个隐藏层和每个隐藏层的神经元。例如，hidden _ layer _ sizes-(5，3，3)表示有三个隐藏层，第一层的神经元数量分别为 5 个、第二层为 3 个、第三层为 3 个。默认值为(100)，即一个包含 100 个神经元的隐藏层。
激活 :这是一个隐藏层的激活功能；有四种激活功能可供使用；默认为“relu”

建议在建模前对数据进行缩放或归一化，因为 MLP 对要素缩放非常敏感。

# split data to training and testing data
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.2, random_state=2017)
print ('Number of samples in training set: %d' %(len(y_train)))
print ('Number of samples in test set: %d' %(len(y_test)))

# Standardise data, and fit only to the training data
scaler = StandardScaler()
scaler.fit(X_train)

# Apply the transformations to the data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize ANN classifier

mlp = MLPClassifier(hidden_layer_sizes=(100), activation="logistic", max_iter = 100)

# Train the classifier with the training data
mlp.fit(X_train_scaled,y_train)
#----output----
Number of samples in training set: 1437
Number of samples in test set: 360

MLPClassifier(activation='logistic', alpha=0.0001, batch_size="auto",
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(30, 30, 30), learning_rate="constant",
       learning_rate_init=0.001, max_iter=100, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver="adam", tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

print("Training set score: %f" % mlp.score(X_train_scaled, y_train))
print("Test set score: %f" % mlp.score(X_test_scaled, y_test))
#----output----
Training set score: 0.990953
Test set score: 0.983333

# predict results from the test data

X_test_predicted = mlp.predict(X_test_scaled)

fig = plt.figure(figsize=(8, 8))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(32):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(X_test.reshape(-1, 8, 8)[i], cmap=plt.cm.gray_r)

    # label the image with the target value
    if X_test_predicted[i] == y_test[i]:
        ax.text(0, 1, X_test_predicted[i], color="green", bbox=dict(facecolor='white'))
    else:
        ax.text(0, 1, X_test_predicted[i], color="red", bbox=dict(facecolor='white'))
#----output----

Listing 6-3Example Code for Sklearn MLPClassifier

受限玻尔兹曼机

Geoffrey Hinton (2007)提出了一种 RBM 算法，它学习样本训练数据输入的概率分布。它在监督/非监督 ML 的不同领域有着广泛的应用，如特征学习、降维、分类、协同过滤和主题建模。

考虑在第五章的“推荐系统”一节中讨论的电影分级的例子。像《复仇者联盟》、《阿凡达》和《星际穿越》这样的电影与最新的幻想和科幻因素有很强的关联。根据用户评级，RBM 将发现潜在的因素，可以解释电影选择的激活。简而言之，RBM 描述了输入数据集的相关变量之间的可变性，即潜在的较低数量的未观测变量。

能量函数由 E(v，h)=–a^Tv–b^Th–v^TWh 给出。

可见输入层的概率函数可以由 $f(v)=-{a}^Tv-\sum \limits_i\ \mathit{\log}\sum \limits_{h_i}{e}^{h_i\left({b}_i+{W}_iv\right)}$ 给出。

让我们使用 bernoulllirbm 在数字数据集上构建逻辑回归模型，并将其准确性与直接逻辑回归(不使用 bernoulllirbm)模型的准确性进行比较。

让我们通过向左、向右、向下和向上移动 1 个像素来轻推数据集，以旋绕图像(清单 6-4 )。

# Function to nudge the dataset
def nudge_dataset(X, Y):
    """
    This produces a dataset 5 times bigger than the original one,
    by moving the 8x8 images in X around by 1px to left, right, down, up
    """
    direction_vectors = [
        [[0, 1, 0],
         [0, 0, 0],
         [0, 0, 0]],

        [[0, 0, 0],
         [1, 0, 0],
         [0, 0, 0]],

        [[0, 0, 0],
         [0, 0, 1],
         [0, 0, 0]],

        [[0, 0, 0],
         [0, 0, 0],
         [0, 1, 0]]]

    shift = lambda x, w: convolve(x.reshape((8, 8)), mode="constant",
                                  weights=w).ravel()
    X = np.concatenate([X] +
                       [np.apply_along_axis(shift, 1, X, vector)
                        for vector in direction_vectors])
    Y = np.concatenate([Y for _ in range(5)], axis=0)
    return X, Y

Listing 6-4Function to Nudge the Dataset

BernoulliRBM 假设我们的特征向量的列落在范围 0 到 1 内。但是，MNIST 数据集表示为无符号的 8 位整数，范围在 0 到 255 之间。

定义一个函数将列缩放到范围(0，1)内。scale 函数有两个参数:我们的数据矩阵 X 和一个用于防止被零除错误的 epsilon 值(清单 6-5 )。

# Example adapted from scikit-learn documentation
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets, metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import Pipeline
from scipy.ndimage import convolve

# Load Data
digits = datasets.load_digits()
X = np.asarray(digits.data, 'float32')
y = digits.target

X, y = nudge_dataset(X, digits.target)

# Scale the features such that the values are between 0-1 scale
X = (X - np.min(X, 0)) / (np.max(X, 0) + 0.0001)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2017)
print (X.shape)
print (y.shape)
#----output----
(8985L, 64)
(8985L,)

# Gridsearch for logistic regression

# perform a grid search on the 'C' parameter of Logistic
params = {"C": [1.0, 10.0, 100.0]}

Grid_Search = GridSearchCV(LogisticRegression(), params, n_jobs = -1, verbose = 1)
Grid_Search.fit(X_train, y_train)

# print diagnostic information to the user and grab the
print ("Best Score: %0.3f" % (Grid_Search.best_score_))

# best model
bestParams = Grid_Search.best_estimator_.get_params()

print (bestParams.items())
#----output----
Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best Score: 0.774
[('warm_start', False), ('C', 100.0), ('n_jobs', 1), ('verbose', 0), ('intercept_scaling', 1), ('fit_intercept', True), ('max_iter', 100), ('penalty', 'l2'), ('multi_class', 'ovr'), ('random_state', None), ('dual', False), ('tol', 0.0001), ('solver', 'liblinear'), ('class_weight', None)]

# evaluate using Logistic Regression

and only the raw pixel
logistic = LogisticRegression(C = 100)
logistic.fit(X_train, y_train)

print ("Train accuracy: ", metrics.accuracy_score(y_train, logistic.predict(X_train)))
print ("Test accuracyL ", metrics.accuracy_score(y_test, logistic.predict(X_test)))
#----output----
Train accuracy:  0.797440178075
Test accuracyL  0.800779076238

Listing 6-5Example Code for Using BernoulliRBM with Classifier

让我们对 RBM +逻辑回归模型进行网格搜索—对逻辑回归的 RBM 和 C 上的学习率、迭代次数和组件数量进行网格搜索。

# initialize the RBM + Logistic Regression pipeline
rbm = BernoulliRBM()
logistic = LogisticRegression()
classifier = Pipeline([("rbm", rbm), ("logistic", logistic)])

params = {
    "rbm__learning_rate": [0.1, 0.01, 0.001],
    "rbm__n_iter": [20, 40, 80],
    "rbm__n_components": [50, 100, 200],
    "logistic__C": [1.0, 10.0, 100.0]}

# perform a grid search over the parameter
Grid_Search = GridSearchCV(classifier, params, n_jobs = -1, verbose = 1)
Grid_Search.fit(X_train, y_train)

# print diagnostic information to the user and grab the
# best model
print ("Best Score: %0.3f" % (gs.best_score_))

print ("RBM + Logistic Regression parameters")
bestParams = gs.best_estimator_.get_params()

# loop over the parameters

and print each of them out
# so they can be manually set
for p in sorted(params.keys()):
    print ("\t %s: %f" % (p, bestParams[p]))
#----output----
Fitting 3 folds for each of 81 candidates, totalling 243 fits
Best Score: 0.505
RBM + Logistic Regression parameters
     logistic__C: 100.000000
     rbm__learning_rate: 0.001000
     rbm__n_components: 200.000000
     rbm__n_iter: 20.000000

# initialize the RBM + Logistic Regression classifier with
# the cross-validated parameters
rbm = BernoulliRBM(n_components = 200, n_iter = 20, learning_rate = 0.1,  verbose = False)
logistic = LogisticRegression(C = 100)

# train the classifier and show an evaluation report
classifier = Pipeline([("rbm", rbm), ("logistic", logistic)])
classifier.fit(X_train, y_train)

print (metrics.accuracy_score(y_train, classifier.predict(X_train)))
print (metrics.accuracy_score(y_test, classifier.predict(X_test)))
#----output----
0.936839176405
0.932109070673

# plot RBM components
plt.figure(figsize=(15, 15))
for i, comp in enumerate(rbm.components_):
    plt.subplot(20, 20, i + 1)
    plt.imshow(comp.reshape((8, 8)), cmap=plt.cm.gray_r,
               interpolation='nearest')
    plt.xticks(())
    plt.yticks(())
plt.suptitle('200 components extracted by RBM', fontsize=16)
plt.show()
#----output----

Listing 6-6Example Code for Grid Search with RBM + Logistic Regression

请注意，与没有 RBM 的模型相比，有 RBM 的逻辑回归模型将模型得分提高了 10%以上。

注意

为了进一步实践并获得更好的理解，我建议您在 Scikit-learn 的 Olivetti 人脸数据集上尝试前面的示例代码，该数据集包含 1992 年 4 月至 1994 年 4 月在剑桥美国电话电报公司实验室拍摄的人脸图像。您可以使用olivetti = datasets.fetch_olivetti_faces()加载数据。

堆叠 RBM 被称为深度信任网络(DBN)，这是一种初始化技术。然而，这种技术在 2006-2007 年间很流行，但是已经过时了。所以在 Keras 中没有现成的 DBN 实现。然而，如果你对一个简单的 DBN 实现感兴趣，我推荐你看一看 https://github.com/albertbup/deep-belief-network ，它有 MIT 的许可。

使用 Keras 的 MLP

在 Keras 中，神经网络被定义为一系列层，这些层的容器是序列类。顺序模型是层的线性堆叠；每一层的输出都输入到下一层的输入中。

神经网络的第一层将定义预期的输入数量。激活函数变换来自层中每个神经元的求和信号；同样可以提取并添加到序列中作为一个类似层的对象，称为激活。行动的选择取决于我们试图解决的问题的类型(如回归或二分类或多分类)。

from matplotlib import pyplot as plt
import numpy as np
np.random.seed(2017)

from keras.models import Sequential
from keras.datasets import mnist
from keras.layers import Dense, Activation, Dropout, Input
from keras.models import Model
from keras.utils import np_utils

from IPython.display import SVG
from keras import backend as K
from keras.callbacks import EarlyStopping
from keras.utils.visualize_util import model_to_dot, plot_model

# load data
(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(X_train.shape[0], input_unit_size)
X_test  = X_test.reshape(X_test.shape[0], input_unit_size)
X_train = X_train.astype('float32')
X_test  = X_test.astype('float32')

# Scale the values by dividing 255 i.e., means foreground (black)
X_train /= 255
X_test  /= 255

# one-hot representation, required for multiclass problems
y_train = np_utils.to_categorical(y_train, nb_classes)
y_test = np_utils.to_categorical(y_test, nb_classes)

print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

nb_classes = 10 # class size
# flatten 28*28 images to a 784 vector for each image
input_unit_size = 28*28

# create model

model = Sequential()
model.add(Dense(input_unit_size, input_dim=input_unit_size, kernel_initializer="normal", activation="relu"))
model.add(Dense(nb_classes, kernel_initializer="normal", activation="softmax"))
#----output----
'X_train shape:', (60000, 784)
60000, 'train samples'
10000, 'test samples'

Listing 6-7Example Code for Keras MLP

编译是一个带有预计算步骤的模型，它将我们定义的层序列转换成一系列高效的矩阵转换。它有三个参数:一个优化器、一个损失函数和一个评估指标列表。

与 Scikit-learn 实现不同，Keras 提供了大量的优化器，如 SGD、RMSprop、Adagrad(自适应子梯度)、Adadelta(自适应学习速率)、Adam、Adamax、Nadam 和 TFOptimizer。为了简洁起见，我不会在这里解释这些，但你可以参考官方的 Keras 网站做进一步的参考。

一些标准损失函数是用于回归的“mse ”,用于二元分类的 binary_crossentropy(对数损失),以及用于多分类问题的 categorical _ crossentropy(多类对数损失)。

支持不同类型问题的标准评估指标，您可以向它们传递一个列表进行评估(清单 6-8 )。

# Compile model
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])

SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format="svg"))

Listing 6-8Compile the Model

使用反向传播算法训练网络，并根据指定的方法和损失函数优化网络。每个时期可以被划分成批次。

import pandas as pd

# load pima indians dataset
dataset = pd.read_csv('Data/Diabetes.csv')

# split into input (X) and output (y) variables
X = dataset.iloc[:,0:8].values
y = dataset['class'].values     # dependent variables

# create model
model = Sequential()
model.add(Dense(12, input_dim=8, kernel_initializer="uniform", activation="relu"))
model.add(Dense(1, kernel_initializer="uniform", activation="sigmoid"))

# Compile model
model.compile(loss='binary_crossentropy', optimizer="adam", metrics=['accuracy'])

SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format="svg"))

# Fit the model
model.fit(X, y, epochs=5, batch_size=10)
# evaluate the model

scores = model.evaluate(X, y)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
#----output----
Epoch 1/5
768/768 [==============================] - 0s 306us/step - loss: 0.6737 - acc: 0.6250
Epoch 2/5
768/768 [==============================] - 0s 118us/step - loss: 0.6527 - acc: 0.6510
Epoch 3/5
768/768 [==============================] - 0s 96us/step - loss: 0.6432 - acc: 0.6563
Epoch 4/5
768/768 [==============================] - 0s 109us/step - loss: 0.6255 - acc: 0.6719
Epoch 5/5
768/768 [==============================] - 0s 113us/step - loss: 0.6221 - acc: 0.6706
768/768 [==============================] - 0s 84us/step
acc: 68.75%

Listing 6-10Additional Example to Train Model and Evaluate for Diabetes Dataset

# model training
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=5, batch_size=500, verbose=2)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Error: %.2f%%" % (100-scores[1]*100))
#----output----
Train on 60000 samples, validate on 10000 samples
Epoch 1/5
 - 3s - loss: 0.3863 - acc: 0.8926 - val_loss: 0.1873 - val_acc: 0.9477
Epoch 2/5
 - 3s - loss: 0.1558 - acc: 0.9561 - val_loss: 0.1280 - val_acc: 0.9612
Epoch 3/5
 - 3s - loss: 0.1071 - acc: 0.9696 - val_loss: 0.1009 - val_acc: 0.9697
Epoch 4/5
 - 3s - loss: 0.0800 - acc: 0.9773 - val_loss: 0.0845 - val_acc: 0.9756
Epoch 5/5
 - 3s - loss: 0.0607 - acc: 0.9832 - val_loss: 0.0760 - val_acc: 0.9776
Error: 2.24%

Listing 6-9Train Model and Evaluate

自动编码器

顾名思义，自动编码器旨在学习在没有人工干预的情况下自动编码训练样本数据的表示。autoencoder 广泛用于降维和数据去噪(图 6-5 )。

图 6-5

自动编码器

构建自动编码器通常有三个要素:

通过非线性函数将输入映射到隐藏表示的编码函数，z = sigmoid (Wx + b)
诸如 x' = sigmoid(W'y + b ')的解码函数，它将映射回具有与 x 相同形状的重构 x '
损失函数，它是一个距离函数，用于测量数据的压缩表示和解压缩表示之间的信息损失。重建误差可以用传统的平方误差||x-z|| ² 来衡量。

我们将使用著名的 MNIST 手写数字数据库，该数据库包含大约 70，000 个手写灰度数字图像样本，从 0 到 9。每幅图像的大小为 28 × 28，强度等级从 0 到 255 不等，其中 60，000 幅图像附有 0 到 9 的整数标签，其余图像没有标签(测试数据集)。

使用自动编码器降维

清单 6-11 提供了一个使用自动编码器减少维度的示例代码实现。

import numpy as np
np.random.seed(2017)

from keras.datasets import mnist
from keras.models import Model
from keras.layers import Input, Dense
from keras.optimizers import Adadelta
from keras.utils import np_utils

from IPython.display import SVG
from keras import backend as K
from keras.callbacks import EarlyStopping
from keras.utils.visualize_util import model_to_dot
from matplotlib import pyplot as plt

# Load mnist data
input_unit_size = 28*28
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# function to plot digits
def draw_digit(data, row, col, n):
    size = int(np.sqrt(data.shape[0]))
    plt.subplot(row, col, n)
    plt.imshow(data.reshape(size, size))
    plt.gray()

# Normalize
X_train = X_train.reshape(X_train.shape[0], input_unit_size)
X_train = X_train.astype('float32')
X_train /= 255
print('X_train shape:', X_train.shape)
#----output----
'X_train shape:', (60000, 784)

# Autoencoder
inputs = Input(shape=(input_unit_size,))
x = Dense(144, activation="relu")(inputs)
outputs = Dense(input_unit_size)(x)
model = Model(input=inputs, output=outputs)
model.compile(loss='mse', optimizer="adadelta")

SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format="svg"))
#----output----

Listing 6-11Example Code for Dimension Reduction Using an Autoencoder

注意，通过在隐藏层中编码到 144，784 维被减小，并且在层 3 中再次使用解码器构造回 784。

model.fit(X_train, X_train, nb_epoch=5, batch_size=258)
#----output----
Epoch 1/5
60000/60000 [==============================] - 8s - loss: 0.0733
Epoch 2/5
60000/60000 [==============================] - 9s - loss: 0.0547
Epoch 3/5
60000/60000 [==============================] - 11s - loss: 0.0451
Epoch 4/5
60000/60000 [==============================] - 11s - loss: 0.0392
Epoch 5/5
60000/60000 [==============================] - 11s - loss: 0.0354

# plot the images from input layers
show_size = 5
total = 0
plt.figure(figsize=(5,5))
for i in range(show_size):
    for j in range(show_size):
        draw_digit(X_train[total], show_size, show_size, total+1)
        total+=1
plt.show()
#----output----

# plot the encoded (compressed) layer image
get_layer_output = K.function([model.layers[0].input],
                                  [model.layers[1].output])

hidden_outputs = get_layer_output([X_train[0:show_size**2]])[0]

total = 0
plt.figure(figsize=(5,5))
for i in range(show_size):
    for j in range(show_size):
        draw_digit(hidden_outputs[total], show_size, show_size, total+1)
        total+=1
plt.show()
#----output----

# Plot the decoded (de-compressed) layer images
get_layer_output = K.function([model.layers[0].input],
                                  [model.layers[2].output])

last_outputs = get_layer_output([X_train[0:show_size**2]])[0]

total = 0
plt.figure(figsize=(5,5))
for i in range(show_size):
    for j in range(show_size):
        draw_digit(last_outputs[total], show_size, show_size, total+1)
        total+=1
plt.show()
#----output----

使用自动编码器对图像去噪

从压缩的隐藏层中发现鲁棒特征是使自动编码器能够从去噪版本或原始图像有效地重建输入的一个重要方面。这由去噪自动编码器解决，它是自动编码器的随机版本。

让我们向数字数据集引入一些噪声，并尝试建立一个模型来对图像进行降噪(清单 6-12 )。

# Introducing noise to the image
noise_factor = 0.5
X_train_noisy = X_train + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=X_train.shape)
X_train_noisy = np.clip(X_train_noisy, 0., 1.)

# Function for visualization
def draw(data, row, col, n):
    plt.subplot(row, col, n)
    plt.imshow(data, cmap=plt.cm.gray_r)
    plt.axis('off')

show_size = 10
plt.figure(figsize=(20,20))

for i in range(show_size):
    draw(X_train_noisy[i].reshape(28,28), 1, show_size, i+1)
plt.show()
#----output----

Listing 6-12Example Code for Denoising Using an Autoencoder

#Let's fit a model on noisy training dataset

.
model.fit(X_train_noisy, X_train, nb_epoch=5, batch_size=258)

# Prediction for denoised image
X_train_pred = model.predict(X_train_noisy)

show_size = 10
plt.figure(figsize=(20,20))

for i in range(show_size):
    draw(X_train_pred[i].reshape(28,28), 1, show_size, i+1)
plt.show()
#----output----

注意，我们可以调整模型来提高去噪图像的清晰度。

卷积神经网络(CNN)

在图像分类领域，CNN 已经成为构建高效模型的首选算法。CNN 类似于普通的神经网络，只是它明确假设输入是图像，这允许我们将某些属性编码到架构中。然后，这些使得转发功能有效，以实现并减少网络中的参数。神经元以三维方式排列:宽度、高度和深度。

让我们考虑 CIFAR-10(加拿大高级研究所)，这是一个标准的计算机视觉和深度学习图像数据集。它由 60，000 张 32×32 像素见方的彩色照片组成，每个像素为 RGB，分为十类，包括常见的物体，如飞机、汽车、鸟类、猫、鹿、狗、青蛙、马、船和卡车。基本上每个图像的大小是 32 × 32 × 3(宽度×高度× RGB 颜色通道)。

CNN 由四种主要类型的层组成:输入层、卷积层、汇集层和全连接层。

输入层将保存原始像素，因此 CIFAR-10 的图像在输入层将有 32 × 32 × 3 个维度。卷积层将计算来自输入层的小局部区域的权重之间的点积，因此，如果我们决定有五个过滤器，则结果减少的维度将是 32 × 32 × 5。ReLU 层将应用元素激活函数，这不会影响维度。池层将沿宽度和高度对空间维度进行向下采样，得到 16 × 16 × 5 的维度。最后全连通层会计算类得分，得到的维数是单个向量 1 × 1 × 10(十个类得分)。这一层的每一个神经元都连接到上一卷的所有数字上(图 6-6 )。

图 6-6

卷积神经网络

下面的示例说明使用 Keras 和 Theano 后端。要使用 Theano 后端启动 Kearas，请在启动 Jupyter 笔记本的同时运行以下命令，“KERAS_BACKEND=theano jupyter notebook”(列表 6-13 )。

import keras
if K=='tensorflow':
    keras.backend.set_image_dim_ordering('tf')
else:
    keras.backend.set_image_dim_ordering('th')

from keras.models import Sequential
from keras.datasets import cifar10
from keras.layers import Dense, Dropout, Activation, Conv2D, MaxPooling2D, Flatten
from keras.utils import np_utils
from keras.preprocessing import sequence

from keras import backend as K
from IPython.display import SVG, display
from keras.utils.vis_utils import model_to_dot, plot_model
import numpy as np
np.random.seed(2017)

img_rows, img_cols = 32, 32
img_channels = 3

batch_size = 256
nb_classes = 10
nb_epoch = 4
nb_filters = 10
nb_conv = 3
nb_pool = 2
kernel_size = 3 # convolution kernel size

if K.image_dim_ordering() == 'th':
    input_shape = (3, img_rows, img_cols)
else:
    input_shape = (img_rows, img_cols, 3)

(X_train, y_train), (X_test, y_test) = cifar10.load_data()
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
#----output----
X_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples

# define two groups of layers: feature (convolutions) and classification (dense)
feature_layers = [
    Conv2D(nb_filters, kernel_size, input_shape=input_shape),
    Activation('relu'),
    Conv2D(nb_filters, kernel_size),
    Activation('relu'),
    MaxPooling2D(pool_size=(nb_pool, nb_pool)),
    Flatten(),
]
classification_layers = [
    Dense(512),
    Activation('relu'),
    Dense(nb_classes),
    Activation('softmax')
]

# create complete model

model = Sequential(feature_layers + classification_layers)

model.compile(loss='categorical_crossentropy', optimizer="adadelta", metrics=['accuracy'])

SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format="svg"))
#----output----
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d_1 (Conv2D)            (None, 30, 30, 10)        280
_________________________________________________________________
activation_1 (Activation)    (None, 30, 30, 10)        0
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 28, 28, 10)        910
_________________________________________________________________
activation_2 (Activation)    (None, 28, 28, 10)        0
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 14, 14, 10)        0
_________________________________________________________________
flatten_1 (Flatten)          (None, 1960)              0
_________________________________________________________________
dense_1 (Dense)              (None, 512)               1004032
_________________________________________________________________
activation_3 (Activation)    (None, 512)               0
_________________________________________________________________
dense_2 (Dense)              (None, 10)                5130
_________________________________________________________________
activation_4 (Activation)    (None, 10)                0
=================================================================
Total params: 1,010,352
Trainable params: 1,010,352
Non-trainable params: 0

# fit model
model.fit(X_train, Y_train, validation_data=(X_test, Y_test),
          epochs=nb_epoch, batch_size=batch_size, verbose=2)
#----output----
Train on 50000 samples, validate on 10000 samples
Epoch 1/4
 - 50s - loss: 1.8512 - acc: 0.3422 - val_loss: 1.5729 - val_acc: 0.4438
Epoch 2/4
 - 38s - loss: 1.4350 - acc: 0.4945 - val_loss: 1.4312 - val_acc: 0.4832
Epoch 3/4
 - 38s - loss: 1.2542 - acc: 0.5566 - val_loss: 1.3300 - val_acc: 0.5191
Epoch 4/4
 - 38s - loss: 1.1375 - acc: 0.6021 - val_loss: 1.1760 - val_acc: 0.5760

Let's visualize each layer. Note that we applied ten filters.

# function for Visualization
# visualization
def draw(data, row, col, n):
    plt.subplot(row, col, n)
    plt.imshow(data)

def draw_digit(data, row, col):
    for j in range(row):
        plt.figure(figsize=(16,16))
        for i in range(col):
            plt.subplot(row, col, i+1)
            plt.imshow(data[j,:,:,i])
            plt.axis('off')
        plt.tight_layout()
    plt.show()

### Input layer (original image)
show_size = 10
plt.figure(figsize=(16,16))
for i in range(show_size):
    draw(X_train[i], 1, show_size, i+1)
plt.show()
#----output----

Listing 6-13CNN Using Keras with Theano Backend on CIFAR10 Dataset

Notice in the following that the hidden layers features are stored in ten filters.

# first layer

get_first_layer_output = K.function([model.layers[0].input],
                          [model.layers[1].output])
first_layer = get_first_layer_output([X_train[0:show_size]])[0]

print ('first layer shape: ', first_layer.shape)
draw_digit(first_layer, first_layer.shape[0], first_layer.shape[3])
#----output----

# second layer

get_second_layer_output = K.function([model.layers[0].input],
                          [model.layers[3].output])
second_layers = get_second_layer_output([X_train[0:show_size]])[0]

print ('second layer shape: ', second_layers.shape)
draw_digit(second_layers, second_layers.shape[0], second_layers.shape[3]) #----output----

# third layer

get_third_layer_output = K.function([model.layers[0].input],
                          [model.layers[4].output])
third_layers = get_third_layer_output([X_train[0:show_size]])[0]

print ('third layer shape: ', third_layers.shape)
draw_digit(third_layers, third_layers.shape[0], third_layers.shape[3])
#----output-----

MNIST 数据集上的 CNN

作为一个额外的例子，让我们看看 CNN 在 digits 数据集上的表现(清单 6-14 )。

import keras
keras.backend.backend()
keras.backend.image_dim_ordering()

# using theano as backend
K = keras.backend.backend()
if K=='tensorflow':
    keras.backend.set_image_dim_ordering('tf')
else:
    keras.backend.set_image_dim_ordering('th')

from matplotlib import pyplot as plt

%matplotlib inline

import numpy as np
np.random.seed(2017)

from keras import backend as K
from keras.models import Sequential
from keras.datasets import mnist
from keras.layers import Dense, Dropout, Activation, Conv2D, MaxPooling2D, Flatten
from keras.utils import np_utils
from keras.preprocessing import sequence

from keras import backend as K
from IPython.display import SVG, display
from keras.utils.vis_utils import model_to_dot, plot_model

nb_filters = 5 # the number of filters
nb_pool = 2 # window size of pooling
nb_conv = 3 # window or kernel size of filter
nb_epoch = 5
kernel_size = 3 # convolution kernel size

if K.image_dim_ordering() == 'th':
    input_shape = (1, img_rows, img_cols)
else:
    input_shape = (img_rows, img_cols, 1)

# data

(X_train, y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
#----output----
'X_train shape:', (60000, 1, 28, 28)
60000, 'train samples'
10000, 'test samples'

# define two groups of layers

: feature (convolutions) and classification (dense)
feature_layers = [
    Conv2D(nb_filters, kernel_size, input_shape=input_shape),
    Activation('relu'),
    Conv2D(nb_filters, kernel_size),
    Activation('relu'),
    MaxPooling2D(pool_size = nb_pool),
    Dropout(0.25),
    Flatten(),
]
classification_layers = [
    Dense(128),
    Activation('relu'),
    Dropout(0.5),
    Dense(nb_classes),
    Activation('softmax')
]

# create complete model

model = Sequential(feature_layers + classification_layers)

model.compile(loss='categorical_crossentropy', optimizer="adadelta", metrics=['accuracy'])

SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format="svg"))

print(model.summary())
#----output----

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d_1 (Conv2D)            (None, 26, 26, 5)         50
_________________________________________________________________
activation_1 (Activation)    (None, 26, 26, 5)         0
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 24, 24, 5)         230
_________________________________________________________________
activation_2 (Activation)    (None, 24, 24, 5)         0
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 5)         0
_________________________________________________________________
dropout_1 (Dropout)          (None, 12, 12, 5)         0
_________________________________________________________________
flatten_1 (Flatten)          (None, 720)               0
_________________________________________________________________
dense_1 (Dense)              (None, 128)               92288
_________________________________________________________________
activation_3 (Activation)    (None, 128)               0
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1290
_________________________________________________________________
activation_4 (Activation)    (None, 10)                0
=================================================================
Total params: 93,858
Trainable params: 93,858
Non-trainable params: 0

model.fit(X_train, Y_train, batch_size=256, epochs=nb_epoch, verbose=2,  validation_split=0.2)
#----output----
Train on 48000 samples, validate on 12000 samples
Epoch 1/5
 - 15s - loss: 0.6098 - acc: 0.8085 - val_loss: 0.1609 - val_acc: 0.9523
Epoch 2/5
 - 15s - loss: 0.2427 - acc: 0.9251 - val_loss: 0.1148 - val_acc: 0.9675
Epoch 3/5
 - 15s - loss: 0.1941 - acc: 0.9410 - val_loss: 0.0950 - val_acc: 0.9727
Epoch 4/5
 - 15s - loss: 0.1670 - acc: 0.9483 - val_loss: 0.0866 - val_acc: 0.9753
Epoch 5/5
 - 15s - loss: 0.1500 - acc: 0.9548 - val_loss: 0.0830 - val_acc: 0.9767

Listing 6-14CNN Using Keras with Theano Backend on MNIST Dataset

层的可视化

# visualization

def draw(data, row, col, n):
    plt.subplot(row, col, n)
    plt.imshow(data, cmap=plt.cm.gray_r)
    plt.axis('off')

def draw_digit(data, row, col):
    for j in range(row):
        plt.figure(figsize=(8,8))
        for i in range(col):
            plt.subplot(row, col, i+1)
            plt.imshow(data[j,:,:,i], cmap=plt.cm.gray_r)
            plt.axis('off')
        plt.tight_layout()
    plt.show()

# Sample input layer (original image)
show_size = 10
plt.figure(figsize=(20,20))

for i in range(show_size):
    draw(X_train[i].reshape(28,28), 1, show_size, i+1)
plt.show()
#----output----

# First layer with 5 filters
get_first_layer_output = K.function([model.layers[0].input], [model.layers[1].output])
first_layer = get_first_layer_output([X_train[0:show_size]])[0]

print ('first layer shape: ', first_layer.shape)

draw_digit(first_layer, first_layer.shape[0], first_layer.shape[3])
#----output----

循环神经网络(RNN)

众所周知，MLP(前馈网络)在顺序事件模型(如概率语言模型)上并不擅长在每个给定点根据前一个单词预测下一个单词。RNN 建筑解决了这个问题。它类似于 MLP，只是它有一个反馈环，这意味着它将以前的时间步骤反馈到当前步骤。这种类型的架构生成序列来模拟情况并创建合成数据。这使得它成为处理序列数据的理想建模选择，如语音文本挖掘、图像字幕、时间序列预测、机器人控制、语言建模等。(图 6-7 )。

图 6-7

循环神经网络

前一步的隐藏层和最终输出被反馈到网络中，并将被用作下一步的隐藏层的输入，这意味着网络将记住过去，并反复预测接下来将发生什么。一般 RNN 体系结构的缺点是，它可能占用大量内存，并且难以针对长期时间依赖性进行训练(即，长文本的上下文在任何给定阶段都应该是已知的)。

长短期记忆(LSTM)

LSTM 是改进的 RNN 体系结构的实现，以解决一般 RNN 的问题，并且它允许远程依赖。它旨在通过线性存储单元获得更好的记忆，这些存储单元由一组用于控制信息流的门单元包围——信息何时应该进入存储器，何时应该忘记，何时应该输出。它在递归分量中不使用激活函数，因此梯度项不会随着反向传播而消失。图 6-8 给出了简单多层感知器与 RNN 和 LSTM 的比较。

图 6-8

简单 MLP 诉 RNN 诉 LSM

请参考表 6-3 来理解关键的 LSTM 组件公式。

表 6-3

LSTM 组件

公式

|
| --- | --- |
| 输入门层:这决定在单元状态中存储哪些值。 | i _t =乙状结肠(w_Ix_t+u_Ih_t-1+b_I |
| 忘记门层:顾名思义，这决定了从单元状态中丢弃什么信息。 | f _t =乙状结肠(W_fx_t+U_fh_t-1+b_f |
| 输出门层:创建可添加到像元状态的值的向量。 | O _t =乙状结肠(W_ox_t+u_Ih_t-1+b_o |
| 存储单元状态向量 | c_t= f_to c_t-1+I_to *双曲正切(W_cx_t+u_ch_t-1+b_c |

让我们看一个 IMDB 数据集的例子，它为电影评论标记了情绪(正面/负面)。这些评论已经过预处理，并被编码为一系列单词索引(清单 6-15 )。

import numpy as np
np.random.seed(2017)  # for reproducibility

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding
from keras.layers import LSTM
from keras.datasets import imdb

max_features = 20000
maxlen = 80  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
#----output----
Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
X_train shape: (25000, 80)
X_test shape: (25000, 80)

#Model configuration
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, recurrent_dropout=0.2, dropout=0.2))  # try using a GRU instead, for fun
model.add(Dense(1))
model.add(Activation('sigmoid'))

# Try using different optimizers and different optimizer configs

model.compile(loss='binary_crossentropy', optimizer="adam", metrics=['accuracy'])

#Train
model.fit(X_train, y_train, batch_size=batch_size, epochs=5, validation_data=(X_test, y_test))
#----output----
Epoch 1/5
25000/25000 [==============================] - 99s 4ms/step - loss: 0.4604 - acc: 0.7821 - val_loss: 0.3762 - val_acc: 0.8380
Epoch 2/5
25000/25000 [==============================] - 86s 3ms/step - loss: 0.3006 - acc: 0.8766 - val_loss: 0.3710 - val_acc: 0.8353
Epoch 3/5
25000/25000 [==============================] - 86s 3ms/step - loss: 0.2196 - acc: 0.9146 - val_loss: 0.4113 - val_acc: 0.8212
Epoch 4/5
25000/25000 [==============================] - 86s 3ms/step - loss: 0.1558 - acc: 0.9411 - val_loss: 0.4733 - val_acc: 0.8116
Epoch 5/5
25000/25000 [==============================] - 86s 3ms/step - loss: 0.1112 - acc: 0.9597 - val_loss: 0.6225 - val_acc: 0.8202

# Evaluate
train_score, train_acc = model.evaluate(X_train, y_train, batch_size=batch_size)
test_score, test_acc = model.evaluate(X_test, y_test, batch_size=batch_size)

print ('Train score:', train_score)
print ('Train accuracy:', train_acc)

print ('Test score:', test_score)
print ('Test accuracy:', test_acc)
#----output----
25000/25000 [==============================] - 37s 1ms/step
25000/25000 [==============================] - 28s 1ms/step
Train score: 0.055540263980031014
Train accuracy: 0.98432
Test score: 0.5643649271917344
Test accuracy: 0.82388

Listing 6-15Example Code for Keras LSTM

迁移学习

根据我们过去的经验，我们人类可以很容易地学会一项新技能。我们的学习效率更高，尤其是当手头的任务与我们过去所做的相似时。例如，根据我们过去的经验，为计算机专业人员学习一种新的编程语言，或者为经验丰富的司机驾驶一种新型车辆是相对容易的。

迁移学习是 ML 中的一个领域，旨在利用解决一个问题时获得的知识来解决一个不同但相关的问题(图 6-9 )。

图 6-9

迁移学习

没有什么比通过示例来理解更好的了，所以让我们在 MNIST 数据集的前 5 个数字(0 到 4)上训练一个简单的两级层 CNN 模型(一个要素层和一个分类层)，然后应用迁移学习来冻结要素层，并针对数字 5 到 9 的分类微调密集层(清单 6-16 )。

import numpy as np
np.random.seed(2017)  # for reproducibility

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.utils import np_utils
from keras import backend as K

batch_size = 128
nb_classes = 5
nb_epoch = 5

# input image dimensions
img_rows, img_cols = 28, 28

# number of convolutional filters to use
nb_filters = 32

# size of pooling area for max pooling
pool_size = 2

# convolution kernel size
kernel_size = 3

input_shape = (img_rows, img_cols, 1)

# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# create two datasets

one with digits below 5 and one with 5 and above
X_train_lt5 = X_train[y_train < 5]
y_train_lt5 = y_train[y_train < 5]
X_test_lt5 = X_test[y_test < 5]
y_test_lt5 = y_test[y_test < 5]

X_train_gte5 = X_train[y_train >= 5]
y_train_gte5 = y_train[y_train >= 5] - 5  # make classes start at 0 for
X_test_gte5 = X_test[y_test >= 5]         # np_utils.to_categorical
y_test_gte5 = y_test[y_test >= 5] – 5

# Train model for digits 0 to 4
def train_model(model, train, test, nb_classes):
    X_train = train[0].reshape((train[0].shape[0],) + input_shape)
    X_test = test[0].reshape((test[0].shape[0],) + input_shape)
    X_train = X_train.astype('float32')
    X_test = X_test.astype('float32')
    X_train /= 255
    X_test /= 255
    print('X_train shape:', X_train.shape)
    print(X_train.shape[0], 'train samples')
    print(X_test.shape[0], 'test samples')

    # convert class vectors to binary class matrices
    Y_train = np_utils.to_categorical(train[1], nb_classes)
    Y_test = np_utils.to_categorical(test[1], nb_classes)

    model.compile(loss='categorical_crossentropy',
                  optimizer='adadelta',
                  metrics=['accuracy'])

    model.fit(X_train, Y_train, 

              batch_size=batch_size, epochs=nb_epoch,
              verbose=1,
              validation_data=(X_test, Y_test))
    score = model.evaluate(X_test, Y_test, verbose=0)
    print('Test score:', score[0])
    print('Test accuracy:', score[1])

# define two groups of layers: feature (convolutions) and classification (dense)
feature_layers = [
    Conv2D(nb_filters, kernel_size,
                  padding='valid',
                  input_shape=input_shape),
    Activation('relu'),
    Conv2D(nb_filters, kernel_size),
    Activation('relu'),
    MaxPooling2D(pool_size=(pool_size, pool_size)),
    Dropout(0.25),
    Flatten(),
]
classification_layers = [
    Dense(128),
    Activation('relu'),
    Dropout(0.5),
    Dense(nb_classes),
    Activation('softmax')
]

# create complete model

model = Sequential(feature_layers + classification_layers)

# train model for 5-digit classification [0..4]
train_model(model, (X_train_lt5, y_train_lt5), (X_test_lt5, y_test_lt5), nb_classes)
#----output----
X_train shape: (30596, 28, 28, 1)
30596 train samples
5139 test samples
Train on 30596 samples, validate on 5139 samples
Epoch 1/5
30596/30596 [==============================] - 40s 1ms/step - loss: 0.1692 - acc: 0.9446 - val_loss: 0.0573 - val_acc: 0.9798
Epoch 2/5
30596/30596 [==============================] - 37s 1ms/step - loss: 0.0473 - acc: 0.9858 - val_loss: 0.0149 - val_acc: 0.9947
Epoch 3/5
30596/30596 [==============================] - 37s 1ms/step - loss: 0.0316 - acc: 0.9906 - val_loss: 0.0112 - val_acc: 0.9947
Epoch 4/5
30596/30596 [==============================] - 37s 1ms/step - loss: 0.0257 - acc: 0.9928 - val_loss: 0.0094 - val_acc: 0.9967
Epoch 5/5
30596/30596 [==============================] - 37s 1ms/step - loss: 0.0204 - acc: 0.9940 - val_loss: 0.0078 - val_acc: 0.9977
Test score: 0.00782204038783338
Test accuracy: 0.9976649153531816

Transfer existing trained model on 0 to 4 to build model for digits 5 to 9

# freeze feature layers and rebuild model
for layer in feature_layers:
    layer.trainable = False

# transfer: train dense layers for new classification task [5..9]
train_model(model, (X_train_gte5, y_train_gte5), (X_test_gte5, y_test_gte5), nb_classes)
#----output----
X_train shape: (29404, 28, 28, 1)
29404 train samples
4861 test samples
Train on 29404 samples, validate on 4861 samples
Epoch 1/5
29404/29404 [==============================] - 14s 484us/step - loss: 0.2290 - acc: 0.9353 - val_loss: 0.0504 - val_acc: 0.9846
Epoch 2/5
29404/29404 [==============================] - 14s 475us/step - loss: 0.0755 - acc: 0.9764 - val_loss: 0.0325 - val_acc: 0.9899
Epoch 3/5
29404/29404 [==============================] - 14s 480us/step - loss: 0.0563 - acc: 0.9828 - val_loss: 0.0326 - val_acc: 0.9881
Epoch 4/5
29404/29404 [==============================] - 14s 480us/step - loss: 0.0472 - acc: 0.9852 - val_loss: 0.0258 - val_acc: 0.9893
Epoch 5/5
29404/29404 [==============================] - 14s 481us/step - loss: 0.0404 - acc: 0.9871 - val_loss: 0.0259 - val_acc: 0.9907
Test score: 0.025926338075212205
Test accuracy: 0.990742645546184

Listing 6-16Example Code for Transfer Learning

注意，对于前五个数字分类器，我们在 5 个时期后获得了 99.8%的测试准确度，并且在转移和微调后，对于后五个数字获得了 99.2%的测试准确度。

强化学习

强化学习是一种基于与环境互动的目标导向的学习方法。目标是让代理在一个环境中行动，以最大化其回报。这里 agent 是一个智能程序，环境是外部条件(图 6-10 )。

图 6-10

强化学习就像教你的狗变戏法

让我们考虑一个预定义的系统的例子，这个系统教狗一个新的技巧，你不需要告诉狗做什么。然而，如果狗做对了，你可以奖励它，如果它做错了，你可以惩罚它。每走一步，它都得记住是什么使它得到奖励或惩罚；这就是通常所说的信用分配问题。类似地，我们可以训练一个计算机代理，使得它的目标是采取行动从状态 st 移动到状态 st+1，并找到行为函数来最大化折扣奖励的期望总和，并将状态映射到行动。根据 Deepmind Technologies 在 2013 年发表的论文，更新状态的 Q 学习规则由下式给出:Q[s，a] _new = Q[s，a]_prev+α∫(r+γ∫max(s，a)–Q[s，a] _prev ，其中

α是学习率，

r 是对最新行动的奖励，

γ是贴现因子，以及

max(s，a)是对最佳行动的新价值的估计。

如果序列 s '在下一个时间步的最优值 Q[s，a]对于所有可能的动作 a '是已知的，那么最优策略是选择动作 a '最大化 r+γ÷max(s，a)-Q[s，a] _{prev 的期望值。}

让我们考虑一个例子，其中一个代理正试图走出迷宫(图 6-11 )。它可以向任意方向移动任意一个方格或区域，如果退出就可以获得奖励。形式化强化问题的最常见方法是将其表示为马尔可夫决策过程。假设代理处于状态 b(迷宫区域)，目标是到达状态 f，那么在一个步骤内代理可以从 b 到达 f。让我们为允许代理到达目标状态的节点之间的链接设置 100(否则为 0)的奖励。清单 6-17 提供了 q-learning 的示例代码实现。

图 6-11

左图:五态迷宫。右图:马尔可夫决策过程

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection

# defines the reward/link connection graph
R = np.array([[-1, -1, -1, -1,  0,  -1],
              [-1, -1, -1,  0, -1, 100],
              [-1, -1, -1,  0, -1,  -1],
              [-1,  0,  0, -1,  0,  -1],
              [ 0, -1, -1,  0, -1, 100],
              [-1,  0, -1, -1,  0, 100]]).astype("float32")
Q = np.zeros_like(R)

Listing 6-17Example Code for Q-learning

表中的-1 表示节点之间没有链接。例如，状态‘a’不能转到状态‘b’。

# learning parameter
gamma = 0.8

# Initialize random_state
initial_state = randint(0,4)

# This function returns all available actions

in the state given as an argument
def available_actions(state):
    current_state_row = R[state,]
    av_act = np.where(current_state_row >= 0)[1]
    return av_act

# This function chooses at random which action to be performed within the range
# of all the available actions.
def sample_next_action(available_actions_range):
    next_action = int(np.random.choice(available_act,1))
    return next_action

# This function updates the Q matrix

according to the path selected and the Q
# learning algorithm
def update(current_state, action, gamma):

    max_index = np.where(Q[action,] == np.max(Q[action,]))[1]

    if max_index.shape[0] > 1:
        max_index = int(np.random.choice(max_index, size = 1))
    else:
        max_index = int(max_index)
    max_value = Q[action, max_index]

    # Q learning formula
    Q[current_state, action] = R[current_state, action] + gamma * max_value

# Get available actions in the current state
available_act = available_actions(initial_state)

# Sample next action to be performed
action = sample_next_action(available_act)

# Train over 100 iterations, re-iterate the process above).
for i in range(100):
    current_state = np.random.randint(0, int(Q.shape[0]))
    available_act = available_actions(current_state)
    action = sample_next_action(available_act)
    update(current_state,action,gamma)

# Normalize the "trained" Q matrix

print ("Trained Q matrix: \n", Q/np.max(Q)*100)

# Testing
current_state = 2
steps = [current_state]

while current_state != 5:
    next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[1]
    if next_step_index.shape[0] > 1:
        next_step_index = int(np.random.choice(next_step_index, size = 1))
    else:
        next_step_index = int(next_step_index)
    steps.append(next_step_index)
    current_state = next_step_index

# Print selected sequence of steps
print ("Best sequence path: ", steps)
#----output----
Best sequence path:  [2, 3, 1, 5]

摘要

在这一章中，你已经简要地学习了使用人工神经网络的深度学习技术的各种主题，从单感知器开始，到多层感知器，到更复杂形式的深度神经网络，如 CNN 和 RNN。您已经了解了与图像数据相关的各种问题，以及研究人员如何试图模仿人脑来构建模型，这些模型可以分别使用卷积神经网络和循环神经网络来解决与计算机视觉和文本挖掘相关的复杂问题。您还了解了如何使用自动编码器来压缩/解压缩数据或消除图像数据中的噪声。您了解了广受欢迎的 RBN，它可以学习输入数据中的概率分布，使我们能够构建更好的模型。您学习了迁移学习，它帮助我们将知识从一个模型转移到另一个类似的模型。最后，我们简要地看了一个使用 Q-learning 的强化学习的简单例子。恭喜你。你已经到达了掌握机器学习的六步探险的终点。

七、总结

我希望你喜欢六步简化机器学习(ML)探险。您从第 1 步开始了学习之旅，学习了 Python 3 编程语言的核心理念和关键概念。在步骤 2 中，您学习了 ML 历史、高级类别(监督/非监督/强化学习)和构建 ML 系统的三个重要框架(SEMMA、CRISP-DM、KDD 数据挖掘过程)、主要数据分析包(NumPy、Pandas、Matplotlib)及其关键概念，以及不同核心 ML 库的比较。在第 3 步中，您学习了不同的数据类型、关键的数据质量问题以及如何处理它们、探索性分析、监督/非监督学习的核心方法以及它们的示例实现。在第 4 步中，您学习了各种模型诊断技术、过拟合的 bagging 技术、欠拟合的 boosting 技术、集成技术；以及用于构建高效模型的超参数调整(网格/随机搜索)。在步骤 5 中，您了解了文本挖掘过程的概况:数据集合、数据预处理、数据探索或可视化，以及可以构建的各种模型。您还了解了如何构建协作/基于内容的推荐系统来个性化用户体验。在步骤 6 中，您学习了通过感知器的人工神经网络、用于图像分析的卷积神经网络(CNN)、用于文本分析的循环神经网络(RNNs ),以及用于学习强化学习概念的简单玩具示例。这些都是在过去几年中有很大发展的高级主题。

总的来说，你已经学习了广泛的常用 ML 主题；它们中的每一个都带有许多参数来控制和调整模型性能。为了在整本书中保持简单，我要么使用默认参数，要么只向您介绍关键参数(在某些地方)。软件包的创建者已经仔细选择了参数的默认选项，以给出合适的结果来帮助您入门。所以，首先你可以使用默认参数。但是，我建议您探索其他参数，并使用手动/网格/随机搜索来使用它们，以确保模型的健壮性。表 7-1 总结了各种可能的问题类型、示例用例以及您可以使用的潜在 ML 算法。请注意，这只是一个示例列表，而不是一个详尽的列表。

表 7-1

问题类型与潜在的 ML 算法

问题类型

示例使用案例

潜在的 ML 算法

|
| --- | --- | --- |
| 预测连续数 | 商店每天/每周的销售额是多少？ | 线性回归或多项式回归 |
| 预测连续数的计数类型 | 一个班次需要多少员工？一家新店需要多少停车位？ | 泊松分布的广义线性模型 |
| 预测一个事件的概率(对/错) | 交易是欺诈的概率有多大？ | 二元分类模型(逻辑回归、决策树模型、boosting 模型、kNN 等。) |
| 从许多可能的事件中预测事件的概率(多类) | 交易高风险/中风险/低风险的概率是多少？ | 多类分类模型(逻辑回归、决策树模型、boosting 模型、kNN 等。) |
| 根据相似性对内容进行分组 | 分组相似的客户？分组相似的类别？ | k-均值聚类，层次聚类 |
| 降维 | 拥有最大百分比信息的重要维度是什么？ | 主成分分析，奇异值分解 |
| 主题建模 | 根据主题或主题结构对文档进行分组？ | 潜在狄利克雷分配，非负矩阵分解 |
| 观点挖掘 | 预测与文本相关的情感？ | 自然语言工具包(NLTK) |
| 推荐系统 | 向用户推销什么产品/项目？ | 基于内容的过滤、协作过滤 |
| 文本分类 | 预测文档属于已知类别的概率？ | 循环神经网络(RNN)，二元或多类分类模型 |
| 图像分类 | 预测图像属于已知类别的概率。 | 卷积神经网络(CNN)，二进制或多类分类模型 |

技巧

对于初学者来说，构建一个高效的模型可能是一项具有挑战性的任务。既然你已经学会了使用什么样的算法，我想给出我的 2 美分清单，让你在开始建模活动时记住。

从问题/假设开始，然后转向数据！

在使用数据制定要实现的目标之前，不要急于理解数据。从一系列问题开始，并与领域专家密切合作，以理解核心问题并构建问题陈述，这是一个很好的实践。这将有助于您选择正确的 ML 算法(监督与非监督)，然后继续理解不同的数据源(图 7-1 )。

图 7-1

对数据的疑问/假设

不要从头开始重新发明轮子

ML 开源社区非常活跃；有很多有效的工具可用，而且更多的工具正在被开发/发布。因此，除非需要，否则不要试图在解决方案/算法/工具方面重新发明轮子(图 7-2 )。在冒险从头开始构建之前，尝试了解市场上存在哪些解决方案。

图 7-2

不要多此一举

从简单的模型开始

总是从简单的模型(如回归)开始，因为这些可以很容易地用通俗的语言向任何非技术人员解释(图 7-3 )。这将有助于您和主题专家理解变量关系，并获得对模型的信心。此外，它将极大地帮助您创建正确的特征。仅当您看到模型性能显著提高时，才转向复杂模型。

图 7-3

从一个简单的模型开始

专注于特征工程

相关特性导致高效的模型，而不是更多的特性！请注意，包含大量特征可能会导致过度拟合问题。在模型中包含相关特性是构建高效模型的关键。请记住，特征工程部分是作为一种艺术形式来谈论的，是竞争 ML 的关键区别。正确的配料混合到正确的数量是美味食物的秘密；类似地，将相关/正确的特征传递给 ML 算法是高效模型的秘密(图 7-4 )。

图 7-4

特征工程是一门艺术

当心普通的洗钱进口商

小心处理一些常见的 ML 欺骗，如数据质量问题(如缺失数据、异常值、分类数据、缩放)、分类的不平衡数据集、过拟合和欠拟合。使用第三章中讨论的处理数据质量问题的适当技术和第四章中讨论的技术，如集合技术和超参数调整，以提高模型性能。

快乐的机器学习

我希望这次用简化的六个步骤进行机器学习的探索是值得的，我希望这能帮助你开始一个新的旅程，将它们应用于现实世界的问题。我祝你一切顺利，并在今后的探索中取得成功。

posted @ 2024-10-05 17:13 绝不原创的飞龙阅读(203) 评论(0) 收藏举报

刷新页面返回顶部

龙哥盟

掠夺·扩张·投机·博弈

精通-Python-机器学习的六个步骤-全-

精通 Python 机器学习的六个步骤（全）

一、Python 3 入门

生活中最好的东西都是免费的

冉冉升起的明星

选择 Python 2.x 或 Python 3.x

Windows 操作系统

系统

图形安装程序

命令行安装程序

Linux 操作系统

来自官方网站

运行 Python

关键概念

Python 标识符

关键词

我的第一个 Python 程序

代码块

缺口

套房

基本对象类型

何时使用列表、元组、集合或字典

Python 中的注释

多行语句

单行上的多条语句

基本运算符

算术运算符

比较或关系运算符

赋值运算符

按位运算符

逻辑运算符

成员运算符

标识运算符

控制结构

选择

迭代次数

警告

列表

元组

设置

在 Python 中更改集合

从集合中移除项目

集合操作

集合联合

设置交叉点

集合差异

设置对称差

基本操作

词典

用户定义的函数

定义函数

变量的范围

默认参数

可变长度参数

模块

文件输入/输出

打开文件

异常处理

注意

摘要

二、机器学习简介

历史和演变

人工智能进化

不同形式

统计数字

频率论者

贝叶斯定理的

回归

数据挖掘

数据分析

描述性分析

诊断分析

预测分析

规定性分析

数据科学

统计与数据挖掘、数据分析与数据科学

机器学习类别

监督学习