0|1seaborn 简介¶

Seaborn是一种基于matplotlib的图形可视化python libraty。它提供了一种高度交互式界面,便于用户能够做出各种有吸引力的统计图表。Seaborn其实是在matplotlib的基础上进行了更高级的API封装,从而使得作图更加容易,在大多数情况下使用seaborn就能做出很具有吸引力的图,而使用matplotlib就能制作具有更多特色的图。应该把Seaborn视为matplotlib的补充,而不是替代物。同时它能高度兼容numpy与pandas数据结构以及scipy与statsmodels等统计模式。掌握seaborn能很大程度帮助我们更高效的观察数据与图表,并且更加深入了解它们。


  • 基于matplotlib aesthetics绘图风格,增加了一些绘图模式
  • 增加调色板功能,利用色彩丰富的图像揭示您数据中的模式
  • 运用数据子集绘制与比较单变量和双变量分布的功能
  • 运用聚类算法可视化矩阵数据
  • 灵活运用处理时间序列数据
  • 利用网格建立复杂图像集






There are several ways to draw a scatter plot in seaborn. The most basic, which should be used when both variables are numeric, is the scatterplot() function. In the categorical visualization tutorial, we will see specialized tools for using scatterplots to visualize categorical data. The scatterplot() is the default kind in relplot() (it can also be forced by setting kind="scatter"):

In [1]:
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set(style="darkgrid")
In [2]:
tips = sns.load_dataset("tips")
In [3]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
In [4]:
<matplotlib.collections.PathCollection at 0x7fbbcebdd240>
In [5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbbcebc0978>

除此之外,我们还可以通过在 relplot()中指定kind="scatter"获取同样的效果。

In [6]:
sns.relplot(x="total_bill", y="tip", data=tips);


In [7]:
sns.relplot(x="total_bill", y="tip", hue="smoker", data=tips);

To emphasize the difference between the classes, and to improve accessibility, you can use a different marker style for each class:


In [8]:
sns.relplot(x="total_bill", y="tip", hue="smoker", style="size", data=tips);


In [9]:
sns.relplot(x="total_bill", y="tip", hue="smoker", style="time", data=tips);


In [10]:
sns.relplot(x="total_bill", y="tip", hue="size", data=tips);


In [11]:
sns.relplot(x="total_bill", y="tip", size="size", data=tips);


In [12]:
sns.relplot(x="total_bill", y="tip", hue='sex', style = 'time',size="size", data=tips);


In [13]:
fmri = sns.load_dataset("fmri") fmri
subject timepoint event region signal
0 s13 18 stim parietal -0.017552
1 s5 14 stim parietal -0.080883
2 s12 18 stim parietal -0.081033
3 s11 18 stim parietal -0.046134
4 s10 18 stim parietal -0.037970
5 s9 18 stim parietal -0.103513
6 s8 18 stim parietal -0.064408
7 s7 18 stim parietal -0.060526
8 s6 18 stim parietal -0.007029
9 s5 18 stim parietal -0.040557
10 s4 18 stim parietal -0.048812
11 s3 18 stim parietal -0.047148
12 s2 18 stim parietal -0.086623
13 s1 18 stim parietal -0.046659
14 s0 18 stim parietal -0.075570
15 s13 17 stim parietal -0.008265
16 s12 17 stim parietal -0.088512
17 s7 9 stim parietal 0.058897
18 s10 17 stim parietal -0.016847
19 s9 17 stim parietal -0.121574
20 s8 17 stim parietal -0.076287
21 s7 17 stim parietal -0.043812
22 s6 17 stim parietal -0.014746
23 s5 17 stim parietal -0.056682
24 s4 17 stim parietal -0.044582
25 s3 17 stim parietal -0.053514
26 s2 17 stim parietal -0.077292
27 s1 17 stim parietal -0.038021
28 s0 17 stim parietal -0.071300
29 s13 16 stim parietal -0.002856
... ... ... ... ... ...
1034 s5 13 cue frontal -0.014985
1035 s4 13 cue frontal -0.021514
1036 s3 13 cue frontal -0.047639
1037 s2 13 cue frontal 0.047918
1038 s1 13 cue frontal 0.028379
1039 s0 13 cue frontal -0.021729
1040 s13 12 cue frontal -0.020686
1041 s12 12 cue frontal -0.003034
1042 s11 12 cue frontal 0.055766
1043 s10 12 cue frontal 0.005711
1044 s9 12 cue frontal 0.024292
1045 s7 12 cue frontal -0.014005
1046 s2 7 cue frontal -0.078363
1047 s10 10 cue frontal -0.016124
1048 s8 10 cue frontal -0.015141
1049 s10 8 cue frontal -0.052505
1050 s9 8 cue frontal -0.008729
1051 s8 8 cue frontal 0.007278
1052 s7 8 cue frontal 0.015765
1053 s6 8 cue frontal -0.063961
1054 s5 8 cue frontal -0.028292
1055 s4 8 cue frontal -0.160821
1056 s3 8 cue frontal -0.033848
1057 s2 8 cue frontal -0.069666
1058 s1 8 cue frontal -0.136059
1059 s0 8 cue frontal 0.018165
1060 s13 7 cue frontal -0.029130
1061 s12 7 cue frontal -0.004939
1062 s11 7 cue frontal -0.025367
1063 s0 0 cue parietal -0.006899

1064 rows × 5 columns

In [14]:
sns.relplot(x="timepoint", y="signal", kind="line", data=fmri);


In [15]:
sns.relplot(x="timepoint", y="signal", kind="line", ci="sd", data=fmri);


In [16]:
sns.relplot(x="timepoint", y="signal", estimator=None, kind="line", data=fmri);




In [17]:
sns.relplot(x="timepoint", y="signal", hue="event", kind="line", data=fmri);


In [18]:
sns.relplot(x="timepoint", y="signal", hue="region", style="event", kind="line", data=fmri);


In [19]:
sns.relplot(x="timepoint", y="signal", hue="region", style="event", dashes=False, markers=True, kind="line", data=fmri);


In [20]:
dots = sns.load_dataset("dots").query("align == 'dots'") dots.head()
align choice time coherence firing_rate
0 dots T1 -80 0.0 33.189967
1 dots T1 -80 3.2 31.691726
2 dots T1 -80 6.4 34.279840
3 dots T1 -80 12.8 32.631874
4 dots T1 -80 25.6 35.060487


In [21]:
sns.relplot(x="time", y="firing_rate", hue="coherence", style="choice", kind="line", data=dots);


In [22]:
palette = sns.cubehelix_palette(light=.8, n_colors=6) sns.relplot(x="time", y="firing_rate", hue="coherence", size="choice", palette=palette, kind="line", data=dots);



In [23]:
sns.relplot(x="total_bill", y="tip", hue="smoker", col="time", data=tips);


In [24]:
sns.relplot(x="timepoint", y="signal", hue="subject", col="region", row="event", height=4, kind="line", estimator=None, data=fmri);


In [25]:
sns.relplot(x="timepoint", y="signal", hue="event", style="event", col="subject", col_wrap=3, height=3, aspect=.75, linewidth=2.5, kind="line", data=fmri.query("region == 'frontal'"));

以上的一系列可视化方法,称为小倍数绘图( “lattice” plots or “small-multiples”),在研究大规模数据集的时候尤为重要,因为使用该方法,可以把复杂的数据根据一定的规律展示出来,并且借助可视化,使人的肉眼可以识别这种规律。需要注意的是,有的时候,简单的图比复杂的图更能帮助我们发现和解决问题。


在上一节中,我们学习了如何使用relplot()描述数据集中多变量之间的关系,其中我们主要关心的是两个数值型变量之间的关系。本节我们进一步地,讨论离散型( categorical)变量的绘制方法。

在seaborn中,我们有很多可视化离散型随机变量的方法。类似于relplot()之于scatterplot()lineplot()的关系, 我们有一个catplot()方法,该方法提高了我们一个从更高层次调用各类函数的渠道,例如swarmplot(),boxplot(),violinplot()等。


  • Categorical scatterplots:

    • stripplot() (with kind="strip"; the default)
    • swarmplot() (with kind="swarm")
  • Categorical distribution plots:

    • boxplot() (with kind="box")
    • violinplot() (with kind="violin")
    • boxenplot() (with kind="boxen")
  • Categorical estimate plots:
    • pointplot() (with kind="point")
    • barplot() (with kind="bar")
    • countplot() (with kind="count")




In [26]:
import seaborn as sns import matplotlib.pyplot as plt sns.set(style="ticks", color_codes=True)


The default representation of the data in catplot() uses a scatterplot. There are actually two different categorical scatter plots in seaborn. They take different approaches to resolving the main challenge in representing categorical data with a scatter plot, which is that all of the points belonging to one category would fall on the same position along the axis corresponding to the categorical variable. The approach used by stripplot(), which is the default “kind” in catplot() is to adjust the positions of points on the categorical axis with a small amount of random “jitter”:


方法一: 我们可以考虑采用stripplot(),该方法通过给每一个数据点一个在x轴上的小扰动,使得数据点不会过分重叠。stripplot()catplot()的默认参数。

In [27]:
tips = sns.load_dataset("tips") tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
In [28]:
sns.catplot(x="day", y="total_bill", data=tips);


In [29]:
sns.catplot(x="day", y="total_bill", jitter=0.2, data=tips);

方法二: 使用swarmplot(),该方法通过特定算法将数据点在横轴上分隔开,进一步提高区分度,防止重叠。该方法对于小数据集尤其适用,调用该方法只需要在catplot()中指定参数kind="swarm"即可。

In [30]:
sns.catplot(x="day", y="total_bill", kind="swarm", data=tips);


In [31]:
sns.catplot(x="day", y="total_bill", hue="sex",kind="swarm", data=tips);


In [32]:
sns.catplot(x="size", y="total_bill", kind="swarm", data=tips.query("size != 3"));


In [33]:
sns.catplot(x="smoker", y="tip", order=["Yes", "No"], data=tips);


In [34]:
sns.catplot(x="day", y="total_bill", hue="time", kind="swarm", data=tips);
In [35]:
# sns.catplot(x="sex", y="day", hue="time", kind="swarm", data=tips);




基于分布的绘图方法中最简单的就是箱线图了,关于箱线图的理论已经在之前的讲义中进行了介绍,这里不再展开。箱线图的调用方法也很简单,直接kind = "box"就行。

In [36]:
sns.catplot(x="day", y="total_bill", kind="box", data=tips);


In [37]:
sns.catplot(x="day", y="total_bill", hue="smoker", kind="box", data=tips);


In [38]:
tips["weekend"] = tips["day"].isin(["Sat", "Sun"]) #tips
In [39]:
sns.catplot(x="day", y="total_bill", hue="weekend", kind="box", dodge=False, data=tips);


In [40]:
diamonds = sns.load_dataset("diamonds") diamonds
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
5 0.24 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
6 0.24 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47
7 0.26 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
8 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
9 0.23 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39
10 0.30 Good J SI1 64.0 55.0 339 4.25 4.28 2.73
11 0.23 Ideal J VS1 62.8 56.0 340 3.93 3.90 2.46
12 0.22 Premium F SI1 60.4 61.0 342 3.88 3.84 2.33
13 0.31 Ideal J SI2 62.2 54.0 344 4.35 4.37 2.71
14 0.20 Premium E SI2 60.2 62.0 345 3.79 3.75 2.27
15 0.32 Premium E I1 60.9 58.0 345 4.38 4.42 2.68
16 0.30 Ideal I SI2 62.0 54.0 348 4.31 4.34 2.68
17 0.30 Good J SI1 63.4 54.0 351 4.23 4.29 2.70
18 0.30 Good J SI1 63.8 56.0 351 4.23 4.26 2.71
19 0.30 Very Good J SI1 62.7 59.0 351 4.21 4.27 2.66
20 0.30 Good I SI2 63.3 56.0 351 4.26 4.30 2.71
21 0.23 Very Good E VS2 63.8 55.0 352 3.85 3.92 2.48
22 0.23 Very Good H VS1 61.0 57.0 353 3.94 3.96 2.41
23 0.31 Very Good J SI1 59.4 62.0 353 4.39 4.43 2.62
24 0.31 Very Good J SI1 58.1 62.0 353 4.44 4.47 2.59
25 0.23 Very Good G VVS2 60.4 58.0 354 3.97 4.01 2.41
26 0.24 Premium I VS1 62.5 57.0 355 3.97 3.94 2.47
27 0.30 Very Good J VS2 62.2 57.0 357 4.28 4.30 2.67
28 0.23 Very Good D VS2 60.5 61.0 357 3.96 3.97 2.40
29 0.23 Very Good F VS1 60.9 57.0 357 3.96 3.99 2.42
... ... ... ... ... ... ... ... ... ... ...
53910 0.70 Premium E SI1 60.5 58.0 2753 5.74 5.77 3.48
53911 0.57 Premium E IF 59.8 60.0 2753 5.43 5.38 3.23
53912 0.61 Premium F VVS1 61.8 59.0 2753 5.48 5.40 3.36
53913 0.80 Good G VS2 64.2 58.0 2753 5.84 5.81 3.74
53914 0.84 Good I VS1 63.7 59.0 2753 5.94 5.90 3.77
53915 0.77 Ideal E SI2 62.1 56.0 2753 5.84 5.86 3.63
53916 0.74 Good D SI1 63.1 59.0 2753 5.71 5.74 3.61
53917 0.90 Very Good J SI1 63.2 60.0 2753 6.12 6.09 3.86
53918 0.76 Premium I VS1 59.3 62.0 2753 5.93 5.85 3.49
53919 0.76 Ideal I VVS1 62.2 55.0 2753 5.89 5.87 3.66
53920 0.70 Very Good E VS2 62.4 60.0 2755 5.57 5.61 3.49
53921 0.70 Very Good E VS2 62.8 60.0 2755 5.59 5.65 3.53
53922 0.70 Very Good D VS1 63.1 59.0 2755 5.67 5.58 3.55
53923 0.73 Ideal I VS2 61.3 56.0 2756 5.80 5.84 3.57
53924 0.73 Ideal I VS2 61.6 55.0 2756 5.82 5.84 3.59
53925 0.79 Ideal I SI1 61.6 56.0 2756 5.95 5.97 3.67
53926 0.71 Ideal E SI1 61.9 56.0 2756 5.71 5.73 3.54
53927 0.79 Good F SI1 58.1 59.0 2756 6.06 6.13 3.54
53928 0.79 Premium E SI2 61.4 58.0 2756 6.03 5.96 3.68
53929 0.71 Ideal G VS1 61.4 56.0 2756 5.76 5.73 3.53
53930 0.71 Premium E SI1 60.5 55.0 2756 5.79 5.74 3.49
53931 0.71 Premium F SI1 59.8 62.0 2756 5.74 5.73 3.43
53932 0.70 Very Good E VS2 60.5 59.0 2757 5.71 5.76 3.47
53933 0.70 Very Good E VS2 61.2 59.0 2757 5.69 5.72 3.49
53934 0.72 Premium D SI1 62.7 59.0 2757 5.69 5.73 3.58
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53940 rows × 10 columns

In [41]:
diamonds = sns.load_dataset("diamonds") sns.catplot(x="color", y="price", kind="boxen", data=diamonds.sort_values("color"));



In [42]:
sns.catplot(x="total_bill", y="day", hue="time", kind="violin", data=tips);

该方法用到了kernel density estimate (KDE),进而提供了更为丰富的数据分布信息。另一方面,由于KDE的引入,该方法也有更多的参数可以修改,例如bw.cut

In [43]:
sns.catplot(x="total_bill", y="day", hue="time", kind="violin", bw=0.5, cut=0, data=tips);


In [44]:
f, ax = plt.subplots(figsize=(20, 5)) a = sns.catplot(x="day", y="total_bill", hue="sex", kind="violin", split=True, data=tips,ax = ax);


In [45]:
sns.catplot(x="day", y="total_bill", hue="sex", kind="violin", inner="stick", split=True, palette="Set1", data=tips);


In [46]:
g = sns.catplot(x="day", y="total_bill", kind="violin", inner=None, data=tips) sns.swarmplot(x="day", y="total_bill", color="k", size=3, data=tips, ax=g.ax);
In [47]:
g = sns.catplot(x="day", y="total_bill", kind="box",data=tips) sns.swarmplot(x="day", y="total_bill", color="k", size=3, data=tips, ax=g.ax);



Bar plots

我们最常用的分类数据的可视化方式是柱状图方式,在seaborn中,barplot()在总的数据集中选用某种估计方法进行参数的估计(默认是平均值)。 当每一个类别中有多个数据时,该方法还会使用bootstrapping绘制出均值的置信区间(通过errorbar的形式)

In [48]:
titanic = sns.load_dataset("titanic") titanic.head()
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
In [49]:
titanic = sns.load_dataset("titanic") sns.catplot(x="sex", y="survived", hue="class", kind="bar", data=titanic);


In [50]:
sns.catplot(x="deck", kind="count", palette="rocket", data=titanic);


In [51]:
sns.catplot(y="deck", hue="class", kind="count", palette="pastel", edgecolor=".6", data=titanic);

Point plots


In [52]:
sns.catplot(x="sex", y="survived", hue="class", kind="point", data=titanic);


In [53]:
sns.catplot(x="class", y="survived", hue="sex", palette={"male": "g", "female": "m"}, markers=["^", "o"], linestyles=["-", "--"], kind="point", data=titanic);


除了支持以上的传入格式外,这些函数也支持传入其他形式的数据,比如DataFrame 和 two-dimensional numpy arrays。

In [54]:
iris = sns.load_dataset("iris") iris.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [55]:
iris = sns.load_dataset("iris") sns.catplot(data=iris, orient="h", kind="box");
In [56]:
sns.violinplot(x=iris.species, y=iris.sepal_length);


In [57]:
f, ax = plt.subplots(figsize=(10, 5)) sns.countplot(y="deck", data=titanic, color="m");



In [58]:
sns.catplot(x="time", y="total_bill", hue="smoker", col="day", aspect=.6, kind="swarm", data=tips);


In [59]:
g = sns.catplot(x="fare", y="survived", row="class", kind="box", orient="h", height=1.5, aspect=4, data=titanic.query("fare > 0")) g.set(xscale="log");



In [60]:
import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from scipy import stats
In [61]:



In [62]:
x = np.random.normal(size =100) sns.distplot(x);



In [63]:
sns.distplot(x, kde=True, rug=True);


In [64]:
sns.distplot(x, bins=5, kde=False, rug=True);
In [65]:
sns.distplot(x, bins=25, kde=False, rug=True);

0|1核密度估计Kernel density estimation¶


In [66]:
sns.distplot(x, hist=False, rug=True,kde = True);

那么,我们是符合得到这样一条曲线的呢? 实际上,我们将每一个数据点用一个以其为中心的高斯分布曲线代替,然后将这些高斯分布曲线叠加得到的。

In [67]:
x = np.random.normal(0, 1, size=30) # 生成中心在0,scale为1,30维的正态分布数据 bandwidth = 1.06 * x.std() * x.size ** (-1 / 5.) # 确定带宽 support = np.linspace(-4, 4, 200) kernels = [] for x_i in x: kernel = stats.norm(x_i, bandwidth).pdf(support) kernels.append(kernel) plt.plot(support, kernel, color="r") sns.rugplot(x, color=".2", linewidth=3);


In [68]:
from scipy.integrate import trapz density = np.sum(kernels, axis=0) density /= trapz(density, support) # 使用梯形积分计算曲线下面积,然后归一化 plt.plot(support, density);

我们可以通过观察,发现,使用seaborn中的kdeplot()我们会得到同样的曲线,或者使用distplot(kde = True)也有同样的效果。

In [69]:
sns.kdeplot(x, shade=True);

除了核函数,另一个影响KDE的参数是带宽(h)。带宽反映了KDE曲线整体的平坦程度,也即观察到的数据点在KDE曲线形成过程中所占的比重 — 带宽越大,观察到的数据点在最终形成的曲线形状中所占比重越小,KDE整体曲线就越平坦;带宽越小,观察到的数据点在最终形成的曲线形状中所占比重越大,KDE整体曲线就越陡峭。

In [70]:
sns.kdeplot(x) sns.kdeplot(x, bw=.2, label="bw: 0.2") sns.kdeplot(x, bw=2, label="bw: 2") plt.legend();


In [71]:
sns.kdeplot(x, shade=True, cut=4) sns.rugplot(x);



In [72]:
x = np.random.gamma(6, size=200) sns.distplot(x, kde=False, fit=stats.gamma); # 是用gamma分布拟合,并可视化



In [73]:
mean = [0, 1] cov = [(1, .5), (.5, 1)] data = np.random.multivariate_normal(mean, cov, 200) df = pd.DataFrame(data, columns=["x", "y"])
In [74]:
x y
0 1.819591 1.557201
1 -0.136995 0.814663
2 -0.487868 1.262799
3 -0.773655 -0.177352
4 1.311222 1.988374



In [75]:
sns.jointplot(x="x", y="y", data=df);

0|1Hexbin plots¶

与一维柱状图对应的二维图像称之为Hexbin plots,该图像帮助我们统计位于每一个六边形区域的数据的个数,然后用颜色加以表示,这种方法尤其对于大规模的数据更为适用。

In [76]:
x, y = np.random.multivariate_normal(mean, cov, 1000).T sns.jointplot(x=x, y=y, kind="hex", color="k"); # with sns.axes_style("white"): # sns.jointplot(x=x, y=y, kind="hex", color="k");


In [77]:
x, y = np.random.multivariate_normal(mean, cov, 1000).T with sns.axes_style("white"): sns.jointplot(x=x, y=y, kind="reg", color="k");


类似于一维情况,我们在二维平面一样可以进行核密度估计。通过设置kind = 'kde',我们就可以得到一个核密度估计的云图,以及两个单变量的核密度估计曲线。

In [78]:
sns.jointplot(x="x", y="y", data=df, kind="kde");
In [79]:
with sns.axes_style("white"): sns.jointplot(x="x", y="y", data=df, kind="kde");


In [80]:
f, ax = plt.subplots(figsize=(6, 6)) sns.kdeplot(df.x, df.y, ax=ax) sns.rugplot(df.x, color="g", ax=ax) sns.rugplot(df.y, vertical=True, ax=ax);

If you wish to show the bivariate density more continuously, you can simply increase the number of contour levels:

In [81]:
f, ax = plt.subplots(figsize=(6, 6)) cmap = sns.cubehelix_palette(as_cmap=True, dark=0, light=1, reverse=True) sns.kdeplot(df.x, df.y, cmap=cmap, n_levels=509, shade=True);


In [82]:
g = sns.jointplot(x="x", y="y", data=df, kind="kde", color="m") g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+") g.ax_joint.collections[0].set_alpha(0.5) g.set_axis_labels("$X$", "$Y$");



In [83]:
iris = sns.load_dataset("iris") sns.pairplot(iris);
In [84]:
import matplotlib.pyplot as plt import numpy as np x = [1, 2, 3, 4, 5] y1 = [3, 1, 5, 9, 4] y4 = [4, 2, 1, 3, 9] barWidth = 0.2 plt.bar(x, y1, barWidth, align= 'center', color = 'c', tick_label = ['label1','label2','labe3','label4','label5']) # c means greenish blue plt.bar(np.array(x)+barWidth, y4, barWidth, align= 'center', bottom = np.add(y1, y4), color = 'g', tick_label = ['label1','label2','labe3','label4','label5'])
<BarContainer object of 5 artists>




In [85]:
import numpy as np import seaborn as sns import matplotlib.pyplot as plt
In [86]:
In [87]:
tips = sns.load_dataset("tips")




In [88]:
sns.regplot(x="total_bill", y="tip", data=tips);
In [89]:
sns.lmplot(x="total_bill", y="tip", data=tips);



  • regplot()能接受更多种形式的数据,例如numpy arrays, pandas Series, references to variables in a pandas DataFrame,而 lmplot()只能接受references to variables in a pandas DataFrame,也就是只能接受“tidy” data
  • regplot() 仅仅指出 lmplot()的一部分参数


In [90]:
sns.lmplot(x="size", y="tip", data=tips);


In [91]:
sns.lmplot(x="size", y="tip", data=tips, x_jitter=.1);


In [92]:
sns.lmplot(x="size", y="tip", data=tips, x_estimator=np.mean);



我们这里使用的是 The Anscombe’s quartet dataset,在这个数据集中,不同形式的数据会得到同样的一个回归方程,但是拟合效果却是不同的。


In [93]:
anscombe = sns.load_dataset("anscombe")
In [94]:
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'I'"), ci=None, scatter_kws={"s": 80});


In [95]:
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"), ci=None, scatter_kws={"s": 80});


In [96]:
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"), order=129, ci=None, scatter_kws={"s": 80});
/opt/conda/lib/python3.6/site-packages/seaborn/regression.py:237: RankWarning: Polyfit may be poorly conditioned return np.polyval(np.polyfit(_x, _y, order), grid)


In [97]:
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"), ci=None, scatter_kws={"s": 80});


In [98]:
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"), robust=True, ci=None, scatter_kws={"s": 80});


In [99]:
tips["big_tip"] = (tips.tip / tips.total_bill) > .15 sns.lmplot(x="total_bill", y="big_tip", data=tips, y_jitter=.03);


In [100]:
sns.lmplot(x="total_bill", y="big_tip", data=tips, logistic=True, y_jitter=.03);

请注意,相比如简单的线性回归,逻辑回归以及robust regression 计算量较大,同时,置信区间的计算也会涉及到bootstrap,因此如果我们想要加快计算速度的话,可以把bootstrap关掉。

其他拟合数据的方法包括非参数拟合中的局部加权回归散点平滑法(LOWESS)。LOWESS 主要思想是取一定比例的局部数据,在这部分子集中拟合多项式回归曲线,这样我们便可以观察到数据在局部展现出来的规律和趋势。

In [101]:
sns.lmplot(x="total_bill", y="tip", data=tips, lowess=True);

使用residplot(),我们可以检测简单的线性回顾是否能够比较好地拟合原数据集。 理想情况下,简单线性回归的残差应该随机地分布在y=0附近。

In [102]:
sns.residplot(x="x", y="y", data=anscombe.query("dataset == 'I'"), scatter_kws={"s": 80});


In [103]:
sns.residplot(x="x", y="y", data=anscombe.query("dataset == 'II'"), scatter_kws={"s": 80});



这时regplot()lmplot()就有区别了。regplot()只能展示两个变量之间的关系,而lmplot()则能进一步地引入第三个因素(categorical variables)。


In [104]:
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips);


In [105]:
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips, markers=["o", "x"], palette="Set1");

To add another variable, you can draw multiple “facets” which each level of the variable appearing in the rows or columns of the grid:


In [106]:
sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", data=tips);
In [107]:
sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", row="sex", data=tips);




In [108]:
f, ax = plt.subplots(figsize=(5, 6)) sns.regplot(x="total_bill", y="tip", data=tips, ax=ax);


In [109]:
sns.lmplot(x="total_bill", y="tip", col="day", data=tips, col_wrap=2, height=3);
In [110]:
sns.lmplot(x="total_bill", y="tip", col="day", data=tips, aspect=.5);


其他的一些seaborn函数也在更高的层面上支持了线性回归的加入。例如,在我们之前讲过的jointplot里面,我们通过给出kind = 'reg'参数,就可以绘制出数据的线性回归。

In [111]:
sns.jointplot(x="total_bill", y="tip", data=tips, kind="reg");


In [112]:
sns.pairplot(tips, x_vars=["total_bill", "size"], y_vars=["tip"], height=5, aspect=.8, kind="reg");

进一步地,我们可以通过在pairplot中引入huekind = 'reg',研究更高维度数据的线性线性关系。

In [113]:
sns.pairplot(tips, x_vars=["total_bill", "size"], y_vars=["tip"], hue="smoker", height=5, aspect=.8, kind="reg");


不过需要指出的是,为了使用seaborn的网格化绘图方法,我们的原始数据必须是Pandas DataFrame格式的。同时,该数据也必须是“tidy” data,换句话说,该数据的构成必须是每列代表一个特征,每行代表一个样本。不符合标准的数据是无法使用seaborn 的网格化绘图方法的。

In [114]:
import seaborn as sns import matplotlib.pyplot as plt
In [115]:

0|1FacetGrid 网格化绘图方法¶



其实,之前的relplot(), catplot(), 以及 lmplot()内部都使用了这里介绍的网格化绘图方法,因此借用本节课学到的知识,也可以对这三个函数的运行结果进行修改。

In [116]:
tips = sns.load_dataset("tips")

Initializing the grid like this sets up the matplotlib figure and axes, but doesn’t draw anything on them.



  • 绘图变量名
  • 绘图函数


In [117]:
g = sns.FacetGrid(tips, col="time") g.map(plt.hist, "tip");


再比如,我们想研究性别、是否吸烟、总花费与小费数的关系。就可以使用如下的代码。从中我们可以看到,关键词参数alpha = 0.7也可以直接作用FacetGrid.map()的输入,该参数也会传递给绘图函数,在这个问题中,绘图函数就是plt.scatter

In [118]:
g = sns.FacetGrid(tips, col="sex", hue="smoker") g.map(plt.scatter, "total_bill", "tip", alpha=.7) g.add_legend();


In [119]:
g = sns.FacetGrid(tips, row="smoker", col="time", margin_titles=True) g.map(sns.regplot, "size", "total_bill", color=".3", fit_reg=False, x_jitter=.1);
<div class="alert alert-block alert-danger"> <b>注意:</b> margin_titles目前还没有在matplotlib中有正式的api支持,因此可能会在某些情况下报错。报错的话,就关掉这个选项即可。 </div>


In [120]:
g = sns.FacetGrid(tips, col="day", height=4, aspect=.5) g.map(sns.barplot, "sex", "total_bill");
/opt/conda/lib/python3.6/site-packages/seaborn/axisgrid.py:715: UserWarning: Using the barplot function without specifying `order` is likely to produce an incorrect plot. warnings.warn(warning)


In [121]:
ordered_days = ['Sat', 'Sun', 'Thur', 'Fri'] g = sns.FacetGrid(tips, row="day", row_order=ordered_days, height=1.7, aspect=4,) g.map(sns.distplot, "total_bill", hist=True, rug=True);


In [122]:
pal = dict(Lunch="seagreen", Dinner="gray") g = sns.FacetGrid(tips, hue="time", palette=pal, height=5) g.map(plt.scatter, "total_bill", "tip", s=50, alpha=.7, linewidth=.5, edgecolor="white") g.add_legend();


In [123]:
g = sns.FacetGrid(tips, hue="sex", palette="Set1", height=5, hue_kws={"marker": ["o", "v"]}) g.map(plt.scatter, "total_bill", "tip", s=100, linewidth=.5, edgecolor="white") g.add_legend();


In [124]:
attend = sns.load_dataset("attention").query("subject <= 12") g = sns.FacetGrid(attend, col="subject", height=2, ylim=(0, 10)) g.map(sns.pointplot, "solutions", "score", color=".3", ci=None);
/opt/conda/lib/python3.6/site-packages/seaborn/axisgrid.py:715: UserWarning: Using the pointplot function without specifying `order` is likely to produce an incorrect plot. warnings.warn(warning)


In [125]:
attend = sns.load_dataset("attention").query("subject <= 12") g = sns.FacetGrid(attend, col="subject", col_wrap=4,height=2, ylim=(0, 10)) g.map(sns.pointplot, "solutions", "score", color=".3", ci=None);
/opt/conda/lib/python3.6/site-packages/seaborn/axisgrid.py:715: UserWarning: Using the pointplot function without specifying `order` is likely to produce an incorrect plot. warnings.warn(warning)


In [126]:
with sns.axes_style("white"): g = sns.FacetGrid(tips, row="sex", col="smoker", margin_titles=True, height=2.5) g.map(plt.scatter, "total_bill", "tip", color="#334488", edgecolor="white", lw=.5); g.set_axis_labels("Total bill (US Dollars)", "Tip"); g.set(xticks=[10, 30, 50], yticks=[2, 6, 10]); g.fig.subplots_adjust(wspace=.02, hspace=.02);

For even more customization, you can work directly with the underling matplotlib Figure and Axes objects, which are stored as member attributes at fig and axes (a two-dimensional array), respectively. When making a figure without row or column faceting, you can also use the ax attribute to directly access the single axes.

In [127]:
g = sns.FacetGrid(tips, col="smoker", margin_titles=True, height=4) g.map(plt.scatter, "total_bill", "tip", color="#338844", edgecolor="white", s=50, lw=1) for ax in g.axes.flat: ax.plot((0, 50), (0, .2 * 50), c=".2", ls="--") g.set(xlim=(0, 60), ylim=(0, 14));

0|1Using custom functions¶

You’re not limited to existing matplotlib and seaborn functions when using FacetGrid. However, to work properly, any function you use must follow a few rules:

It must plot onto the “currently active” matplotlib Axes. This will be true of functions in the matplotlib.pyplot namespace, and you can call plt.gca to get a reference to the current Axes if you want to work directly with its methods. It must accept the data that it plots in positional arguments. Internally, FacetGrid will pass a Series of data for each of the named positional arguments passed to FacetGrid.map(). It must be able to accept color and label keyword arguments, and, ideally, it will do something useful with them. In most cases, it’s easiest to catch a generic dictionary of **kwargs and pass it along to the underlying plotting function. Let’s look at minimal example of a function you can plot with. This function will just take a single vector of data for each facet:

In [128]:
from scipy import stats def quantile_plot(x, **kwargs): qntls, xr = stats.probplot(x, fit=False) plt.scatter(xr, qntls, **kwargs) g = sns.FacetGrid(tips, col="sex", height=4) g.map(quantile_plot, "total_bill");


If we want to make a bivariate plot, you should write the function so that it accepts the x-axis variable first and the y-axis variable second:

In [129]:
def qqplot(x, y, **kwargs): _, xr = stats.probplot(x, fit=False) _, yr = stats.probplot(y, fit=False) plt.scatter(xr, yr, **kwargs) g = sns.FacetGrid(tips, col="smoker", height=4) g.map(qqplot, "total_bill", "tip");


Because plt.scatter accepts color and label keyword arguments and does the right thing with them, we can add a hue facet without any difficulty:

In [130]:
g = sns.FacetGrid(tips, hue="time", col="sex", height=4) g.map(qqplot, "total_bill", "tip") g.add_legend();

This approach also lets us use additional aesthetics to distinguish the levels of the hue variable, along with keyword arguments that won’t be dependent on the faceting variables:

In [131]:
g = sns.FacetGrid(tips, hue="time", col="sex", height=4, hue_kws={"marker": ["s", "D"]}) g.map(qqplot, "total_bill", "tip", s=40, edgecolor="w") g.add_legend();

Sometimes, though, you’ll want to map a function that doesn’t work the way you expect with the color and label keyword arguments. In this case, you’ll want to explicitly catch them and handle them in the logic of your custom function. For example, this approach will allow use to map plt.hexbin, which otherwise does not play well with the FacetGrid API:

In [132]:
def hexbin(x, y, color, **kwargs): cmap = sns.light_palette(color, as_cmap=True) plt.hexbin(x, y, gridsize=15, cmap=cmap, **kwargs) with sns.axes_style("dark"): g = sns.FacetGrid(tips, hue="time", col="time", height=4) g.map(hexbin, "total_bill", "tip", extent=[0, 50, 0, 10]);


0|1Plotting pairwise data relationships¶

PairGrid also allows you to quickly draw a grid of small subplots using the same plot type to visualize data in each. In a PairGrid, each row and column is assigned to a different variable, so the resulting plot shows each pairwise relationship in the dataset. This style of plot is sometimes called a “scatterplot matrix”, as this is the most common way to show each relationship, but PairGrid is not limited to scatterplots.

It’s important to understand the differences between a FacetGrid and a PairGrid. In the former, each facet shows the same relationship conditioned on different levels of other variables. In the latter, each plot shows a different relationship (although the upper and lower triangles will have mirrored plots). Using PairGrid can give you a very quick, very high-level summary of interesting relationships in your dataset.

The basic usage of the class is very similar to FacetGrid. First you initialize the grid, then you pass plotting function to a map method and it will be called on each subplot. There is also a companion function, pairplot() that trades off some flexibility for faster plotting.

In [133]:
iris = sns.load_dataset("iris") g = sns.PairGrid(iris) g.map(plt.scatter);

It’s possible to plot a different function on the diagonal to show the univariate distribution of the variable in each column. Note that the axis ticks won’t correspond to the count or density axis of this plot, though.

In [134]:
g = sns.PairGrid(iris) g.map_diag(plt.hist) g.map_offdiag(plt.scatter);

A very common way to use this plot colors the observations by a separate categorical variable. For example, the iris dataset has four measurements for each of three different species of iris flowers so you can see how they differ.

In [135]:
g = sns.PairGrid(iris, hue="species") g.map_diag(plt.hist) g.map_offdiag(plt.scatter) g.add_legend();

By default every numeric column in the dataset is used, but you can focus on particular relationships if you want.

In [136]:
g = sns.PairGrid(iris, vars=["sepal_length", "sepal_width"], hue="species") g.map(plt.scatter);

It’s also possible to use a different function in the upper and lower triangles to emphasize different aspects of the relationship.

In [137]:
g = sns.PairGrid(iris) g.map_upper(plt.scatter) g.map_lower(sns.kdeplot) g.map_diag(sns.kdeplot, lw=3, legend=False);

The square grid with identity relationships on the diagonal is actually just a special case, and you can plot with different variables in the rows and columns.

In [138]:
g = sns.PairGrid(tips, y_vars=["tip"], x_vars=["total_bill", "size"], height=4) g.map(sns.regplot, color=".3") g.set(ylim=(-1, 11), yticks=[0, 5, 10]);

Of course, the aesthetic attributes are configurable. For instance, you can use a different palette (say, to show an ordering of the hue variable) and pass keyword arguments into the plotting functions.

In [139]:
g = sns.PairGrid(tips, hue="size", palette="GnBu_d") g.map(plt.scatter, s=50, edgecolor="white") g.add_legend();

PairGrid is flexible, but to take a quick look at a dataset, it can be easier to use pairplot(). This function uses scatterplots and histograms by default, although a few other kinds will be added (currently, you can also plot regression plots on the off-diagonals and KDEs on the diagonal).

In [140]:
sns.pairplot(iris, hue="species", height=2.5);

You can also control the aesthetics of the plot with keyword arguments, and it returns the PairGrid instance for further tweaking.

In [141]:
g = sns.pairplot(iris, hue="species", palette="Set2", diag_kind="kde", height=2.5)

画出令人赏心悦目的图形,是数据可视化的目标之一。我们知道,数据可视化可以帮助我们向观众更加直观的展示定量化的insight, 帮助我们阐述数据中蕴含的道理。除此之外,我们还希望可视化的图表能够帮助引起读者的兴趣,使其对我们的工作更感兴趣。


In [142]:
import numpy as np import seaborn as sns import matplotlib.pyplot as plt


In [143]:
def sinplot(flip=1): x = np.linspace(0, 14, 100) for i in range(1, 7): plt.plot(x, np.sin(x + i * .5) * (7 - i) * flip)


In [144]:


In [145]:
sns.set() sinplot()

(Note that in versions of seaborn prior to 0.8, set() was called on import. On later versions, it must be explicitly invoked).

Seaborn 把matplotlib中的参数分为了两类。其中第一类用来调整图片的风格(背景、线型线宽、字体、坐标轴等),第二类用来根据不同的需求微调绘图格式(图片用在论文、ppt、海报时有不同的格式需求。)


0|1Seaborn 绘图风格¶

在seaborn中,有五种预置好的绘图风格,分别是:darkgrid, whitegrid, dark, whiteticks。其中darkgrid是默认风格。



In [146]:
sns.set_style("whitegrid") data = np.random.normal(size=(20, 6)) + np.arange(6) / 2 sns.boxplot(data=data);


In [147]:
sns.set_style("dark") sinplot()
In [148]:
sns.set_style("white") sinplot()


In [149]:
sns.set_style("ticks") sinplot()


In [150]:
sinplot() sns.despine()


In [151]:
sns.set_style("white") sns.boxplot(data=data, palette="deep") sns.despine(left=True,bottom=True)



In [152]:
f = plt.figure() with sns.axes_style("darkgrid"): ax = f.add_subplot(1, 2, 1) sinplot() ax = f.add_subplot(1, 2, 2) sinplot(-1)

0|1自定义seaborn styles¶


In [153]:
{'axes.facecolor': 'white', 'axes.edgecolor': '.15', 'axes.grid': False, 'axes.axisbelow': True, 'axes.labelcolor': '.15', 'figure.facecolor': 'white', 'grid.color': '.8', 'grid.linestyle': '-', 'text.color': '.15', 'xtick.color': '.15', 'ytick.color': '.15', 'xtick.direction': 'out', 'ytick.direction': 'out', 'lines.solid_capstyle': 'round', 'patch.edgecolor': 'w', 'image.cmap': 'rocket', 'font.family': ['sans-serif'], 'font.sans-serif': ['Arial', 'DejaVu Sans', 'Liberation Sans', 'Bitstream Vera Sans', 'sans-serif'], 'patch.force_edgecolor': True, 'xtick.bottom': False, 'xtick.top': False, 'ytick.left': False, 'ytick.right': False, 'axes.spines.left': True, 'axes.spines.bottom': True, 'axes.spines.right': True, 'axes.spines.top': True}


In [154]:
sns.set_style("white", {"ytick.right": True,'axes.grid':False}) sinplot()


