如何在 Python 中创建用于分类的模拟数据
在本教程中,我们将学习如何在 Python 中从分类创建模拟数据。
介绍
模拟数据可以定义为不代表真实现象但使用参数和约束合成生成的任何数据。
我们何时以及为什么需要模拟数据?
有时,在机器学习或深度学习中对特定算法进行原型设计时,我们通常会面临缺乏对我们有用的良好真实世界数据的稀缺。有时,给定任务没有此类数据可用。在这种情况下,我们可能需要合成生成的数据。此数据也可以来自实验室模拟。
模拟数据的优势
-
主要表示数据,因为它可能是真实形式
-
包含较少的噪声变化,因此可以被认为是理想的数据集
-
适用于快速原型设计和 POC
生成模拟数据以使用 Python 进行分类
在本演示中,我们将使用 sci-ki learn 来生成模拟数据。
例
from sklearn
.datasets
import make_classification
import pandas
as pd
import seaborn
as sns
# Creating a simulated feature matrix and output vector with 100 samples features
, output
= make_classification
(n_samples
=
100
,
# taking ten features n_features
=
10
,
# five features that predict the output's classes n_informative
=
5
,
# five features that are random and unrelated to the output's classes n_redundant
=
5
,
# three output classes n_classes
=
3
,
# with 20% of observations in the first class, 30% in the second class,
# and 50% in the third class. ('None' makes balanced classes) weights
=
[
.2
,
.3
,
.8
]
)
print
(
"Feature Dataframe: "
)
; df_features
= pd
.DataFrame
(features
, columns
=
[
"Feature 1"
,
"Feature 2"
,
"Feature 3"
,
"Feature 4"
,
"Feature 5"
,
"Feature 6"
,
"Feature 7"
,
"Feature 8"
,
"Feature 9"
,
"Feature 10"
]
) output_series
= pd
.Series
(output
,name
=
'label'
) df
= pd
.concat
(
[df_features
,output_series
]
,axis
=
1
)
print
(df
.head
(
)
)
## plot using seaborn sns
.
set
(rc
=
{
"figure.figsize"
:
(
16
,
8
)
}
)
## Plotting 'Feature 1' vs label sns
.scatterplot
(data
=df
,x
=
'Feature 1'
,y
=
'label'
,s
=
50
)
输出
<span style="color:#000000">Feature Dataframe:
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 \
0 0.849715 -0.381343 0.650106 -1.439747 -0.442026 0.785891
1 1.841786 0.912779 2.090686 -2.220130 -0.744132 -0.116817
2 -0.915034 -3.324696 -2.613417 0.852612 -3.908363 4.352266
3 1.305116 -1.582905 -0.797318 -0.943912 -1.753893 1.721998
4 0.894486 -0.130399 -0.968311 0.989773 -0.987330 -0.296457
Feature 7 Feature 8 Feature 9 Feature 10 label
0 0.119725 1.156633 0.794226 0.511587 2
1 -0.064624 2.311732 0.178347 1.294978 1
2 3.038898 -2.273558 4.194868 2.693096 2
3 0.817046 0.577196 2.651006 1.826657 2
4 -0.280331 0.096983 1.227921 0.909471 2
</span>
另一种方法是使用 Faker python 库。让我们看一下下面的示例。 安装伪造程序库
例
!pip install Faker
from random
import randint
import pandas
as pd
from faker
import Faker
from faker
.providers
import DynamicProvider medical_professions_provider
= DynamicProvider
( provider_name
=
"medical_profession"
, elements
=
[
"dr."
,
"doctor"
,
"nurse"
,
"surgeon"
,
"clerk"
]
,
) fake
= Faker
(
) fake
.add_provider
(medical_professions_provider
)
def
input_data
(x
)
:
# pandas dataframe data
= pd
.DataFrame
(
)
for i
in
range
(
0
, x
)
: data
.loc
[i
,
'id'
]
= randint
(
1
,
100
) data
.loc
[i
,
'name'
]
= fake
.name
(
) data
.loc
[i
,
'address'
]
= fake
.address
(
) data
.loc
[i
,
'latitude'
]
=
str
(fake
.latitude
(
)
) data
.loc
[i
,
'longitude'
]
=
str
(fake
.longitude
(
)
) data
.loc
[i
,
'target'
]
=
str
(fake
.medical_profession
(
)
)
return data
print
(input_data
(
10
)
)
输出
<span style="color:#000000">id name address \
7.0 Monique Rodriguez 481 Rebecca Landing Suite 727\nDominiquefurt, ...
4.0 Elizabeth Johnson 62492 Zimmerman Crest Apt. 047\nPort Jerome, W...
18.0 Max Rangel 4379 Obrien Curve\nDavistown, IA 02341
31.0 Tammie Kent 4866 Angela Turnpike Apt. 658\nNorth Sheilabor...
42.0 James Johnston 26827 Jeremiah Alley\nFreystad, SC 86902
21.0 Shawn Robles 137 Jessica Ridges Apt. 436\nWilliamburgh, AZ ...
13.0 Stephen Hodges Unit 9799 Box 0625\nDPO AA 94415
91.0 Eric Lewis PhD 4711 Nicholas Loaf\nWest Lisa, UT 28944
68.0 Matthew Munoz 37836 White Crest\nGonzalezport, NC 75320
34.0 Lawrence Anderson 76712 Garza Mills Apt. 751\nPort Penny, CT 43042
latitude longitude target 0 60.574796 109.367770 clerk
1 84.7225155 -167.216393 dr.
2 82.598649 62.961322 surgeon
3 26.9617205 89.333171 doctor
4 -37.1740195 -140.766121 dr.
5 -40.8904645 28.820918 clerk
6 88.809220 76.442779 dr.
7 35.728143 178.729120 doctor
8 -16.5669945 126.686740 dr.
9 -49.271970 160.737754 clerk
</span>
结论
模拟数据在用于原型设计或小型 POC 的日常机器学习应用程序中非常有用。Python 中有一些方便的工具,可以在几行代码中创建模拟数据变得非常简单。