Machine Learning : Pre-processing features
from:http://analyticsbot.ml/2016/10/machine-learning-pre-processing-features/
Machine Learning : Pre-processing features
I am participating in this Kaggle competition. It is a prediction problem contest. The problem statement is:
How severe is an insurance claim?
When you’ve been devastated by a serious car accident, your focus is on the things that matter the most: family, friends, and other loved ones. Pushing paper with your insurance agent is the last place you want your time or mental energy spent. This is why Allstate, a personal insurer in the United States, is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect.
Allstate is currently developing automated methods of predicting the cost, and hence severity, of claims. In this recruitment challenge, Kagglers are invited to show off their creativity and flex their technical chops by creating an algorithm which accurately predicts claims severity. Aspiring competitors will demonstrate insight into better ways to predict claims severity for the chance to be part of Allstate’s efforts to ensure a worry-free customer experience.
You can take a look at the data here. You can easily open the dataset in excel and take a look at the variables/features in the dataset. There are 116 categorical variables in the dataset and 14 continuous variables. Let’s start the analysis
Import all necessary modules.
1
2
3
4
5
6
7
8
9
|
# import required libraries
# pandas for reading data and manipulation
# scikit learn to one hot encoder and label encoder
# sns and matplotlib to visualize
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
import operator
|
All these modules should be installed on your machine. I am using Python 2.7.11. If you have to install these modules, you can simply do
1
2
3
4
|
pip install <module name>
Example:
pip install pandas
|
Let’s read the datasets using pandas
1
2
3
|
# read data from csv file
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
|
Let’s take a look at the data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
# let's take a look at the train and test data
print '**************************************'
print 'TRAIN DATA'
print '**************************************'
print train.head(5)
print '**************************************'
print 'TEST DATA'
print '**************************************'
print test.head(5)
# the above code wont print all columns.
# to print all columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# let's take a look at the train and test data again
print '**************************************'
print 'TRAIN DATA'
print '**************************************'
print train.head(5)
print '**************************************'
print 'TEST DATA'
print '**************************************'
print test.head(5)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
|
TRAIN DATA
**************************************
id cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9 ... cont6 \
0 1 A B A B A A A A B ... 0.718367
1 2 A B A A A A A A B ... 0.438917
2 5 A B A A B A A A B ... 0.289648
3 10 B B A B A A A A B ... 0.440945
4 11 A B A B A A A A B ... 0.178193
cont7 cont8 cont9 cont10 cont11 cont12 cont13 \
0 0.335060 0.30260 0.67135 0.83510 0.569745 0.594646 0.822493
1 0.436585 0.60087 0.35127 0.43919 0.338312 0.366307 0.611431
2 0.315545 0.27320 0.26076 0.32446 0.381398 0.373424 0.195709
3 0.391128 0.31796 0.32128 0.44467 0.327915 0.321570 0.605077
4 0.247408 0.24564 0.22089 0.21230 0.204687 0.202213 0.246011
cont14 loss
0 0.714843 2213.18
1 0.304496 1283.60
2 0.774425 3005.09
3 0.602642 939.85
4 0.432606 2763.85
[5 rows x 132 columns]
**************************************
TEST DATA
**************************************
id cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9 ... cont5 \
0 4 A B A A A A A A B ... 0.281143
1 6 A B A B A A A A B ... 0.836443
2 9 A B A B B A B A B ... 0.718531
3 12 A A A A B A A A A ... 0.397069
4 15 B A A A A B A A A ... 0.302678
cont6 cont7 cont8 cont9 cont10 cont11 cont12 \
0 0.466591 0.317681 0.61229 0.34365 0.38016 0.377724 0.369858
1 0.482425 0.443760 0.71330 0.51890 0.60401 0.689039 0.675759
2 0.212308 0.325779 0.29758 0.34365 0.30529 0.245410 0.241676
3 0.369930 0.342355 0.40028 0.33237 0.31480 0.348867 0.341872
4 0.398862 0.391833 0.23688 0.43731 0.50556 0.359572 0.352251
cont13 cont14
0 0.704052 0.392562
1 0.453468 0.208045
2 0.258586 0.297232
3 0.592264 0.555955
4 0.301535 0.825823
[5 rows x 131 columns]
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
|
**************************************
TRAIN DATA
**************************************
id cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9 cat10 cat11 cat12 cat13 \
0 1 A B A B A A A A B A B A A
1 2 A B A A A A A A B B A A A
2 5 A B A A B A A A B B B B B
3 10 B B A B A A A A B A A A A
4 11 A B A B A A A A B B A B A
cat14 cat15 cat16 cat17 cat18 cat19 cat20 cat21 cat22 cat23 cat24 cat25 \
0 A A A A A A A A A B A A
1 A A A A A A A A A A A A
2 A A A A A A A A A A A A
3 A A A A A A A A A B A A
4 A A A A A A A A A B A A
cat26 cat27 cat28 cat29 cat30 cat31 cat32 cat33 cat34 cat35 cat36 cat37 \
0 A A A A A A A A A A A A
1 A A A A A A A A A A A A
2 A A A A A A A A A A B A
3 A A A A A A A A A A A A
4 A A A A A A A A A A A A
cat38 cat39 cat40 cat41 cat42 cat43 cat44 cat45 cat46 cat47 cat48 cat49 \
0 A A A A A A A A A A A A
1 A A A A A A A A A A A A
2 A A A A A A A A A A A A
3 A A A A A A A A A A A A
4 A A A A A A A A A A A A
cat50 cat51 cat52 cat53 cat54 cat55 cat56 cat57 cat58 cat59 cat60 cat61 \
0 A A A A A A A A A A A A
1 A A A A A A A A A A A A
2 A A A A A A A A A A A A
3 A A A A A A A A A A A A
4 A A A A A A A A A A A A
cat62 cat63 cat64 cat65 cat66 cat67 cat68 cat69 cat70 cat71 cat72 cat73 \
0 A A A A A A A A A A A A
1 A A A A A A A A A A A A
2 A A A A A A A A A A A A
3 A A A A A A A A A A A B
4 A A A A A A A A A A B A
cat74 cat75 cat76 cat77 cat78 cat79 cat80 cat81 cat82 cat83 cat84 cat85 \
0 A B A D B B D D B D C B
1 A A A D B B D D A B C B
2 A A A D B B B D B D C B
3 A A A D B B D D D B C B
4 A A A D B D B D B B C B
cat86 cat87 cat88 cat89 cat90 cat91 cat92 cat93 cat94 cat95 cat96 cat97 \
0 D B A A A A A D B C E A
1 D B A A A A A D D C E E
2 B B A A A A A D D C E E
3 D B A A A A A D D C E E
4 B C A A A B H D B D E E
cat98 cat99 cat100 cat101 cat102 cat103 cat104 cat105 cat106 cat107 cat108 \
0 C T B G A A I E G J G
1 D T L F A A E E I K K
2 A D L O A B E F H F A
3 D T I D A A E E I K K
4 A P F J A A D E K G B
cat109 cat110 cat111 cat112 cat113 cat114 cat115 cat116 cont1 cont2 \
0 BU BC C AS S A O LB 0.726300 0.245921
1 BI CQ A AV BM A O DP 0.330514 0.737068
2 AB DK A C AF A I GK 0.261841 0.358319
3 BI CS C N AE A O DJ 0.321594 0.555782
4 H C C Y BM A K CK 0.273204 0.159990
cont3 cont4 cont5 cont6 cont7 cont8 cont9 \
0 0.187583 0.789639 0.310061 0.718367 0.335060 0.30260 0.67135
1 0.592681 0.614134 0.885834 0.438917 0.436585 0.60087 0.35127
2 0.484196 0.236924 0.397069 0.289648 0.315545 0.27320 0.26076
3 0.527991 0.373816 0.422268 0.440945 0.391128 0.31796 0.32128
4 0.527991 0.473202 0.704268 0.178193 0.247408 0.24564 0.22089
cont10 cont11 cont12 cont13 cont14 loss
0 0.83510 0.569745 0.594646 0.822493 0.714843 2213.18
1 0.43919 0.338312 0.366307 0.611431 0.304496 1283.60
2 0.32446 0.381398 0.373424 0.195709 0.774425 3005.09
3 0.44467 0.327915 0.321570 0.605077 0.602642 939.85
4 0.21230 0.204687 0.202213 0.246011 0.432606 2763.85
**************************************
TEST DATA
**************************************
id cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9 cat10 cat11 cat12 cat13 \
0 4 A B A A A A A A B A B A A
1 6 A B A B A A A A B A A A A
2 9 A B A B B A B A B B A B B
3 12 A A A A B A A A A A A A A
4 15 B A A A A B A A A A A A A
cat14 cat15 cat16 cat17 cat18 cat19 cat20 cat21 cat22 cat23 cat24 cat25 \
0 A A A A A A A A A A A A
1 A A A A A A A A A B B A
2 B A A A A A A A A B A A
3 A A A A A A A A A A A A
4 A A A A A A A A A A A A
cat26 cat27 cat28 cat29 cat30 cat31 cat32 cat33 cat34 cat35 cat36 cat37 \
0 A A A A A A A A A A A A
1 A A A A A A A A A A A A
2 A A A A A A A A A A B A
3 A A A A A A A A A A B A
4 A A A A A A A A A A A A
cat38 cat39 cat40 cat41 cat42 cat43 cat44 cat45 cat46 cat47 cat48 cat49 \
0 A A A A A A A A A A A A
1 A A A A A A A A A A A A
2 B B A A A A A A A A A A
3 B A A B A A A A A A A A
4 A A A A A A A A A A A A
cat50 cat51 cat52 cat53 cat54 cat55 cat56 cat57 cat58 cat59 cat60 cat61 \
0 A A A A A A A A A A A A
1 A A A A A A A A A A A A
2 A A A A A A A B A A A A
3 A A A A A A A A A A A A
4 B A A A A A A A A A A A
cat62 cat63 cat64 cat65 cat66 cat67 cat68 cat69 cat70 cat71 cat72 cat73 \
0 A A A A A A A A A A A A
1 A A A A A A A A A A B A
2 A A A A A A A A A A A A
3 A A A A A A A A A A B A
4 A A A A A A A A A A A A
cat74 cat75 cat76 cat77 cat78 cat79 cat80 cat81 cat82 cat83 cat84 cat85 \
0 A A A D B B D D B B C B
1 A B A D B B D D B B C B
2 A A B D B B B B B D C B
3 A A A D B D B D B B A B
4 A A A D B B D D B B C B
cat86 cat87 cat88 cat89 cat90 cat91 cat92 cat93 cat94 cat95 cat96 cat97 \
0 D B A A A A A D C C E C
1 B B A A A A A D D D E A
2 B B A B A A A D D C E E
3 D D A A A G H D D C E E
4 B B A A A A A D B D E A
cat98 cat99 cat100 cat101 cat102 cat103 cat104 cat105 cat106 cat107 cat108 \
0 D T H G A A G E I L K
1 A P B D A A G G G F B
2 A D G Q A D D E J G A
3 D T G A A D E E I K K
4 A P A A A A F E G E B
cat109 cat110 cat111 cat112 cat113 cat114 cat115 cat116 cont1 cont2 \
0 BI BC A J AX A Q HG 0.321594 0.299102
1 BI CO E G X A L HK 0.634734 0.620805
2 BI CS C U AE A K CK 0.290813 0.737068
3 BI CR A AY AJ A P DJ 0.268622 0.681761
4 AB EG A E I C J HA 0.553846 0.299102
cont3 cont4 cont5 cont6 cont7 cont8 cont9 \
0 0.246911 0.402922 0.281143 0.466591 0.317681 0.61229 0.34365
1 0.654310 0.946616 0.836443 0.482425 0.443760 0.71330 0.51890
2 0.711159 0.412789 0.718531 0.212308 0.325779 0.29758 0.34365
3 0.592681 0.354893 0.397069 0.369930 0.342355 0.40028 0.33237
4 0.263570 0.696873 0.302678 0.398862 0.391833 0.23688 0.43731
cont10 cont11 cont12 cont13 cont14
0 0.38016 0.377724 0.369858 0.704052 0.392562
1 0.60401 0.689039 0.675759 0.453468 0.208045
2 0.30529 0.245410 0.241676 0.258586 0.297232
3 0.31480 0.348867 0.341872 0.592264 0.555955
4 0.50556 0.359572 0.352251 0.301535 0.825823
|
You might have noticed that we printed the same thing twice. Well, the first time Python prints a small number of columns and the first five observations but the second time it prints all columns and their 5 observations. This is because of the
1
2
|
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
|
Make sure you have the 5 in head else it will print everything on screen, which will not be pretty. Take a look at the columns present in the train and test set.
1
2
|
print 'columns in train set : ', train.columns
print 'columns in test set : ', test.columns
|
There is an ID column in both the data sets which we don’t need for any analysis. Moreover, we will keep the loss column from the training data set into a separate variable.
1
2
3
4
|
# remove ID column. No use.
train.drop('id',axis=1,inplace=True)
test.drop('id',axis=1,inplace=True)
loss = train.drop('loss', axis = 1, inplace = True)
|
Let’s take a look at the continuous variables and their basic statistical analysis.
1
2
3
4
5
|
# high level statistics. mean media mode count and quartiles
# note - this will work only for the continous variables
# not for the categorical variables
print train.describe()
print test.describe()
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
## train
cont1 cont2 cont3 cont4 \
count 188318.000000 188318.000000 188318.000000 188318.000000
mean 0.493861 0.507188 0.498918 0.491812
std 0.187640 0.207202 0.202105 0.211292
min 0.000016 0.001149 0.002634 0.176921
25% 0.346090 0.358319 0.336963 0.327354
50% 0.475784 0.555782 0.527991 0.452887
75% 0.623912 0.681761 0.634224 0.652072
max 0.984975 0.862654 0.944251 0.954297
cont5 cont6 cont7 cont8 \
count 188318.000000 188318.000000 188318.000000 188318.000000
mean 0.487428 0.490945 0.484970 0.486437
std 0.209027 0.205273 0.178450 0.199370
min 0.281143 0.012683 0.069503 0.236880
25% 0.281143 0.336105 0.350175 0.312800
50% 0.422268 0.440945 0.438285 0.441060
75% 0.643315 0.655021 0.591045 0.623580
max 0.983674 0.997162 1.000000 0.980200
cont9 cont10 cont11 cont12 \
count 188318.000000 188318.000000 188318.000000 188318.000000
mean 0.485506 0.498066 0.493511 0.493150
std 0.181660 0.185877 0.209737 0.209427
min 0.000080 0.000000 0.035321 0.036232
25% 0.358970 0.364580 0.310961 0.311661
50% 0.441450 0.461190 0.457203 0.462286
75% 0.566820 0.614590 0.678924 0.675759
max 0.995400 0.994980 0.998742 0.998484
cont13 cont14
count 188318.000000 188318.000000
mean 0.493138 0.495717
std 0.212777 0.222488
min 0.000228 0.179722
25% 0.315758 0.294610
50% 0.363547 0.407403
75% 0.689974 0.724623
max 0.988494 0.844848
|
In many competitions, you’ll find there are some features that might be present in the training set but not in the test set and vice-versa.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
# at this point, it is wise to check whether there are any features that
# are there is one of the dataset but not in other
missingFeatures = False
inTrainNotTest = []
for feature in train.columns:
if feature not in test.columns:
missingFeatures = True
inTrainNotTest.append(feature)
if len(inTrainNotTest)>0:
print ', '. join(inTrainNotTest), ' features are present in training set but not in test set'
inTestNotTrain = []
for feature in test.columns:
if feature not in train.columns:
missingFeatures = True
inTestNotTrain.append(feature)
if len(inTestNotTrain)>0:
print ', '. join(inTestNotTrain), ' features are present in test set but not in training set'
|
In this case, we see that there are not different columns between train and test sets.
Let’s identify the categorical and continuous variables. For this data set, there are two ways to find them:
- variables have ‘cat’ and ‘cont’ in them, defining them
- pandas consider the data type as object
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
# find categorical variables
# in this problem, categorical variables are start with cat which is easy
# to identify
# in other problems it not might be like that
# we will see two ways to identify this in this problem
# we will also find the continous or numerical variables
## 1. by name
categorical_train = [var for var in train.columns if 'cat' in var]
categorical_test = [var for var in test.columns if 'cat' in var]
continous_train = [var for var in train.columns if 'cont' in var]
continous_test = [var for var in test.columns if 'cont' in var]
## 2. by type = object
categorical_train = train.dtypes[train.dtypes == "object"].index
categorical_test = test.dtypes[test.dtypes == "object"].index
continous_train = train.dtypes[train.dtypes != "object"].index
continous_test = test.dtypes[test.dtypes != "object"].index
|
Correlation between continuous variables
Let’s take a look at correlation between the variables. The idea is to remove variables that are highly correlated.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
# lets check for correlation between continous data
# correlation between numerical variables is something like this
# if we increase one variable, there is a siginficant almost increase/decrease
# in the other variable. it varies from -1 to 1
correlation_train = train[continous_train].corr()
correlation_test = train[continous_test].corr()
# for the purpose of this analysis, we will consider to variables to
# highly correlation if the correlation is more than 0.6
threshold = 0.6
for i in range(len(correlation_train)):
for j in range(len(correlation_train)):
if (i>j) and (correlation_train.iloc[i,j]>threshold):
print ("%s and %s = %.2f" % (train.columns[i],train.columns[j],correlation_train.iloc[i,j]))
for i in range(len(correlation_test)):
for j in range(len(correlation_test)):
if (i>j) and (correlation_test.iloc[i,j]>threshold):
print ("%s and %s = %.2f" % (test.columns[i],test.columns[j],correlation_test.iloc[i,j]))
# we can remove one of the two highly correlatied variables to improve performance
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
|
cat6 and cat1 = 0.76
cat7 and cat6 = 0.66
cat9 and cat1 = 0.93
cat9 and cat6 = 0.80
cat10 and cat1 = 0.81
cat10 and cat6 = 0.88
cat10 and cat9 = 0.79
cat11 and cat6 = 0.77
cat11 and cat7 = 0.75
cat11 and cat9 = 0.61
cat11 and cat10 = 0.70
cat12 and cat1 = 0.61
cat12 and cat6 = 0.79
cat12 and cat7 = 0.74
cat12 and cat9 = 0.63
cat12 and cat10 = 0.71
cat12 and cat11 = 0.99
cat13 and cat6 = 0.82
cat13 and cat9 = 0.64
cat13 and cat10 = 0.71
cat6 and cat1 = 0.76
cat7 and cat6 = 0.66
cat9 and cat1 = 0.93
cat9 and cat6 = 0.80
cat10 and cat1 = 0.81
cat10 and cat6 = 0.88
cat10 and cat9 = 0.79
cat11 and cat6 = 0.77
cat11 and cat7 = 0.75
cat11 and cat9 = 0.61
cat11 and cat10 = 0.70
cat12 and cat1 = 0.61
cat12 and cat6 = 0.79
cat12 and cat7 = 0.74
cat12 and cat9 = 0.63
cat12 and cat10 = 0.71
cat12 and cat11 = 0.99
cat13 and cat6 = 0.82
cat13 and cat9 = 0.64
cat13 and cat10 = 0.71
|
Let’s take at the labels present in the categorical variables. Although we don’t have any different columns, what may happen is some labels might not be present in one or the other data set.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
# lets check for factors in the categorical variables
for feature in categorical_train:
print feature, 'has ', len(train[feature].unique()), 'values. Unique values are :: ', train[feature].unique()
for feature in categorical_test:
print feature, 'has ', len(test[feature].unique()), 'values. Unique values are :: ', test[feature].unique()
# lets take a look whether the unique values/factors are not present in each of the dataset
# for example cat1 in both the datasets has values only A & B. Sometimes
# it may happen that some new value is present in the test set which maybe ruin your model
featuresDone = []
for feature in categorical_train:
if feature in categorical_test:
if set(train[feature].unique()) - set(test[feature].unique()) != set([]):
print 'Train set has ', len(train[feature].unique()), 'values. Unique values are :: ', train[feature].unique(), '\n'
print 'test set has ', len(test[feature].unique()), 'values. Unique values are :: ', test[feature].unique(), '\n'
print 'Missing vaues are : ', set(train[feature].unique()) - set(test[feature].unique())
featuresDone.append(feature)
for feature in categorical_test:
if (feature in categorical_train) and (feature not in featuresDone):
if set(train[feature].unique()) - set(test[feature].unique()) != set([]):
print 'Train set has ', len(train[feature].unique()), 'values. Unique values are :: ', train[feature].unique(), '\n'
print 'test set has ', len(test[feature].unique()), 'values. Unique values are :: ', test[feature].unique(), '\n'
print 'Missing vaues are : ', set(train[feature].unique()) - set(test[feature].unique())
featuresDone.append(feature)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
|
cat1 has 2 values. Unique values are :: ['A' 'B']
cat2 has 2 values. Unique values are :: ['B' 'A']
cat3 has 2 values. Unique values are :: ['A' 'B']
cat4 has 2 values. Unique values are :: ['B' 'A']
cat5 has 2 values. Unique values are :: ['A' 'B']
cat6 has 2 values. Unique values are :: ['A' 'B']
cat7 has 2 values. Unique values are :: ['A' 'B']
cat8 has 2 values. Unique values are :: ['A' 'B']
cat9 has 2 values. Unique values are :: ['B' 'A']
cat10 has 2 values. Unique values are :: ['A' 'B']
cat11 has 2 values. Unique values are :: ['B' 'A']
cat12 has 2 values. Unique values are :: ['A' 'B']
cat13 has 2 values. Unique values are :: ['A' 'B']
cat14 has 2 values. Unique values are :: ['A' 'B']
cat15 has 2 values. Unique values are :: ['A' 'B']
cat16 has 2 values. Unique values are :: ['A' 'B']
cat17 has 2 values. Unique values are :: ['A' 'B']
cat18 has 2 values. Unique values are :: ['A' 'B']
cat19 has 2 values. Unique values are :: ['A' 'B']
cat20 has 2 values. Unique values are :: ['A' 'B']
cat21 has 2 values. Unique values are :: ['A' 'B']
cat22 has 2 values. Unique values are :: ['A' 'B']
cat23 has 2 values. Unique values are :: ['B' 'A']
cat24 has 2 values. Unique values are :: ['A' 'B']
cat25 has 2 values. Unique values are :: ['A' 'B']
cat26 has 2 values. Unique values are :: ['A' 'B']
cat27 has 2 values. Unique values are :: ['A' 'B']
cat28 has 2 values. Unique values are :: ['A' 'B']
cat29 has 2 values. Unique values are :: ['A' 'B']
cat30 has 2 values. Unique values are :: ['A' 'B']
cat31 has 2 values. Unique values are :: ['A' 'B']
cat32 has 2 values. Unique values are :: ['A' 'B']
cat33 has 2 values. Unique values are :: ['A' 'B']
cat34 has 2 values. Unique values are :: ['A' 'B']
cat35 has 2 values. Unique values are :: ['A' 'B']
cat36 has 2 values. Unique values are :: ['A' 'B']
cat37 has 2 values. Unique values are :: ['A' 'B']
cat38 has 2 values. Unique values are :: ['A' 'B']
cat39 has 2 values. Unique values are :: ['A' 'B']
cat40 has 2 values. Unique values are :: ['A' 'B']
cat41 has 2 values. Unique values are :: ['A' 'B']
cat42 has 2 values. Unique values are :: ['A' 'B']
cat43 has 2 values. Unique values are :: ['A' 'B']
cat44 has 2 values. Unique values are :: ['A' 'B']
cat45 has 2 values. Unique values are :: ['A' 'B']
cat46 has 2 values. Unique values are :: ['A' 'B']
cat47 has 2 values. Unique values are :: ['A' 'B']
cat48 has 2 values. Unique values are :: ['A' 'B']
cat49 has 2 values. Unique values are :: ['A' 'B']
cat50 has 2 values. Unique values are :: ['A' 'B']
cat51 has 2 values. Unique values are :: ['A' 'B']
cat52 has 2 values. Unique values are :: ['A' 'B']
cat53 has 2 values. Unique values are :: ['A' 'B']
cat54 has 2 values. Unique values are :: ['A' 'B']
cat55 has 2 values. Unique values are :: ['A' 'B']
cat56 has 2 values. Unique values are :: ['A' 'B']
cat57 has 2 values. Unique values are :: ['A' 'B']
cat58 has 2 values. Unique values are :: ['A' 'B']
cat59 has 2 values. Unique values are :: ['A' 'B']
cat60 has 2 values. Unique values are :: ['A' 'B']
cat61 has 2 values. Unique values are :: ['A' 'B']
cat62 has 2 values. Unique values are :: ['A' 'B']
cat63 has 2 values. Unique values are :: ['A' 'B']
cat64 has 2 values. Unique values are :: ['A' 'B']
cat65 has 2 values. Unique values are :: ['A' 'B']
cat66 has 2 values. Unique values are :: ['A' 'B']
cat67 has 2 values. Unique values are :: ['A' 'B']
cat68 has 2 values. Unique values are :: ['A' 'B']
cat69 has 2 values. Unique values are :: ['A' 'B']
cat70 has 2 values. Unique values are :: ['A' 'B']
cat71 has 2 values. Unique values are :: ['A' 'B']
cat72 has 2 values. Unique values are :: ['A' 'B']
cat73 has 3 values. Unique values are :: ['A' 'B' 'C']
cat74 has 3 values. Unique values are :: ['A' 'B' 'C']
cat75 has 3 values. Unique values are :: ['B' 'A' 'C']
cat76 has 3 values. Unique values are :: ['A' 'C' 'B']
cat77 has 4 values. Unique values are :: ['D' 'C' 'B' 'A']
cat78 has 4 values. Unique values are :: ['B' 'A' 'C' 'D']
cat79 has 4 values. Unique values are :: ['B' 'D' 'A' 'C']
cat80 has 4 values. Unique values are :: ['D' 'B' 'A' 'C']
cat81 has 4 values. Unique values are :: ['D' 'B' 'A' 'C']
cat82 has 4 values. Unique values are :: ['B' 'A' 'D' 'C']
cat83 has 4 values. Unique values are :: ['D' 'B' 'A' 'C']
cat84 has 4 values. Unique values are :: ['C' 'A' 'D' 'B']
cat85 has 4 values. Unique values are :: ['B' 'A' 'C' 'D']
cat86 has 4 values. Unique values are :: ['D' 'B' 'C' 'A']
cat87 has 4 values. Unique values are :: ['B' 'C' 'D' 'A']
cat88 has 4 values. Unique values are :: ['A' 'D' 'E' 'B']
cat89 has 8 values. Unique values are :: ['A' 'B' 'C' 'E' 'D' 'H' 'I' 'G']
cat90 has 7 values. Unique values are :: ['A' 'B' 'C' 'D' 'F' 'E' 'G']
cat91 has 8 values. Unique values are :: ['A' 'B' 'G' 'C' 'D' 'E' 'F' 'H']
cat92 has 7 values. Unique values are :: ['A' 'H' 'B' 'C' 'D' 'I' 'F']
cat93 has 5 values. Unique values are :: ['D' 'C' 'A' 'B' 'E']
cat94 has 7 values. Unique values are :: ['B' 'D' 'C' 'A' 'F' 'E' 'G']
cat95 has 5 values. Unique values are :: ['C' 'D' 'E' 'A' 'B']
cat96 has 8 values. Unique values are :: ['E' 'D' 'G' 'B' 'F' 'A' 'I' 'C']
cat97 has 7 values. Unique values are :: ['A' 'E' 'C' 'G' 'D' 'F' 'B']
cat98 has 5 values. Unique values are :: ['C' 'D' 'A' 'E' 'B']
cat99 has 16 values. Unique values are :: ['T' 'D' 'P' 'S' 'R' 'K' 'E' 'F' 'N' 'J' 'C' 'M' 'H' 'G' 'I' 'O']
cat100 has 15 values. Unique values are :: ['B' 'L' 'I' 'F' 'J' 'H' 'C' 'M' 'A' 'G' 'O' 'N' 'K' 'D' 'E']
cat101 has 19 values. Unique values are :: ['G' 'F' 'O' 'D' 'J' 'A' 'C' 'Q' 'M' 'I' 'L' 'R' 'S' 'E' 'N' 'H' 'B' 'U'
'K']
cat102 has 9 values. Unique values are :: ['A' 'C' 'B' 'D' 'G' 'E' 'F' 'H' 'J']
cat103 has 13 values. Unique values are :: ['A' 'B' 'C' 'F' 'E' 'D' 'G' 'H' 'I' 'L' 'K' 'J' 'N']
cat104 has 17 values. Unique values are :: ['I' 'E' 'D' 'K' 'H' 'F' 'G' 'P' 'C' 'J' 'L' 'M' 'N' 'O' 'B' 'A' 'Q']
cat105 has 20 values. Unique values are :: ['E' 'F' 'H' 'G' 'I' 'D' 'J' 'K' 'M' 'C' 'A' 'L' 'N' 'P' 'T' 'Q' 'R' 'O'
'B' 'S']
cat106 has 17 values. Unique values are :: ['G' 'I' 'H' 'K' 'F' 'J' 'E' 'L' 'M' 'D' 'A' 'C' 'N' 'O' 'R' 'B' 'P']
cat107 has 20 values. Unique values are :: ['J' 'K' 'F' 'G' 'I' 'M' 'H' 'L' 'E' 'D' 'O' 'C' 'N' 'A' 'Q' 'P' 'U' 'B'
'R' 'S']
cat108 has 11 values. Unique values are :: ['G' 'K' 'A' 'B' 'D' 'I' 'F' 'H' 'E' 'C' 'J']
cat109 has 84 values. Unique values are :: ['BU' 'BI' 'AB' 'H' 'K' 'CD' 'BQ' 'M' 'G' 'BL' 'L' 'AL' 'N' 'CL' 'R' 'F'
'BJ' 'AR' 'AT' 'S' 'AS' 'BO' 'X' 'D' 'BM' 'I' 'BH' 'CI' 'CF' 'C' 'AM' 'U'
'BE' 'BR' 'CJ' 'AE' 'A' 'Q' 'AW' 'T' 'AJ' 'AH' 'BA' 'BV' 'CC' 'CA' 'BG'
'BB' 'O' 'BD' 'AV' 'AX' 'AQ' 'AA' 'AI' 'AU' 'BX' 'AP' 'CK' 'Y' 'CH' 'BS'
'AN' 'AO' 'BC' 'CE' 'E' 'BY' 'CB' 'BT' 'P' 'BK' 'AF' 'B' 'BF' 'CG' 'V'
'ZZ' 'AY' 'BP' 'BN' 'J' 'AG' 'AK']
cat110 has 131 values. Unique values are :: ['BC' 'CQ' 'DK' 'CS' 'C' 'EB' 'DW' 'AM' 'AI' 'EG' 'CL' 'BS' 'BT' 'CO' 'CM'
'EL' 'AY' 'W' 'EE' 'AC' 'DX' 'CI' 'DT' 'A' 'V' 'DM' 'EF' 'DL' 'DA' 'BP'
'DH' 'CF' 'N' 'T' 'CR' 'X' 'CH' 'EM' 'DC' 'AX' 'BG' 'CJ' 'EA' 'AD' 'U'
'AK' 'BX' 'AW' 'G' 'BA' 'L' 'AP' 'CG' 'R' 'DU' 'I' 'AR' 'O' 'DF' 'AT' 'E'
'AB' 'AU' 'DI' 'CN' 'CP' 'AL' 'ED' 'DJ' 'AO' 'CY' 'BE' 'BJ' 'D' 'AA' 'CK'
'CV' 'BK' 'BB' 'AE' 'BO' 'P' 'DO' 'CT' 'AJ' 'BR' 'Y' 'DR' 'BQ' 'BL' 'B'
'BW' 'H' 'DP' 'DG' 'AG' 'BN' 'J' 'CW' 'DV' 'Q' 'DY' 'EI' 'AV' 'DQ' 'BU'
'K' 'BF' 'BD' 'DS' 'DE' 'BM' 'BY' 'CD' 'BI' 'DD' 'DB' 'AH' 'CC' 'DN' 'CU'
'BV' 'CX' 'AN' 'EK' 'EJ' 'AS' 'AF' 'CB' 'EH' 'S']
cat111 has 16 values. Unique values are :: ['C' 'A' 'G' 'E' 'I' 'M' 'W' 'S' 'K' 'O' 'Q' 'U' 'F' 'B' 'Y' 'D']
cat112 has 51 values. Unique values are :: ['AS' 'AV' 'C' 'N' 'Y' 'J' 'AH' 'K' 'U' 'E' 'AK' 'AI' 'AE' 'A' 'L' 'F' 'AP'
'AD' 'AF' 'AL' 'AN' 'S' 'AW' 'I' 'AR' 'AX' 'AU' 'AQ' 'O' 'AO' 'R' 'H' 'G'
'AC' 'AT' 'AG' 'X' 'AA' 'Q' 'AY' 'D' 'BA' 'P' 'B' 'AM' 'M' 'T' 'W' 'V'
'AB' 'AJ']
cat113 has 61 values. Unique values are :: ['S' 'BM' 'AF' 'AE' 'Y' 'AX' 'H' 'K' 'L' 'A' 'J' 'AK' 'N' 'M' 'AJ' 'AT' 'F'
'BC' 'AY' 'AD' 'BG' 'BO' 'AS' 'BD' 'AN' 'I' 'BF' 'BK' 'AW' 'AG' 'BJ' 'AO'
'Q' 'AM' 'X' 'AU' 'BN' 'BH' 'AI' 'C' 'AV' 'AQ' 'AH' 'G' 'E' 'BA' 'AL' 'BI'
'U' 'AB' 'V' 'O' 'BB' 'AP' 'B' 'BL' 'BE' 'T' 'P' 'AC' 'AR']
cat114 has 19 values. Unique values are :: ['A' 'J' 'E' 'C' 'F' 'L' 'N' 'I' 'R' 'U' 'O' 'B' 'Q' 'V' 'D' 'X' 'W' 'S'
'G']
cat115 has 23 values. Unique values are :: ['O' 'I' 'K' 'P' 'Q' 'L' 'J' 'R' 'N' 'M' 'H' 'G' 'F' 'A' 'S' 'W' 'T' 'C'
'E' 'D' 'B' 'X' 'U']
cat116 has 326 values. Unique values are :: ['LB' 'DP' 'GK' 'DJ' 'CK' 'LO' 'IE' 'LY' 'GS' 'HK' 'DC' 'MP' 'DS' 'LE' 'HQ'
'HJ' 'GC' 'BY' 'HX' 'HL' 'HG' 'MD' 'LF' 'LM' 'CB' 'CS' 'KQ' 'HN' 'LQ' 'KW'
'IT' 'LN' 'CW' 'LC' 'GX' 'GE' 'CP' 'HB' 'GI' 'GM' 'CR' 'JR' 'HA' 'EE' 'BA'
'LJ' 'IH' 'HV' 'GU' 'HM' 'CY' 'IC' 'KD' 'KI' 'DN' 'MG' 'LL' 'KN' 'LH' 'DF'
'EY' 'LW' 'KA' 'EK' 'DK' 'EO' 'CG' 'K' 'HC' 'DI' 'FB' 'IG' 'FR' 'CI' 'EC'
'KR' 'HI' 'IU' 'MC' 'BP' 'JW' 'FH' 'IF' 'E' 'DA' 'KL' 'LX' 'IL' 'KB' 'IQ'
'EL' 'JX' 'H' 'GN' 'CD' 'DH' 'AC' 'FD' 'ME' 'KC' 'FT' 'CT' 'DM' 'GL' 'ES'
'JL' 'BX' 'II' 'HP' 'ED' 'CU' 'EN' 'FG' 'MJ' 'KE' 'CF' 'EB' 'DD' 'EI' 'FX'
'EA' 'BO' 'KP' 'EP' 'FC' 'GB' 'JU' 'LV' 'CO' 'EF' 'BD' 'HW' 'LI' 'GT' 'HH'
'KJ' 'CN' 'B' 'FE' 'GA' 'FW' 'IY' 'MO' 'JG' 'ID' 'DX' 'FA' 'LA' 'HR' 'GJ'
'GO' 'KT' 'GW' 'U' 'MI' 'GP' 'F' 'DU' 'KM' 'BV' 'DT' 'IM' 'LD' 'GR' 'HD'
'BS' 'AJ' 'KX' 'LR' 'ML' 'KU' 'CE' 'IA' 'DE' 'R' 'AO' 'MU' 'AK' 'CX' 'HY'
'EH' 'MA' 'GH' 'LK' 'DL' 'AX' 'IN' 'BI' 'JM' 'JF' 'KK' 'DR' 'LT' 'GF' 'AW'
'KY' 'CA' 'MK' 'DV' 'EG' 'DW' 'MN' 'V' 'CM' 'GY' 'AF' 'JC' 'MR' 'JE' 'IP'
'KV' 'KH' 'BW' 'MQ' 'D' 'HF' 'CV' 'BL' 'FL' 'GV' 'CQ' 'BM' 'JB' 'J' 'FU'
'AG' 'EJ' 'CH' 'MW' 'X' 'DG' 'AV' 'EW' 'O' 'DO' 'BK' 'FS' 'T' 'CL' 'Y'
'JQ' 'I' 'AL' 'JJ' 'HT' 'FF' 'JA' 'GD' 'FV' 'BQ' 'M' 'S' 'EU' 'P' 'FJ'
'AR' 'LG' 'IR' 'GQ' 'MM' 'AY' 'MF' 'GG' 'KG' 'JD' 'L' 'KS' 'AH' 'JV' 'EV'
'CC' 'AB' 'FK' 'JY' 'G' 'W' 'BC' 'AM' 'KF' 'LU' 'IK' 'BU' 'AT' 'JP' 'Q'
'IJ' 'JO' 'JH' 'AS' 'JN' 'BF' 'AD' 'FP' 'MV' 'AA' 'CJ' 'DY' 'IB' 'AN' 'EQ'
'JT' 'BG' 'AP' 'MB' 'JK' 'FI' 'MS' 'HE' 'C' 'IV' 'IO' 'BT' 'DQ' 'FM' 'HO'
'MH' 'MT' 'FO' 'JI' 'FQ' 'AU' 'FN' 'BB' 'HU' 'IX' 'AE']
cat1 has 2 values. Unique values are :: ['A' 'B']
cat2 has 2 values. Unique values are :: ['B' 'A']
cat3 has 2 values. Unique values are :: ['A' 'B']
cat4 has 2 values. Unique values are :: ['A' 'B']
cat5 has 2 values. Unique values are :: ['A' 'B']
cat6 has 2 values. Unique values are :: ['A' 'B']
cat7 has 2 values. Unique values are :: ['A' 'B']
cat8 has 2 values. Unique values are :: ['A' 'B']
cat9 has 2 values. Unique values are :: ['B' 'A']
cat10 has 2 values. Unique values are :: ['A' 'B']
cat11 has 2 values. Unique values are :: ['B' 'A']
cat12 has 2 values. Unique values are :: ['A' 'B']
cat13 has 2 values. Unique values are :: ['A' 'B']
cat14 has 2 values. Unique values are :: ['A' 'B']
cat15 has 2 values. Unique values are :: ['A' 'B']
cat16 has 2 values. Unique values are :: ['A' 'B']
cat17 has 2 values. Unique values are :: ['A' 'B']
cat18 has 2 values. Unique values are :: ['A' 'B']
cat19 has 2 values. Unique values are :: ['A' 'B']
cat20 has 2 values. Unique values are :: ['A' 'B']
cat21 has 2 values. Unique values are :: ['A' 'B']
cat22 has 2 values. Unique values are :: ['A' 'B']
cat23 has 2 values. Unique values are :: ['A' 'B']
cat24 has 2 values. Unique values are :: ['A' 'B']
cat25 has 2 values. Unique values are :: ['A' 'B']
cat26 has 2 values. Unique values are :: ['A' 'B']
cat27 has 2 values. Unique values are :: ['A' 'B']
cat28 has 2 values. Unique values are :: ['A' 'B']
cat29 has 2 values. Unique values are :: ['A' 'B']
cat30 has 2 values. Unique values are :: ['A' 'B']
cat31 has 2 values. Unique values are :: ['A' 'B']
cat32 has 2 values. Unique values are :: ['A' 'B']
cat33 has 2 values. Unique values are :: ['A' 'B']
cat34 has 2 values. Unique values are :: ['A' 'B']
cat35 has 2 values. Unique values are :: ['A' 'B']
cat36 has 2 values. Unique values are :: ['A' 'B']
cat37 has 2 values. Unique values are :: ['A' 'B']
cat38 has 2 values. Unique values are :: ['A' 'B']
cat39 has 2 values. Unique values are :: ['A' 'B']
cat40 has 2 values. Unique values are :: ['A' 'B']
cat41 has 2 values. Unique values are :: ['A' 'B']
cat42 has 2 values. Unique values are :: ['A' 'B']
cat43 has 2 values. Unique values are :: ['A' 'B']
cat44 has 2 values. Unique values are :: ['A' 'B']
cat45 has 2 values. Unique values are :: ['A' 'B']
cat46 has 2 values. Unique values are :: ['A' 'B']
cat47 has 2 values. Unique values are :: ['A' 'B']
cat48 has 2 values. Unique values are :: ['A' 'B']
cat49 has 2 values. Unique values are :: ['A' 'B']
cat50 has 2 values. Unique values are :: ['A' 'B']
cat51 has 2 values. Unique values are :: ['A' 'B']
cat52 has 2 values. Unique values are :: ['A' 'B']
cat53 has 2 values. Unique values are :: ['A' 'B']
cat54 has 2 values. Unique values are :: ['A' 'B']
cat55 has 2 values. Unique values are :: ['A' 'B']
cat56 has 2 values. Unique values are :: ['A' 'B']
cat57 has 2 values. Unique values are :: ['A' 'B']
cat58 has 2 values. Unique values are :: ['A' 'B']
cat59 has 2 values. Unique values are :: ['A' 'B']
cat60 has 2 values. Unique values are :: ['A' 'B']
cat61 has 2 values. Unique values are :: ['A' 'B']
cat62 has 2 values. Unique values are :: ['A' 'B']
cat63 has 2 values. Unique values are :: ['A' 'B']
cat64 has 2 values. Unique values are :: ['A' 'B']
cat65 has 2 values. Unique values are :: ['A' 'B']
cat66 has 2 values. Unique values are :: ['A' 'B']
cat67 has 2 values. Unique values are :: ['A' 'B']
cat68 has 2 values. Unique values are :: ['A' 'B']
cat69 has 2 values. Unique values are :: ['A' 'B']
cat70 has 2 values. Unique values are :: ['A' 'B']
cat71 has 2 values. Unique values are :: ['A' 'B']
cat72 has 2 values. Unique values are :: ['A' 'B']
cat73 has 3 values. Unique values are :: ['A' 'B' 'C']
cat74 has 3 values. Unique values are :: ['A' 'B' 'C']
cat75 has 3 values. Unique values are :: ['A' 'B' 'C']
cat76 has 3 values. Unique values are :: ['A' 'B' 'C']
cat77 has 4 values. Unique values are :: ['D' 'C' 'B' 'A']
cat78 has 4 values. Unique values are :: ['B' 'D' 'C' 'A']
cat79 has 4 values. Unique values are :: ['B' 'D' 'A' 'C']
cat80 has 4 values. Unique values are :: ['D' 'B' 'C' 'A']
cat81 has 4 values. Unique values are :: ['D' 'B' 'C' 'A']
cat82 has 4 values. Unique values are :: ['B' 'A' 'D' 'C']
cat83 has 4 values. Unique values are :: ['B' 'D' 'A' 'C']
cat84 has 4 values. Unique values are :: ['C' 'A' 'D' 'B']
cat85 has 4 values. Unique values are :: ['B' 'D' 'C' 'A']
cat86 has 4 values. Unique values are :: ['D' 'B' 'C' 'A']
cat87 has 4 values. Unique values are :: ['B' 'D' 'C' 'A']
cat88 has 4 values. Unique values are :: ['A' 'D' 'E' 'B']
cat89 has 8 values. Unique values are :: ['A' 'B' 'D' 'C' 'F' 'H' 'E' 'G']
cat90 has 6 values. Unique values are :: ['A' 'B' 'C' 'D' 'F' 'E']
cat91 has 8 values. Unique values are :: ['A' 'G' 'B' 'C' 'E' 'D' 'F' 'H']
cat92 has 8 values. Unique values are :: ['A' 'H' 'B' 'C' 'G' 'I' 'D' 'E']
cat93 has 5 values. Unique values are :: ['D' 'E' 'C' 'B' 'A']
cat94 has 7 values. Unique values are :: ['C' 'D' 'B' 'E' 'F' 'A' 'G']
cat95 has 5 values. Unique values are :: ['C' 'D' 'E' 'A' 'B']
cat96 has 9 values. Unique values are :: ['E' 'B' 'G' 'D' 'F' 'I' 'A' 'C' 'H']
cat97 has 7 values. Unique values are :: ['C' 'A' 'E' 'G' 'D' 'F' 'B']
cat98 has 5 values. Unique values are :: ['D' 'A' 'C' 'E' 'B']
cat99 has 17 values. Unique values are :: ['T' 'P' 'D' 'H' 'R' 'F' 'K' 'S' 'N' 'C' 'E' 'J' 'I' 'G' 'M' 'U' 'O']
cat100 has 15 values. Unique values are :: ['H' 'B' 'G' 'A' 'F' 'I' 'L' 'K' 'J' 'N' 'O' 'M' 'D' 'C' 'E']
cat101 has 17 values. Unique values are :: ['G' 'D' 'Q' 'A' 'F' 'M' 'L' 'O' 'C' 'I' 'J' 'S' 'R' 'E' 'B' 'H' 'K']
cat102 has 7 values. Unique values are :: ['A' 'C' 'B' 'E' 'D' 'G' 'F']
cat103 has 14 values. Unique values are :: ['A' 'D' 'C' 'B' 'E' 'F' 'G' 'I' 'H' 'K' 'J' 'M' 'L' 'N']
cat104 has 17 values. Unique values are :: ['G' 'D' 'E' 'F' 'H' 'K' 'I' 'O' 'L' 'C' 'J' 'M' 'N' 'P' 'B' 'A' 'Q']
cat105 has 18 values. Unique values are :: ['E' 'G' 'F' 'H' 'I' 'D' 'J' 'A' 'L' 'C' 'K' 'N' 'M' 'P' 'O' 'T' 'B' 'Q']
cat106 has 18 values. Unique values are :: ['I' 'G' 'J' 'D' 'F' 'K' 'H' 'E' 'L' 'M' 'A' 'O' 'C' 'N' 'R' 'B' 'Q' 'P']
cat107 has 20 values. Unique values are :: ['L' 'F' 'G' 'K' 'E' 'D' 'C' 'M' 'H' 'I' 'J' 'A' 'O' 'S' 'P' 'N' 'Q' 'U'
'R' 'B']
cat108 has 11 values. Unique values are :: ['K' 'B' 'A' 'G' 'D' 'F' 'E' 'H' 'J' 'I' 'C']
cat109 has 74 values. Unique values are :: ['BI' 'AB' 'K' 'G' 'BU' 'M' 'I' 'O' 'BO' 'CD' 'T' 'BQ' 'R' 'X' 'AR' 'E'
'BL' 'CI' 'S' 'AL' 'BH' 'N' 'U' 'F' 'AS' 'AQ' 'AW' 'CC' 'AN' 'AJ' 'C' 'AT'
'D' 'H' 'CA' 'A' 'AX' 'L' 'BD' 'V' 'BX' 'AH' 'CL' 'AM' 'BA' 'BR' 'AO' 'AE'
'AY' 'BB' 'BJ' 'AP' 'BN' 'AI' 'Q' 'BS' 'CK' 'AU' 'CE' 'BC' 'BG' 'AD' 'Y'
'BK' 'AA' 'CG' 'AV' 'P' 'AF' 'CB' 'CF' 'BE' 'CH' 'ZZ']
cat110 has 123 values. Unique values are :: ['BC' 'CO' 'CS' 'CR' 'EG' 'CL' 'EL' 'BT' 'EB' 'CQ' 'BS' 'C' 'W' 'DX' 'CM'
'A' 'EF' 'CI' 'DL' 'AI' 'BP' 'N' 'DJ' 'CT' 'E' 'DW' 'CH' 'V' 'AM' 'DK'
'EA' 'BR' 'DR' 'D' 'EE' 'T' 'AP' 'I' 'AC' 'CY' 'DM' 'AL' 'CK' 'AD' 'AY'
'CF' 'CD' 'BG' 'AK' 'DA' 'DC' 'DQ' 'BA' 'U' 'CX' 'BJ' 'AV' 'AR' 'K' 'CG'
'DT' 'CN' 'O' 'BO' 'DU' 'CJ' 'AX' 'DH' 'BX' 'AH' 'AU' 'AB' 'BV' 'EM' 'L'
'BH' 'DI' 'DB' 'DE' 'CV' 'DO' 'BQ' 'AW' 'AJ' 'J' 'CU' 'P' 'CP' 'DS' 'BL'
'AO' 'AA' 'DF' 'DG' 'CC' 'X' 'BF' 'AE' 'BU' 'AT' 'BB' 'B' 'ED' 'Y' 'G'
'BE' 'DD' 'DY' 'DP' 'R' 'CW' 'DN' 'AG' 'BW' 'BY' 'EK' 'CA' 'AS' 'EJ' 'BM'
'Q' 'S' 'EN']
cat111 has 16 values. Unique values are :: ['A' 'E' 'C' 'G' 'K' 'I' 'Q' 'U' 'M' 'O' 'S' 'F' 'L' 'W' 'Y' 'B']
cat112 has 51 values. Unique values are :: ['J' 'G' 'U' 'AY' 'E' 'AN' 'AG' 'R' 'N' 'AV' 'AW' 'AS' 'AJ' 'AU' 'T' 'AH'
'AK' 'AF' 'D' 'L' 'AP' 'AI' 'K' 'A' 'AM' 'AT' 'AO' 'O' 'F' 'AD' 'C' 'S'
'AC' 'AA' 'X' 'Y' 'AE' 'AL' 'W' 'Q' 'I' 'B' 'M' 'AR' 'BA' 'AX' 'H' 'V'
'AB' 'P' 'AQ']
cat113 has 60 values. Unique values are :: ['AX' 'X' 'AE' 'AJ' 'I' 'BC' 'S' 'Y' 'L' 'A' 'AO' 'AN' 'N' 'BM' 'AK' 'Q'
'BK' 'J' 'M' 'AV' 'H' 'AD' 'AS' 'AW' 'BN' 'K' 'AG' 'BJ' 'F' 'BG' 'AF' 'AU'
'BO' 'AT' 'BH' 'BD' 'AI' 'AY' 'BF' 'AM' 'E' 'AH' 'C' 'BI' 'AB' 'BA' 'BB'
'O' 'B' 'AQ' 'V' 'BL' 'G' 'AP' 'U' 'AA' 'R' 'AR' 'AL' 'P']
cat114 has 18 values. Unique values are :: ['A' 'C' 'E' 'N' 'I' 'O' 'F' 'J' 'R' 'L' 'U' 'V' 'Q' 'B' 'W' 'G' 'D' 'S']
cat115 has 23 values. Unique values are :: ['Q' 'L' 'K' 'P' 'J' 'I' 'H' 'O' 'M' 'N' 'R' 'G' 'S' 'A' 'F' 'T' 'U' 'X'
'W' 'D' 'C' 'E' 'B']
cat116 has 311 values. Unique values are :: ['HG' 'HK' 'CK' 'DJ' 'HA' 'HY' 'MD' 'KC' 'GC' 'DT' 'HX' 'GE' 'HV' 'HJ' 'DA'
'HL' 'KB' 'JR' 'EP' 'DF' 'DP' 'LN' 'IE' 'GK' 'KW' 'CD' 'CR' 'CG' 'GS' 'LF'
'IF' 'HQ' 'FB' 'LL' 'LQ' 'JE' 'GL' 'LM' 'LB' 'LO' 'DC' 'HB' 'GT' 'CS' 'GX'
'BD' 'CI' 'IC' 'CW' 'EC' 'CH' 'KI' 'MG' 'JW' 'JU' 'HM' 'IT' 'IH' 'IG' 'LY'
'MC' 'EL' 'FH' 'MO' 'KD' 'GU' 'MJ' 'KA' 'FD' 'HH' 'DK' 'AC' 'GI' 'LW' 'BY'
'HN' 'CU' 'BU' 'BO' 'GM' 'KU' 'FR' 'EO' 'CN' 'EI' 'HC' 'LI' 'DS' 'EA' 'ME'
'E' 'GA' 'CB' 'LV' 'CP' 'GN' 'KL' 'CX' 'DH' 'CA' 'BV' 'BX' 'JL' 'KJ' 'EF'
'DD' 'AQ' 'FC' 'GP' 'LX' 'FT' 'HP' 'CM' 'BP' 'CO' 'GJ' 'KR' 'JX' 'KN' 'KP'
'K' 'IU' 'EK' 'LC' 'DO' 'LJ' 'R' 'LT' 'FU' 'KX' 'LD' 'HW' 'DI' 'GW' 'EE'
'GB' 'L' 'KQ' 'BQ' 'EY' 'FE' 'MP' 'MK' 'KS' 'DN' 'LA' 'EN' 'DM' 'AF' 'HD'
'FX' 'FG' 'CQ' 'IM' 'AW' 'EH' 'LK' 'IN' 'DG' 'JC' 'B' 'MU' 'FF' 'KT' 'CT'
'GR' 'IL' 'IQ' 'MI' 'GY' 'MQ' 'AO' 'FA' 'ED' 'I' 'DW' 'AX' 'DU' 'ES' 'EJ'
'HI' 'EB' 'GO' 'LG' 'LE' 'MN' 'BK' 'CL' 'ML' 'IY' 'JM' 'H' 'MA' 'EM' 'AK'
'KE' 'CF' 'HF' 'AJ' 'II' 'Y' 'DX' 'ID' 'GV' 'EW' 'KK' 'HR' 'CV' 'DR' 'IP'
'LH' 'MM' 'BS' 'FW' 'AR' 'GG' 'EG' 'MW' 'KM' 'DL' 'MS' 'JY' 'FP' 'JF' 'BW'
'KY' 'FY' 'GD' 'S' 'CE' 'GH' 'AN' 'KV' 'DE' 'GF' 'AI' 'HT' 'IA' 'BA' 'LR'
'N' 'JP' 'EU' 'JQ' 'BC' 'U' 'MR' 'JG' 'T' 'J' 'BG' 'BM' 'KF' 'IR' 'ET' 'Q'
'MV' 'KO' 'HE' 'JA' 'FK' 'KG' 'FV' 'O' 'BJ' 'JH' 'JV' 'JB' 'IW' 'AD' 'BT'
'F' 'AU' 'IJ' 'AE' 'IV' 'AA' 'DB' 'G' 'JK' 'JJ' 'LP' 'CJ' 'MX' 'BR' 'AV'
'BH' 'JS' 'FQ' 'M' 'FM' 'KH' 'ER' 'AG' 'A' 'AL' 'FL' 'BN' 'BE' 'IS' 'DV'
'FJ' 'CY' 'MH' 'LU' 'BB' 'LS' 'D' 'HS' 'FI' 'EX']
Train set has 8 values. Unique values are :: ['A' 'B' 'C' 'E' 'D' 'H' 'I' 'G']
test set has 8 values. Unique values are :: ['A' 'B' 'D' 'C' 'F' 'H' 'E' 'G']
Missing vaues are : set(['I'])
Train set has 7 values. Unique values are :: ['A' 'B' 'C' 'D' 'F' 'E' 'G']
test set has 6 values. Unique values are :: ['A' 'B' 'C' 'D' 'F' 'E']
Missing vaues are : set(['G'])
Train set has 7 values. Unique values are :: ['A' 'H' 'B' 'C' 'D' 'I' 'F']
test set has 8 values. Unique values are :: ['A' 'H' 'B' 'C' 'G' 'I' 'D' 'E']
Missing vaues are : set(['F'])
Train set has 19 values. Unique values are :: ['G' 'F' 'O' 'D' 'J' 'A' 'C' 'Q' 'M' 'I' 'L' 'R' 'S' 'E' 'N' 'H' 'B' 'U'
'K']
test set has 17 values. Unique values are :: ['G' 'D' 'Q' 'A' 'F' 'M' 'L' 'O' 'C' 'I' 'J' 'S' 'R' 'E' 'B' 'H' 'K']
Missing vaues are : set(['U', 'N'])
Train set has 9 values. Unique values are :: ['A' 'C' 'B' 'D' 'G' 'E' 'F' 'H' 'J']
test set has 7 values. Unique values are :: ['A' 'C' 'B' 'E' 'D' 'G' 'F']
Missing vaues are : set(['H', 'J'])
Train set has 20 values. Unique values are :: ['E' 'F' 'H' 'G' 'I' 'D' 'J' 'K' 'M' 'C' 'A' 'L' 'N' 'P' 'T' 'Q' 'R' 'O'
'B' 'S']
test set has 18 values. Unique values are :: ['E' 'G' 'F' 'H' 'I' 'D' 'J' 'A' 'L' 'C' 'K' 'N' 'M' 'P' 'O' 'T' 'B' 'Q']
Missing vaues are : set(['S', 'R'])
Train set has 84 values. Unique values are :: ['BU' 'BI' 'AB' 'H' 'K' 'CD' 'BQ' 'M' 'G' 'BL' 'L' 'AL' 'N' 'CL' 'R' 'F'
'BJ' 'AR' 'AT' 'S' 'AS' 'BO' 'X' 'D' 'BM' 'I' 'BH' 'CI' 'CF' 'C' 'AM' 'U'
'BE' 'BR' 'CJ' 'AE' 'A' 'Q' 'AW' 'T' 'AJ' 'AH' 'BA' 'BV' 'CC' 'CA' 'BG'
'BB' 'O' 'BD' 'AV' 'AX' 'AQ' 'AA' 'AI' 'AU' 'BX' 'AP' 'CK' 'Y' 'CH' 'BS'
'AN' 'AO' 'BC' 'CE' 'E' 'BY' 'CB' 'BT' 'P' 'BK' 'AF' 'B' 'BF' 'CG' 'V'
'ZZ' 'AY' 'BP' 'BN' 'J' 'AG' 'AK']
test set has 74 values. Unique values are :: ['BI' 'AB' 'K' 'G' 'BU' 'M' 'I' 'O' 'BO' 'CD' 'T' 'BQ' 'R' 'X' 'AR' 'E'
'BL' 'CI' 'S' 'AL' 'BH' 'N' 'U' 'F' 'AS' 'AQ' 'AW' 'CC' 'AN' 'AJ' 'C' 'AT'
'D' 'H' 'CA' 'A' 'AX' 'L' 'BD' 'V' 'BX' 'AH' 'CL' 'AM' 'BA' 'BR' 'AO' 'AE'
'AY' 'BB' 'BJ' 'AP' 'BN' 'AI' 'Q' 'BS' 'CK' 'AU' 'CE' 'BC' 'BG' 'AD' 'Y'
'BK' 'AA' 'CG' 'AV' 'P' 'AF' 'CB' 'CF' 'BE' 'CH' 'ZZ']
Missing vaues are : set(['CJ', 'BF', 'B', 'AG', 'BM', 'AK', 'J', 'BT', 'BV', 'BP', 'BY'])
Train set has 131 values. Unique values are :: ['BC' 'CQ' 'DK' 'CS' 'C' 'EB' 'DW' 'AM' 'AI' 'EG' 'CL' 'BS' 'BT' 'CO' 'CM'
'EL' 'AY' 'W' 'EE' 'AC' 'DX' 'CI' 'DT' 'A' 'V' 'DM' 'EF' 'DL' 'DA' 'BP'
'DH' 'CF' 'N' 'T' 'CR' 'X' 'CH' 'EM' 'DC' 'AX' 'BG' 'CJ' 'EA' 'AD' 'U'
'AK' 'BX' 'AW' 'G' 'BA' 'L' 'AP' 'CG' 'R' 'DU' 'I' 'AR' 'O' 'DF' 'AT' 'E'
'AB' 'AU' 'DI' 'CN' 'CP' 'AL' 'ED' 'DJ' 'AO' 'CY' 'BE' 'BJ' 'D' 'AA' 'CK'
'CV' 'BK' 'BB' 'AE' 'BO' 'P' 'DO' 'CT' 'AJ' 'BR' 'Y' 'DR' 'BQ' 'BL' 'B'
'BW' 'H' 'DP' 'DG' 'AG' 'BN' 'J' 'CW' 'DV' 'Q' 'DY' 'EI' 'AV' 'DQ' 'BU'
'K' 'BF' 'BD' 'DS' 'DE' 'BM' 'BY' 'CD' 'BI' 'DD' 'DB' 'AH' 'CC' 'DN' 'CU'
'BV' 'CX' 'AN' 'EK' 'EJ' 'AS' 'AF' 'CB' 'EH' 'S']
test set has 123 values. Unique values are :: ['BC' 'CO' 'CS' 'CR' 'EG' 'CL' 'EL' 'BT' 'EB' 'CQ' 'BS' 'C' 'W' 'DX' 'CM'
'A' 'EF' 'CI' 'DL' 'AI' 'BP' 'N' 'DJ' 'CT' 'E' 'DW' 'CH' 'V' 'AM' 'DK'
'EA' 'BR' 'DR' 'D' 'EE' 'T' 'AP' 'I' 'AC' 'CY' 'DM' 'AL' 'CK' 'AD' 'AY'
'CF' 'CD' 'BG' 'AK' 'DA' 'DC' 'DQ' 'BA' 'U' 'CX' 'BJ' 'AV' 'AR' 'K' 'CG'
'DT' 'CN' 'O' 'BO' 'DU' 'CJ' 'AX' 'DH' 'BX' 'AH' 'AU' 'AB' 'BV' 'EM' 'L'
'BH' 'DI' 'DB' 'DE' 'CV' 'DO' 'BQ' 'AW' 'AJ' 'J' 'CU' 'P' 'CP' 'DS' 'BL'
'AO' 'AA' 'DF' 'DG' 'CC' 'X' 'BF' 'AE' 'BU' 'AT' 'BB' 'B' 'ED' 'Y' 'G'
'BE' 'DD' 'DY' 'DP' 'R' 'CW' 'DN' 'AG' 'BW' 'BY' 'EK' 'CA' 'AS' 'EJ' 'BM'
'Q' 'S' 'EN']
Missing vaues are : set(['BD', 'EI', 'EH', 'AF', 'H', 'BN', 'BI', 'BK', 'CB', 'DV', 'AN'])
Train set has 16 values. Unique values are :: ['C' 'A' 'G' 'E' 'I' 'M' 'W' 'S' 'K' 'O' 'Q' 'U' 'F' 'B' 'Y' 'D']
test set has 16 values. Unique values are :: ['A' 'E' 'C' 'G' 'K' 'I' 'Q' 'U' 'M' 'O' 'S' 'F' 'L' 'W' 'Y' 'B']
Missing vaues are : set(['D'])
Train set has 61 values. Unique values are :: ['S' 'BM' 'AF' 'AE' 'Y' 'AX' 'H' 'K' 'L' 'A' 'J' 'AK' 'N' 'M' 'AJ' 'AT' 'F'
'BC' 'AY' 'AD' 'BG' 'BO' 'AS' 'BD' 'AN' 'I' 'BF' 'BK' 'AW' 'AG' 'BJ' 'AO'
'Q' 'AM' 'X' 'AU' 'BN' 'BH' 'AI' 'C' 'AV' 'AQ' 'AH' 'G' 'E' 'BA' 'AL' 'BI'
'U' 'AB' 'V' 'O' 'BB' 'AP' 'B' 'BL' 'BE' 'T' 'P' 'AC' 'AR']
test set has 60 values. Unique values are :: ['AX' 'X' 'AE' 'AJ' 'I' 'BC' 'S' 'Y' 'L' 'A' 'AO' 'AN' 'N' 'BM' 'AK' 'Q'
'BK' 'J' 'M' 'AV' 'H' 'AD' 'AS' 'AW' 'BN' 'K' 'AG' 'BJ' 'F' 'BG' 'AF' 'AU'
'BO' 'AT' 'BH' 'BD' 'AI' 'AY' 'BF' 'AM' 'E' 'AH' 'C' 'BI' 'AB' 'BA' 'BB'
'O' 'B' 'AQ' 'V' 'BL' 'G' 'AP' 'U' 'AA' 'R' 'AR' 'AL' 'P']
Missing vaues are : set(['BE', 'AC', 'T'])
Train set has 19 values. Unique values are :: ['A' 'J' 'E' 'C' 'F' 'L' 'N' 'I' 'R' 'U' 'O' 'B' 'Q' 'V' 'D' 'X' 'W' 'S'
'G']
test set has 18 values. Unique values are :: ['A' 'C' 'E' 'N' 'I' 'O' 'F' 'J' 'R' 'L' 'U' 'V' 'Q' 'B' 'W' 'G' 'D' 'S']
Missing vaues are : set(['X'])
Train set has 326 values. Unique values are :: ['LB' 'DP' 'GK' 'DJ' 'CK' 'LO' 'IE' 'LY' 'GS' 'HK' 'DC' 'MP' 'DS' 'LE' 'HQ'
'HJ' 'GC' 'BY' 'HX' 'HL' 'HG' 'MD' 'LF' 'LM' 'CB' 'CS' 'KQ' 'HN' 'LQ' 'KW'
'IT' 'LN' 'CW' 'LC' 'GX' 'GE' 'CP' 'HB' 'GI' 'GM' 'CR' 'JR' 'HA' 'EE' 'BA'
'LJ' 'IH' 'HV' 'GU' 'HM' 'CY' 'IC' 'KD' 'KI' 'DN' 'MG' 'LL' 'KN' 'LH' 'DF'
'EY' 'LW' 'KA' 'EK' 'DK' 'EO' 'CG' 'K' 'HC' 'DI' 'FB' 'IG' 'FR' 'CI' 'EC'
'KR' 'HI' 'IU' 'MC' 'BP' 'JW' 'FH' 'IF' 'E' 'DA' 'KL' 'LX' 'IL' 'KB' 'IQ'
'EL' 'JX' 'H' 'GN' 'CD' 'DH' 'AC' 'FD' 'ME' 'KC' 'FT' 'CT' 'DM' 'GL' 'ES'
'JL' 'BX' 'II' 'HP' 'ED' 'CU' 'EN' 'FG' 'MJ' 'KE' 'CF' 'EB' 'DD' 'EI' 'FX'
'EA' 'BO' 'KP' 'EP' 'FC' 'GB' 'JU' 'LV' 'CO' 'EF' 'BD' 'HW' 'LI' 'GT' 'HH'
'KJ' 'CN' 'B' 'FE' 'GA' 'FW' 'IY' 'MO' 'JG' 'ID' 'DX' 'FA' 'LA' 'HR' 'GJ'
'GO' 'KT' 'GW' 'U' 'MI' 'GP' 'F' 'DU' 'KM' 'BV' 'DT' 'IM' 'LD' 'GR' 'HD'
'BS' 'AJ' 'KX' 'LR' 'ML' 'KU' 'CE' 'IA' 'DE' 'R' 'AO' 'MU' 'AK' 'CX' 'HY'
'EH' 'MA' 'GH' 'LK' 'DL' 'AX' 'IN' 'BI' 'JM' 'JF' 'KK' 'DR' 'LT' 'GF' 'AW'
'KY' 'CA' 'MK' 'DV' 'EG' 'DW' 'MN' 'V' 'CM' 'GY' 'AF' 'JC' 'MR' 'JE' 'IP'
'KV' 'KH' 'BW' 'MQ' 'D' 'HF' 'CV' 'BL' 'FL' 'GV' 'CQ' 'BM' 'JB' 'J' 'FU'
'AG' 'EJ' 'CH' 'MW' 'X' 'DG' 'AV' 'EW' 'O' 'DO' 'BK' 'FS' 'T' 'CL' 'Y'
'JQ' 'I' 'AL' 'JJ' 'HT' 'FF' 'JA' 'GD' 'FV' 'BQ' 'M' 'S' 'EU' 'P' 'FJ'
'AR' 'LG' 'IR' 'GQ' 'MM' 'AY' 'MF' 'GG' 'KG' 'JD' 'L' 'KS' 'AH' 'JV' 'EV'
'CC' 'AB' 'FK' 'JY' 'G' 'W' 'BC' 'AM' 'KF' 'LU' 'IK' 'BU' 'AT' 'JP' 'Q'
'IJ' 'JO' 'JH' 'AS' 'JN' 'BF' 'AD' 'FP' 'MV' 'AA' 'CJ' 'DY' 'IB' 'AN' 'EQ'
'JT' 'BG' 'AP' 'MB' 'JK' 'FI' 'MS' 'HE' 'C' 'IV' 'IO' 'BT' 'DQ' 'FM' 'HO'
'MH' 'MT' 'FO' 'JI' 'FQ' 'AU' 'FN' 'BB' 'HU' 'IX' 'AE']
test set has 311 values. Unique values are :: ['HG' 'HK' 'CK' 'DJ' 'HA' 'HY' 'MD' 'KC' 'GC' 'DT' 'HX' 'GE' 'HV' 'HJ' 'DA'
'HL' 'KB' 'JR' 'EP' 'DF' 'DP' 'LN' 'IE' 'GK' 'KW' 'CD' 'CR' 'CG' 'GS' 'LF'
'IF' 'HQ' 'FB' 'LL' 'LQ' 'JE' 'GL' 'LM' 'LB' 'LO' 'DC' 'HB' 'GT' 'CS' 'GX'
'BD' 'CI' 'IC' 'CW' 'EC' 'CH' 'KI' 'MG' 'JW' 'JU' 'HM' 'IT' 'IH' 'IG' 'LY'
'MC' 'EL' 'FH' 'MO' 'KD' 'GU' 'MJ' 'KA' 'FD' 'HH' 'DK' 'AC' 'GI' 'LW' 'BY'
'HN' 'CU' 'BU' 'BO' 'GM' 'KU' 'FR' 'EO' 'CN' 'EI' 'HC' 'LI' 'DS' 'EA' 'ME'
'E' 'GA' 'CB' 'LV' 'CP' 'GN' 'KL' 'CX' 'DH' 'CA' 'BV' 'BX' 'JL' 'KJ' 'EF'
'DD' 'AQ' 'FC' 'GP' 'LX' 'FT' 'HP' 'CM' 'BP' 'CO' 'GJ' 'KR' 'JX' 'KN' 'KP'
'K' 'IU' 'EK' 'LC' 'DO' 'LJ' 'R' 'LT' 'FU' 'KX' 'LD' 'HW' 'DI' 'GW' 'EE'
'GB' 'L' 'KQ' 'BQ' 'EY' 'FE' 'MP' 'MK' 'KS' 'DN' 'LA' 'EN' 'DM' 'AF' 'HD'
'FX' 'FG' 'CQ' 'IM' 'AW' 'EH' 'LK' 'IN' 'DG' 'JC' 'B' 'MU' 'FF' 'KT' 'CT'
'GR' 'IL' 'IQ' 'MI' 'GY' 'MQ' 'AO' 'FA' 'ED' 'I' 'DW' 'AX' 'DU' 'ES' 'EJ'
'HI' 'EB' 'GO' 'LG' 'LE' 'MN' 'BK' 'CL' 'ML' 'IY' 'JM' 'H' 'MA' 'EM' 'AK'
'KE' 'CF' 'HF' 'AJ' 'II' 'Y' 'DX' 'ID' 'GV' 'EW' 'KK' 'HR' 'CV' 'DR' 'IP'
'LH' 'MM' 'BS' 'FW' 'AR' 'GG' 'EG' 'MW' 'KM' 'DL' 'MS' 'JY' 'FP' 'JF' 'BW'
'KY' 'FY' 'GD' 'S' 'CE' 'GH' 'AN' 'KV' 'DE' 'GF' 'AI' 'HT' 'IA' 'BA' 'LR'
'N' 'JP' 'EU' 'JQ' 'BC' 'U' 'MR' 'JG' 'T' 'J' 'BG' 'BM' 'KF' 'IR' 'ET' 'Q'
'MV' 'KO' 'HE' 'JA' 'FK' 'KG' 'FV' 'O' 'BJ' 'JH' 'JV' 'JB' 'IW' 'AD' 'BT'
'F' 'AU' 'IJ' 'AE' 'IV' 'AA' 'DB' 'G' 'JK' 'JJ' 'LP' 'CJ' 'MX' 'BR' 'AV'
'BH' 'JS' 'FQ' 'M' 'FM' 'KH' 'ER' 'AG' 'A' 'AL' 'FL' 'BN' 'BE' 'IS' 'DV'
'FJ' 'CY' 'MH' 'LU' 'BB' 'LS' 'D' 'HS' 'FI' 'EX']
Missing vaues are : set(['BF', 'FS', 'W', 'BL', 'BI', 'HU', 'JN', 'JO', 'JI', 'DY', 'JD', 'FN', 'FO', 'IB', 'JT', 'DQ', 'IX', 'C', 'AB', 'GQ', 'CC', 'AH', 'AM', 'AP', 'AS', 'AT', 'IO', 'V', 'AY', 'X', 'EV', 'EQ', 'MF', 'MB', 'IK', 'MT', 'P', 'HO'])
|
Let’s plot the categorical variables to see the distribution of the variables.
1
2
3
4
5
6
7
8
9
10
|
# lets visualize the values in each of the features
# keep in mind you'll be seeing a lot of plots now
# better is use ipython/jupyter notebook to plot inline plots
for feature in categorical_train:
sns.countplot(x = train[feature], data = train)
plt.show()
for feature in categorical_test:
sns.countplot(x = test[feature], data = test)
plt.show()
|
One Hot Encoding of categorical variables
Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature.
- The first way is to use dictvectorizer to encode the labels in the feature.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
# cat1 to cat72 have only two labels A and B
# cat73 to cat 108 have more than two labels
# cat109 to cat116 have many labels
# moreover you must have noticed that some labels are missing in some features of train/test dataset
# this might become a problem when working with multiple datasets
# to avoid this, we will merge data before doing onehotencoding
train_test = pd.concat(train, test).reset_index(drop=True)
categorical = train_test.dtypes[train_test.dtypes == "object"].index
# lets check for factors in the categorical variables
for feature in categorical:
print feature, 'has ', len(train_test[feature].unique()), 'values. Unique values are :: ', train_test[feature].unique()
# 1. one hot encoding all categorical variables
v = DictVectorizer()
train_test_qual = v.fit_transform(train_test[categorical].to_dict('records'))
print 'total vocabulary :: ', train_test_qual.vocabulary_
print 'total number of columns', len(train_test_qual.vocabulary_.keys())
print 'total number of new columns added ', len(train_test_qual.vocabulary_.keys()) - len(categorical)
# it can be seen that we are adding too many new variables. This encoding is important
# since machine learning algorithms dont understand strings and we have to convert string factors
# as numeric factors which increase our dimensionality
new_df = pd.DataFrame(X_qual.toarray(), columns= [i[0] for i in sorted(v.vocabulary_.items(), key=operator.itemgetter(1))])
new_df = pd.concat([new_df, train_test], axis=1)
# remove initial categorical variables
new_df.drop(categorical, axis=1, inplace=True)
# take back the train and test set from the above data
train_featured = new_df.iloc[:train.shape[0], :]
test_featured = new_df.iloc[train.shape[0]:, :]
train_featured[continous_train] = train[continous_train]
test_featured[continous_train] = test[continous_train]
train_featured['loss'] = loss
|
2. Second method is to use pandas to get dummy variables
1
2
3
4
5
6
7
8
9
10
11
12
13
|
# 2. using get dummies from pandas
new_df2 = train_test
dummies = pd.get_dummies(train_test[categorical], drop_first = True)
new_df2 = pd.concat([new_df2, dummies], axis=1)
new_df2.drop(categorical, inplace=True, axis=1)
# take back the train and test set from the above data
train_featured2 = new_df2.iloc[:train.shape[0], :]
test_featured2 = new_df2.iloc[train.shape[0]:, :]
train_featured2[continous_train] = train[continous_train]
test_featured2[continous_train] = test[continous_train]
train_featured2['loss'] = loss
|
3. Some of these variables only have two labels and some have more than two. One way is to use factorize to convert these labels to numeric
1
2
3
4
5
6
7
8
9
10
11
|
# 3. pd.factorize
new_df3 = train_test
for feature in new_df3.columns:
new_df3[feature] = pd.factorize(new_df3[feature], sort=True)[0]
# take back the train and test set from the above data
train_featured3 = new_df3.iloc[:train.shape[0], :]
test_featured3 = new_df3.iloc[train.shape[0]:, :]
train_featured3[continous_train] = train[continous_train]
test_featured3[continous_train] = test[continous_train]
train_featured3['loss'] = loss
|
4. Another way to is to mix the dummy variables and factorize
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
# 4. mixed model
# what we can do is mix these models since cat1 to cat72 just have 2 labels, so we can factorize
# these variables
# for the rest we can make dummies
new_df4 = train_test
for feature in new_df4.columns[:72]:
new_df4[feature] = pd.factorize(new_df4[feature], sort=True)[0]
dummies = pd.get_dummies(train_test[categorical[73:]], drop_first = True)
new_df4 = pd.concat([new_df4, dummies], axis=1)
new_df4.drop(categorical[73:], inplace=True, axis=1)
# take back the train and test set from the above data
train_featured4 = new_df4.iloc[:train.shape[0], :]
test_featured4 = new_df4.iloc[train.shape[0]:, :]
train_featured4[continous_train] = train[continous_train]
test_featured4[continous_train] = test[continous_train]
train_featured4['loss'] = loss
|
Here’s the full code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
|
# import required libraries
# pandas for reading data and manipulation
# scikit learn to one hot encoder and label encoder
# sns and matplotlib to visualize
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
import operator
# read data from csv file
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# let's take a look at the train and test data
print '**************************************'
print 'TRAIN DATA'
print '**************************************'
print train.head(5)
print '**************************************'
print 'TEST DATA'
print '**************************************'
print test.head(5)
# the above code wont print all columns.
# to print all columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# let's take a look at the train and test data again
print '**************************************'
print 'TRAIN DATA'
print '**************************************'
print train.head(5)
print '**************************************'
print 'TEST DATA'
print '**************************************'
print test.head(5)
# remove ID column. No use.
train.drop('id',axis=1,inplace=True)
test.drop('id',axis=1,inplace=True)
loss = train.drop('loss', axis = 1, inplace = True)
# high level statistics. mean media mode count and quartiles
# note - this will work only for the continous variables
# not for the categorical variables
print train.describe()
print test.describe()
# at this point, it is wise to check whether there are any features that
# are there is one of the dataset but not in other
missingFeatures = False
inTrainNotTest = []
for feature in train.columns:
if feature not in test.columns:
missingFeatures = True
inTrainNotTest.append(feature)
if len(inTrainNotTest)>0:
print ', '. join(inTrainNotTest), ' features are present in training set but not in test set'
inTestNotTrain = []
for feature in test.columns:
if feature not in train.columns:
missingFeatures = True
inTestNotTrain.append(feature)
if len(inTestNotTrain)>0:
print ', '. join(inTestNotTrain), ' features are present in test set but not in training set'
# find categorical variables
# in this problem, categorical variables are start with cat which is easy
# to identify
# in other problems it not might be like that
# we will see two ways to identify this in this problem
# we will also find the continous or numerical variables
## 1. by name
categorical_train = [var for var in train.columns if 'cat' in var]
categorical_test = [var for var in test.columns if 'cat' in var]
continous_train = [var for var in train.columns if 'cont' in var]
continous_test = [var for var in test.columns if 'cont' in var]
## 2. by type = object
categorical_train = train.dtypes[train.dtypes == "object"].index
categorical_test = test.dtypes[test.dtypes == "object"].index
continous_train = train.dtypes[train.dtypes != "object"].index
continous_test = test.dtypes[test.dtypes != "object"].index
# lets check for correlation between continous data
# correlation between numerical variables is something like this
# if we increase one variable, there is a siginficant almost increase/decrease
# in the other variable. it varies from -1 to 1
correlation_train = train[continous_train].corr()
correlation_test = train[continous_test].corr()
# for the purpose of this analysis, we will consider to variables to
# highly correlation if the correlation is more than 0.6
threshold = 0.6
for i in range(len(correlation_train)):
for j in range(len(correlation_train)):
if (i>j) and (correlation_train.iloc[i,j]>threshold):
print ("%s and %s = %.2f" % (train.columns[i],train.columns[j],correlation_train.iloc[i,j]))
for i in range(len(correlation_test)):
for j in range(len(correlation_test)):
if (i>j) and (correlation_test.iloc[i,j]>threshold):
print ("%s and %s = %.2f" % (test.columns[i],test.columns[j],correlation_test.iloc[i,j]))
####cat6 and cat1 = 0.76
####cat7 and cat6 = 0.66
####cat9 and cat1 = 0.93
####cat9 and cat6 = 0.80
####cat10 and cat1 = 0.81
####cat10 and cat6 = 0.88
####cat10 and cat9 = 0.79
####cat11 and cat6 = 0.77
####cat11 and cat7 = 0.75
####cat11 and cat9 = 0.61
####cat11 and cat10 = 0.70
####cat12 and cat1 = 0.61
####cat12 and cat6 = 0.79
####cat12 and cat7 = 0.74
####cat12 and cat9 = 0.63
####cat12 and cat10 = 0.71
####cat12 and cat11 = 0.99
####cat13 and cat6 = 0.82
####cat13 and cat9 = 0.64
####cat13 and cat10 = 0.71
# we can remove one of the two highly correlatied variables to improve performance
# lets check for factors in the categorical variables
for feature in categorical_train:
print feature, 'has ', len(train[feature].unique()), 'values. Unique values are :: ', train[feature].unique()
for feature in categorical_test:
print feature, 'has ', len(test[feature].unique()), 'values. Unique values are :: ', test[feature].unique()
# lets take a look whether the unique values/factors are not present in each of the dataset
# for example cat1 in both the datasets has values only A & B. Sometimes
# it may happen that some new value is present in the test set which maybe ruin your model
featuresDone = []
for feature in categorical_train:
if feature in categorical_test:
if set(train[feature].unique()) - set(test[feature].unique()) != set([]):
print 'Train set has ', len(train[feature].unique()), 'values. Unique values are :: ', train[feature].unique(), '\n'
print 'test set has ', len(test[feature].unique()), 'values. Unique values are :: ', test[feature].unique(), '\n'
print 'Missing vaues are : ', set(train[feature].unique()) - set(test[feature].unique())
featuresDone.append(feature)
for feature in categorical_test:
if (feature in categorical_train) and (feature not in featuresDone):
if set(train[feature].unique()) - set(test[feature].unique()) != set([]):
print 'Train set has ', len(train[feature].unique()), 'values. Unique values are :: ', train[feature].unique(), '\n'
print 'test set has ', len(test[feature].unique()), 'values. Unique values are :: ', test[feature].unique(), '\n'
print 'Missing vaues are : ', set(train[feature].unique()) - set(test[feature].unique())
featuresDone.append(feature)
# lets visualize the values in each of the features
# keep in mind you'll be seeing a lot of plots now
# better is use ipython/jupyter notebook to plot inline plots
for feature in categorical_train:
sns.countplot(x = train[feature], data = train)
#plt.show()
for feature in categorical_test:
sns.countplot(x = test[feature], data = test)
#plt.show()
# cat1 to cat72 have only two labels A and B
# cat73 to cat 108 have more than two labels
# cat109 to cat116 have many labels
# moreover you must have noticed that some labels are missing in some features of train/test dataset
# this might become a problem when working with multiple datasets
# to avoid this, we will merge data before doing onehotencoding
train_test = pd.concat(train, test).reset_index(drop=True)
categorical = train_test.dtypes[train_test.dtypes == "object"].index
# lets check for factors in the categorical variables
for feature in categorical:
print feature, 'has ', len(train_test[feature].unique()), 'values. Unique values are :: ', train_test[feature].unique()
# 1. one hot encoding all categorical variables
v = DictVectorizer()
train_test_qual = v.fit_transform(train_test[categorical].to_dict('records'))
print 'total vocabulary :: ', train_test_qual.vocabulary_
print 'total number of columns', len(train_test_qual.vocabulary_.keys())
print 'total number of new columns added ', len(train_test_qual.vocabulary_.keys()) - len(categorical)
# it can be seen that we are adding too many new variables. This encoding is important
# since machine learning algorithms dont understand strings and we have to convert string factors
# as numeric factors which increase our dimensionality
new_df = pd.DataFrame(X_qual.toarray(), columns= [i[0] for i in sorted(v.vocabulary_.items(), key=operator.itemgetter(1))])
new_df = pd.concat([new_df, train_test], axis=1)
# remove initial categorical variables
new_df.drop(categorical, axis=1, inplace=True)
# take back the train and test set from the above data
train_featured = new_df.iloc[:train.shape[0], :]
test_featured = new_df.iloc[train.shape[0]:, :]
train_featured[continous_train] = train[continous_train]
test_featured[continous_train] = test[continous_train]
train_featured['loss'] = loss
# 2. using get dummies from pandas
new_df2 = train_test
dummies = pd.get_dummies(train_test[categorical], drop_first = True)
new_df2 = pd.concat([new_df2, dummies], axis=1)
new_df2.drop(categorical, inplace=True, axis=1)
# take back the train and test set from the above data
train_featured2 = new_df2.iloc[:train.shape[0], :]
test_featured2 = new_df2.iloc[train.shape[0]:, :]
train_featured2[continous_train] = train[continous_train]
test_featured2[continous_train] = test[continous_train]
train_featured2['loss'] = loss
# 3. pd.factorize
new_df3 = train_test
for feature in new_df3.columns:
new_df3[feature] = pd.factorize(new_df3[feature], sort=True)[0]
# take back the train and test set from the above data
train_featured3 = new_df3.iloc[:train.shape[0], :]
test_featured3 = new_df3.iloc[train.shape[0]:, :]
train_featured3[continous_train] = train[continous_train]
test_featured3[continous_train] = test[continous_train]
train_featured3['loss'] = loss
# 4. mixed model
# what we can do is mix these models since cat1 to cat72 just have 2 labels, so we can factorize
# these variables
# for the rest we can make dummies
new_df4 = train_test
for feature in new_df4.columns[:72]:
new_df4[feature] = pd.factorize(new_df4[feature], sort=True)[0]
dummies = pd.get_dummies(train_test[categorical[73:]], drop_first = True)
new_df4 = pd.concat([new_df4, dummies], axis=1)
new_df4.drop(categorical[73:], inplace=True, axis=1)
# take back the train and test set from the above data
train_featured4 = new_df4.iloc[:train.shape[0], :]
test_featured4 = new_df4.iloc[train.shape[0]:, :]
train_featured4[continous_train] = train[continous_train]
test_featured4[continous_train] = test[continous_train]
train_featured4['loss'] = loss
## this we can use for training and testing in the model
|
Happy Python-ing!
Posted in Machine Learning, Python