Understand the data
A new data set (problem) is a wrapped gift. It’s full of promise and anticipation at the miracles you can wreak once you’ve solved it. But it remains a mystery until you’ve opened it. This chapter is about opening up your new data set so you can see what’s inside, get an appreciation for what you’ll be able to do with the data, and start thinking about how you’ll approach model building with it.
Attributes (the variables being used to make predictions) are also known as the
following:
■Predictors
■Features
■Independent variables
■Inputs
Labels are also known as the following:
■Outcomes
■Targets
■Dependent variables
■Responses
Different Types of Attributes and Labels Drive Modeling Choices
The attributes come in two different types: numeric variables and categorical (or factor) variables. Attribute 1 (height) is a numeric variable and is the most usual type of attribute. Attribute 2 is gender and is indicated by the entry Male or Female. This type of attribute is called a categoricalor factor variable. Categorical variables have the property that there’s no order relation between the various values. There’s no sense to Male < Female (despite centuries of squabbling). Categorical variables can be two‐valued, like Male Female, or multivalued, like states (AL, AK, AR . . . WY). Other distinctions can be drawn regarding attributes (integer versus float, for example), but they do not have the same impact on machine learning algorithms. The reason for this is that many machine learning algorithms take numeric attributes only; they cannot handle categorical or factor variables. Penalized regression algorithms deal only with numeric attributes. The same is true for support vector machines, kernel methods, and K‐nearest neighbors.
When the labels are numeric, the problem is called a regression problem. When the labels are categorical, the problem is called a classification problem. If the categorical target takes only two values, the problem is called a binary classification problem. If it takes more than two values, the problem is called a multiclass classification problem.
The classification problem might also be simpler than the regression problem. Consider, for instance, the difference in complexity between a topographic map with a single contour line (say the 100‐foot contour line) and a topographic map with contour lines every 10 feet. The single contour divides the map into the areas that are higher than 100 feet and those that are lower and contains considerably less information than the more detailed contour map. A classifier is trying to compute a single dividing contour without regard for behavior distant from
the decision boundary, whereas regression is trying to draw the whole map.????不懂
Items to Check:
Number of rows and columns
Number of categorical variables and number of unique values for each
Missing values
Summary statistics for attributes and labels
Classification Problems: Detecting Unexploded Mines Using Sonar
The data result from some experiments to determine if sonar can be used to detect unexploded mines left in harbors
subsequent to military actions. The sonar signal is what’s called a chirped signal. That means that the signal rises (or falls) in frequency over the duration of the sound pulse. The measurements in the data set represent the power measurements collected in the sonar receiver at different points in the returned signal. For roughly half of the examples, the sonar is illuminating a rock, and for the other half a metal cylinder having the shape of a mine. The data set goes by the name of “Rocks versus Mines.”
The number of rows and columns has several impacts on how you proceed. First, the overall size gives you a rough idea of how long your training times are going to be. For a small data set like the rocks versus mines data, training time will be less than a minute, which will facilitate iterating through the process of training and tweaking. If the data set grows to 1,000 x 1,000, the training times will grow
to a fraction of a minute for penalized linear regression and a few minutes for an ensemble method. As the data set gets to several tens of thousands of rows and columns, the training times will expand to 3 or 4 hours for penalized linear regression and 12 to 24 hours for an ensemble method. The larger training times will have an impact on your development time because you’ll iterate a
number of times.
The second important observation regarding row and column counts is that if the data set has many more columns than rows, you may be more likely to get the best prediction with penalized linear regression and vice versa.
next step in the checklist is to determine how many of the columns of data are numeric versus categorical
Statistical Summaries
After determining which attributes are categorical and which are numeric, you’ll want some descriptive statistics for the numeric variables and a count of the unique categories in each categorical attribute.
Visualization of Outliers Using Quantile‐Quantile Plot
One way to study outliers in more detail is to plot the distribution of the data in question relative to some reasonable distributions to see whether the relative numbers match up. Listing 2-4 shows how to use the Python function probplot to help determine whether the data has outliers or not. The resulting plot shows how the boundaries associated with empirical percentiles in the data compare to the boundaries for the same percentiles of a Gaussian distribution. If the data being analyzed comes from a Gaussian distribution, the point being plotted will
lie on a straight line. Figure 2-1 shows that a couple of points from column 4 of the rocks versus mines data are very far from the line. That means that the tails of the rocks versus mines data contain more examples than the tails of a Gaussian density.
import numpy as np import pylab import scipy.stats as stats import urllib2 import sys target_url = ("https://archive.ics.uci.edu/ml/machine-learning-" "databases/undocumented/connectionist-bench/sonar/sonar.all-data") data = urllib2.urlopen(target_url) #arrange data into list for labels and list of lists for attributes xList = [] labels = [] for line in data: #split on comma row = line.strip().split(",") xList.append(row) nrow = len(xList) ncol = len(xList[1]) continues type = [0]*3 colCounts = [] #generate summary statistics for column 3 (e.g.) col = 3 colData = [] for row in xList: colData.append(float(row[col])) stats.probplot(colData, dist="norm", plot=pylab) pylab.show()
Outliers may cause trouble either for model building or prediction. After you’ve trained a model on this data set, you can look at the errors your model makes and see whether the errors are correlated with these outliers. If they are, you can then take steps to correct them. For example, you can replicate the poor‐performing examples to force them to be more heavily represented. You can segregate them out and train on them as a separate class. You can also edit them out of the data if they represent an abnormality that won’t be present in the data your model will see when deployed.