数据挖掘导论-1
Classification | [Predictive] |
Clustering | [Descriptive] |
Association Rule Discovery | [Descriptive] |
Sequential Pattern Discovery | [Descriptive] |
Regression | [Predictive] |
Deviation Detection | [Predictive] |
categorical/qualitative
1) nominal:
mode众数
entropy熵
contingency correlation列联相关
x,2-test卡方检验
2) Ordinal: median/percentiles/rank correlation/
run tests游程检验
sign test符号检验
numeric/quantitative
3) Interval:
mean/standard deviation/Pearson's correlation/t and F tests
4) Ratio:
geometric mean/harmonic mean/percent variation百分比变差
data quality problems:
1) Noise and outliers
2) missing values
why: 1. info not collected; 2. attributes not applicable for all
how: 1. eliminate data objects; 2. estimate missing values; 3. Ignore missing values during analysis; 4. replace with all possible values(weighted by probabilities)
3) duplicate data
data preprocessing:
1) aggregation
2) sampling
3) dimensionality reduction
curse of dimensionality: dimensionality↑sparse↑,density & distance meaningful↓
how: Principle Component Analysis; Singular Value Decomposition
4) feature subset selection
5) feature creation
feature extraction: domain-specific
mapping data to new space: Fourier transform/Wavelet transform
feature construction: combining features
6) discretization and binarization
7) attribute transformation
Euclidean density = number of points per unit volume