数据挖掘导论-1

Classification [Predictive]
Clustering  [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]

 

categorical/qualitative
1) nominal:
mode众数
entropy熵
contingency correlation列联相关
x,2-test卡方检验

2) Ordinal: median/percentiles/rank correlation/
run tests游程检验
sign test符号检验

numeric/quantitative

3) Interval:
mean/standard deviation/Pearson's correlation/t and F tests
4) Ratio:
geometric mean/harmonic mean/percent variation百分比变差


 data quality problems:

1) Noise and outliers
2) missing values
why: 1. info not collected; 2. attributes not applicable for all
how: 1. eliminate data objects; 2. estimate missing values; 3. Ignore missing values during analysis; 4. replace with all possible values(weighted by probabilities)
3) duplicate data


data preprocessing:
1) aggregation
2) sampling
3) dimensionality reduction
curse of dimensionality: dimensionality↑sparse↑,density & distance meaningful↓
how: Principle Component Analysis; Singular Value Decomposition
4) feature subset selection

5) feature creation

feature extraction: domain-specific
mapping data to new space: Fourier transform/Wavelet transform
feature construction: combining features

6) discretization and binarization
7) attribute transformation


 


 



Euclidean density = number of points per unit volume

 

posted @ 2017-03-03 22:24  陆离可  阅读(205)  评论(0编辑  收藏  举报