Data Mining - Week2, Data Preprocessing

ShaneG_Miku

2021 年 09 月 29 日

1686 次浏览

暂无评论

2855字数

数据挖掘

# Data Mining - Week2, Data Preprocessing

## 2.1 Data Preprocession

> Why data preprocessing?

Real data are notoriously **dirty**, the biggest challenge in many data minging project.

## 2.2 Data Cleaning

### 2.2.1 Missing Data

*Type of Missing data*:

- Missing compeletely at random
- Missing conditionally at random
- Not missing at rondom

*Handle of Missing data*:

- Ignore
- FIll in the missing value manually/automaticlly

### 2.2.2 Outliers

> TODO: Regression, Clustering.

#### Local Outlier Factor(LOF)

- $\text{distance}_k(O)$: The $k$-th closed distance to point $O$
- $d(A, B)$: The distance between $A$ and $B$.

*Definition. 1*:

$$
\text{distance}_k(A, B) = \max \{\text{distance}_k(B), d(A, B)\}
$$

*Definition. 2*:

$$
\text{lrd}(A) = 1/\left(\frac{\sum_{B\in N_k(A)} \text{distance}_k(A, B)}{|N_k(A)|}\right)
$$

> $N_K(A)$: A set of $k$-nearest neighbors to point $A$

*Definition. 3*:

$$
\text{LOF}_k(A)
=\frac{\sum_{B \in N_k(A)} \frac{\text{lrd}(B)}{\text{lrd}(A)}}{|N_k(A)|}
=\frac{\sum_{B \in N_k(A)} \text{lrd}(B)}{|N_k(A)|}/\text{lrd}(A)
$$

![Pic.2-1 Source: https://v1-www.xuetangx.com/asset-v1:TsinghuaX+80240372X+sp+type@asset+block/Data_Preprocessing-0630.pptx, Page 13, Dr. Bo Yuan](/usr/uploads/2021/09/3766661594.png)

Pic.2-1 reveals that the LOF value is larger, the point is the outlier's probability higher.

> More: https://zhuanlan.zhihu.com/p/385238291

#### Duplicate Data

$$
\boxed{\text{Create Keys}} \to \boxed{\text{Sort}} \to \boxed{\text{Merge}}
$$

## 2.3 Data Transformation

### 2.3.1 Attribute Types

- Continous
- Discrete
- Ordinal
- Nominal
- String

> 对于Nominal, 不同的编码方式会导致问题的结构上的差别

### 2.3.2 Sampling

- **Aggregation**: change of scale, more stable and less variability
- **Imbalanced dataset**: Sampling can be also used to adjust the class distributions

## 2.4 Data Description

### 2.4.1 Normalization

*Min-max normalization*:

$$
v'=\frac{v-min}{max-min}(new_{max}-new_{min})+new_{min}
$$

### 2.4.2 Mean, Median, Mode, Variance

*Mean*:

$$
\tilde{x}=\frac{1}{n}\sum^n_{i=1}x_i=\frac{1}{n}(x_1+\cdots+x_n)
$$

*Median*:

$$
P(X \leq M)=P(X \geq M)=\int^m_{-\infty}f(x)dx=\frac{1}{2}
$$

*Mode*:

- The most frequently occurring value in a list

*Variance*:

$$
Var(X)=E[(X-\mu)^2]\\
Var(X)=\int(X-\mu)^2f(x)dx
$$

### 2.4.3 Product-moment correlation

$$
r_{A,B}=\frac{\sum(A-\bar{A})(B-\bar{B})}{(n-1)\sigma_{A}\sigma_{B}}=\frac{\sum(AB)-n\bar{A}\bar{B}}{(n-1)\sigma_{A}\sigma_{B}}
$$

- If $r_{A,B}>0$, $A$ and $B$ are positively correlated.
- If $r_{A,B}=0$, **no linear correlation** between $A$ and $B$.
- If $r_{A,B}<0$, $A$ and $B$ are negatively correlated.

### 2.4.4 Chi-squear test

$$
\chi^2=\sum \frac{(\text{Observed}-\text{Expected}^2)}{\text{Expected}}
$$

## 2.5 Feature Selection

### 2.5.1 Entropy

$$
H(X)=-\sum^n_{i=1}p(x_i)\log_b p(x_i-)
$$

> Example:
> $X: \{a=\text{"Non-Smoker"}; b=\text{"Smoker"}\}$
> $H(S|X=a)=-0.8 \cdot \log_2 0.8 -0.2 \cdot \log_2 0.2 =0.7219$
> $H(S|X=b)=0.2864$
> $H(S|X) = 0.6 \cdot 0.7219+0.4 \cdot 0.2864 =0.5477$
> $Gain(S, X)=H(S)-H(S|X)=0.4523$ -> information gain

最后修改：2021 年 10 月 09 日