Data Mining - Week5, Clustering

博主： ShaneG_Miku
发布时间：2021 年 10 月 28 日
2014 次浏览
暂无评论
3867字数
分类：数据挖掘

# Data Mining - Week5, Clustering

> 才发现进度快了一周，把NN给跳过了，不过学过了应该问题不大.

- Partitoning Methods
  - K-means
  - Sequential Leader
  - Model Based Method
  - Density Based Method
- Hierachical Method

## 5.1 Clustering

### 5.1.1 Definition

> Ref: https://developers.google.com/machine-learning/clustering/overview

*Definition 5.1*: Grouping **unlabeled examples** is called **Clustering**.

### 5.1.2 Similarity measure

Before you can group similar examples, you first need to find similar examples. You can measure similarity between examples by combining the examples' **feature data** into a metric, called a **similarity measure**.

### 5.1.3 Usage

- market segmentation
- social network analysis
- search result grouping
- medical imaging
- image segmentation
- anomaly detection

## 5.2 Clustering Algorithm

### 5.2.1 Evaluation

$$
J_e=\sum^c_{i=1} \sum_{x \in D_i}\Vert x-m_i \Vert^2 ,
m_i=\frac{1}{n_i}\sum_{x \in D_i}x
$$

> 在一些非球型的数据上可能存在问题

### 5.2.2 Silhouette coefficient

$$
s(i)=\frac{b(i)-a(i)}{\max\{b(i),a(i)\}} 
\text{ or } 
s(i)=\left\{
\begin{aligned}
&1-\frac{a(i)}{b(i)} &, a(i)<b(i) \\
&0                   &, a(i)=b(i) \\
&\frac{b(i)}{a(i)}   &, a(i)>b(i)
\end{aligned}
\right.
$$

- $a(i)$: average dissmilarity of $i$ with all other points in the same cluster
- $b(i)$: the lowest average dissmilarity of $i$ to other clusters

> $s(i) \in [-1,1]$
> 
> - When $a(i) \ll b(i)$, $s(i) = 1$, means clusters are well apart from each other and clearly distinguished.
> - When $a(i) \simeq b(i)$, $s(i) = 0$, means clusters are indifferent, or distance between clusters is not significant.
> - When $a(i) \gg b(i)$, $s(i) = -1$, mmeans clusters are assigned in the wrong way.

### 5.2.3 K-Means

> Ref: https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages

#### Algorithm Steps

Given an initial set of k-means $a_1, a_2, \ldots, a_k$:

1. Iteratively determines the best $k$ center points (centroids)
2. Assigns each example to the closest centroid. Those examples nearest the same centroid belong to the same group.

#### Advantages

- Relatively simple to implement.
- Scales to large data sets.
- Guarantees convergence.
- Can warm-start the positions of centroids.
- Easily adapts to new examples.
- Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

#### Disadvantages

- Choosing $k$ manually
- Being dependent on initial values
- Clustering data of varying sizes and density

### 5.2.4 Sequential Leader Clustering

#### Algorithm Steps

For every new data point:

1. Compute the distance between the new data point and every cluster's center.
2. If the minimum distance is smaller than the chosen threshold, assign the new data point to the corresponding cluster and re-compute cluster center.
3. Otherwise, create a new cluster with the new data  point as its center.

## 5.3 EM Method

### 5.3.1 Gaussianm Mixture Model

> Ref: https://towardsdatascience.com/gaussian-mixture-models-explained-6986aaf5a95

##### Definition

A **Gaussian Mixture** is a function that is comprised of **several** Gaussians, each identified by $k \in \{1, \ldots, K\}$, where $K$ is the number of clusters of our dataset.

In each mixture:

- $\mu$: mean, defines its **center**.
- $\sigma$: convariance, defines its **width**
- $\pi$: mixing probability, defines how **big or small** the Gaussian function will be.

The mixing coefficients are themselves probalilities:

$$
\sum^K_{k=1} \pi_k = 1
$$

Generally, Gaussian density function:

$$
\mathcal{N}(\boldsymbol{x}|\mu,\sigma)=\frac{1}{(2\pi)^{D/2}|\sigma|^{1/2}} \exp(-\frac{1}{2}(\boldsymbol{x}-\mu)^T \sigma^{-1}(\boldsymbol{x}-\mu))
$$

### 5.3.2 Expectation-maximization algorithm

#### Algorithm Steps

Define $\theta = \{\pi, \mu, \sigma\}$, the parameters of out model.

1. Init $\theta$. (Ex. use k-means result as a good starting point for this algorithm)
2. **Expectation step**, Evaluate

最后修改：2021 年 10 月 28 日

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复

评论 *

私密评论

名称 *

🎲

邮箱

地址

陈埯猝
陈埯猝:文章真不错http://wap.jst-gpmx.cn...
CrystMiku
最近还用Azure吗，我hk的服务器速度好慢
h_daniel
墙啊大佬，AFO蒟蒻顶礼膜拜orz%%%

Data Mining - Week5, Clustering

ShaneG_Miku • 2021 年 10 月 28 日

# Data Mining - Week5, Clustering

> 才发现进度快了一周，把NN给跳过了，不过学过了应该问题不大.

- Partitoning Methods
  - K-means
  - Sequential Leader
  - Model Based Method
  - Density Based Method
- Hierachical Method

## 5.1 Clustering

### 5.1.1 Definition

> Ref: https://developers.google.com/machine-learning/clustering/overview

*Definition 5.1*: Grouping **unlabeled examples** is called **Clustering**.

### 5.1.2 Similarity measure

### 5.1.3 Usage

- market segmentation
- social network analysis
- search result grouping
- medical imaging
- image segmentation
- anomaly detection

## 5.2 Clustering Algorithm

### 5.2.1 Evaluation

$$
J_e=\sum^c_{i=1} \sum_{x \in D_i}\Vert x-m_i \Vert^2 ,
m_i=\frac{1}{n_i}\sum_{x \in D_i}x
$$

> 在一些非球型的数据上可能存在问题

### 5.2.2 Silhouette coefficient

- $a(i)$: average dissmilarity of $i$ with all other points in the same cluster
- $b(i)$: the lowest average dissmilarity of $i$ to other clusters

### 5.2.3 K-Means

> Ref: https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages

#### Algorithm Steps

Given an initial set of k-means $a_1, a_2, \ldots, a_k$:

1. Iteratively determines the best $k$ center points (centroids)
2. Assigns each example to the closest centroid. Those examples nearest the same centroid belong to the same group.

#### Advantages

#### Disadvantages

- Choosing $k$ manually
- Being dependent on initial values
- Clustering data of varying sizes and density

### 5.2.4 Sequential Leader Clustering

#### Algorithm Steps

For every new data point:

## 5.3 EM Method

### 5.3.1 Gaussianm Mixture Model

> Ref: https://towardsdatascience.com/gaussian-mixture-models-explained-6986aaf5a95

##### Definition

A **Gaussian Mixture** is a function that is comprised of **several** Gaussians, each identified by $k \in \{1, \ldots, K\}$, where $K$ is the number of clusters of our dataset.

In each mixture:

- $\mu$: mean, defines its **center**.
- $\sigma$: convariance, defines its **width**
- $\pi$: mixing probability, defines how **big or small** the Gaussian function will be.

The mixing coefficients are themselves probalilities:

$$
\sum^K_{k=1} \pi_k = 1
$$

Generally, Gaussian density function:

$$
\mathcal{N}(\boldsymbol{x}|\mu,\sigma)=\frac{1}{(2\pi)^{D/2}|\sigma|^{1/2}} \exp(-\frac{1}{2}(\boldsymbol{x}-\mu)^T \sigma^{-1}(\boldsymbol{x}-\mu))
$$

### 5.3.2 Expectation-maximization algorithm

#### Algorithm Steps

Define $\theta = \{\pi, \mu, \sigma\}$, the parameters of out model.

1. Init $\theta$. (Ex. use k-means result as a good starting point for this algorithm)
2. **Expectation step**, Evaluate

Data Mining - Week5, Clustering

发表评论取消回复

此内容被密码保护

2022.4.1 - 4.10

2019 新生赛题解

Casbin - Golang中权限控制模型定义和使用

Happy New Year - 2021!!!

Codeforces Round #550 (Div.3) 题目 & 题解

Chapter 1 HW

Chapter 1 数据库概论

Robustra: Training Provable Robust Neural Networks over Reference Adversarial Space

Docker的正确使用姿势

Data Mining - Week5, Clustering

发表评论 取消回复

Data Mining - Week5, Clustering

发表评论取消回复