Loading... # Data Mining - Week5, Clustering > 才发现进度快了一周,把NN给跳过了,不过学过了应该问题不大. - Partitoning Methods - K-means - Sequential Leader - Model Based Method - Density Based Method - Hierachical Method ## 5.1 Clustering ### 5.1.1 Definition > Ref: https://developers.google.com/machine-learning/clustering/overview *Definition 5.1*: Grouping **unlabeled examples** is called **Clustering**. ### 5.1.2 Similarity measure Before you can group similar examples, you first need to find similar examples. You can measure similarity between examples by combining the examples' **feature data** into a metric, called a **similarity measure**. ### 5.1.3 Usage - market segmentation - social network analysis - search result grouping - medical imaging - image segmentation - anomaly detection ## 5.2 Clustering Algorithm ### 5.2.1 Evaluation $$ J_e=\sum^c_{i=1} \sum_{x \in D_i}\Vert x-m_i \Vert^2 , m_i=\frac{1}{n_i}\sum_{x \in D_i}x $$ > 在一些非球型的数据上可能存在问题 ### 5.2.2 Silhouette coefficient $$ s(i)=\frac{b(i)-a(i)}{\max\{b(i),a(i)\}} \text{ or } s(i)=\left\{ \begin{aligned} &1-\frac{a(i)}{b(i)} &, a(i)<b(i) \\ &0 &, a(i)=b(i) \\ &\frac{b(i)}{a(i)} &, a(i)>b(i) \end{aligned} \right. $$ - $a(i)$: average dissmilarity of $i$ with all other points in the same cluster - $b(i)$: the lowest average dissmilarity of $i$ to other clusters > $s(i) \in [-1,1]$ > > - When $a(i) \ll b(i)$, $s(i) = 1$, means clusters are well apart from each other and clearly distinguished. > - When $a(i) \simeq b(i)$, $s(i) = 0$, means clusters are indifferent, or distance between clusters is not significant. > - When $a(i) \gg b(i)$, $s(i) = -1$, mmeans clusters are assigned in the wrong way. ### 5.2.3 K-Means > Ref: https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages #### Algorithm Steps Given an initial set of k-means $a_1, a_2, \ldots, a_k$: 1. Iteratively determines the best $k$ center points (centroids) 2. Assigns each example to the closest centroid. Those examples nearest the same centroid belong to the same group. #### Advantages - Relatively simple to implement. - Scales to large data sets. - Guarantees convergence. - Can warm-start the positions of centroids. - Easily adapts to new examples. - Generalizes to clusters of different shapes and sizes, such as elliptical clusters. #### Disadvantages - Choosing $k$ manually - Being dependent on initial values - Clustering data of varying sizes and density ### 5.2.4 Sequential Leader Clustering #### Algorithm Steps For every new data point: 1. Compute the distance between the new data point and every cluster's center. 2. If the minimum distance is smaller than the chosen threshold, assign the new data point to the corresponding cluster and re-compute cluster center. 3. Otherwise, create a new cluster with the new data point as its center. ## 5.3 EM Method ### 5.3.1 Gaussianm Mixture Model > Ref: https://towardsdatascience.com/gaussian-mixture-models-explained-6986aaf5a95 ##### Definition A **Gaussian Mixture** is a function that is comprised of **several** Gaussians, each identified by $k \in \{1, \ldots, K\}$, where $K$ is the number of clusters of our dataset. In each mixture: - $\mu$: mean, defines its **center**. - $\sigma$: convariance, defines its **width** - $\pi$: mixing probability, defines how **big or small** the Gaussian function will be. The mixing coefficients are themselves probalilities: $$ \sum^K_{k=1} \pi_k = 1 $$ Generally, Gaussian density function: $$ \mathcal{N}(\boldsymbol{x}|\mu,\sigma)=\frac{1}{(2\pi)^{D/2}|\sigma|^{1/2}} \exp(-\frac{1}{2}(\boldsymbol{x}-\mu)^T \sigma^{-1}(\boldsymbol{x}-\mu)) $$ ### 5.3.2 Expectation-maximization algorithm #### Algorithm Steps Define $\theta = \{\pi, \mu, \sigma\}$, the parameters of out model. 1. Init $\theta$. (Ex. use k-means result as a good starting point for this algorithm) 2. **Expectation step**, Evaluate 最后修改:2021 年 10 月 28 日 © 允许规范转载 赞 0 如果觉得我的文章对你有用,请随意赞赏