Course Blog for Social Media Analytics: Post 2 - Study "Clustering using K-Means"

What is K-means clustering?

The K-means clustering algorithm was developed by MacQueen. This algorithm is aimed to partition a set of objects based on their characteristics. After that the object can be partitioned into k clusters, where k is a predefined constant. We need a reference point to identify a clusters and this reference point is our k centroid. The centroid of a cluster is formed by the relative distance of all the object in that cluster.

Simple implementation of Lloyd’s algorithm for performing k-means clustering in python:

import numpy as np
 
def cluster_points(X, mu):
    clusters  = {}
    for x in X:
        bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
                    for i in enumerate(mu)], key=lambda t:t[1])[0]
        try:
            clusters[bestmukey].append(x)
        except KeyError:
            clusters[bestmukey] = [x]
    return clusters
 
def reevaluate_centers(mu, clusters):
    newmu = []
    keys = sorted(clusters.keys())
    for k in keys:
        newmu.append(np.mean(clusters[k], axis = 0))
    return newmu
 
def has_converged(mu, oldmu):
    return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu])
 
def find_centers(X, K):
    # Initialize to K random centers
    oldmu = random.sample(X, K)
    mu = random.sample(X, K)
    while not has_converged(mu, oldmu):
        oldmu = mu
        # Assign all points in X to clusters
        clusters = cluster_points(X, mu)
        # Reevaluate centers
        mu = reevaluate_centers(oldmu, clusters)
    return(mu, clusters)

def init_board_gauss(N, k):
    n = float(N)/k
    X = []
    for i in range(k):
        c = (random.uniform(-1, 1), random.uniform(-1, 1))
        s = random.uniform(0.05,0.5)
        x = []
        while len(x) < n:
            a, b = np.array([np.random.normal(c[0], s), np.random.normal(c[1], s)])
            # Continue drawing points from the distribution in the range [-1,1]
            if abs(a) < 1 and abs(b) < 1:
                x.append([a,b])
        X.extend(x)
    X = np.array(X)[:N]
    return X

Application of Clustering using K-Means

There are many clustering approaches can be used to analyses web contents such as text-based clustering, link-based clustering and graph based clustering.

Text-based clustering is based on the content of web document. The basis idea is that if two document contain many common words then it can be defined that two document are very similar. The text-based approaches can be further classified according to different clustering method such as partitional, hierarchal or graphical based.

Partitional clustering
The partitional document clustering approach attempt a flat partitioning of a set of documents into a predefined number of disjoint cluster. The most common partitional clustering algorithm is k-means, which relies on the idea of centroid can be a good representation of the cluster.

Graph based clustering
This clustering is relied on the edge detection. Each node represents a document and there exists an edge between two nodes if the document similarity between documents in different clusters. The basic idea is that the weights of the edges in the same cluster will be greater than the weights of the edges across clusters.The advantages of these approaches are that can capture the structure of the data and that they work effectively in high dimensional spaces. The disadvantage is that the graph must fit the memory.

4 則留言:

324324234432014年10月16日上午8:55
Hi, Kevin, that's really a great piece of work which you have done. Just that I think it would be better if you could add more comments on the codes your wrote, that would be more easier for the people like me which are not familiar with programming to follow your idea.
回覆刪除
回覆
Unknown2014年10月16日晚上11:41
HI, He Huang, I glad you like my work. The main ideal for the python code is using NumPy to do the N-dimensional array calculation and do the graphical presentation for the K-Means.

NumPy's main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In Numpy dimensions are called axes. The number of axes is rank.
For example, the coordinates of a point in 3D space [1, 2, 1] is an array of rank 1, because it has one axis. That axis has a length of 3. In example pictured below, the array has rank 2 (it is 2-dimensional). The first dimension (axis) has a length of 2, the second dimension has a length of 3.
回覆刪除
回覆
Unknown2014年11月20日上午8:03
K-means is classic algorithm for clustering. Your blog includes implementation details about the algorithm. It's really worth reading.
回覆刪除
回覆
Wang Shupeng2014年11月21日凌晨4:35
This blog really helps me. Now I am also doing some thing with Numpy. Thank for your sharing.
回覆刪除
回覆

2014年10月2日 星期四

Post 2 - Study "Clustering using K-Means"

What is K-means clustering?

Application of Clustering using K-Means

4 則留言:

2014年10月2日星期四