AIR

k-均值聚类算法的特点

1.优点:容易实现。

2.缺点:可能收敛到局部最小值,在大规模数据集上收敛较慢。

3.使用数据类型:数据型数据。

k-均值聚类的一般流程

1.收集数据:使用任意方法。
2.准备数据:需要数值型数据类计算距离, 也可以将标称型数据映射为二值型数据再用于距离计算。
3.分析数据:使用任意方法。
4.训练算法:不适用于无监督学习,即无监督学习不需要训练步骤。
5.测试算法:应用聚类算法、观察结果.可以使用量化的误差指标如误差平方和(后面会介绍)来评价算 法的结果。
6.使用算法:可以用于所希望的任何应用.通常情况下, 簇质心可以代表整个簇的数据来做出决策。

从上面的质心可知,需要计算某种函数,而K-均值聚类支持函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from numpy import *

def loadDataSet(fileName):
dataSet = []
fr = open(fileName)
for line in fr.readlines():
curLine = line.strip().split('\t')
fltLine = list(map(float, curLine))
dataSet.append(fltLine)
return dataSet

def distEclud(vecA, vecB):
return sqrt(sum(power(vecA - vecB, 2)))

def randCent(dataMat, k):
n = shape(dataMat)[1]
centroids = mat(zeros((k, n)))
for j in range(n):
minJ = min(dataMat[:, j])
rangeJ = float(max(dataMat[:, j]) - minJ)
centroids[:, j] = mat(minJ + rangeJ * random.rand(k, 1))
return centroids

在Python提示符下输入:

提示输入1

k_均值聚类算法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def kMeans(dataMat, k, distMeas=distEclud, createCent=randCent):
m = shape(dataMat)[0]
clusterAssment = mat(zeros(
(m, 2)))
centroids = createCent(dataMat, k)
clusterChanged = True
while clusterChanged:
clusterChanged = False
for i in range(m):
minDist = inf
minIndex = -1
for j in range(k):
distJI = distMeas(centroids[j, :],
dataMat[i, :])
if distJI < minDist:
minDist = distJI
minIndex = j
if clusterAssment[i, 0] != minIndex:
clusterChanged = True
clusterAssment[
i, :] = minIndex, minDist**2
print(centroids)
for cent in range(k):
ptsInClust = dataMat[nonzero(
clusterAssment[:, 0].A == cent)[0]]
centroids[cent, :] = mean(
ptsInClust, axis=0)
return centroids, clusterAssment

在python提示符下的输入

提示输入2

二分k-均值聚类算法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#二分k-均值聚类算法
def biKMeans(dataMat, k, distMeas=distEclud):
m = shape(dataMat)[0]
clusterAssment = mat(zeros((m, 2)))
centroid0 = mean(dataMat, axis=0).tolist()[0]
centList = [centroid0]
for j in range(m):
clusterAssment[j, 1] = distMeas(mat(centroid0), dataMat[j, :]) ** 2
while len(centList) < k:
for i in range(len(centList)):
ptsInCurrCluster = dataMat[nonzero(
clusterAssment[:, 0].A == i)[0], :]
centroidMat, splitClustAss = kMeans(
ptsInCurrCluster, 2, distMeas)
sseSplit = sum(splitClustAss[:, 1])
sseNotSplit = sum(
clusterAssment[nonzero(clusterAssment[:, 0].A != i)[0],
1])
print("sseSplit, and notSplit: ", sseSplit, sseNotSplit)
if (sseSplit + sseNotSplit) < lowestSSE:
bestCentToSplit = i
bestNewCents = centroidMat
bestClustAss = splitClustAss.copy()
lowestSSE = sseSplit + sseNotSplit

bestClustAss[nonzero(bestClustAss[:, 0].A == 1)[0], 0] = len(
centList)
bestClustAss[nonzero(bestClustAss[:, 0].A == 0)[0],
0] = bestCentToSplit
print('the bestCentToSplit is: ', bestCentToSplit)
print('the len of bestClustAss is: ', len(bestClustAss))

centList[bestCentToSplit] = bestNewCents[0, :].tolist()[
0]
centList.append(
bestNewCents[1, :].tolist()[0])
clusterAssment[nonzero(clusterAssment[:, 0].A == bestCentToSplit)[
0], :] = bestClustAss
return mat(centList), clusterAssment

 Comments


Blog content follows the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License

Use Material X as theme , total visits times .
载入天数...载入时分秒...