聚类算法就是通过一个固定的准则将若干个数据分成不同的类,而这个准则就是算法,即分类的标准。
1.样本:
数据是这样的,300个数据点:
186.663 202.772198.676 148.778143.059 205.835124.315 209.143183.409 151.252186.651 184.617152.448 213.176193.86 187.893147.529 260.705192.255 195.25199.348 246.308193.697 188.236112.12 201.993168.106 211.265212.573 186.155196.042 189.468163.708 204.957181.054 220.624158.703 168.099159.757 184.74196.79 192.998186.786 210.828196.497 207.053198.588 202.922181.534 173.303163.578 213.044179.282 176.883196.609 190.543138.516 157.012195.177 156.58190.53 182.799185.528 198.14142.969 164.181179.023 247.875214.873 197.911205.648 225.069152.519 237.886117.663 200.206195.056 178.23206.471 231.914195.335 134.527179.842 192.186201.969 232.993146.255 179.038205.406 208.909116.01 196.927209.268 204.178194.259 198.687178.556 182.883198.249 194.934196.83 190.598194.126 171.121119.272 163.223170.944 150.63182.481 206.846186.658 190.327197.471 162.009159.709 209.665199.476 211.293206.748 245.509206.406 204.516176.252 199.142190.133 229.646178.712 188.019151.013 237.015176.742 212.558182.972 201.977199.323 146.504156.122 239.561186.448 192.126179.963 192.297198.579 185.982188.084 201.899183.696 243.438147.175 193.677191.479 191.342108.569 191.222182.775 136.605130.451 156.001214.888 193.649161.908 148.296159.809 178.67204.497 154.195171.158 222.761196.04 181.647179.137 199.344153.147 151.605196.244 142.589207.977 225.414154.339 236.739207.607 225.961191.832 171.313164.26 215.03197.486 96.329199.638 59.965211.683 54.121151.582 23.532271.28 71.503264.923 101.928167.617 100.39202.113 114.749274.472 35.1209.937 18.919260.42 52.741157.854 27.62227.209 102.074188.259 90.859198.543 120.785141.484 26.01167.229 72.261205.988 117.576196.063 87.301156.426 31.878282.295 68.04291.867 17.576255.483 38.275185.408 89.429279.012 66.076275.475 47.206273.288 47.413214.551 77.592195.28 47.477233.479 84.275201.75 121.026258.297 100.726145.24 17.732168.497 80.165152.201 87.073156.81 100.00640.015 274.342111.668 225.726132.572 318.50281.682 208.12792.682 313.25783.935 256.66463.135 259.184124.016 260.5743.4 228.49424.468 221.772100.061 316.45398.86 271.58113.752 219.064110.894 212.3341.353 304.50815.272 280.95456.536 239.83537.807 221.0515.459 224.6963.999 248.9378.504 363.068138.674 288.37595.426 268.56995.851 352.587115.264 219.74519.005 214.40324.337 251.301138.374 262.9333.097 201.849111.099 296.60368.028 279.671225.167 197.408281.761 153.633265.153 201.25234.606 199.763242.599 161.636288.481 181.345232.487 146.963239.962 247.851230.852 155.934287.084 207.745258.476 253.752245.504 250.344231.481 220.091289.341 158.156224.293 218.578274.152 194.052266.65 199.529220.442 169.775273.666 154.976278.837 166.881287.532 188.421269.012 263.561254.356 209.196326.444 240.195269.494 130.213274.942 181.885351.502 294.563239.381 257.045285.555 174.956237.724 166.39318.404 240.652228.208 161.013219.366 203.459233.696 243.415228.683 182.809280.194 173.569238.482 195.29236.198 181.33223.364 173.82286.391 157.439220.934 198.874273.372 212.147260.989 286.917182.289 367.853362.761 317.836209.364 310.228177.461 291.76205.365 375.53237.474 355.536187.025 392.858294.034 353.4251.77 341.213306.181 318.855258.727 362.831193.536 338.408284.516 335.944264.24 275.262155.706 317.301137.6 339.338217.667 288.749228.865 389.289156.911 365.382196.577 267.226131.481 380.664243.27 284.093340.328 328.199129.81 383.682227.398 285.797210.289 305.682121.652 351.048214.065 380.543165.671 344.769297.968 358.993180.87 319.932229.68 334.947229.294 275.786280.687 361.591214.035 396.153205.155 332.869188.183 269.347245.506 349.31136.127 127.038103.733 3.847117.045 109.70220.688 130.3199.413 143.01842.53 102.254134.522 51.703127.222 145.68944.47 79.91825.086 74.26780.817 67.63640.818 76.98866.217 99.70892.698 155.32118.949 141.192111.029 146.34192.091 68.26995.929 88.974112.129 32.05513.645 123.9598.771 119.90789.082 80.09544.047 61.279137.867 117.78497.626 166.5420 129.274105.707 154.58293.265 44.73235.537 156.55882.69 151.633118.047 9.60657.817 66.352310.759 112.875370.382 182.441394.08 164.391353.638 137.656362.721 63.942348.123 194.11353.612 138.251351.363 152.021357.519 126.935289.905 94.706353.809 213.003392.191 162.078331.554 147.666313.834 93.734343.987 86.263382.006 206.06386.329 131.662310.848 147.063308.198 75.715314.647 88.739286.943 139.761377.336 144.399400 164.092363.689 199.723297.952 131.461361.09 80.444372.778 172.28
这些数据显示成图形,如下图:
如上图,这是三百个数据点,单凭肉眼看,我无法分别点和点,那个和哪个是同一类,这些点很没有规律,但是请看下图:
如上图:我便可以看清楚这个点与点之间是有联系的,经过连线我看到了点与点之间的关系,将它们可以分成七个类(簇),大致的块我都可以说出来,这些个点程序是怎么找出来的。
2.核心思想:
1. 将数据分为k个非空子集
2. 计算每个类中心点(k-means<centroid>中心点是所有点的average),记为seed point
3. 将每个object聚类到最近seed point
4. 返回2,当聚类结果不再变化的时候stop
代码:来自于维基百科:
#!/usr/bin/python# -*- coding: UTF-8 -*-from math import pi, sin, cosfrom collections import namedtuplefrom random import random, choicefrom copy import copytry: import psyco psyco.full()except ImportError: passFLOAT_MAX = 1e100class Point: __slots__ = ["x", "y", "group"] def __init__(self, x=0.0, y=0.0, group=0): self.x, self.y, self.group = x, y, group"""创建数据源,先初始化300个点对象,再循环赋值"""def generate_points(npoints, radius): points = [Point() for _ in xrange(npoints)] # note: this is not a uniform 2-d distribution for p in points: r = random() * radius ang = random() * 2 * pi p.x = r * cos(ang) p.y = r * sin(ang) return pointsdef nearest_cluster_center(point, cluster_centers): """Distance and index of the closest cluster center""" def sqr_distance_2D(a, b): return (a.x - b.x) ** 2 + (a.y - b.y) ** 2 min_index = point.group min_dist = FLOAT_MAX for i, cc in enumerate(cluster_centers): d = sqr_distance_2D(cc, point) if min_dist > d: min_dist = d min_index = i return (min_index, min_dist)'''points是数据点,nclusters是给定的簇类数目cluster_centers包含初始化的nclusters个中心点,开始都是对象->(0,0,0)'''def kpp(points, cluster_centers): cluster_centers[0] = copy(choice(points)) #随机选取第一个中心点 d = [0.0 for _ in xrange(len(points))] #列表,长度为len(points),保存每个点离最近的中心点的距离 for i in xrange(1, len(cluster_centers)): # i=1...len(c_c)-1 sum = 0 for j, p in enumerate(points): d[j] = nearest_cluster_center(p, cluster_centers[:i])[1]#第j个数据点p与各个中心点距离的最小值 sum += d[j] sum *= random() for j, di in enumerate(d): sum -= di if sum > 0: continue cluster_centers[i] = copy(points[j]) break for p in points: p.group = nearest_cluster_center(p, cluster_centers)[0]'''points是数据点,nclusters是给定的簇类数目'''def lloyd(points, nclusters): cluster_centers = [Point() for _ in xrange(nclusters)]#根据指定的中心点个数,初始化中心点,均为(0,0,0) # call k++ init kpp(points, cluster_centers) #选择初始种子点 lenpts10 = len(points) >> 10 changed = 0 while True: # group element for centroids are used as counters for cc in cluster_centers: cc.x = 0 cc.y = 0 cc.group = 0 for p in points: #与该种子点在同一簇的数据点的个数 cluster_centers[p.group].group += 1 cluster_centers[p.group].x += p.x cluster_centers[p.group].y += p.y for cc in cluster_centers:#生成新的中心点 cc.x /= cc.group cc.y /= cc.group # find closest centroid of each PointPtr changed = 0 #记录所属簇发生变化的数据点的个数 for p in points: min_i = nearest_cluster_center(p, cluster_centers)[0] if min_i != p.group: changed += 1 p.group = min_i # stop when 99.9% of points are good if changed <= lenpts10: break for i, cc in enumerate(cluster_centers): cc.group = i return cluster_centersdef print_eps(points, cluster_centers, W=400, H=400): Color = namedtuple("Color", "r g b"); colors = [] for i in xrange(len(cluster_centers)): colors.append(Color((3 * (i + 1) % 11) / 11.0, (7 * i % 11) / 11.0, (9 * i % 11) / 11.0)) max_x = max_y = -FLOAT_MAX min_x = min_y = FLOAT_MAX for p in points: if max_x < p.x: max_x = p.x if min_x > p.x: min_x = p.x if max_y < p.y: max_y = p.y if min_y > p.y: min_y = p.y scale = min(W / (max_x - min_x), H / (max_y - min_y)) cx = (max_x + min_x) / 2 cy = (max_y + min_y) / 2 print "%%!PS-Adobe-3.0\n%%%%BoundingBox: -5 -5 %d %d" % (W + 10, H + 10) print ("/l {rlineto} def /m {rmoveto} def\n" + "/c { .25 sub exch .25 sub exch .5 0 360 arc fill } def\n" + "/s { moveto -2 0 m 2 2 l 2 -2 l -2 -2 l closepath " + " gsave 1 setgray fill grestore gsave 3 setlinewidth" + " 1 setgray stroke grestore 0 setgray stroke }def") for i, cc in enumerate(cluster_centers): print ("%g %g %g setrgbcolor" % (colors[i].r, colors[i].g, colors[i].b)) for p in points: if p.group != i: continue print ("%.3f %.3f c" % ((p.x - cx) * scale + W / 2, (p.y - cy) * scale + H / 2)) print ("\n0 K中心 %g %g s" % ((cc.x - cx) * scale + W / 2, (cc.y - cy) * scale + H / 2)) print "\n%%%%EOF"def main(): npoints = 300 k = 7 # # clusters points = generate_points(npoints, 10) cluster_centers = lloyd(points, k) print_eps(points, cluster_centers)main()
3.计算结果:
计算出的7个点是:
数据视图:
178.432 194.88211.643 69.325870.6203 261.456258.91 196.902221.139 334.23482.3629 100.829347.752 138.943
全局分类结果:
结果二: