博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
机器学习-聚类算法2
阅读量:6643 次
发布时间:2019-06-25

本文共 10517 字,大约阅读时间需要 35 分钟。

聚类算法就是通过一个固定的准则将若干个数据分成不同的类,而这个准则就是算法,即分类的标准。

1.样本:

数据是这样的,300个数据点:

 

186.663    202.772198.676    148.778143.059    205.835124.315    209.143183.409    151.252186.651    184.617152.448    213.176193.86    187.893147.529    260.705192.255    195.25199.348    246.308193.697    188.236112.12    201.993168.106    211.265212.573    186.155196.042    189.468163.708    204.957181.054    220.624158.703    168.099159.757    184.74196.79    192.998186.786    210.828196.497    207.053198.588    202.922181.534    173.303163.578    213.044179.282    176.883196.609    190.543138.516    157.012195.177    156.58190.53    182.799185.528    198.14142.969    164.181179.023    247.875214.873    197.911205.648    225.069152.519    237.886117.663    200.206195.056    178.23206.471    231.914195.335    134.527179.842    192.186201.969    232.993146.255    179.038205.406    208.909116.01    196.927209.268    204.178194.259    198.687178.556    182.883198.249    194.934196.83    190.598194.126    171.121119.272    163.223170.944    150.63182.481    206.846186.658    190.327197.471    162.009159.709    209.665199.476    211.293206.748    245.509206.406    204.516176.252    199.142190.133    229.646178.712    188.019151.013    237.015176.742    212.558182.972    201.977199.323    146.504156.122    239.561186.448    192.126179.963    192.297198.579    185.982188.084    201.899183.696    243.438147.175    193.677191.479    191.342108.569    191.222182.775    136.605130.451    156.001214.888    193.649161.908    148.296159.809    178.67204.497    154.195171.158    222.761196.04    181.647179.137    199.344153.147    151.605196.244    142.589207.977    225.414154.339    236.739207.607    225.961191.832    171.313164.26    215.03197.486    96.329199.638    59.965211.683    54.121151.582    23.532271.28    71.503264.923    101.928167.617    100.39202.113    114.749274.472    35.1209.937    18.919260.42    52.741157.854    27.62227.209    102.074188.259    90.859198.543    120.785141.484    26.01167.229    72.261205.988    117.576196.063    87.301156.426    31.878282.295    68.04291.867    17.576255.483    38.275185.408    89.429279.012    66.076275.475    47.206273.288    47.413214.551    77.592195.28    47.477233.479    84.275201.75    121.026258.297    100.726145.24    17.732168.497    80.165152.201    87.073156.81    100.00640.015    274.342111.668    225.726132.572    318.50281.682    208.12792.682    313.25783.935    256.66463.135    259.184124.016    260.5743.4    228.49424.468    221.772100.061    316.45398.86    271.58113.752    219.064110.894    212.3341.353    304.50815.272    280.95456.536    239.83537.807    221.0515.459    224.6963.999    248.9378.504    363.068138.674    288.37595.426    268.56995.851    352.587115.264    219.74519.005    214.40324.337    251.301138.374    262.9333.097    201.849111.099    296.60368.028    279.671225.167    197.408281.761    153.633265.153    201.25234.606    199.763242.599    161.636288.481    181.345232.487    146.963239.962    247.851230.852    155.934287.084    207.745258.476    253.752245.504    250.344231.481    220.091289.341    158.156224.293    218.578274.152    194.052266.65    199.529220.442    169.775273.666    154.976278.837    166.881287.532    188.421269.012    263.561254.356    209.196326.444    240.195269.494    130.213274.942    181.885351.502    294.563239.381    257.045285.555    174.956237.724    166.39318.404    240.652228.208    161.013219.366    203.459233.696    243.415228.683    182.809280.194    173.569238.482    195.29236.198    181.33223.364    173.82286.391    157.439220.934    198.874273.372    212.147260.989    286.917182.289    367.853362.761    317.836209.364    310.228177.461    291.76205.365    375.53237.474    355.536187.025    392.858294.034    353.4251.77    341.213306.181    318.855258.727    362.831193.536    338.408284.516    335.944264.24    275.262155.706    317.301137.6    339.338217.667    288.749228.865    389.289156.911    365.382196.577    267.226131.481    380.664243.27    284.093340.328    328.199129.81    383.682227.398    285.797210.289    305.682121.652    351.048214.065    380.543165.671    344.769297.968    358.993180.87    319.932229.68    334.947229.294    275.786280.687    361.591214.035    396.153205.155    332.869188.183    269.347245.506    349.31136.127    127.038103.733    3.847117.045    109.70220.688    130.3199.413    143.01842.53    102.254134.522    51.703127.222    145.68944.47    79.91825.086    74.26780.817    67.63640.818    76.98866.217    99.70892.698    155.32118.949    141.192111.029    146.34192.091    68.26995.929    88.974112.129    32.05513.645    123.9598.771    119.90789.082    80.09544.047    61.279137.867    117.78497.626    166.5420    129.274105.707    154.58293.265    44.73235.537    156.55882.69    151.633118.047    9.60657.817    66.352310.759    112.875370.382    182.441394.08    164.391353.638    137.656362.721    63.942348.123    194.11353.612    138.251351.363    152.021357.519    126.935289.905    94.706353.809    213.003392.191    162.078331.554    147.666313.834    93.734343.987    86.263382.006    206.06386.329    131.662310.848    147.063308.198    75.715314.647    88.739286.943    139.761377.336    144.399400    164.092363.689    199.723297.952    131.461361.09    80.444372.778    172.28

 

这些数据显示成图形,如下图:

如上图,这是三百个数据点,单凭肉眼看,我无法分别点和点,那个和哪个是同一类,这些点很没有规律,但是请看下图:

 

如上图:我便可以看清楚这个点与点之间是有联系的,经过连线我看到了点与点之间的关系,将它们可以分成七个类(簇),大致的块我都可以说出来,这些个点程序是怎么找出来的。

2.核心思想:

       1. 将数据分为k个非空子集

       2. 计算每个类中心点(k-means<centroid>中心点是所有点的average),记为seed point

       3. 将每个object聚类到最近seed point

       4. 返回2,当聚类结果不再变化的时候stop

 

代码:来自于维基百科:

#!/usr/bin/python# -*- coding: UTF-8 -*-from math import pi, sin, cosfrom collections import namedtuplefrom random import random, choicefrom copy import copytry:    import psyco    psyco.full()except ImportError:    passFLOAT_MAX = 1e100class Point:    __slots__ = ["x", "y", "group"]    def __init__(self, x=0.0, y=0.0, group=0):        self.x, self.y, self.group = x, y, group"""创建数据源,先初始化300个点对象,再循环赋值"""def generate_points(npoints, radius):    points = [Point() for _ in xrange(npoints)]    # note: this is not a uniform 2-d distribution    for p in points:        r = random() * radius        ang = random() * 2 * pi        p.x = r * cos(ang)        p.y = r * sin(ang)    return pointsdef nearest_cluster_center(point, cluster_centers):    """Distance and index of the closest cluster center"""    def sqr_distance_2D(a, b):        return (a.x - b.x) ** 2 + (a.y - b.y) ** 2    min_index = point.group    min_dist = FLOAT_MAX    for i, cc in enumerate(cluster_centers):        d = sqr_distance_2D(cc, point)        if min_dist > d:            min_dist = d            min_index = i    return (min_index, min_dist)'''points是数据点,nclusters是给定的簇类数目cluster_centers包含初始化的nclusters个中心点,开始都是对象->(0,0,0)'''def kpp(points, cluster_centers):    cluster_centers[0] = copy(choice(points)) #随机选取第一个中心点    d = [0.0 for _ in xrange(len(points))] #列表,长度为len(points),保存每个点离最近的中心点的距离    for i in xrange(1, len(cluster_centers)): # i=1...len(c_c)-1        sum = 0        for j, p in enumerate(points):            d[j] = nearest_cluster_center(p, cluster_centers[:i])[1]#第j个数据点p与各个中心点距离的最小值            sum += d[j]        sum *= random()        for j, di in enumerate(d):            sum -= di            if sum > 0:                continue            cluster_centers[i] = copy(points[j])            break    for p in points:        p.group = nearest_cluster_center(p, cluster_centers)[0]'''points是数据点,nclusters是给定的簇类数目'''def lloyd(points, nclusters):    cluster_centers = [Point() for _ in xrange(nclusters)]#根据指定的中心点个数,初始化中心点,均为(0,0,0)    # call k++ init    kpp(points, cluster_centers) #选择初始种子点    lenpts10 = len(points) >> 10    changed = 0    while True:        # group element for centroids are used as counters        for cc in cluster_centers:            cc.x = 0            cc.y = 0            cc.group = 0        for p in points: #与该种子点在同一簇的数据点的个数            cluster_centers[p.group].group += 1            cluster_centers[p.group].x += p.x            cluster_centers[p.group].y += p.y        for cc in cluster_centers:#生成新的中心点            cc.x /= cc.group            cc.y /= cc.group        # find closest centroid of each PointPtr        changed = 0 #记录所属簇发生变化的数据点的个数        for p in points:            min_i = nearest_cluster_center(p, cluster_centers)[0]            if min_i != p.group:                changed += 1                p.group = min_i        # stop when 99.9% of points are good        if changed <= lenpts10:            break    for i, cc in enumerate(cluster_centers):        cc.group = i    return cluster_centersdef print_eps(points, cluster_centers, W=400, H=400):    Color = namedtuple("Color", "r g b");    colors = []    for i in xrange(len(cluster_centers)):        colors.append(Color((3 * (i + 1) % 11) / 11.0,                            (7 * i % 11) / 11.0,                            (9 * i % 11) / 11.0))    max_x = max_y = -FLOAT_MAX    min_x = min_y = FLOAT_MAX    for p in points:        if max_x < p.x: max_x = p.x        if min_x > p.x: min_x = p.x        if max_y < p.y: max_y = p.y        if min_y > p.y: min_y = p.y    scale = min(W / (max_x - min_x),                H / (max_y - min_y))    cx = (max_x + min_x) / 2    cy = (max_y + min_y) / 2    print "%%!PS-Adobe-3.0\n%%%%BoundingBox: -5 -5 %d %d" % (W + 10, H + 10)    print ("/l {rlineto} def /m {rmoveto} def\n" +           "/c { .25 sub exch .25 sub exch .5 0 360 arc fill } def\n" +           "/s { moveto -2 0 m 2 2 l 2 -2 l -2 -2 l closepath " +           "   gsave 1 setgray fill grestore gsave 3 setlinewidth" +           " 1 setgray stroke grestore 0 setgray stroke }def")    for i, cc in enumerate(cluster_centers):        print ("%g %g %g setrgbcolor" %               (colors[i].r, colors[i].g, colors[i].b))        for p in points:            if p.group != i:                continue            print ("%.3f %.3f c" % ((p.x - cx) * scale + W / 2,                                    (p.y - cy) * scale + H / 2))        print ("\n0 K中心 %g %g s" % ((cc.x - cx) * scale + W / 2,                                        (cc.y - cy) * scale + H / 2))    print "\n%%%%EOF"def main():    npoints = 300    k = 7  # # clusters    points = generate_points(npoints, 10)    cluster_centers = lloyd(points, k)    print_eps(points, cluster_centers)main()

 3.计算结果:

计算出的7个点是:

数据视图:

178.432	194.88211.643	69.325870.6203	261.456258.91	196.902221.139	334.23482.3629	100.829347.752	138.943

 

全局分类结果:

结果二:

 

转载于:https://www.cnblogs.com/gnool/p/6662956.html

你可能感兴趣的文章
Android 开发 -------- 自己定义View 画 五子棋
查看>>
c和c++的输入输出
查看>>
git填坑笔记
查看>>
linux下配置LAMP开发环境,以及经常使用小细节
查看>>
程序员的出路0
查看>>
关于Unity中UI中的RawImage节点以及制作地图滚动效果
查看>>
HiWorkV1.3版震撼公布,今日起正式公开測试!
查看>>
(七)Thymeleaf的 th:* 属性之—— th: ->设值& 遍历迭代& 条件判断
查看>>
linux 线程切换效率与进程切换效率相差究竟有多大?
查看>>
Tensorflow一些常用基本概念与函数
查看>>
2014中国民营企业500强在京津冀经济区、珠江三角洲、长江三角洲分布
查看>>
【JEECG技术博文】JEECG 简单实例解说权限控制
查看>>
如果想从jenkins直接生成docker镜像,并推送到harbor中,最简单的脚本如何实现?...
查看>>
条件随机场CRF(二) 前向后向算法评估标记序列概率
查看>>
codecombat之边远地区的森林1-11关及地牢38关代码分享
查看>>
小米2S电池电量用尽充电无法开机解决方法
查看>>
确定性这剂毒药,你喝过没
查看>>
GODOT 3.0 开发进度汇报 #6
查看>>
c#简单写售票系统
查看>>
简单RPC框架-业务线程池
查看>>