K-Means算法的实现

K-Means算法是分类数据中最简单有效的算法。我在前面的博客里曾经写过

K-means

但其还是有两个明显的缺陷。一是K-Means必须保存全部数据集,如果训练数据集很大必须使用大量的存储空间,此外必须对每个数据都计算一遍距离,这会很费时间。
另一个缺陷在于它无法给出任何数据的基础结构信息。但是我会在后面的博客中写出解决之道

本文使用python以及Matlab分别该算法,以及在文章末尾简单实现了基于该算法实现的手写识别

源码及测试文件

python实现

首先导入numpy,operator和listdir

1
2
3
4
5
6
7
8
# -*- coding: utf-8 -*-

from numpy import *
import operator

# 从os模块中拿出listdir,其作用为可以列出给定目录名

from os import listdir

算法主要思想

对未知类别属性的数据集中的每个点依次执行以下操作:
(1)计算一直类别数据集中的每个点与当前点之间的距离
(2)按照距离递增次序排序
(3)选取与当前点距离最小的k个点
(4)确定前k个点所在类别出现的频率
(5)返回前k个点出现频率最高的类别作为当前点的预测分类

算法实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def classify0(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
# 距离计算
diffMat = tile(inX, (dataSetSize,1)) - dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
classCount={}
# 选择距离最小的k个点
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
# 排序
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]

归一化处理

1
2
3
4
5
6
7
8
9
10
11
# newValue = (oldValue -min) / (max - min)
def autoNorm(dataset):
minvals = dataset.min(0)
maxvals = dataset.max(0)
ranges = maxvals - minvals
normDataSet = zeros(shape(dataset))
m = dataset.shape[0]
normDataSet = dataset - tile(minvals, (m,1))
# 特征值相除
normDataSet = normDataSet/tile(ranges, (m,1))
return normDataSet,ranges,minvals

算法测试

测试的文件也在上面那个git仓库里

第一次测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def file2matrix(filename):
fr = open(filename)
# 得到文件行数
arrayOLines = fr.readlines()
numberOfLines = len(arrayOLines)
# 创建返回的numpy库
returnMat = zeros((numberOfLines, 3))
classLabelVector = []
index = 0
# 解析文件数据到列表
for line in arrayOLines:
line = line.strip()
listFromLine = line.split('\t')
returnMat[index,:] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat,classLabelVector

测试方法:

进入到当前目录下,进入python命令行

加载该模块,然后按下面输入即可测试

1
2
3
4
import kNN
datingDataMat, datingLabels = kNN.file2matrix('datingTestSet2.txt')
datingDataMat
datingLabels[0:20]

效果图如下,这里我在自己测试时手残将data误打成deta,见谅

第二次测试

1
2
3
4
5
6
7
8
9
10
11
12
13
def datingClassTest():
hoRatio = 0.50 #hold out 10%
datingDataMat,datingLabels = file2matrix('datingTestSet2.txt') #load data setfrom file
normMat, ranges, minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
print("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]))
if (classifierResult != datingLabels[i]): errorCount += 1.0
print("the total error rate is: %f" % (errorCount/float(numTestVecs)))
print(errorCount)

测试方法如下

重载kNN模块,然后按下面输入测试

1
2
3
4
5
6
reload kNN
normMat,ranges,minVals = kNN.autoNorm(detingDataMat)
normMat
ranges
minVals
kNN.datingClassTest()

这里也将归一化一齐测试了,因为这里的测试需要用到归一化方法

效果图如下,因为kNN.datingClassTest()的运行结果过长,只截取末尾

这里可以看到错误率为6.4%,错误32个。这里的算法并没有经过训练,所以这么高或者甚至更高的错误率也是可以出现的

Matlab实现

这些是我当时跟coursera上斯坦福大学的著名教授吴恩达(Andrew Ng)的机器学习课程的作业,已通过在线测试,正确率还是有些保证的

find closest centroids

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
K = size(centroids, 1);
idx = zeros(size(X,1), 1);
for i = 1:length(X)
distance = inf;
for j = 1:K
KDist = norm(X(i,:) - centroids(j,:));
if KDist < distance
distance = KDist;
idx(i) = j;
end
end
end
end

compute centroids

1
2
3
4
5
6
7
8
9
10
11
12
function centroids = computeCentroids(X, idx, K)
%COMPUTECENTROIDS returns the new centroids by computing the means of the
%data points assigned to each centroid.
[m n] = size(X);
centroids = zeros(K, n);
for i = 1:K
indices = idx == i;
for j = 1:n
centroids(i,j) = sum(X(:,j) .* indices) / sum(indices);
end
end
end

kMeansInitCentroids

1
2
3
4
5
6
7
8
9
10
11
12
function centroids = kMeansInitCentroids(X, K)
%KMEANSINITCENTROIDS This function initializes K centroids that are to be
%used in K-Means on the dataset X
centroids = zeros(K, size(X, 2));
centroids - kMeansInitCentroids(X,K);
for iter = 1:iterations
idx = findClosestCentroids(X,centroids);
centroids = computeMeans(X,idx,K);
end
randidx = randperm(size(X,1));
centroids = X(randidx(1:K), :);
end

run k-Means

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
function [centroids, idx] = runkMeans(X, initial_centroids, ...
max_iters, plot_progress)
%RUNKMEANS runs the K-Means algorithm on data matrix X, where each row of X
% Set default value for plot progress
if ~exist('plot_progress', 'var') || isempty(plot_progress)
plot_progress = false;
end
% Plot the data if we are plotting progress
if plot_progress
figure;
hold on;
end
% Initialize values
[m n] = size(X);
K = size(initial_centroids, 1);
centroids = initial_centroids;
previous_centroids = centroids;
idx = zeros(m, 1);
% Run K-Means
for i=1:max_iters

% Output progress
fprintf('K-Means iteration %d/%d...\n', i, max_iters);
if exist('OCTAVE_VERSION')
fflush(stdout);
end

% For each example in X, assign it to the closest centroid
idx = findClosestCentroids(X, centroids);

% Optionally, plot progress here
if plot_progress
plotProgresskMeans(X, centroids, previous_centroids, idx, K, i);
previous_centroids = centroids;
fprintf('Press enter to continue.\n');
pause;
end

% Given the memberships, compute new centroids
centroids = computeCentroids(X, idx, K);
end
% Hold off if we are plotting progress
if plot_progress
hold off;
end
end

基于K-Means算法实现的手写识别

一些额外的应用

实现思想

(1)收集数据:提供文本文件
(2)准备数据:编写函数img2vector(),将图像格式转换为分类器可识别的向量格式
(3)分析数据:检查数据,确保其符合要求
(4)训练算法
(5)测试算法:编写函数使用提供的部分数据集作为测试样本。测试样本与非测试赝本的主要区别在于测试样本是已经分类的数据,如果预测分类与实际分类不同,则标记为一个错误
(6)使用算法

将图像转换为测试向量

1
2
3
4
5
6
7
8
def img2vector(filename):
returnVect = zeros((1,1024))
fr = open(filename)
for i in range(32):
lineStr = fr.readline()
for j in range(32):
returnVect[0,32*i+j] = int(lineStr[j])
return returnVect

识别算法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def handwritingClassTest():
hwLabels = []
# 获取目录内容
trainingFileList = listdir('digits/trainingDigits')
m = len(trainingFileList)
trainingMat = zeros((m,1024))
# 从文件名解析分类内容
for i in range(m):
fileNameStr = trainingFileList[i]
fileStr = fileNameStr.split('.')[0]
classNumStr = int(fileStr.split('_')[0])
hwLabels.append(classNumStr)
trainingMat[i,:] = img2vector('digits/trainingDigits/%s' % fileNameStr)
testFileList = listdir('digits/testDigits')
errorCount = 0.0
mTest = len(testFileList)
for i in range(mTest):
fileNameStr = testFileList[i]
fileStr = fileNameStr.split('.')[0]
classNumStr = int(fileStr.split('_')[0])
vectorUnderTest = img2vector('digits/testDigits/%s' % fileNameStr)
classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
print("the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr))
if (classifierResult != classNumStr): errorCount += 1.0
print("\nthe total number of errors is: %d" % errorCount)
print("\nthe total error rate is: %f" % (errorCount/float(mTest)))

测试方法

同样在当前目录下,在python命令行中重载kNN,然后按如下输入

1
2
reload kNN
kNN.handwritingClassTest()

这个运行结果也是很长,因为准备的数据较多,所以也只截取了末尾部分

可以看到这回由于加大了数据样本,错误率明显降低

整理辛苦,转载请注明出处