后端程序员之路 13、使用KNN进行数字识别

尝试一些用KNN来做数字识别，测试数据来自：
MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges
http://yann.lecun.com/exdb/mnist/

1、数据
将位图转为向量（数组），k尝试取值3-15，距离计算采用欧式距离。
d(x,y)=\sqrt{\sum_{i=1}^{n}(x_i-y_i)^2}

2、测试
调整k的取值和基础样本数量，测试得出k取值对识别正确率的影响，以及分类识别的耗时。

如何用python解析mnist图片 - 海上扬凡的博客 - 博客频道 - CSDN.NET
http://blog.csdn.net/u014046170/article/details/47445919

# -*- coding: utf-8 -*-
"""
Created on Wed Mar 08 14:38:15 2017

@author: zapline<278998871@qq.com>
"""

import struct
import os
import numpy

def read_file_data(filename):
f = open(filename, 'rb')
buf = f.read()
f.close()
return buf

def loadImageDataSet(filename):
index = 0
buf = read_file_data(filename)
magic, images, rows, columns = struct.unpack_from('>IIII' , buf , index)
index += struct.calcsize('>IIII')
data = numpy.zeros((images, rows * columns))
for i in xrange(images):
imgVector = numpy.zeros((1, rows * columns))
for x in xrange(rows):
for y in xrange(columns):
imgVector[0, x * columns + y] = int(struct.unpack_from('>B', buf, index)[0])
index += struct.calcsize('>B')
data[i, :] = imgVector
return data

def loadLableDataSet(filename):
index = 0
buf = read_file_data(filename)
magic, images = struct.unpack_from('>II' , buf , index)
index += struct.calcsize('>II')
data = []
for i in xrange(images):
lable = int(struct.unpack_from('>B', buf, index)[0])
index += struct.calcsize('>B')
data.append(lable)
return data

def loadDataSet():
path = "D:\\kingsoft\\ml\\dataset\\"
trainingImageFile = path + "train-images.idx3-ubyte"
trainingLableFile = path + "train-labels.idx1-ubyte"
testingImageFile = path + "t10k-images.idx3-ubyte"
testingLableFile = path + "t10k-labels.idx1-ubyte"
train_x = loadImageDataSet(trainingImageFile)
train_y = loadLableDataSet(trainingLableFile)
test_x = loadImageDataSet(testingImageFile)
test_y = loadLableDataSet(testingLableFile)
return train_x, train_y, test_x, test_y

# -*- coding: utf-8 -*-
"""
Created on Wed Mar 08 14:35:55 2017

@author: zapline<278998871@qq.com>
"""

import numpy

def kNNClassify(newInput, dataSet, labels, k):
numSamples = dataSet.shape[0]
diff = numpy.tile(newInput, (numSamples, 1)) - dataSet
squaredDiff = diff ** 2
squaredDist = numpy.sum(squaredDiff, axis = 1)
distance = squaredDist ** 0.5
sortedDistIndices = numpy.argsort(distance)

classCount = {}
for i in xrange(k):
voteLabel = labels[sortedDistIndices[i]]
classCount[voteLabel] = classCount.get(voteLabel, 0) + 1

maxCount = 0
for key, value in classCount.items():
if value > maxCount:
maxCount = value
maxIndex = key
return maxIndex

# -*- coding: utf-8 -*-
"""
Created on Wed Mar 08 14:39:21 2017

@author: zapline<278998871@qq.com>
"""

import dataset
import knn

def testHandWritingClass():
print "step 1: load data..."
train_x, train_y, test_x, test_y = dataset.loadDataSet()

print "step 2: training..."
pass

print "step 3: testing..."
numTestSamples = test_x.shape[0]
matchCount = 0
for i in xrange(numTestSamples):
predict = knn.kNNClassify(test_x[i], train_x, train_y, 3)
if predict == test_y[i]:
matchCount += 1
accuracy = float(matchCount) / numTestSamples

print "step 4: show the result..."
print 'The classify accuracy is: %.2f%%' % (accuracy * 100)

testHandWritingClass()
print "game over"

总结：上述代码跑起来比较慢，但是在train数据够多的情况下，准确率不错

后端程序员之路 13、使用KNN进行数字识别的更多相关文章

后端程序员之路 12、K最近邻(k-Nearest Neighbour，KNN)分类算法
K最近邻(k-Nearest Neighbour,KNN)分类算法,是最简单的机器学习算法之一.由于KNN方法主要靠周围有限的邻近的样本,而不是靠判别类域的方法来确定所属类别的,因此对于类域的交叉或重 ...
后端程序员之路 59、go uiprogress
gosuri/uiprogress: A go library to render progress bars in terminal applicationshttps://github.com/g ...
后端程序员之路 52、A Tour of Go-2
# flowcontrol - for - for i := 0; i < 10; i++ { - for ; sum < 1000; { ...
后端程序员之路 43、Redis list
Redis数据类型之LIST类型 - Web程序猿 - 博客频道 - CSDN.NEThttp://blog.csdn.net/thinkercode/article/details/46565051 ...
后端程序员之路 22、RESTful API
理解RESTful架构 - 阮一峰的网络日志http://www.ruanyifeng.com/blog/2011/09/restful.html RESTful API 设计指南 - 阮一峰的网络日 ...
后端程序员之路 16、信息熵、决策树、ID3
信息论的熵 - guisu,程序人生. 逆水行舟,不进则退. - 博客频道 - CSDN.NEThttp://blog.csdn.net/hguisu/article/details/27305435 ...
后端程序员之路 7、Zookeeper
Zookeeper是hadoop的一个子项目,提供分布式应用程序协调服务. Apache ZooKeeper - Homehttps://zookeeper.apache.org/ zookeeper ...
后端程序员之路 4、一种monitor的做法
record_t包含_sum._count._time_stamp._max._min最基础的一条记录,可以用来记录最大值.最小值.计数.总和metric_t含有RECORD_NUM(6)份recor ...
后端程序员之路 58、go wlog
daviddengcn/go-colortext: Change the color of console text.https://github.com/daviddengcn/go-colorte ...

随机推荐

【洛谷 p3376】模板-网络最大流（图论）
题目:给出一个网络图,以及其源点和汇点,求出其网络最大流. 解法:网络流Dinic算法. 1 #include<cstdio> 2 #include<cstdlib> 3 #i ...
Python 遭遇 ProxyError 问题记录
最近遇到的一个问题,在搞清楚之后才发现这么多年的 HTTPS_PROXY 都配置错了! 起因想用 Python 在网上下载一些图片素材,结果 requests 报错 requests.excepti ...
洛谷P1119-灾后重建-floyd算法
洛谷P1119-灾后重建题目描述给出\(B\)地区的村庄数NN,村庄编号从\(0\)到\(N-1\),和所有\(M\)条公路的长度,公路是双向的. 给出第\(i\)个村庄重建完成的时间\(t_i\ ...
kubernetes实战-配置中心（四）分环境使用apollo配置中心
要进行分环境,需要将现有实验环境进行拆分 portal服务,可以各个环境共用,但是apollo-adminservice和apollo-configservice必须要分开. 1.zk环境拆分为tes ...
Python优化机制：常量折叠
英文:https://arpitbhayani.me/blogs/constant-folding-python 作者:arprit 译者:豌豆花下猫("Python猫"公众号作者 ...
爬虫入门六总结资料与Scrapy实例-bibibili番剧信息
title: 爬虫入门六总结资料与Scrapy实例-bibibili番剧信息 date: 2020-03-16 20:00:00 categories: python tags: crawler ...
831A- Unimodal Array
A. Unimodal Array time limit per test 1 second memory limit per test 256 megabytes input standard in ...
C++ STL （基础）
STL是什么(STL简介) 本节主要讲述 STL 历史.STL 组件.STL 基本结构以及 STL 编程概述.STL 历史可以追溯到 1972 年 C 语言在 UNIX 计算机上的首次使用.直到 19 ...
Vue3（四）从jQuery 转到 Vue工程化的捷径
不会 webpack 还想学 vue 工程化开发的福音熟悉jQuery开发的,学习vue的简单使用是没用啥问题的,但是学习vue的工程化开发方式,往往会遇到各种问题,比如: webpack.nod ...
014.NET5_MVC_Razor扩展Html控件02
第二种方法: 通过一个后台方法,返回一个不存在的html标签字符串,在读取的时候,通过后台方法去渲染成需要的标签和内容: 1. 定义一个普通类,类名称建议以TagHelper结尾,并且给类添加特性[H ...

后端程序员之路 13、使用KNN进行数字识别

后端程序员之路 13、使用KNN进行数字识别的更多相关文章

随机推荐

热门专题