使用k-近邻算法改进约会网站的配对效果

---恢复内容开始---

《 Machine Learning 机器学习实战》的确是一本学习python，掌握数据相关技能的，不可多得的好书！！

最近邻算法源码如下，给有需要的入门者学习，大神请绕道。

'''

Created on Sep 16, 2010

kNN: k Nearest Neighbors

Input:      inX: vector to compare to existing dataset (1xN)

            dataSet: size m data set of known vectors (NxM)

            labels: data set labels (1xM vector)

            k: number of neighbors to use for comparison (should be an odd number)

Output:     the most popular class label

@author: pbharrin

'''

from numpy import *

import operator

from os import listdir

def classify0(inX, dataSet, labels, k):

    dataSetSize = dataSet.shape[0]

    diffMat = tile(inX, (dataSetSize,1)) - dataSet

    sqDiffMat = diffMat**2

    sqDistances = sqDiffMat.sum(axis=1)

    distances = sqDistances**0.5

    sortedDistIndicies = distances.argsort()

    classCount={}

    for i in range(k):

        voteIlabel = labels[sortedDistIndicies[i]]

        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1

    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)

    return sortedClassCount[0][0]

def createDataSet():

    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])

    labels = ['A','A','B','B']

    return group, labels

def file2matrix(filename):

    fr = open(filename)

    numberOfLines = len(fr.readlines())         #get the number of lines in the file

    returnMat = zeros((numberOfLines,3))        #prepare matrix to return

    classLabelVector = []                       #prepare labels return

    fr = open(filename)

    index = 0

    for line in fr.readlines():

        line = line.strip()

        listFromLine = line.split('\t')

        returnMat[index,:] = listFromLine[0:3]          # 读取前三个属性值

        classLabelVector.append(int(listFromLine[-1]))  # 读取类标签

        index += 1

    return returnMat,classLabelVector

def autoNorm(dataSet):

    minVals = dataSet.min(0)

    maxVals = dataSet.max(0)

    ranges = maxVals - minVals

    normDataSet = zeros(shape(dataSet))

    m = dataSet.shape[0]

    normDataSet = dataSet - tile(minVals, (m,1))

    normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide

    return normDataSet, ranges, minVals

def datingClassTest():

    hoRatio = 0.50      #hold out 10%

    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')       #load data setfrom file

    normMat, ranges, minVals = autoNorm(datingDataMat)

    m = normMat.shape[0]

    numTestVecs = int(m*hoRatio)

    errorCount = 0.0

    for i in range(numTestVecs):

        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)

        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])

        if (classifierResult != datingLabels[i]): errorCount += 1.0

    print "the total error rate is: %f" % (errorCount/float(numTestVecs))

    print errorCount

手写数字识别

# 解析文本数据

def img2vector(filename):

    returnVect = zeros((1,1024))

    fr = open(filename)

    for i in range(32):

        lineStr = fr.readline()

        for j in range(32):

            returnVect[0,32*i+j] = int(lineStr[j])

    return returnVect

# 测试

def handwritingClassTest():

    hwLabels = []

    trainingFileList = listdir('trainingDigits')           #load the training set

    m = len(trainingFileList)

    trainingMat = zeros((m,1024))

    for i in range(m):

        fileNameStr = trainingFileList[i]

        fileStr = fileNameStr.split('.')[0]     #take off .txt

        classNumStr = int(fileStr.split('_')[0])

        hwLabels.append(classNumStr)

        trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)

    testFileList = listdir('testDigits')        #iterate through the test set

    errorCount = 0.0

    mTest = len(testFileList)

    for i in range(mTest):

        fileNameStr = testFileList[i]

        fileStr = fileNameStr.split('.')[0]     #take off .txt

        classNumStr = int(fileStr.split('_')[0])

        vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)

        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)

        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)

        if (classifierResult != classNumStr): errorCount += 1.0

    print "\nthe total number of errors is: %d" % errorCount

    print "\nthe total error rate is: %f" % (errorCount/float(mTest))

使用k-近邻算法改进约会网站的配对效果的更多相关文章

使用K近邻算法改进约会网站的配对效果
1 定义数据集导入函数 import numpy as np """ 函数说明:打开并解析文件,对数据进行分类:1 代表不喜欢,2 代表魅力一般,3 代表极具魅力 Par ...
k-近邻(KNN)算法改进约会网站的配对效果[Python]
使用Python实现k-近邻算法的一般流程为: 1.收集数据:提供文本文件 2.准备数据:使用Python解析文本文件,预处理 3.分析数据:可视化处理 4.训练算法:此步骤不适用与k——近邻算法 5 ...
【Machine Learning in Action --2】K-近邻算法改进约会网站的配对效果
摘自:<机器学习实战>,用python编写的(需要matplotlib和numpy库) 海伦一直使用在线约会网站寻找合适自己的约会对象.尽管约会网站会推荐不同的人选,但她没有从中找到喜欢的 ...
机器学习读书笔记（二）使用k-近邻算法改进约会网站的配对效果
一.背景海伦女士一直使用在线约会网站寻找适合自己的约会对象.尽管约会网站会推荐不同的任选,但她并不是喜欢每一个人.经过一番总结,她发现自己交往过的人可以进行如下分类不喜欢的人魅力一般的人极具魅 ...
吴裕雄--天生自然python机器学习：使用K-近邻算法改进约会网站的配对效果
在约会网站使用K-近邻算法准备数据:从文本文件中解析数据海伦收集约会数据巳经有了一段时间,她把这些数据存放在文本文件(1如1^及抓比加中,每个样本数据占据一行,总共有1000行.海伦的样本主 ...
机器学习实战1-2.1 KNN改进约会网站的配对效果 datingTestSet2.txt 下载方法
今天读<机器学习实战>读到了使用k-临近算法改进约会网站的配对效果,道理我都懂,但是看到代码里面的数据样本集 datingTestSet2.txt 有点懵,这个样本集在哪里,只给了我一个文 ...
KNN算法项目实战——改进约会网站的配对效果
KNN项目实战——改进约会网站的配对效果 1.项目背景: 海伦女士一直使用在线约会网站寻找适合自己的约会对象.尽管约会网站会推荐不同的人选,但她并不是喜欢每一个人.经过一番总结,她发现自己交往过的人可 ...
kNN分类算法实例1：用kNN改进约会网站的配对效果
目录实战内容用sklearn自带库实现kNN算法分类将内含非数值型的txt文件转化为csv文件用sns.lmplot绘图反映几个特征之间的关系参考资料 @ 实战内容海伦女士一直使用在线约会 ...
k-近邻算法-优化约会网站的配对效果
KNN原理 1. 假设有一个带有标签的样本数据集(训练样本集),其中包含每条数据与所属分类的对应关系. 2. 输入没有标签的新数据后,将新数据的每个特征与样本集中数据对应的特征进行比较. a. 计算新 ...

随机推荐

LINQ及EntityFramework何时从数据库返回数据，备忘
Generally speaking, LINQ queries are executed when the application code processes data (for instance ...
linux概念之/proc与/sys
http://blog.chinaunix.net/uid-1835494-id-3070465.html proc/x:1/sched http://bbs.chinaunix.net/threa ...
jQuery选择器大全(48个代码片段+21幅图演示)
选择器是jQuery最基础的东西,本文中列举的选择器基本上囊括了所有的jQuery选择器,也许各位通过这篇文章能够加深对jQuery选择器的理解,它们本身用法就非常简单,我更希望的是它能够提升个人编 ...
intellij idea +maven4+springmvc4搭建
0.淘宝mave培训PPT http://www.open-open.com/doc/view/4058453cde4b429c82ff2807d8aa81f0 1.intellij创建空的maven ...
linux 下 nginx 启动服务器 80端口被占用问题
把80端口占用的程序杀死 sudo fuser -k 80/tcp rm -fr 文件 ----删除文件及文加下的所有文件 echo > filename ---清空文件的内容
EV电池指标及特点
在电池的大家族中,蓄电池的种类是最多的,共同的特点是可以经历多次充电.放电循环,反复使用,这也正是蓄电池作为电动汽车动力源的基础.当然,并不是所有的蓄电池都适合应用于电动汽车,从全球新能源汽车的发展来 ...
Android 使用Telephony API
Android 使用Telephony API public class TelephonyDemo extends Activity { TextView textOut; TelephonyMan ...
[mysql] mysql explain 使用
explain显示了mysql如何使用索引来处理select语句以及连接表.可以帮助选择更好的索引和写出更优化的查询语句. 先解析一条sql语句,看出现什么内容 EXPLAINSELECTs.uid, ...
初级——程序如何打包成apk文件
将Eclipse Android项目打包成APK文件是本文要介绍的内容,主要是来了解并学习Eclipse Android打包的内容,具体关于Eclipse Android内容的详解来看本文.Eclip ...
C#对数组去重
#region ArrayList的示例应用 /// 方法名:DelArraySame /// 功能: 删除数组中重复的元素 /// </summary> /// <param na ...

使用k-近邻算法改进约会网站的配对效果

使用k-近邻算法改进约会网站的配对效果的更多相关文章

随机推荐

热门专题