要求：使用10-fold交叉验证方法实现SVM的对人脸库识别，列出不同核函数参数对识别结果的影响，要求画对比曲线。

使用Python完成，主要参考文献【4】，其中遇到不懂的功能函数一个一个的查官方文档和相关资料。其中包含了使用Python画图，遍历文件，读取图片，PCA降维，SVM，交叉验证等知识。

0.数据说明预处理

下载AT&T人脸数据（http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html），解压缩后为40个文件夹，每个文件夹是一个人的10张人脸照片。使用Python的glob库和PIL的Image读取照片，并转化为一维向量。这里需要注意，glob并非按照顺序读取，所以需要按照文件夹一个人一个人的读取数据，并标记对应分类。

 PICTURE_PATH = u"F:\\att_faces"

 all_data_set = [] #原始总数据集，二维矩阵n*m，n个样例，m个属性

 all_data_label = [] #总数据对应的类标签

 def get_picture():

     label = 1

     #读取所有图片并一维化

     while (label <= 20):

         for name in glob.glob(PICTURE_PATH + "\\s" + str(label) + "\\*.pgm"):

             img = Image.open(name)

             #img.getdata()

             #np.array(img).reshape(1, 92*112)

             all_data_set.append( list(img.getdata()) )

             all_data_label.append(label)

         label += 1

 get_picture()

1.PCA降维

获得原始数据后，对数据使用PCA降维处理，其中设定降维后的特征数目时遇到了问题，参考资料中n_components设定为150，但是该数据集采用大的该值后识别率会非常低，即虽然可以百分百识别出训练集人脸，但无法预测识别出新的脸，发生了过拟合（？）。经过把参数n_components设置为16后，产生了非常好的结果。PCA降维后数据的维数取多少比较好？有什么标准判断？注意，若维数较高，SVM训练会非常慢并且占用很高内存，维数小反而取得了很好的结果和效率。

另外，例子中是分别对测试集与训练集使用PCA降维，即PCA fit时只用了训练集。将数据转换为numpy的array类型是为了后面编程方便。

 n_components = 16#这个降维后的特征值个数如果太大，比如100，结果将极其不准确，为何？？

 pca = PCA(n_components = n_components, svd_solver='auto',

           whiten=True).fit(all_data_set)

 #PCA降维后的总数据集

 all_data_pca = pca.transform(all_data_set)

 #X为降维后的数据，y是对应类标签

 X = np.array(all_data_pca)

 y = np.array(all_data_label)

2. SVM训练与识别

对降维后的数据进行训练与识别。

 #输入核函数名称和参数gamma值，返回SVM训练十折交叉验证的准确率

 def SVM(kernel_name, param):

     #十折交叉验证计算出平均准确率

     #n_splits交叉验证，随机取

     kf = KFold(n_splits=10, shuffle = True)

     precision_average = 0.0

     param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5]}#自动穷举出最优的C参数

     clf = GridSearchCV(SVC(kernel=kernel_name, class_weight='balanced', gamma = param),

                        param_grid)

     for train, test in kf.split(X):

         clf = clf.fit(X[train], y[train])

         #print(clf.best_estimator_)

         test_pred = clf.predict(X[test])

         #print classification_report(y[test], test_pred)

         #计算平均准确率

         precision = 0

         for i in range(0, len(y[test])):

             if (y[test][i] == test_pred[i]):

                 precision = precision + 1

         precision_average = precision_average + float(precision)/len(y[test])

     precision_average = precision_average / 10

     #print (u"准确率为" + str(precision_average))

     return precision_average

3.主程序，设置不同参数对比分析

根据例子中的gamma值选择，发现其可以从非常小开始，即0.0001，经过人工实验，到1时rbf kernel出现了较差的结果，所以画图对比时在0.0001至1之间取100个点，因为点多后程序运行会非常慢。程序中的x_label即枚举的gamma参数值。为了节省时间，数据只选择了前20个人，最终执行时间为366.672秒。

 t0 = time()

 kernel_to_test = ['rbf', 'poly', 'sigmoid']

 #rint SVM(kernel_to_test[0], 0.1)

 plt.figure(1)

 for kernel_name in kernel_to_test:

     x_label = np.linspace(0.0001, 1, 100)

     y_label = []

     for i in x_label:

         y_label.append(SVM(kernel_name, i))

     plt.plot(x_label, y_label, label=kernel_name)

 print("done in %0.3fs" % (time() - t0))

 plt.xlabel("Gamma")

 plt.ylabel("Precision")

 plt.title('Different Kernels Contrust')

 plt.legend()

 plt.show()

Figure 1 不同核函数不同参数识别率对比图

参考：

[1] Philipp Wagner.Face Recognition with Python. July 18, 2012

[2] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A Practical Guide to Support Vector Classication. National Taiwan University, Taipei 106, Taiwan.

[3] http://www.cnblogs.com/cvlabs/archive/2010/04/13/1711470.html

[4]http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html#sphx-glr-auto-examples-applications-face-recognition-py

[5] http://blog.csdn.net/ikerpeng/article/details/20370041

附录完整代码：

 # -*- coding: utf-8 -*-

 """

 Created on Fri Dec 02 15:51:14 2016

 @author: JiaY

 """

 from time import time

 from PIL import Image

 import glob

 import numpy as np

 import sys

 from sklearn.model_selection import KFold

 from sklearn.model_selection import train_test_split

 from sklearn.decomposition import PCA

 from sklearn.model_selection import GridSearchCV

 from sklearn.svm import SVC

 from sklearn.metrics import classification_report

 import matplotlib.pyplot as plt

 #设置解释器为utf8编码，不知为何文件开头的注释没用。

 #尽管这样设置，在IPython下仍然会出错，只能用原装Python解释器执行本程序

 reload(sys)

 sys.setdefaultencoding("utf8")

 print sys.getdefaultencoding()

 PICTURE_PATH = u"F:\\课程相关资料\\研究生——数据挖掘\\16年作业\\att_faces"

 all_data_set = [] #原始总数据集，二维矩阵n*m，n个样例，m个属性

 all_data_label = [] #总数据对应的类标签

 def get_picture():

     label = 1

     #读取所有图片并一维化

     while (label <= 20):

         for name in glob.glob(PICTURE_PATH + "\\s" + str(label) + "\\*.pgm"):

             img = Image.open(name)

             #img.getdata()

             #np.array(img).reshape(1, 92*112)

             all_data_set.append( list(img.getdata()) )

             all_data_label.append(label)

         label += 1

 get_picture()

 n_components = 16#这个降维后的特征值个数如果太大，比如100，结果将极其不准确，为何？？

 pca = PCA(n_components = n_components, svd_solver='auto',

           whiten=True).fit(all_data_set)

 #PCA降维后的总数据集

 all_data_pca = pca.transform(all_data_set)

 #X为降维后的数据，y是对应类标签

 X = np.array(all_data_pca)

 y = np.array(all_data_label)

 #输入核函数名称和参数gamma值，返回SVM训练十折交叉验证的准确率

 def SVM(kernel_name, param):

     #十折交叉验证计算出平均准确率

     #n_splits交叉验证，随机取

     kf = KFold(n_splits=10, shuffle = True)

     precision_average = 0.0

     param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5]}#自动穷举出最优的C参数

     clf = GridSearchCV(SVC(kernel=kernel_name, class_weight='balanced', gamma = param),

                        param_grid)

     for train, test in kf.split(X):

         clf = clf.fit(X[train], y[train])

         #print(clf.best_estimator_)

         test_pred = clf.predict(X[test])

         #print classification_report(y[test], test_pred)

         #计算平均准确率

         precision = 0

         for i in range(0, len(y[test])):

             if (y[test][i] == test_pred[i]):

                 precision = precision + 1

         precision_average = precision_average + float(precision)/len(y[test])

     precision_average = precision_average / 10

     #print (u"准确率为" + str(precision_average))

     return precision_average

 t0 = time()

 kernel_to_test = ['rbf', 'poly', 'sigmoid']

 #rint SVM(kernel_to_test[0], 0.1)

 plt.figure(1)

 for kernel_name in kernel_to_test:

     x_label = np.linspace(0.0001, 1, 100)

     y_label = []

     for i in x_label:

         y_label.append(SVM(kernel_name, i))

     plt.plot(x_label, y_label, label=kernel_name)

 print("done in %0.3fs" % (time() - t0))

 plt.xlabel("Gamma")

 plt.ylabel("Precision")

 plt.title('Different Kernels Contrust')

 plt.legend()

 plt.show()    

 """

 clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)

 X_train, X_test, y_train, y_test = train_test_split(

     X, y, test_size=0.1, random_state=42)

 clf = clf.fit(X_train, y_train)

 test_pred = clf.predict(X_test)

 print classification_report(y_test, test_pred)

 #十折交叉验证计算出平均准确率

 precision_average = 0.0

 for train, test in kf.split(X):

     clf = clf.fit(X[train], y[train])

     #print(clf.best_estimator_)

     test_pred = clf.predict(X[test])

     #print classification_report(y[test], test_pred)

     #计算平均准确率

     precision = 0

     for i in range(0, len(y[test])):

         if (y[test][i] == test_pred[i]):

             precision = precision + 1

     precision_average = precision_average + float(precision)/len(y[test])

 precision_average = precision_average / 10

 print ("准确率为" + str(precision_average))

 print("done in %0.3fs" % (time() - t0))

 """

 """

 print("Fitting the classifier to the training set")

 t0 = time()

 param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],

               'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], }

 clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)

 clf = clf.fit(all_data_pca, all_data_label)

 print("done in %0.3fs" % (time() - t0))

 print("Best estimator found by grid search:")

 print(clf.best_estimator_)

 all_data_set_pred = clf.predict(all_data_pca)

 #target_names = range(1, 11)

 print(classification_report(all_data_set_pred, all_data_label))

 """

利用Python sklearn的SVM对AT&T人脸数据进行人脸识别的更多相关文章

《利用python进行数据分析》读书笔记--第九章数据聚合与分组运算（一）
http://www.cnblogs.com/batteryhp/p/5046450.html 对数据进行分组并对各组应用一个函数,是数据分析的重要环节.数据准备好之后,通常的任务就是计算分组统计或生 ...
利用Python进行数据分析(10) pandas基础: 处理缺失数据
数据不完整在数据分析的过程中很常见. pandas使用浮点值NaN表示浮点和非浮点数组里的缺失数据. pandas使用isnull()和notnull()函数来判断缺失情况. 对于缺失数据一般处理 ...
《利用Python进行数据分析》笔记---第6章数据加载、存储与文件格式
写在前面的话: 实例中的所有数据都是在GitHub上下载的,打包下载即可. 地址是:http://github.com/pydata/pydata-book 还有一定要说明的: 我使用的是Python ...
利用Python进行数据分析(9) pandas基础: 汇总统计和计算
pandas 对象拥有一些常用的数学和统计方法. 例如,sum() 方法,进行列小计: sum() 方法传入 axis=1 指定为横向汇总,即行小计: idxmax() 获取最大值对应的索 ...
python sklearn模型的保存
使用python的机器学习包sklearn的时候,如果训练集是固定的,我们往往想要将一次训练的模型结果保存起来,以便下一次使用,这样能够避免每次运行时都要重新训练模型时的麻烦. 在python里面,有 ...
【转】利用python的KMeans和PCA包实现聚类算法
转自:https://www.cnblogs.com/yjd_hycf_space/p/7094005.html 题目: 通过给出的驾驶员行为数据(trip.csv),对驾驶员不同时段的驾驶类型进行聚 ...
利用python的KMeans和PCA包实现聚类算法
题目: 通过给出的驾驶员行为数据(trip.csv),对驾驶员不同时段的驾驶类型进行聚类,聚成普通驾驶类型,激进类型和超冷静型3类 . 利用Python的scikit-learn包中的Kmeans算法 ...
python + sklearn ︱分类效果评估——acc、recall、F1、ROC、回归、距离
之前提到过聚类之后,聚类质量的评价: 聚类︱python实现六大分群质量评估指标(兰德系数.互信息.轮廓系数) R语言相关分类效果评估: R语言︱分类器的性能表现评价(混淆矩阵,准确率,召回率,F ...
利用Python进行文章特征提取（一）
# 文字特征提取词库模型(bag of words) 2016年2月26,星期五 # 1.词库表示法 In [9]: # sklearn 的 CountVectorizer类能够把文档词块化(tok ...

随机推荐

[CSS] Nest a grid within a grid
A grid item can also be a grid container! Let’s see how to specify a grid within a grid.
如何使用Name对象，包括WorkspaceNames和DatasetNames
转自chanyinhelv原文如何使用Name对象,包括WorkspaceNames和DatasetNames 第一原文链接该博主还有很多有关arcgis二次开发的不错的文章. 如何使用Name对 ...
Qt 子窗口内嵌到父窗口中
有时需要把一个子窗口内嵌进入父窗口当中. 我们可以这样做 1.新建一个QWidget 或者QDialog的子类 ClassA(父类为ClassB) 2.在新建类的构造函数中添加设置窗口属性 setWi ...
STL源代码学习--vector用法汇总
一.容器vector 使用vector你必须包含头文件<vector>: #include<vector> 型别vector是一个定义于namespace std内的templ ...
调制：调幅（AM）与调频（FM）
AM:amplitude modulation,幅度调制: FM:Frequency Modulation,频率调制: 1. 为什么要调制 MW:Medium Wave,中波,SW:Short Wav ...
POJ 1418 基本操作和圆离散弧
Viva Confetti Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 761 Accepted: 319 Descr ...
NET媒体文件操作组件TagLib
开源的.NET媒体文件操作组件TagLib#解析人生得意须尽欢莫使金樽空对月.写博客都会在吃饭后,每次吃饭都要喝上二两小酒,写博客前都要闲扯,这些都是个人爱好,改不掉了,看不惯的人,还望多多包 ...
Call、Apply和Bind
首先说一下bind,灵活的通过bind来改变this指针 bind方法会创建一个新函数,称为绑定函数.当调用这个绑定函数时,绑定函数会以创建它时传入bind方法的第一个参数作为this, 传入bind ...
ASP.NET中前台如何调用后台变量
.Asp.Net中几种相似的标记符号: < %=...%>< %#... %>< % %>< %@ %>解释及用法答: < %#... %> ...
JS简单验证password强度
<input type="password" id="password" value=""/><button id=&qu ...