深度学习实践系列（1）- 从零搭建notMNIST逻辑回归模型

MNIST 被喻为深度学习中的Hello World示例，由Yann LeCun等大神组织收集的一个手写数字的数据集，有60000个训练集和10000个验证集，是个非常适合初学者入门的训练集。这个网站也提供了业界对这个数据集的各种算法的尝试结果，也能看出机器学习的算法的演进史，从早期的线性逻辑回归到K-means，再到两层神经网络，到多层神经网络，再到最近的卷积神经网络，随着的算法模型的改善，错误率也不断下降，所以目前这个数据集的错误率已经可以控制在0.2%左右，基本和人类识别的能力相当了。

这篇文章的例子我们会用一个更加有趣点的数据集 notMNIST，和MNIST不同的是它是一个各种形态的字母的数据集合，总共有a~j 10个字母组成，字母a相对应的一些图片如下：

在这个例子中，我们会使用TensorFlow和sklearn等库，对数据集进行一系列处理，最终使用逻辑回归模型来进行机器学习并且预测。

1. 准备环境

安装Python2.7和pip

Python2.7的官方网站：https://www.python.org/getit/

pip是Python Package Index，通过pip可以非常方便的查找安装其他软件，安装pip的方法如下：https://pip.pypa.io/en/stable/installing/

安装TensorFlow

$ pip install tensorflow

2. 下载数据

import需要的第三方库

# These are all the modules we'll be using later. Make sure you can import them

# before proceeding further.

from __future__ import print_function

import matplotlib.pyplot as plt

import numpy as np

import os

import sys

import tarfile

from IPython.display import display, Image

from scipy import ndimage

from sklearn.linear_model import LogisticRegression

from six.moves.urllib.request import urlretrieve

from six.moves import cPickle as pickle



# Config the matplotlib backend as plotting inline in IPython

%matplotlib inline

首先，我们会下载数据集到本地电脑。所有的图片都是28*28像素的图片，标示为"A"到"J"（10个分类）。整个数据集合包含大概50000个训练数据和19000个测试数据，所以这样规模的数据集合可以在大多数电脑上较快的完成训练。训练数据文件名是notMNIST_large.tar.gz，测试数据文件名是notMNIST_small.tar.gz。

url = 'http://commondatastorage.googleapis.com/books1000/'

last_percent_reported = None

data_root = '.' # Change me to store data elsewhere



def download_progress_hook(count, blockSize, totalSize):

  """A hook to report the progress of a download. This is mostly intended for users with

  slow internet connections. Reports every 5% change in download progress.

  """

  global last_percent_reported

  percent = int(count * blockSize * 100 / totalSize)



  if last_percent_reported != percent:

    if percent % 5 == 0:

      sys.stdout.write("%s%%" % percent)

      sys.stdout.flush()

    else:

      sys.stdout.write(".")

      sys.stdout.flush()

    last_percent_reported = percent

def maybe_download(filename, expected_bytes, force=False):

  """Download a file if not present, and make sure it's the right size."""

  dest_filename = os.path.join(data_root, filename)

  if force or not os.path.exists(dest_filename):

    print('Attempting to download:', filename)

    filename, _ = urlretrieve(url + filename, dest_filename, reporthook=download_progress_hook)

    print('\nDownload Complete!')

  statinfo = os.stat(dest_filename)

  if statinfo.st_size == expected_bytes:

    print('Found and verified', dest_filename)

  else:

    raise Exception(

      'Failed to verify ' + dest_filename + '. Can you get to it with a browser?')

  return dest_filename



train_filename = maybe_download('notMNIST_large.tar.gz', 247336696)

test_filename = maybe_download('notMNIST_small.tar.gz', 8458043)

解压数据集合，会产生一系列标记从A到J的目录。

num_classes = 10

np.random.seed(133)



def maybe_extract(filename, force=False):

  root = os.path.splitext(os.path.splitext(filename)[0])[0]  # remove .tar.gz

  if os.path.isdir(root) and not force:

    # You may override by setting force=True.

    print('%s already present - Skipping extraction of %s.' % (root, filename))

  else:

    print('Extracting data for %s. This may take a while. Please wait.' % root)

    tar = tarfile.open(filename)

    sys.stdout.flush()

    tar.extractall(data_root)

    tar.close()

  data_folders = [

    os.path.join(root, d) for d in sorted(os.listdir(root))

    if os.path.isdir(os.path.join(root, d))]

  if len(data_folders) != num_classes:

    raise Exception(

      'Expected %d folders, one per class. Found %d instead.' % (

        num_classes, len(data_folders)))

  print(data_folders)

  return data_folders

train_folders = maybe_extract(train_filename)

test_folders = maybe_extract(test_filename)

输出如下：

./notMNIST_large already present - Skipping extraction of ./notMNIST_large.tar.gz.

['./notMNIST_large/A', './notMNIST_large/B', './notMNIST_large/C', './notMNIST_large/D', './notMNIST_large/E', './notMNIST_large/F', './notMNIST_large/G', './notMNIST_large/H', './notMNIST_large/I', './notMNIST_large/J']

./notMNIST_small already present - Skipping extraction of ./notMNIST_small.tar.gz.

['./notMNIST_small/A', './notMNIST_small/B', './notMNIST_small/C', './notMNIST_small/D', './notMNIST_small/E', './notMNIST_small/F', './notMNIST_small/G', './notMNIST_small/H', './notMNIST_small/I', './notMNIST_small/J']

3. 加载数据

验证数据集，查看一下A目录里面前20个数据图片

fn = os.listdir("notMNIST_small/A/")

for file in fn[:20]:

    path = 'notMNIST_small/A/' + file

    display(Image(path))

现在我们将要把图片数据转换成为像素，并且对数据进行Zero Mean是数据更加正则化，整个数据集会被加载到一个3D数组中（图片index，x，y）。如果有些图片不能读取，我们就直接忽略掉。

由于可能不能一次性将所有数据读取到内存中，我们会对每个目录图片分别处理，并将处理完的数据存储到对应的pickle文件中。

image_size = 28  # Pixel width and height.

pixel_depth = 255.0  # Number of levels per pixel.

def load_letter(folder, min_num_images):

  """Load the data for a single letter label."""

  image_files = os.listdir(folder)

  dataset = np.ndarray(shape=(len(image_files), image_size, image_size),

                         dtype=np.float32)

  print(folder)

  num_images = 0

  for image in image_files:

    image_file = os.path.join(folder, image)

    try:

      image_data = (ndimage.imread(image_file).astype(float) -

                    pixel_depth / 2) / pixel_depth

      if image_data.shape != (image_size, image_size):

        raise Exception('Unexpected image shape: %s' % str(image_data.shape))

      dataset[num_images, :, :] = image_data

      num_images = num_images + 1

    except IOError as e:

      print('Could not read:', image_file, ':', e, '- it\'s ok, skipping.')

  dataset = dataset[0:num_images, :, :]

  if num_images < min_num_images:

    raise Exception('Many fewer images than expected: %d < %d' %

                    (num_images, min_num_images))

  print('Full dataset tensor:', dataset.shape)

  print('Mean:', np.mean(dataset))

  print('Standard deviation:', np.std(dataset))

  return dataset

def maybe_pickle(data_folders, min_num_images_per_class, force=False):

  dataset_names = []

  for folder in data_folders:

    set_filename = folder + '.pickle'

    dataset_names.append(set_filename)

    if os.path.exists(set_filename) and not force:

      # You may override by setting force=True.

      print('%s already present - Skipping pickling.' % set_filename)

    else:

      print('Pickling %s.' % set_filename)

      dataset = load_letter(folder, min_num_images_per_class)

      try:

        with open(set_filename, 'wb') as f:

          pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)

      except Exception as e:

        print('Unable to save data to', set_filename, ':', e)

  return dataset_names

train_datasets = maybe_pickle(train_folders, 45000)

test_datasets = maybe_pickle(test_folders, 1800)

输入如下：

notMNIST_large/A.pickle already present - Skipping pickling.

notMNIST_large/B.pickle already present - Skipping pickling.

notMNIST_large/C.pickle already present - Skipping pickling.

notMNIST_large/D.pickle already present - Skipping pickling.

notMNIST_large/E.pickle already present - Skipping pickling.

notMNIST_large/F.pickle already present - Skipping pickling.

notMNIST_large/G.pickle already present - Skipping pickling.

notMNIST_large/H.pickle already present - Skipping pickling.

notMNIST_large/I.pickle already present - Skipping pickling.

notMNIST_large/J.pickle already present - Skipping pickling.

notMNIST_small/A.pickle already present - Skipping pickling.

notMNIST_small/B.pickle already present - Skipping pickling.

notMNIST_small/C.pickle already present - Skipping pickling.

notMNIST_small/D.pickle already present - Skipping pickling.

notMNIST_small/E.pickle already present - Skipping pickling.

notMNIST_small/F.pickle already present - Skipping pickling.

notMNIST_small/G.pickle already present - Skipping pickling.

notMNIST_small/H.pickle already present - Skipping pickling.

notMNIST_small/I.pickle already present - Skipping pickling.

notMNIST_small/J.pickle already present - Skipping pickling.

验证数据，我们从A.pickle中随机挑选了一个数据

# index 0 should be all As, 1 = all Bs, etc.

pickle_file = train_datasets[0]  

# With would automatically close the file after the nested block of code

with open(pickle_file, 'rb') as f:

    # unpickle

    letter_set = pickle.load(f)  

    # pick a random image index

    sample_idx = np.random.randint(len(letter_set))

    # extract a 2D slice

    sample_image = letter_set[sample_idx, :, :]

    plt.figure()

    # display it

    plt.imshow(sample_image)

4. 准备训练数据、验证数据和测试数据

我们将pickle文件读取出来进行合并，生成了对应的训练数据（Training）,验证数据（Validation）和测试数据（Testing）。

def make_arrays(nb_rows, img_size):

  if nb_rows:

    dataset = np.ndarray((nb_rows, img_size, img_size), dtype=np.float32)

    labels = np.ndarray(nb_rows, dtype=np.int32)

  else:

    dataset, labels = None, None

  return dataset, labels

def merge_datasets(pickle_files, train_size, valid_size=0):

  num_classes = len(pickle_files)

  valid_dataset, valid_labels = make_arrays(valid_size, image_size)

  train_dataset, train_labels = make_arrays(train_size, image_size)

  vsize_per_class = valid_size // num_classes

  tsize_per_class = train_size // num_classes

  start_v, start_t = 0, 0

  end_v, end_t = vsize_per_class, tsize_per_class

  end_l = vsize_per_class+tsize_per_class

  for label, pickle_file in enumerate(pickle_files):

    try:

      with open(pickle_file, 'rb') as f:

        letter_set = pickle.load(f)

        # let's shuffle the letters to have random validation and training set

        np.random.shuffle(letter_set)

        if valid_dataset is not None:

          valid_letter = letter_set[:vsize_per_class, :, :]

          valid_dataset[start_v:end_v, :, :] = valid_letter

          valid_labels[start_v:end_v] = label

          start_v += vsize_per_class

          end_v += vsize_per_class

        train_letter = letter_set[vsize_per_class:end_l, :, :]

        train_dataset[start_t:end_t, :, :] = train_letter

        train_labels[start_t:end_t] = label

        start_t += tsize_per_class

        end_t += tsize_per_class

    except Exception as e:

      print('Unable to process data from', pickle_file, ':', e)

      raise

  return valid_dataset, valid_labels, train_dataset, train_labels

train_size = 200000

valid_size = 10000

test_size = 10000

valid_dataset, valid_labels, train_dataset, train_labels = merge_datasets(

  train_datasets, train_size, valid_size)

_, _, test_dataset, test_labels = merge_datasets(test_datasets, test_size)

print('Training:', train_dataset.shape, train_labels.shape)

print('Validation:', valid_dataset.shape, valid_labels.shape)

print('Testing:', test_dataset.shape, test_labels.shape)

Training: (200000, 28, 28) (200000,)

Validation: (10000, 28, 28) (10000,)

Testing: (10000, 28, 28) (10000,)

随后将数据随机排列

def randomize(dataset, labels):

  permutation = np.random.permutation(labels.shape[0])

  shuffled_dataset = dataset[permutation,:,:]

  shuffled_labels = labels[permutation]

  return shuffled_dataset, shuffled_labels

train_dataset, train_labels = randomize(train_dataset, train_labels)

test_dataset, test_labels = randomize(test_dataset, test_labels)

valid_dataset, valid_labels = randomize(valid_dataset, valid_labels)

将数据保存到notMNIST.pickle文件

pickle_file = 'notMNIST.pickle'

try:

  f = open(pickle_file, 'wb')

  save = {

    'train_dataset': train_dataset,

    'train_labels': train_labels,

    'valid_dataset': valid_dataset,

    'valid_labels': valid_labels,

    'test_dataset': test_dataset,

    'test_labels': test_labels,

    }

  pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)

  f.close()

except Exception as e:

  print('Unable to save data to', pickle_file, ':', e)

  raise

去除数据集中重复的数据

import time

def check_overlaps(images1, images2):

    images1.flags.writeable=False

    images2.flags.writeable=False

    start = time.clock()

    hash1 = set([hash(image1.data) for image1 in images1])

    hash2 = set([hash(image2.data) for image2 in images2])

    all_overlaps = set.intersection(hash1, hash2)

    return all_overlaps, time.clock()-start

r, execTime = check_overlaps(train_dataset, test_dataset)

print('Number of overlaps between training and test sets: {}. Execution time: {}.'.format(len(r), execTime))

r, execTime = check_overlaps(train_dataset, valid_dataset)

print('Number of overlaps between training and validation sets: {}. Execution time: {}.'.format(len(r), execTime))

r, execTime = check_overlaps(valid_dataset, test_dataset)

print('Number of overlaps between validation and test sets: {}. Execution time: {}.'.format(len(r), execTime))

Number of overlaps between training and test sets: 1153. Execution time: 0.951144.

Number of overlaps between training and validation sets: 952. Execution time: 1.014579.

Number of overlaps between validation and test sets: 55. Execution time: 0.088879.

5. 训练模型

我们使用逻辑回归模型来进行训练，来看看最终的准确度如何？

samples, width, height = train_dataset.shape

X_train = np.reshape(train_dataset,(samples,width*height))

y_train = train_labels

# Prepare testing data

samples, width, height = test_dataset.shape

X_test = np.reshape(test_dataset,(samples,width*height))

y_test = test_labels

# Import

from sklearn.linear_model import LogisticRegression

# Instantiate

lg = LogisticRegression(multi_class='multinomial', solver='lbfgs', random_state=42, verbose=1, max_iter=1000, n_jobs=-1)

# Fit

lg.fit(X_train, y_train)

# Predict

y_pred = lg.predict(X_test)

# Score

from sklearn import metrics

metrics.accuracy_score(y_test, y_pred)

[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  4.6min finished

0.90059999999999996

大概花费了5分钟的时间，训练出来的回归模型准确率达到了90%，不错的尝试了！