总结自论文：Faster_RCNN，与Pytorch代码：

代码结构： simple-faster-rcnn-pytorch.py

data
- __init__.py
- dataset.py
- util.py
- voc_dataset.py　　
misc
- convert_caffe_pretain.py
- train_fast.py　　
model
- utils
  - nms
    - __init__.py
    - _nms_gpu_post.py
    - build.py
    - non_maximum_suppression.py　　
  - __init__.py
  - bbox_tools.py
  - creator_tool.py
  - roi_cupy.py　　
- __init__.py
- faster_rcnn.py
- faster_rcnn_vgg16.py
- region_proposal_network.py
- roi_module.py　　
utils
- __init__.py
- array_tool.py
- config.py
- eval_tool.py
- vis_tool.py
demo.ipynb
train.py
trainer.py

代码中有四个包分别为data、misc、model、utils。最核心的部分在model，包括了nms（非极大值抑制）、RPN网络实现、模型定义等。train.py与trainer.py为训练脚本。

本文主要介绍代码第一部分：data包与 utils包。

一. data包

首先下载VOC2007数据集：

wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar

wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar

wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCdevkit_08-Jun-2007.tar

并将三个压缩包解压至一个文件夹（名为VOCdevkit）下：

tar xvf VOCtrainval_06-Nov-.tar

tar xvf VOCtest_06-Nov-.tar

tar xvf VOCdevkit_08-Jun-.tar

1. utils.py

import numpy as np

from PIL import Image

import random

def read_image(path, dtype=np.float32, color=True):

    """Read an image from a file.

    This function reads an image from given file. The image is CHW format and

    the range of its value is :math:`[0, 255]`. If :obj:`color = True`, the

    order of the channels is RGB.

    Args:

        path (str): A path of image file.

        dtype: The type of array. The default value is :obj:`~numpy.float32`.

        color (bool): This option determines the number of channels.

            If :obj:`True`, the number of channels is three. In this case,

            the order of the channels is RGB. This is the default behaviour.

            If :obj:`False`, this function returns a grayscale image.

    Returns:

        ~numpy.ndarray: An image.

    """

    f = Image.open(path)

    try:

        if color:

            img = f.convert('RGB')

        else:

            img = f.convert('P')

        img = np.asarray(img, dtype=dtype)

    finally:

        if hasattr(f, 'close'):

            f.close()

    if img.ndim == 2:

        # reshape (H, W) -> (1, H, W)

        return img[np.newaxis]

    else:

        # transpose (H, W, C) -> (C, H, W)

        return img.transpose((2, 0, 1))

def resize_bbox(bbox, in_size, out_size):

    """Resize bounding boxes according to image resize.

    The bounding boxes are expected to be packed into a two dimensional

    tensor of shape :math:`(R, 4)`, where :math:`R` is the number of

    bounding boxes in the image. The second axis represents attributes of

    the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`,

    where the four attributes are coordinates of the top left and the

    bottom right vertices.

    Args:

        bbox (~numpy.ndarray): An array whose shape is :math:`(R, 4)`.

            :math:`R` is the number of bounding boxes.

        in_size (tuple): A tuple of length 2. The height and the width

            of the image before resized.

        out_size (tuple): A tuple of length 2. The height and the width

            of the image after resized.

    Returns:

        ~numpy.ndarray:

        Bounding boxes rescaled according to the given image shapes.

    """

    bbox = bbox.copy()

    y_scale = float(out_size[0]) / in_size[0]

    x_scale = float(out_size[1]) / in_size[1]

    bbox[:, 0] = y_scale * bbox[:, 0]

    bbox[:, 2] = y_scale * bbox[:, 2]

    bbox[:, 1] = x_scale * bbox[:, 1]

    bbox[:, 3] = x_scale * bbox[:, 3]

    return bbox

def flip_bbox(bbox, size, y_flip=False, x_flip=False):

    """Flip bounding boxes accordingly.

    The bounding boxes are expected to be packed into a two dimensional

    tensor of shape :math:`(R, 4)`, where :math:`R` is the number of

    bounding boxes in the image. The second axis represents attributes of

    the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`,

    where the four attributes are coordinates of the top left and the

    bottom right vertices.

    Args:

        bbox (~numpy.ndarray): An array whose shape is :math:`(R, 4)`.

            :math:`R` is the number of bounding boxes.

        size (tuple): A tuple of length 2. The height and the width

            of the image before resized.

        y_flip (bool): Flip bounding box according to a vertical flip of

            an image.

        x_flip (bool): Flip bounding box according to a horizontal flip of

            an image.

    Returns:

        ~numpy.ndarray:

        Bounding boxes flipped according to the given flips.

    """

    H, W = size

    bbox = bbox.copy()

    if y_flip:

        y_max = H - bbox[:, 0]

        y_min = H - bbox[:, 2]

        bbox[:, 0] = y_min

        bbox[:, 2] = y_max

    if x_flip:

        x_max = W - bbox[:, 1]

        x_min = W - bbox[:, 3]

        bbox[:, 1] = x_min

        bbox[:, 3] = x_max

    return bbox

def crop_bbox(

        bbox, y_slice=None, x_slice=None,

        allow_outside_center=True, return_param=False):

    """Translate bounding boxes to fit within the cropped area of an image.

    This method is mainly used together with image cropping.

    This method translates the coordinates of bounding boxes like

    :func:`data.util.translate_bbox`. In addition,

    this function truncates the bounding boxes to fit within the cropped area.

    If a bounding box does not overlap with the cropped area,

    this bounding box will be removed.

    The bounding boxes are expected to be packed into a two dimensional

    tensor of shape :math:`(R, 4)`, where :math:`R` is the number of

    bounding boxes in the image. The second axis represents attributes of

    the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`,

    where the four attributes are coordinates of the top left and the

    bottom right vertices.

    Args:

        bbox (~numpy.ndarray): Bounding boxes to be transformed. The shape is

            :math:`(R, 4)`. :math:`R` is the number of bounding boxes.

        y_slice (slice): The slice of y axis.

        x_slice (slice): The slice of x axis.

        allow_outside_center (bool): If this argument is :obj:`False`,

            bounding boxes whose centers are outside of the cropped area

            are removed. The default value is :obj:`True`.

        return_param (bool): If :obj:`True`, this function returns

            indices of kept bounding boxes.

    Returns:

        ~numpy.ndarray or (~numpy.ndarray, dict):

        If :obj:`return_param = False`, returns an array :obj:`bbox`.

        If :obj:`return_param = True`,

        returns a tuple whose elements are :obj:`bbox, param`.

        :obj:`param` is a dictionary of intermediate parameters whose

        contents are listed below with key, value-type and the description

        of the value.

        * **index** (*numpy.ndarray*): An array holding indices of used \

            bounding boxes.

    """

    t, b = _slice_to_bounds(y_slice)

    l, r = _slice_to_bounds(x_slice)

    crop_bb = np.array((t, l, b, r))

    if allow_outside_center:

        mask = np.ones(bbox.shape[0], dtype=bool)

    else:

        center = (bbox[:, :2] + bbox[:, 2:]) / 2

        mask = np.logical_and(crop_bb[:2] <= center, center < crop_bb[2:]) \

            .all(axis=1)

    bbox = bbox.copy()

    bbox[:, :2] = np.maximum(bbox[:, :2], crop_bb[:2])

    bbox[:, 2:] = np.minimum(bbox[:, 2:], crop_bb[2:])

    bbox[:, :2] -= crop_bb[:2]

    bbox[:, 2:] -= crop_bb[:2]

    mask = np.logical_and(mask, (bbox[:, :2] < bbox[:, 2:]).all(axis=1))

    bbox = bbox[mask]

    if return_param:

        return bbox, {'index': np.flatnonzero(mask)}

    else:

        return bbox

def _slice_to_bounds(slice_):

    if slice_ is None:

        return 0, np.inf

    if slice_.start is None:

        l = 0

    else:

        l = slice_.start

    if slice_.stop is None:

        u = np.inf

    else:

        u = slice_.stop

    return l, u

def translate_bbox(bbox, y_offset=0, x_offset=0):

    """Translate bounding boxes.

    This method is mainly used together with image transforms, such as padding

    and cropping, which translates the left top point of the image from

    coordinate :math:`(0, 0)` to coordinate

    :math:`(y, x) = (y_{offset}, x_{offset})`.

    The bounding boxes are expected to be packed into a two dimensional

    tensor of shape :math:`(R, 4)`, where :math:`R` is the number of

    bounding boxes in the image. The second axis represents attributes of

    the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`,

    where the four attributes are coordinates of the top left and the

    bottom right vertices.

    Args:

        bbox (~numpy.ndarray): Bounding boxes to be transformed. The shape is

            :math:`(R, 4)`. :math:`R` is the number of bounding boxes.

        y_offset (int or float): The offset along y axis.

        x_offset (int or float): The offset along x axis.

    Returns:

        ~numpy.ndarray:

        Bounding boxes translated according to the given offsets.

    """

    out_bbox = bbox.copy()

    out_bbox[:, :2] += (y_offset, x_offset)

    out_bbox[:, 2:] += (y_offset, x_offset)

    return out_bbox

def random_flip(img, y_random=False, x_random=False,

                return_param=False, copy=False):

    """Randomly flip an image in vertical or horizontal direction.

    Args:

        img (~numpy.ndarray): An array that gets flipped. This is in

            CHW format.

        y_random (bool): Randomly flip in vertical direction.

        x_random (bool): Randomly flip in horizontal direction.

        return_param (bool): Returns information of flip.

        copy (bool): If False, a view of :obj:`img` will be returned.

    Returns:

        ~numpy.ndarray or (~numpy.ndarray, dict):

        If :obj:`return_param = False`,

        returns an array :obj:`out_img` that is the result of flipping.

        If :obj:`return_param = True`,

        returns a tuple whose elements are :obj:`out_img, param`.

        :obj:`param` is a dictionary of intermediate parameters whose

        contents are listed below with key, value-type and the description

        of the value.

        * **y_flip** (*bool*): Whether the image was flipped in the\

            vertical direction or not.

        * **x_flip** (*bool*): Whether the image was flipped in the\

            horizontal direction or not.

    """

    y_flip, x_flip = False, False

    if y_random:

        y_flip = random.choice([True, False])

    if x_random:

        x_flip = random.choice([True, False])

    if y_flip:

        img = img[:, ::-1, :]

    if x_flip:

        img = img[:, :, ::-1]

    if copy:

        img = img.copy()

    if return_param:

        return img, {'y_flip': y_flip, 'x_flip': x_flip}

    else:

        return img

工具文件：

函数read_image首先用PIL将图像读入为RGB格式或单通道格式彩图，然后分别转为C*H*W与1*H*W格式。图像范围【0，255】。

函数resize_bbox将形状为(R,4)的bbox按照输入与输出的height、weight进行resize。

函数flip_bbox将根据是否翻转实现对输入bbox的横向与纵向翻转。

函数crop_bbox将bbox适应于图像的裁剪区域。

函数translate_bbox根据输入的偏移量，进行水平或竖直偏移。

函数random_flip将图片（CHW格式）随机水平或竖直反转：

img = img[:, ::-1, :] 竖直翻转
img = img[:, :, ::-1] 水平翻转

2. voc_dataset.py

import os

import xml.etree.ElementTree as ET

import numpy as np

from .util import read_image

class VOCBboxDataset:

    """Bounding box dataset for PASCAL `VOC`_.

    .. _`VOC`: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/

    The index corresponds to each image.

    When queried by an index, if :obj:`return_difficult == False`,

    this dataset returns a corresponding

    :obj:`img, bbox, label`, a tuple of an image, bounding boxes and labels.

    This is the default behaviour.

    If :obj:`return_difficult == True`, this dataset returns corresponding

    :obj:`img, bbox, label, difficult`. :obj:`difficult` is a boolean array

    that indicates whether bounding boxes are labeled as difficult or not.

    The bounding boxes are packed into a two dimensional tensor of shape

    :math:`(R, 4)`, where :math:`R` is the number of bounding boxes in

    the image. The second axis represents attributes of the bounding box.

    They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`, where the

    four attributes are coordinates of the top left and the bottom right

    vertices.

    The labels are packed into a one dimensional tensor of shape :math:`(R,)`.

    :math:`R` is the number of bounding boxes in the image.

    The class name of the label :math:`l` is :math:`l` th element of

    :obj:`VOC_BBOX_LABEL_NAMES`.

    The array :obj:`difficult` is a one dimensional boolean array of shape

    :math:`(R,)`. :math:`R` is the number of bounding boxes in the image.

    If :obj:`use_difficult` is :obj:`False`, this array is

    a boolean array with all :obj:`False`.

    The type of the image, the bounding boxes and the labels are as follows.

    * :obj:`img.dtype == numpy.float32`

    * :obj:`bbox.dtype == numpy.float32`

    * :obj:`label.dtype == numpy.int32`

    * :obj:`difficult.dtype == numpy.bool`

    Args:

        data_dir (string): Path to the root of the training data.

            i.e. "/data/image/voc/VOCdevkit/VOC2007/"

        split ({'train', 'val', 'trainval', 'test'}): Select a split of the

            dataset. :obj:`test` split is only available for

            2007 dataset.

        year ({'2007', '2012'}): Use a dataset prepared for a challenge

            held in :obj:`year`.

        use_difficult (bool): If :obj:`True`, use images that are labeled as

            difficult in the original annotation.

        return_difficult (bool): If :obj:`True`, this dataset returns

            a boolean array

            that indicates whether bounding boxes are labeled as difficult

            or not. The default value is :obj:`False`.

    """

    def __init__(self, data_dir, split='trainval',

                 use_difficult=False, return_difficult=False,

                 ):

        # if split not in ['train', 'trainval', 'val']:

        #     if not (split == 'test' and year == '2007'):

        #         warnings.warn(

        #             'please pick split from \'train\', \'trainval\', \'val\''

        #             'for 2012 dataset. For 2007 dataset, you can pick \'test\''

        #             ' in addition to the above mentioned splits.'

        #         )

        id_list_file = os.path.join(

            data_dir, 'ImageSets/Main/{0}.txt'.format(split))

        self.ids = [id_.strip() for id_ in open(id_list_file)]

        self.data_dir = data_dir

        self.use_difficult = use_difficult

        self.return_difficult = return_difficult

        self.label_names = VOC_BBOX_LABEL_NAMES

    def __len__(self):

        return len(self.ids)

    def get_example(self, i):

        """Returns the i-th example.

        Returns a color image and bounding boxes. The image is in CHW format.

        The returned image is RGB.

        Args:

            i (int): The index of the example.

        Returns:

            tuple of an image and bounding boxes

        """

        id_ = self.ids[i]

        anno = ET.parse(

            os.path.join(self.data_dir, 'Annotations', id_ + '.xml'))

        bbox = list()

        label = list()

        difficult = list()

        for obj in anno.findall('object'):

            # when in not using difficult split, and the object is

            # difficult, skipt it.

            if not self.use_difficult and int(obj.find('difficult').text) == 1:

                continue

            difficult.append(int(obj.find('difficult').text))

            bndbox_anno = obj.find('bndbox')

            # subtract 1 to make pixel indexes 0-based

            bbox.append([

                int(bndbox_anno.find(tag).text) - 1

                for tag in ('ymin', 'xmin', 'ymax', 'xmax')])

            name = obj.find('name').text.lower().strip()

            label.append(VOC_BBOX_LABEL_NAMES.index(name))

        bbox = np.stack(bbox).astype(np.float32)

        label = np.stack(label).astype(np.int32)

        # When `use_difficult==False`, all elements in `difficult` are False.

        difficult = np.array(difficult, dtype=np.bool).astype(np.uint8)  # PyTorch don't support np.bool

        # Load a image

        img_file = os.path.join(self.data_dir, 'JPEGImages', id_ + '.jpg')

        img = read_image(img_file, color=True)

        # if self.return_difficult:

        #     return img, bbox, label, difficult

        return img, bbox, label, difficult

    __getitem__ = get_example

VOC_BBOX_LABEL_NAMES = (

    'aeroplane',

    'bicycle',

    'bird',

    'boat',

    'bottle',

    'bus',

    'car',

    'cat',

    'chair',

    'cow',

    'diningtable',

    'dog',

    'horse',

    'motorbike',

    'person',

    'pottedplant',

    'sheep',

    'sofa',

    'train',

    'tvmonitor')

实现VOC2007数据类：共9963张图片

VOC2007包含{'train', 'val', 'trainval', 'test'}，共20类，加背景21类。四个集合图片数分别为2501， 2510，5011，4952（trainval=train+val）。VOC2012无test集。

训练时使用trainval数据集，测试时使用test数据集。

每张图像的标注都在xml文件中：

<annotation>

    <folder>VOC2007</folder>

    <filename>000001.jpg</filename>

    <source>

        <database>The VOC2007 Database</database>

        <annotation>PASCAL VOC2007</annotation>

        <image>flickr</image>

        <flickrid>341012865</flickrid>

    </source>

    <owner>

        <flickrid>Fried Camels</flickrid>

        <name>Jinky the Fruit Bat</name>

    </owner>

    <size>

        <width>353</width>

        <height>500</height>

        <depth>3</depth>

    </size>

    <segmented>0</segmented>

    <object>

        <name>dog</name>

        <pose>Left</pose>

        <truncated>1</truncated>

        <difficult>0</difficult>

        <bndbox>

            <xmin>48</xmin>

            <ymin>240</ymin>

            <xmax>195</xmax>

            <ymax>371</ymax>

        </bndbox>

    </object>

    <object>

        <name>person</name>

        <pose>Left</pose>

        <truncated>1</truncated>

        <difficult>0</difficult>

        <bndbox>

            <xmin>8</xmin>

            <ymin>12</ymin>

            <xmax>352</xmax>

            <ymax>498</ymax>

        </bndbox>

    </object>

</annotation>

每个xml文件给出了此图像的size，每个bbox坐标、bbox所含label、以及是否是difficult。

类 VOCBboxDataset继承自Object基类，实例化该类时只需提供VOC数据集路径即可。

类 VOCBboxDataset的方法只有一个，即返回第i张图片的信息（图像、bbox、label、difficult）

3. dataset.py

import torch as t

from .voc_dataset import VOCBboxDataset

from skimage import transform as sktsf

from torchvision import transforms as tvtsf

from . import util

import numpy as np

from utils.config import opt

def inverse_normalize(img):

    if opt.caffe_pretrain:

        img = img + (np.array([122.7717, 115.9465, 102.9801]).reshape(3, 1, 1))

        return img[::-1, :, :]

    # approximate un-normalize for visualize

    return (img * 0.225 + 0.45).clip(min=0, max=1) * 255

def pytorch_normalze(img):

    """

    https://github.com/pytorch/vision/issues/223

    return appr -1~1 RGB

    """

    normalize = tvtsf.Normalize(mean=[0.485, 0.456, 0.406],

                                std=[0.229, 0.224, 0.225])

    img = normalize(t.from_numpy(img))

    return img.numpy()

def caffe_normalize(img):

    """

    return appr -125-125 BGR

    """

    img = img[[2, 1, 0], :, :]  # RGB-BGR

    img = img * 255

    mean = np.array([122.7717, 115.9465, 102.9801]).reshape(3, 1, 1)

    img = (img - mean).astype(np.float32, copy=True)

    return img

def preprocess(img, min_size=600, max_size=1000):

    """Preprocess an image for feature extraction.

    The length of the shorter edge is scaled to :obj:`self.min_size`.

    After the scaling, if the length of the longer edge is longer than

    :param min_size:

    :obj:`self.max_size`, the image is scaled to fit the longer edge

    to :obj:`self.max_size`.

    After resizing the image, the image is subtracted by a mean image value

    :obj:`self.mean`.

    Args:

        img (~numpy.ndarray): An image. This is in CHW and RGB format.

            The range of its value is :math:`[0, 255]`.

    Returns:

        ~numpy.ndarray: A preprocessed image.

    """

    C, H, W = img.shape

    scale1 = min_size / min(H, W)

    scale2 = max_size / max(H, W)

    scale = min(scale1, scale2)

    img = img / 255.

    img = sktsf.resize(img, (C, H * scale, W * scale), mode='reflect')

    # both the longer and shorter should be less than

    # max_size and min_size

    if opt.caffe_pretrain:

        normalize = caffe_normalize

    else:

        normalize = pytorch_normalze

    return normalize(img)

class Transform(object):

    def __init__(self, min_size=600, max_size=1000):

        self.min_size = min_size

        self.max_size = max_size

    def __call__(self, in_data):

        img, bbox, label = in_data

        _, H, W = img.shape

        img = preprocess(img, self.min_size, self.max_size)

        _, o_H, o_W = img.shape

        scale = o_H / H

        bbox = util.resize_bbox(bbox, (H, W), (o_H, o_W))

        # horizontally flip

        img, params = util.random_flip(

            img, x_random=True, return_param=True)

        bbox = util.flip_bbox(

            bbox, (o_H, o_W), x_flip=params['x_flip'])

        return img, bbox, label, scale

class Dataset:

    def __init__(self, opt):

        self.opt = opt

        self.db = VOCBboxDataset(opt.voc_data_dir)

        self.tsf = Transform(opt.min_size, opt.max_size)

    def __getitem__(self, idx):

        ori_img, bbox, label, difficult = self.db.get_example(idx)

        img, bbox, label, scale = self.tsf((ori_img, bbox, label))

        # TODO: check whose stride is negative to fix this instead copy all

        # some of the strides of a given numpy array are negative.

        return img.copy(), bbox.copy(), label.copy(), scale

    def __len__(self):

        return len(self.db)

class TestDataset:

    def __init__(self, opt, split='test', use_difficult=True):

        self.opt = opt

        self.db = VOCBboxDataset(opt.voc_data_dir, split=split, use_difficult=use_difficult)

    def __getitem__(self, idx):

        ori_img, bbox, label, difficult = self.db.get_example(idx)

        img = preprocess(ori_img)

        return img, ori_img.shape[1:], bbox, label, difficult

    def __len__(self):

        return len(self.db)

制作数据

函数inverse_normalize实现对caffe与torchvision版本的去正则化。因为可以利用caffe版本的vgg预训练权重，也可利用torchvision版本的预训练权重。只不过后者结果略微逊色于前者。

函数pytorch_normalze实现对pytorch模型输入图像的标准化：由【0，255】的RGB转为【0，1】的RGB再正则化为【-1，1】的RGB。

函数caffe_normalze实现对caffe模型输入图像的标准化：由【0，255】的RGB转为【0，1】的RGB再正则化为【-125，125】的BGR。

函数preprocess实现对图像的预处理：由read_image函数读入的图像为CHW的【0，255】格式，这里首先除以255，再按照论文长边不超1000，短边不超600。按此比例缩放。然后调用pytorch_normalze或者caffe_normalze对图像进行正则化。

类Transform实现了预处理，定义了__call__方法，在__call__方法中利用函数preprocess对图像预处理，并将bbox按照图像缩放的尺度等比例缩放。然后随机对图像与bbox同时进行水平翻转。

类Dataset实现对训练集样本的生成，即trainval。__getitem__方法利用VOCBboxDataset类来生成一张训练图片，并调用Trandform类处理。返回处理后的图像，bbox，label，scale。

类TestDataset实现对测试机样本的生成，即test。__getitem__方法利用VOCBboxDataset类来生成一张测试图片，不同于训练的是调用preprocess函数处理。也即没有对bbox进行相应resize，而是返回处理前的图像尺寸。

二. utils包

1. array_tool.py

"""

tools to convert specified type

"""

import torch as t

import numpy as np

def tonumpy(data):

    if isinstance(data, np.ndarray):

        return data

    if isinstance(data, t._TensorBase):

        return data.cpu().numpy()

    if isinstance(data, t.autograd.Variable):

        return tonumpy(data.data)

def totensor(data, cuda=True):

    if isinstance(data, np.ndarray):

        tensor = t.from_numpy(data)

    if isinstance(data, t._TensorBase):

        tensor = data

    if isinstance(data, t.autograd.Variable):

        tensor = data.data

    if cuda:

        tensor = tensor.cuda()

    return tensor

def tovariable(data):

    if isinstance(data, np.ndarray):

        return tovariable(totensor(data))

    if isinstance(data, t._TensorBase):

        return t.autograd.Variable(data)

    if isinstance(data, t.autograd.Variable):

        return data

    else:

        raise ValueError("UnKnow data type: %s, input should be {np.ndarray,Tensor,Variable}" %type(data))

def scalar(data):

    if isinstance(data, np.ndarray):

        return data.reshape(1)[0]

    if isinstance(data, t._TensorBase):

        return data.view(1)[0]

    if isinstance(data, t.autograd.Variable):

        return data.data.view(1)[0]

类别转换脚本，实现tensor、numpy、Variable之间的转换。

2. config.py

from pprint import pprint

# Default Configs for training

# NOTE that, config items could be overwriten by passing argument through command line.

# e.g. --voc-data-dir='./data/'

class Config:

    # data

    voc_data_dir = '/home/cy/.chainer/dataset/pfnet/chainercv/voc/VOCdevkit/VOC2007/'

    min_size = 600  # image resize

    max_size = 1000 # image resize

    num_workers = 8

    test_num_workers = 8

    # sigma for l1_smooth_loss

    rpn_sigma = 3.

    roi_sigma = 1.

    # param for optimizer

    # 0.0005 in origin paper but 0.0001 in tf-faster-rcnn

    weight_decay = 0.0005

    lr_decay = 0.1  # 1e-3 -> 1e-4

    lr = 1e-3

    # visualization

    env = 'faster-rcnn'  # visdom env

    port = 8097

    plot_every = 40  # vis every N iter

    # preset

    data = 'voc'

    pretrained_model = 'vgg16'

    # training

    epoch = 14

    use_adam = False # Use Adam optimizer

    use_chainer = False # try match everything as chainer

    use_drop = False # use dropout in RoIHead

    # debug

    debug_file = '/tmp/debugf'

    test_num = 10000

    # model

    load_path = None

    caffe_pretrain = False # use caffe pretrained model instead of torchvision

    caffe_pretrain_path = 'checkpoints/vgg16-caffe.pth'

    def _parse(self, kwargs):

        state_dict = self._state_dict()

        for k, v in kwargs.items():

            if k not in state_dict:

                raise ValueError('UnKnown Option: "--%s"' % k)

            setattr(self, k, v)

        print('======user config========')

        pprint(self._state_dict())

        print('==========end============')

    def _state_dict(self):

        return {k: getattr(self, k) for k, _ in Config.__dict__.items() \

                if not k.startswith('_')}

opt = Config()

配置文件。包括数据及地址、visdom环境、图像尺寸、预训练权重类型、学习率及各超参数。

3. vis_tool.py

import time

import numpy as np

import matplotlib

import torch as t

import visdom

matplotlib.use('Agg')

from matplotlib import pyplot as plot

# from data.voc_dataset import VOC_BBOX_LABEL_NAMES

VOC_BBOX_LABEL_NAMES = (

    'fly',

    'bike',

    'bird',

    'boat',

    'pin',

    'bus',

    'c',

    'cat',

    'chair',

    'cow',

    'table',

    'dog',

    'horse',

    'moto',

    'p',

    'plant',

    'shep',

    'sofa',

    'train',

    'tv',

)

def vis_image(img, ax=None):

    """Visualize a color image.

    Args:

        img (~numpy.ndarray): An array of shape :math:`(3, height, width)`.

            This is in RGB format and the range of its value is

            :math:`[0, 255]`.

        ax (matplotlib.axes.Axis): The visualization is displayed on this

            axis. If this is :obj:`None` (default), a new axis is created.

    Returns:

        ~matploblib.axes.Axes:

        Returns the Axes object with the plot for further tweaking.

    """

    if ax is None:

        fig = plot.figure()

        ax = fig.add_subplot(1, 1, 1)

    # CHW -> HWC

    img = img.transpose((1, 2, 0))

    ax.imshow(img.astype(np.uint8))

    return ax

def vis_bbox(img, bbox, label=None, score=None, ax=None):

    """Visualize bounding boxes inside image.

    Args:

        img (~numpy.ndarray): An array of shape :math:`(3, height, width)`.

            This is in RGB format and the range of its value is

            :math:`[0, 255]`.

        bbox (~numpy.ndarray): An array of shape :math:`(R, 4)`, where

            :math:`R` is the number of bounding boxes in the image.

            Each element is organized

            by :math:`(y_{min}, x_{min}, y_{max}, x_{max})` in the second axis.

        label (~numpy.ndarray): An integer array of shape :math:`(R,)`.

            The values correspond to id for label names stored in

            :obj:`label_names`. This is optional.

        score (~numpy.ndarray): A float array of shape :math:`(R,)`.

             Each value indicates how confident the prediction is.

             This is optional.

        label_names (iterable of strings): Name of labels ordered according

            to label ids. If this is :obj:`None`, labels will be skipped.

        ax (matplotlib.axes.Axis): The visualization is displayed on this

            axis. If this is :obj:`None` (default), a new axis is created.

    Returns:

        ~matploblib.axes.Axes:

        Returns the Axes object with the plot for further tweaking.

    """

    label_names = list(VOC_BBOX_LABEL_NAMES) + ['bg']

    # add for index `-1`

    if label is not None and not len(bbox) == len(label):

        raise ValueError('The length of label must be same as that of bbox')

    if score is not None and not len(bbox) == len(score):

        raise ValueError('The length of score must be same as that of bbox')

    # Returns newly instantiated matplotlib.axes.Axes object if ax is None

    ax = vis_image(img, ax=ax)

    # If there is no bounding box to display, visualize the image and exit.

    if len(bbox) == 0:

        return ax

    for i, bb in enumerate(bbox):

        xy = (bb[1], bb[0])

        height = bb[2] - bb[0]

        width = bb[3] - bb[1]

        ax.add_patch(plot.Rectangle(

            xy, width, height, fill=False, edgecolor='red', linewidth=2))

        caption = list()

        if label is not None and label_names is not None:

            lb = label[i]

            if not (-1 <= lb < len(label_names)):  # modfy here to add backgroud

                raise ValueError('No corresponding name is given')

            caption.append(label_names[lb])

        if score is not None:

            sc = score[i]

            caption.append('{:.2f}'.format(sc))

        if len(caption) > 0:

            ax.text(bb[1], bb[0],

                    ': '.join(caption),

                    style='italic',

                    bbox={'facecolor': 'white', 'alpha': 0.5, 'pad': 0})

    return ax

def fig2data(fig):

    """

    brief Convert a Matplotlib figure to a 4D numpy array with RGBA

    channels and return it

    @param fig： a matplotlib figure

    @return a numpy 3D array of RGBA values

    """

    # draw the renderer

    fig.canvas.draw()

    # Get the RGBA buffer from the figure

    w, h = fig.canvas.get_width_height()

    buf = np.fromstring(fig.canvas.tostring_argb(), dtype=np.uint8)

    buf.shape = (w, h, 4)

    # canvas.tostring_argb give pixmap in ARGB mode. Roll the ALPHA channel to have it in RGBA mode

    buf = np.roll(buf, 3, axis=2)

    return buf.reshape(h, w, 4)

def fig4vis(fig):

    """

    convert figure to ndarray

    """

    ax = fig.get_figure()

    img_data = fig2data(ax).astype(np.int32)

    plot.close()

    # HWC->CHW

    return img_data[:, :, :3].transpose((2, 0, 1)) / 255.

def visdom_bbox(*args, **kwargs):

    fig = vis_bbox(*args, **kwargs)

    data = fig4vis(fig)

    return data

class Visualizer(object):

    """

    wrapper for visdom

    you can still access naive visdom function by

    self.line, self.scater,self._send,etc.

    due to the implementation of `__getattr__`

    """

    def __init__(self, env='default', **kwargs):

        self.vis = visdom.Visdom(env=env, **kwargs)

        self._vis_kw = kwargs

        # e.g.（’loss',23） the 23th value of loss

        self.index = {}

        self.log_text = ''

    def reinit(self, env='default', **kwargs):

        """

        change the config of visdom

        """

        self.vis = visdom.Visdom(env=env, **kwargs)

        return self

    def plot_many(self, d):

        """

        plot multi values

        @params d: dict (name,value) i.e. ('loss',0.11)

        """

        for k, v in d.items():

            if v is not None:

                self.plot(k, v)

    def img_many(self, d):

        for k, v in d.items():

            self.img(k, v)

    def plot(self, name, y, **kwargs):

        """

        self.plot('loss',1.00)

        """

        x = self.index.get(name, 0)

        self.vis.line(Y=np.array([y]), X=np.array([x]),

                      win=name,

                      opts=dict(title=name),

                      update=None if x == 0 else 'append',

                      **kwargs

                      )

        self.index[name] = x + 1

    def img(self, name, img_, **kwargs):

        """

        self.img('input_img',t.Tensor(64,64))

        self.img('input_imgs',t.Tensor(3,64,64))

        self.img('input_imgs',t.Tensor(100,1,64,64))

        self.img('input_imgs',t.Tensor(100,3,64,64),nrows=10)

        ！！！don‘t ~~self.img('input_imgs',t.Tensor(100,64,64),nrows=10)~~！！！

        """

        self.vis.images(t.Tensor(img_).cpu().numpy(),

                        win=name,

                        opts=dict(title=name),

                        **kwargs

                        )

    def log(self, info, win='log_text'):

        """

        self.log({'loss':1,'lr':0.0001})

        """

        self.log_text += ('[{time}] {info} <br>'.format(

            time=time.strftime('%m%d_%H%M%S'), \

            info=info))

        self.vis.text(self.log_text, win)

    def __getattr__(self, name):

        return getattr(self.vis, name)

    def state_dict(self):

        return {

            'index': self.index,

            'vis_kw': self._vis_kw,

            'log_text': self.log_text,

            'env': self.vis.env

        }

    def load_state_dict(self, d):

        self.vis = visdom.Visdom(env=d.get('env', self.vis.env), **(self.d.get('vis_kw')))

        self.log_text = d.get('log_text', '')

        self.index = d.get('index', dict())

        return self

函数vis_image读入一张3,H,W的RGB图像并显示。

函数vis_bbox显示图像及该图的bbox，及bbox的label和score。

函数visdom_bbox调用函数fig2data、fig4vis返回显示后的图像。

类Visualizer将要在visdom中显示的项包装起来。

4. eval_tool.py

评估检测结果

函数calc_detection_voc_prec_rec计算每一类的precision和recall。

函数calc_detection_voc_ap调用第一个函数计算每一类的average precision（ap）。

函数eval_detection_voc调用前两个函数，得到ap、map。

注：

1. bbox坐标都是以(R,4)的形状出现，在进行bounding box回归的时候会将bbox坐标转为中心点坐标(x,y)与height、weight，因为回归的目的是学到offsets（偏移）和scales（尺度），所以需要转换坐标表示。在其他情况bbox坐标都是左上右下角坐标，即`(y_{min}, x_{min}, y_{max}, x_{max})'。

2. 代码实现时的batch_size=1，其余多数代码也是基于size=1的。所以每次输入一张图片，这样很方便处理。

Reference：

从编程实现角度学习Faster R-CNN（附极简实现）

Precision、Recall、Ap概念

Faster_RCNN 1.准备工作的更多相关文章

Linux平台 Oracle 10gR2（10.2.0.5）RAC安装 Part1：准备工作
Linux平台 Oracle 10gR2(10.2.0.5)RAC安装 Part1:准备工作环境:OEL 5.7 + Oracle 10.2.0.5 RAC 1.实施前准备工作 1.1 服务器安装操 ...
步入angularjs directive（指令）--准备工作熟悉hasOwnProperty
在讲解directive之前,先做一下准备工作,为何要这样呢? 因为我们不是简单的说说directive怎么用,还要知道为什么这么用!(今天我们先磨磨刀!). 首先我们讲讲js 基础的知识--hasO ...
Linux平台 Oracle 11gR2 RAC安装Part1：准备工作
一.实施前期准备工作 1.1 服务器安装操作系统 1.2 Oracle安装介质 1.3 共享存储规划 1.4 网络规范分配二.安装前期准备工作 2.1 各节点系统时间校对 2.2 各节点关闭防火墙和 ...
半吊子学习Swift--天气预报程序-准备工作
MacBookPro买完快半年了,当初想着买个本本学点ios,买完就看了几天的教程[捂脸],最近发现人都要废了,想重新开始学习Swift并将每天的进程通过博客发布来督促自己. 由于文笔不好,接触Swi ...
实践 Neutron 前的两个准备工作 - 每天5分钟玩转 OpenStack（78）
上一节配置了 linux-bridge mechanism driver,本节再做两个准备工作: 1. 检视初始的网络状态.2. 了解 linux bridge 环境中的各种网络设备. 初始网络状态 ...
从零开始编写自己的C#框架（2）——开发前准备工作
没想到写了个前言就受到很多朋友的支持,大家的推荐就是我最大的动力(推荐得我热血沸腾,大家就用推荐来猛砸我吧O^-^O),谢谢大家支持. 其实框架开发大家都知道,不过要想写得通俗点,我个人觉得还是挺吃力 ...
会务准备期间材料准备工作具体实施总结 ----（vim技巧应用, python信息提取与整合, microsoft word格式调整批量化)
会务准备期间材料准备工作具体实施总结(vim, python, microsoft word) span.kw { color: #007020; font-weight: bold; } code ...
烂泥：Postfix邮件服务器搭建之准备工作
说实话,Postfix邮件服务器的搭建是一件很麻烦的事情,需要各种软件之间的配置和调试.在写这篇文章之前,我也是搭建测试了不下于10次才算把整个流程给走通,今天刚好有时间把整个搭建过程记录下来. 在正 ...
faster_rcnn c++版本的 caffe 封装，动态库（2）
摘要: 转载请注明出处,楼燚(yì)航的blog,http://www.cnblogs.com/louyihang-loves-baiyan/ github上的代码链接,求给星星:) https:// ...

随机推荐

C#设计模式（11）——装饰者模式
1.装饰者模式介绍装饰者顾名思义就是对一个类添加一些额外的装饰(功能).我们想给一个对象添加一些额外的功能又不改变对象内方法的签名怎么做呢?最常用的方法就是继承了,子类继承父类,然后重写父类的方法. ...
Web项目发布的一些设置
比如我们有个项目想要发布到互联网上,我们首先需要购买域名以及主机,主机的话,推荐云主机(本人推荐西部数码或者阿里云),性能好: 我们先在云主机上搭建环境,比如Mysql,Jdk,Tomcat: 然后我 ...
HDU 1038(速度里程计算 **)
题意是已知车轮的直径,圈数和时间,求所行驶的里程和速度. 单位换算,代码如下: #include <bits/stdc++.h> using namespace std; const do ...
java操作数据库：增删改查
不多bb了直接上. 工具:myeclipse 2016,mysql 5.7 目的:java操作数据库增删改查商品信息 test数据库的goods表 gid主键,自增 1.实体类Goods:封装数据库数 ...
C#执行JavaScript脚本代替Compute
DataTable.Compute不支持round之类的函数,可以调用JScript实现. 1.添加引用Microsoft.Vsa和Microsoft.JScript2.例子代码 object ret ...
****** 三十四 ******、软设笔记【存储器系统】-Cache存储器
Cache存储器 Cache(高速缓冲存储器) 高速缓冲存储器是位于主存与CPU之间的一级存储器,有静态存储芯片(SRAM)组成,容量比较小,速度比主存高得多,接近于CPU的速度,单位成本比内存高.C ...
hadoop1.2.1的安装
前提:1.机器最好都做ssh免密登录,最后在启动hadoop的时候会简单很多免密登录看免密登录 2.集群中的虚拟机最好都关闭防火墙,否则很麻烦 3集群中的虚拟机中必须安装jdk. 具体安装步骤如下: ...
为数据库重新生成log文件
1.新建一个同名的数据库 2.再停掉sql server(注意不要分离数据库) 3.用原数据库的数据文件覆盖掉这个新建的数据库 4.再重启sql server 5.此时打开企业管理器时会出现置疑,先不 ...
MySQL触发器实现表数据同步
其中old表示tab2(被动触发),new表示tab1(主动触发,外部应用程序在此表里执行insert语句) 1.插入:在一个表里添加一条记录,另一个表也添加一条记录DROP TABLE IF EXI ...
二十一、Linux 进程与信号---进程查看和进程状态、进程调度和进程状态变化、进程标识
21.1 进程查看和进程状态 21.1.1 ps 指令 ps 指令通常可以查看到进程的 ID.进程的用户 ID.进程状态和进程的 Command ps:查看当前用户启动的进程 ps -ef:详细查看后 ...

Faster_RCNN 1.准备工作

一. data包

二. utils包

从编程实现角度学习Faster R-CNN（附极简实现）

Faster_RCNN 1.准备工作的更多相关文章

随机推荐

热门专题