[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合

Datasets can often contain components of that require different feature extraction and processing pipelines. This scenario might occur when:

1.Your dataset consists of heterogeneous data types (e.g. raster images and text captions)
2.Your dataset is stored in a Pandas DataFrame and different columns require different processing pipelines.

This example demonstrates how to use sklearn.feature_extraction.FeatureUnion on a dataset containing different types of features. We use the 20-newsgroups dataset and compute standard bag-of-words features for the subject line and body in separate pipelines as well as ad hoc features on the body. We combine them (with weights) using a FeatureUnion and finally train a classifier on the combined set of features.

The choice of features is not particularly helpful, but serves to illustrate the technique.

# Author: Matt Terry <matt.terry@gmail.com>

#

# License: BSD 3 clause

from __future__ import print_function

import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.datasets import fetch_20newsgroups

from sklearn.datasets.twenty_newsgroups import strip_newsgroup_footer

from sklearn.datasets.twenty_newsgroups import strip_newsgroup_quoting

from sklearn.decomposition import TruncatedSVD

from sklearn.feature_extraction import DictVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import classification_report

from sklearn.pipeline import FeatureUnion

from sklearn.pipeline import Pipeline

from sklearn.svm import SVC

class ItemSelector(BaseEstimator, TransformerMixin):

    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first

    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to scikit-learn feature

    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem

    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas

    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],

               'b': [9, 4, 1, 4, 1, 3]}

    >> ds = ItemSelector(key='a')

    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a

    list of dicts).  If your data is structured this way, consider a

    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters

    ----------

    key : hashable, required

        The key corresponding to the desired value in a mappable.

    """

    def __init__(self, key):

        self.key = key

    def fit(self, x, y=None):

        return self

    def transform(self, data_dict):

        return data_dict[self.key]

class TextStats(BaseEstimator, TransformerMixin):

    """Extract features from each document for DictVectorizer"""

    def fit(self, x, y=None):

        return self

    def transform(self, posts):

        return [{'length': len(text),

                 'num_sentences': text.count('.')}

                for text in posts]

class SubjectBodyExtractor(BaseEstimator, TransformerMixin):

    """Extract the subject & body from a usenet post in a single pass.

    Takes a sequence of strings and produces a dict of sequences.  Keys are

    `subject` and `body`.

    """

    def fit(self, x, y=None):

        return self

    def transform(self, posts):

        features = np.recarray(shape=(len(posts),),

                               dtype=[('subject', object), ('body', object)])

        for i, text in enumerate(posts):

            headers, _, bod = text.partition('\n\n')

            bod = strip_newsgroup_footer(bod)

            bod = strip_newsgroup_quoting(bod)

            features['body'][i] = bod

            prefix = 'Subject:'

            sub = ''

            for line in headers.split('\n'):

                if line.startswith(prefix):

                    sub = line[len(prefix):]

                    break

            features['subject'][i] = sub

        return features

pipeline = Pipeline([

    # Extract the subject & body

    ('subjectbody', SubjectBodyExtractor()),

    # Use FeatureUnion to combine the features from subject and body

    ('union', FeatureUnion(

        transformer_list=[

            # Pipeline for pulling features from the post's subject line

            ('subject', Pipeline([

                ('selector', ItemSelector(key='subject')),

                ('tfidf', TfidfVectorizer(min_df=50)),

            ])),

            # Pipeline for standard bag-of-words model for body

            ('body_bow', Pipeline([

                ('selector', ItemSelector(key='body')),

                ('tfidf', TfidfVectorizer()),

                ('best', TruncatedSVD(n_components=50)),

            ])),

            # Pipeline for pulling ad hoc features from post's body

            ('body_stats', Pipeline([

                ('selector', ItemSelector(key='body')),

                ('stats', TextStats()),  # returns a list of dicts

                ('vect', DictVectorizer()),  # list of dicts -> feature matrix

            ])),

        ],

        # weight components in FeatureUnion

        transformer_weights={

            'subject': 0.8,

            'body_bow': 0.5,

            'body_stats': 1.0,

        },

    )),

    # Use a SVC classifier on the combined features

    ('svc', SVC(kernel='linear')),

])

# limit the list of categories to make running this example faster.

categories = ['alt.atheism', 'talk.religion.misc']

train = fetch_20newsgroups(random_state=1,

                           subset='train',

                           categories=categories,

                           )

test = fetch_20newsgroups(random_state=1,

                          subset='test',

                          categories=categories,

                          )

pipeline.fit(train.data, train.target)

y = pipeline.predict(test.data)

print(classification_report(y, test.target))

[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合的更多相关文章

[占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较
[占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较
scikit learn 模块调参 pipeline+girdsearch 数据举例：文档分类（python代码）
scikit learn 模块调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...
Scikit Learn: 在python中机器学习
转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...
Thinkphp框架拓展包使用方式详细介绍--验证码实例（十一）
原文:Thinkphp框架拓展包使用方式详细介绍--验证码实例(十一) 拓展压缩包的使用方式详细介绍 1:将拓展包解压:ThinkPHP3.1.2_Extend.zip --> 将其下的 \ ...
(原创)（三）机器学习笔记之Scikit Learn的线性回归模型初探
一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价模型训练好后,度量模型拟合效果的 ...
(原创)（四）机器学习笔记之Scikit Learn的Logistic回归初探
目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...
Scikit Learn
Scikit Learn Scikit-Learn简称sklearn,基于 Python 语言的,简单高效的数据挖掘和数据分析工具,建立在 NumPy,SciPy 和 matplotlib 上.
[占位-未完成]scikit-learn一般实例之十二:用于RBF核的显式特征映射逼近
It shows how to use RBFSampler and Nystroem to approximate the feature map of an RBF kernel for clas ...
Linear Regression with Scikit Learn
Before you read This is a demo or practice about how to use Simple-Linear-Regression in scikit-lear ...

随机推荐

SQLServer文件收缩-图形化+命令
汇总篇:http://www.cnblogs.com/dunitian/p/4822808.html#tsql 收缩前图形化演示: 不仅仅可以收缩日记文件,数据库文件也是可以收缩的,只不过日记收缩比 ...
Go结构体实现类似成员函数机制
Go语言结构体成员能否是函数,从而实现类似类的成员函数的机制呢?答案是肯定的. package main import "fmt" type stru struct { testf ...
PHP设计模式（三）抽象工厂模式（Abstract Factory For PHP）
一.什么是抽象工厂模式抽象工厂模式的用意为:给客户端提供一个接口,可以创建多个产品族中的产品对象 ,而且使用抽象工厂模式还要满足以下条件: 系统中有多个产品族,而系统一次只可能消费其中一族产品. 同 ...
DevOps对于企业IT的价值
其实从敏捷延展开的 DevOps 概念很早就已经被提出,不过由于配套的技术成熟度水平层次不齐, DevOps 的价值一直没有有效地发挥出来.现如今,随着容器技术的发展, DevOps 在企业中的实践难 ...
docker4dotnet #2 容器化主机
.NET 猿自从认识了小鲸鱼,感觉功力大增.上篇<docker4dotnet #1 前世今生&世界你好>中给大家介绍了如何在Windows上面配置Docker for Window ...
基于select的python聊天室程序
python网络编程具体参考<python select网络编程详细介绍>. 在python中,select函数是一个对底层操作系统的直接访问的接口.它用来监控sockets.files和 ...
python select网络编程详细介绍
刚看了反应堆模式的原理,特意复习了socket编程,本文主要介绍python的基本socket使用和select使用,主要用于了解socket通信过程一.socket模块 socket - Low- ...
使用apache自带日志分割模块rotatelogs，分割日志
rotatelogs 是 Apache 2.2 中自带的管道日志程序,参数如下(参见:http://lamp.linux.gov.cn/Apache/ApacheMenu/programs/rotat ...
Centos 6.6 下搭建php5.2.17+Zend Optimizer3.3.9+Jexus环境
(为何安装php5.2.17这个版本因为phpweb这个程序用到了Zend Optimizer3.3.9 这个东东已经停止更新了最高支持5.2版本的php 所以就有了一晚上填坑的自己和总结了这篇文 ...
keepalived 知识备注
keepalived可用于配置nginx/lvs等负载均衡设备的双机热备. keepalived基于VRRP协议,简单的说就是两个物理路由节点(一主一备),虚拟成一个逻辑上的路由节点. 实际消息的路由 ...

[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合

[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合的更多相关文章

随机推荐

热门专题