[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合
[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合
Datasets can often contain components of that require different feature extraction and processing pipelines. This scenario might occur when:
- 1.Your dataset consists of heterogeneous data types (e.g. raster images and text captions)
- 2.Your dataset is stored in a Pandas DataFrame and different columns require different processing pipelines.
This example demonstrates how to use sklearn.feature_extraction.FeatureUnion on a dataset containing different types of features. We use the 20-newsgroups dataset and compute standard bag-of-words features for the subject line and body in separate pipelines as well as ad hoc features on the body. We combine them (with weights) using a FeatureUnion and finally train a classifier on the combined set of features.
The choice of features is not particularly helpful, but serves to illustrate the technique.
# Author: Matt Terry <matt.terry@gmail.com>
#
# License: BSD 3 clause
from __future__ import print_function
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_footer
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_quoting
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
class ItemSelector(BaseEstimator, TransformerMixin):
"""For data grouped by feature, select subset of data at a provided key.
The data is expected to be stored in a 2D data structure, where the first
index is over features and the second is over samples. i.e.
>> len(data[key]) == n_samples
Please note that this is the opposite convention to scikit-learn feature
matrixes (where the first index corresponds to sample).
ItemSelector only requires that the collection implement getitem
(data[key]). Examples include: a dict of lists, 2D numpy array, Pandas
DataFrame, numpy record array, etc.
>> data = {'a': [1, 5, 2, 5, 2, 8],
'b': [9, 4, 1, 4, 1, 3]}
>> ds = ItemSelector(key='a')
>> data['a'] == ds.transform(data)
ItemSelector is not designed to handle data grouped by sample. (e.g. a
list of dicts). If your data is structured this way, consider a
transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.
Parameters
----------
key : hashable, required
The key corresponding to the desired value in a mappable.
"""
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
class TextStats(BaseEstimator, TransformerMixin):
"""Extract features from each document for DictVectorizer"""
def fit(self, x, y=None):
return self
def transform(self, posts):
return [{'length': len(text),
'num_sentences': text.count('.')}
for text in posts]
class SubjectBodyExtractor(BaseEstimator, TransformerMixin):
"""Extract the subject & body from a usenet post in a single pass.
Takes a sequence of strings and produces a dict of sequences. Keys are
`subject` and `body`.
"""
def fit(self, x, y=None):
return self
def transform(self, posts):
features = np.recarray(shape=(len(posts),),
dtype=[('subject', object), ('body', object)])
for i, text in enumerate(posts):
headers, _, bod = text.partition('\n\n')
bod = strip_newsgroup_footer(bod)
bod = strip_newsgroup_quoting(bod)
features['body'][i] = bod
prefix = 'Subject:'
sub = ''
for line in headers.split('\n'):
if line.startswith(prefix):
sub = line[len(prefix):]
break
features['subject'][i] = sub
return features
pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()),
# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[
# Pipeline for pulling features from the post's subject line
('subject', Pipeline([
('selector', ItemSelector(key='subject')),
('tfidf', TfidfVectorizer(min_df=50)),
])),
# Pipeline for standard bag-of-words model for body
('body_bow', Pipeline([
('selector', ItemSelector(key='body')),
('tfidf', TfidfVectorizer()),
('best', TruncatedSVD(n_components=50)),
])),
# Pipeline for pulling ad hoc features from post's body
('body_stats', Pipeline([
('selector', ItemSelector(key='body')),
('stats', TextStats()), # returns a list of dicts
('vect', DictVectorizer()), # list of dicts -> feature matrix
])),
],
# weight components in FeatureUnion
transformer_weights={
'subject': 0.8,
'body_bow': 0.5,
'body_stats': 1.0,
},
)),
# Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
])
# limit the list of categories to make running this example faster.
categories = ['alt.atheism', 'talk.religion.misc']
train = fetch_20newsgroups(random_state=1,
subset='train',
categories=categories,
)
test = fetch_20newsgroups(random_state=1,
subset='test',
categories=categories,
)
pipeline.fit(train.data, train.target)
y = pipeline.predict(test.data)
print(classification_report(y, test.target))
[占位-未完成]scikit-learn一般实例之十一:异构数据源的特征联合的更多相关文章
- [占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较
[占位-未完成]scikit-learn一般实例之十:核岭回归和SVR的比较
- scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类 (python代码)
scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...
- Scikit Learn: 在python中机器学习
转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...
- Thinkphp框架拓展包使用方式详细介绍--验证码实例(十一)
原文:Thinkphp框架拓展包使用方式详细介绍--验证码实例(十一) 拓展压缩包的使用方式详细介绍 1:将拓展包解压:ThinkPHP3.1.2_Extend.zip --> 将其下的 \ ...
- (原创)(三)机器学习笔记之Scikit Learn的线性回归模型初探
一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价 模型训练好后,度量模型拟合效果的 ...
- (原创)(四)机器学习笔记之Scikit Learn的Logistic回归初探
目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优 一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...
- Scikit Learn
Scikit Learn Scikit-Learn简称sklearn,基于 Python 语言的,简单高效的数据挖掘和数据分析工具,建立在 NumPy,SciPy 和 matplotlib 上.
- [占位-未完成]scikit-learn一般实例之十二:用于RBF核的显式特征映射逼近
It shows how to use RBFSampler and Nystroem to approximate the feature map of an RBF kernel for clas ...
- Linear Regression with Scikit Learn
Before you read This is a demo or practice about how to use Simple-Linear-Regression in scikit-lear ...
随机推荐
- SQLServer文件收缩-图形化+命令
汇总篇:http://www.cnblogs.com/dunitian/p/4822808.html#tsql 收缩前 图形化演示: 不仅仅可以收缩日记文件,数据库文件也是可以收缩的,只不过日记收缩比 ...
- Go结构体实现类似成员函数机制
Go语言结构体成员能否是函数,从而实现类似类的成员函数的机制呢?答案是肯定的. package main import "fmt" type stru struct { testf ...
- PHP设计模式(三)抽象工厂模式(Abstract Factory For PHP)
一.什么是抽象工厂模式 抽象工厂模式的用意为:给客户端提供一个接口,可以创建多个产品族中的产品对象 ,而且使用抽象工厂模式还要满足以下条件: 系统中有多个产品族,而系统一次只可能消费其中一族产品. 同 ...
- DevOps对于企业IT的价值
其实从敏捷延展开的 DevOps 概念很早就已经被提出,不过由于配套的技术成熟度水平层次不齐, DevOps 的价值一直没有有效地发挥出来.现如今,随着容器技术的发展, DevOps 在企业中的实践难 ...
- docker4dotnet #2 容器化主机
.NET 猿自从认识了小鲸鱼,感觉功力大增.上篇<docker4dotnet #1 前世今生&世界你好>中给大家介绍了如何在Windows上面配置Docker for Window ...
- 基于select的python聊天室程序
python网络编程具体参考<python select网络编程详细介绍>. 在python中,select函数是一个对底层操作系统的直接访问的接口.它用来监控sockets.files和 ...
- python select网络编程详细介绍
刚看了反应堆模式的原理,特意复习了socket编程,本文主要介绍python的基本socket使用和select使用,主要用于了解socket通信过程 一.socket模块 socket - Low- ...
- 使用apache自带日志分割模块rotatelogs,分割日志
rotatelogs 是 Apache 2.2 中自带的管道日志程序,参数如下(参见:http://lamp.linux.gov.cn/Apache/ApacheMenu/programs/rotat ...
- Centos 6.6 下搭建php5.2.17+Zend Optimizer3.3.9+Jexus环境
(为何安装php5.2.17这个版本 因为phpweb这个程序用到了Zend Optimizer3.3.9 这个东东已经停止更新了 最高支持5.2版本的php 所以就有了一晚上填坑的自己和总结了这篇文 ...
- keepalived 知识备注
keepalived可用于配置nginx/lvs等负载均衡设备的双机热备. keepalived基于VRRP协议,简单的说就是两个物理路由节点(一主一备),虚拟成一个逻辑上的路由节点. 实际消息的路由 ...