[amazonaccess 1]logistic.py 特征提取
---恢复内容开始---
本文件对应logistic.py
amazonaccess介绍:
根据入职员工的定位(员工角色代码、角色所属家族代码等特征)判断员工是否有访问某资源的权限
logistic.py(python)的关键:
1.通过组合组合几个特征来获取新的特征
例如:组合MGR_ID ROLE_FAMILY得到新特征 hash((85475,290919))=1071656665
2.greedy feature selection
i. 首先从候选特征中选择1个在训练集上表现最好的特征,将其加入好特征goodfeatures中,并将该特征从中候选特征中排除
ii. 从候选特征中选择一个特征与goodfeatures中特征一起,选取在训练数据集中表现最好的特征,加入goodfeatures中,并将该特征从中候选特征中排除
iii.继续选取,直到在训练集上的表现不再增加为止
3.One Hot Encoding
例如:对数据离散数据 [23 33 33 44]进行编码
i. 首先relable,转换为 [0 1 1 2]
ii.对0进行编码 0 0 1 对应 23
对1进行编码 0 1 0 对应 33
对2进行编码 1 0 0 对应 44
这样在最后使用线性模型的时候,离散数据的每个标签都会对应一个权重
代码流程:
1.读取数据,去除ROLE_CODE属性
- learner = 'log'
- print "Reading dataset..."
- train_data = pd.read_csv('train.csv')
- test_data = pd.read_csv('test.csv')
- submit=learner + str(SEED) + '.csv'
- #去除ROLE_CODE特征,因为train和test数据需要同时做变换,所以合到一块
- all_data = np.vstack((train_data.ix[:,1:-1], test_data.ix[:,1:-1]))
- num_train = np.shape(train_data)[0]
2.对数据进行relable
- # Transform data
- print "Transforming data..."
- # Relabel the variable values to smallest possible so that I can use bincount
- # on them later.
- relabler = preprocessing.LabelEncoder()
- for col in range(len(all_data[0,:])):
- relabler.fit(all_data[:, col])
- all_data[:, col] = relabler.transform(all_data[:, col])
3.组合特征生成新特征,这里分别组合了2个特征和3个特征,分别生成(28-2)和(56-12)个新特征,并与原特征合并
在组合特征时,排除了(ROLE_FAMILY,ROLE_FAMILY_DESC)和(ROLE_ROLLUP_1,ROLE_ROLLUP_2)组合
因为特征中很多标签对应的数据只有1条或2条,将这些数据合并到个标签中
组合特征的函数
- def group_data(data, degree=3, hash=hash):
- """
- numpy.array -> numpy.array
- Groups all columns of data into all combinations of triples
- """
- new_data = []
- m,n = data.shape
- for indicies in combinations(range(n), degree):
- #去除ROLE_TITLE和ROLE_FAMILY组合
- if 5 in indicies and 7 in indicies:
- print "feature Xd"
- #去除ROLE_ROLLUP_1和ROLE_ROLLUP_2组合
- elif 2 in indicies and 3 in indicies:
- print "feature Xd"
- else:
- new_data.append([hash(tuple(v)) for v in data[:,indicies]])
- return array(new_data).T
合并数据只有1条或两条的标签
- dp = group_data(all_data, degree=2)
- for col in range(len(dp[0,:])):
- relabler.fit(dp[:, col])
- dp[:, col] = relabler.transform(dp[:, col])
- uniques = len(set(dp[:,col]))
- maximum = max(dp[:,col])
- print col
- if maximum < 65534:
- count_map = np.bincount((dp[:, col]).astype('uint16'))
- for n,i in enumerate(dp[:, col]):
- #只有1条数据的标签,合并
- if count_map[i] <= 1:
- dp[n, col] = uniques
- #只有2条数据的标签,合并
- elif count_map[i] == 2:
- dp[n, col] = uniques+1
- else:
- for n,i in enumerate(dp[:, col]):
- if (dp[:, col] == i).sum() <= 1:
- dp[n, col] = uniques
- elif (dp[:, col] == i).sum() == 2:
- dp[n, col] = uniques+1
- print uniques # unique values
- uniques = len(set(dp[:,col]))
- print uniques
- relabler.fit(dp[:, col])
- dp[:, col] = relabler.transform(dp[:, col])
将新特征和原特征合并
- # Collect the training features together
- y = array(train_data.ACTION)
- X = all_data[:num_train]
- X_2 = dp[:num_train]
- X_3 = dt[:num_train]
- # Collect the testing features together
- X_test = all_data[num_train:]
- X_test_2 = dp[num_train:]
- X_test_3 = dt[num_train:]
- X_train_all = np.hstack((X, X_2, X_3))
- X_test_all = np.hstack((X_test, X_test_2, X_test_3))
4.one hot encoding
- def OneHotEncoder(data, keymap=None):
- """
- OneHotEncoder takes data matrix with categorical columns and
- converts it to a sparse binary matrix.
- Returns sparse binary matrix and keymap mapping categories to indicies.
- If a keymap is supplied on input it will be used instead of creating one
- and any categories appearing in the data that are not in the keymap are
- ignored
- """
- if keymap is None:
- keymap = []
- for col in data.T:
- uniques = set(list(col))
- keymap.append(dict((key, i) for i, key in enumerate(uniques)))
- total_pts = data.shape[0]
- outdat = []
- for i, col in enumerate(data.T):
- km = keymap[i]
- num_labels = len(km)
- spmat = sparse.lil_matrix((total_pts, num_labels))
- for j, val in enumerate(col):
- if val in km:
- spmat[j, km[val]] = 1
- outdat.append(spmat)
- outdat = sparse.hstack(outdat).tocsr()
- return outdat, keymap
- # Xts holds one hot encodings for each individual feature in memory
- # speeding up feature selection
- Xts = [OneHotEncoder(X_train_all[:,[i]])[0] for i in range(num_features)]
5.greedy feature selection
- print "Performing greedy feature selection..."
- score_hist = []
- N = 10
- good_features = set([])
- # Greedy feature selection loop
- while len(score_hist) < 2 or score_hist[-1][0] > score_hist[-2][0]:
- scores = []
- for f in range(len(Xts)):
- if f not in good_features:
- feats = list(good_features) + [f]
- Xt = sparse.hstack([Xts[j] for j in feats]).tocsr()
- score = cv_loop(Xt, y, model, N)
- scores.append((score, f))
- print "Feature: %i Mean AUC: %f" % (f, score)
- good_features.add(sorted(scores)[-1][1])
- score_hist.append(sorted(scores)[-1])
- print "Current features: %s" % sorted(list(good_features))
- # Remove last added feature from good_features
- good_features.remove(score_hist[-1][1])
- good_features = sorted(list(good_features))
- print "Selected features %s" % good_features
- gf = open("feats" + submit, 'w')
- print >>gf, good_features
- gf.close()
- print len(good_features), " features"
6.通过validation选取最优参数,logistic regression为regularization strength
- print "Performing hyperparameter selection..."
- # Hyperparameter selection loop
- score_hist = []
- Xt = sparse.hstack([Xts[j] for j in good_features]).tocsr()
- if learner == 'NB':
- Cvals = [0.001, 0.003, 0.006, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.1]
- else:
- Cvals = np.logspace(-4, 4, 15, base=2) # for logistic
- for C in Cvals:
- if learner == 'NB':
- model.alpha = C
- else:
- model.C = C
- score = cv_loop(Xt, y, model, N)
- score_hist.append((score,C))
- print "C: %f Mean AUC: %f" %(C, score)
- bestC = sorted(score_hist)[-1][1]
- print "Best C value: %f" % (bestC)
7.预测
- print "Performing One Hot Encoding on entire dataset..."
- Xt = np.vstack((X_train_all[:,good_features], X_test_all[:,good_features]))
- Xt, keymap = OneHotEncoder(Xt)
- X_train = Xt[:num_train]
- X_test = Xt[num_train:]
- if learner == 'NB':
- model.alpha = bestC
- else:
- model.C = bestC
- print "Training full model..."
- print "Making prediction and saving results..."
- model.fit(X_train, y)
- preds = model.predict_proba(X_test)[:,1]
- create_test_submission(submit, preds)
- preds = model.predict_proba(X_train)[:,1]
- create_test_submission('Train'+submit, preds)
---恢复内容结束---
[amazonaccess 1]logistic.py 特征提取的更多相关文章
- 【机器学习实战】第5章 Logistic回归
第5章 Logistic回归 Logistic 回归 概述 Logistic 回归虽然名字叫回归,但是它是用来做分类的.其主要思想是: 根据现有数据对分类边界线建立回归公式,以此进行分类. 须知概念 ...
- 【机器学习实战】第5章 Logistic回归(逻辑回归)
第5章 Logistic回归 <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/ ...
- Airbnb新用户的民宿预定结果预测
1. 背景 关于这个数据集,在这个挑战中,您将获得一个用户列表以及他们的人口统计数据.web会话记录和一些汇总统计信息.您被要求预测新用户的第一个预订目的地将是哪个国家.这个数据集中的所有用户都来自美 ...
- sklearn机器学习-泰坦尼克号
sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...
- 逻辑回归原理_挑战者飞船事故和乳腺癌案例_Python和R_信用评分卡(AAA推荐)
sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程) https://study.163.com/course/introduction.htm?courseId=1005269003&a ...
- 02-14 scikit-learn库之逻辑回归
目录 scikit-learn库之逻辑回归 一.LogisticRegression 1.1 使用场景 1.2 代码 1.3 参数详解 1.4 属性 1.5 方法 二.LogisticRegressi ...
- Sklearn使用良心完整入门教程
The complete .ipynb file can be download through my share in onedrive:https://1drv.ms/u/s!Al86h1dThX ...
- 《机器学习_02_线性模型_Logistic回归》
import numpy as np import os os.chdir('../') from ml_models import utils import matplotlib.pyplot as ...
- 基于Python的卷积神经网络和特征提取
基于Python的卷积神经网络和特征提取 用户1737318发表于人工智能头条订阅 224 在这篇文章中: Lasagne 和 nolearn 加载MNIST数据集 ConvNet体系结构与训练 预测 ...
随机推荐
- 今日哈工大刷推荐python脚本
import httplib import random import time import urllib2 import re address = raw_input("Please i ...
- Android小记之--ClickableSpan
在给TextView设置超链接时,要想ClickableSpan的onClick事件响应,还必须同时设置tv.setMovementMethod(LinkMovementMethod.getInsta ...
- PHP环境搭配
电脑上如果有apache,必须先卸载了先,如果有集成的环境,类似于apmserver,也必须先停止先.不然安装的时候,会出现修复和卸载选项,而不是典型安装跟用户自定义安装. apache安装目录 E: ...
- Node.js HTTP 使用详解
对于初学者有没有发觉在查看Node.js官方API的时候非常简单,只有几个洋文描述两下子,没了,我第一次一口气看完所以API后,对于第一个示例都有些懵,特别是参数里的request和response, ...
- system.exit(0) vs system.exit(1)
2.解析 查看java.lang.System的源代码,我们可以找到System.exit(status)这个方法的说明,代码如下: /** * Terminates the currently ru ...
- java设计模式--行为型模式--命令模式
命令模式 概述 将一个请求封装为一个对象,从而使你可用不同的请求对客户进行参数化:对请求排队或记录请求日志,以及支持可撤消的操作. 适用性 .抽象出待执行的动作以参数化某对象. .在不同的时刻指定.排 ...
- POJ 动态规划题目列表
]POJ 动态规划题目列表 容易: 1018, 1050, 1083, 1088, 1125, 1143, 1157, 1163, 1178, 1179, 1189, 1208, 1276, 1322 ...
- ios控制器modal跳转
1. http://www.cnblogs.com/smileEvday/archive/2012/05/29/presentModalViewController.html 2012年5月- Pre ...
- 关于CMCC(中国移动)、CU(中国联通)、CT(中国电信)的一些笔记
一.三大运营商网络 CMCC(ChinaMobileCommunicationCorporation):GSM(2G).TD-SCDMA(3G).TD-LTE(4G); CU(China Unicom ...
- 关于Makefile.am中与Build相关的变量设置 AM_CPPFLAGS
http://tonybai.com/2010/10/26/about-variables-related-to-building-in-makefile-am/ 关于Makefile.am中与Bui ...