pandas dataframe 做机器学习训练数据=》直接使用iloc或者as_matrix即可
样本示意,为kdd99数据源:
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,udp,domain_u,SF,29,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0.00,0.00,0.00,0.00,0.50,1.00,0.00,10,3,0.30,0.30,0.30,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,253,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,tcp,http,SF,223,185,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,4,4,0.00,0.00,0.00,0.00,1.00,0.00,0.00,71,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,tcp,http,SF,230,260,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,19,0.00,0.00,0.00,0.00,1.00,0.00,0.11,3,255,1.00,0.00,0.33,0.07,0.33,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,252,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
1,tcp,smtp,SF,3170,329,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,2,0.00,0.00,0.00,0.00,1.00,0.00,1.00,54,39,0.72,0.11,0.02,0.00,0.02,0.00,0.09,0.13,normal.
0,tcp,http,SF,297,13787,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,177,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,291,3542,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,12,12,0.00,0.00,0.00,0.00,1.00,0.00,0.00,187,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,295,753,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,21,22,0.00,0.00,0.00,0.00,1.00,0.00,0.09,196,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,tcp,http,SF,268,9235,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,5,5,0.00,0.00,0.00,0.00,1.00,0.00,0.00,58,255,1.00,0.00,0.02,0.05,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,253,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
0,tcp,http,SF,223,185,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,3,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,227,8841,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,13,13,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,222,19564,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,22,23,0.00,0.00,0.00,0.00,1.00,0.00,0.09,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,ftp_data,SF,740,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,77,33,0.34,0.08,0.34,0.06,0.00,0.00,0.00,0.00,normal.
0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,ftp_data,SF,35195,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,10,0.00,0.00,0.00,0.00,1.00,0.00,0.00,92,44,0.43,0.07,0.43,0.05,0.00,0.00,0.00,0.00,normal.
0,tcp,ftp_data,SF,8325,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,20,0.00,0.00,0.00,0.00,1.00,0.00,0.00,103,54,0.49,0.06,0.49,0.04,0.00,0.00,0.00,0.00,normal.
代码:
# -*- coding:utf-8 -*- import re
import matplotlib.pyplot as plt
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn import cross_validation
import os
from sklearn.datasets import load_iris
from sklearn import tree
import pydotplus
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
from sklearn_pandas import DataFrameMapper def label(x):
if x == "normal.":
return 0
else:
return 1 if __name__ == '__main__':
data = pd.read_csv('../data/kddcup99/corrected', sep=",", header=None)
print data.columns
print data.iloc[0,0], data.iloc[0,1]
print len(data)
col_cnt = len(data.columns) normal = data.loc[data.loc[:, col_cnt-1] == "normal.", :]
print "normal len:", len(normal)
guess = data.loc[data.loc[:, col_cnt-1] == "guess_passwd.", :]
print "normal len:", len(guess) data = pd.concat([normal, guess])
print len(data) le = preprocessing.LabelEncoder()
for i in range(col_cnt-1):
if isinstance(data.iloc[0,i], str):
print "tranform string column only:", i
data.loc[:,i] = le.fit_transform(data.loc[:,i])
data.loc[:,col_cnt-1] = data.loc[:,col_cnt-1].apply(label)
print data.iloc[0,0], data.iloc[0,1]
x = data.iloc[:, range(col_cnt-1)]
#x = data.iloc[:, [0,4,5,6,7,8,22,23,24,25,26,27,28,29,30]]
y = data.iloc[:, col_cnt-1]
''' also OK
data = data.as_matrix()
x = data[:, range(col_cnt-1)]
y = data[:, col_cnt-1]
'''
print "x=>"
print x.iloc[0:3, :]
print "y=>"
print y[-3:]
#v=load_kdd99("../data/kddcup99/corrected")
#x,y=get_guess_passwdandNormal(v)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x, y)
print clf print cross_validation.cross_val_score(clf, x, y, n_jobs=-1, cv=10) clf = clf.fit(x, y)
dot_data = tree.export_graphviz(clf, out_file=None)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("../photo/6/iris-dt.pdf")
结果:
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41],
dtype='int64')
0 udp
311029
normal len: 60593
normal len: 4367
64960
tranform string column only: 1
tranform string column only: 2
tranform string column only: 3
0 2
x=>
0 1 2 3 4 5 6 7 8 9 ... 31 32 33 34 35 \
0 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0
1 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0
2 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0 36 37 38 39 40
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 [3 rows x 41 columns]
y=>
142098 1
142099 1
142101 1
Name: 41, dtype: int64
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
fg[ 0.9561336 0.99892258 0.99938433 0.99984606 0.99984606 0.99969212
1. 0.99984604 0.99969207 1. ]
pandas dataframe 做机器学习训练数据=》直接使用iloc或者as_matrix即可的更多相关文章
- python pandas.DataFrame选取、修改数据最好用.loc,.iloc,.ix
先手工生出一个数据框吧 import numpy as np import pandas as pd df = pd.DataFrame(np.arange(0,60,2).reshape(10,3) ...
- pandas.DataFrame.quantile
pandas.DataFrame.quantile 用于返回数据中的 处于1/5 1/2(中位数)等数据
- 机器学习之数据预处理,Pandas读取excel数据
Python读写excel的工具库很多,比如最耳熟能详的xlrd.xlwt,xlutils,openpyxl等.其中xlrd和xlwt库通常配合使用,一个用于读,一个用于写excel.xlutils结 ...
- 如何通过Elasticsearch Scroll快速取出数据,构造pandas dataframe — Python多进程实现
首先,python 多线程不能充分利用多核CPU的计算资源(只能共用一个CPU),所以得用多进程.笔者从3.7亿数据的索引,取200多万的数据,从取数据到构造pandas dataframe总共大概用 ...
- Pandas DataFrame数据的增、删、改、查
Pandas DataFrame数据的增.删.改.查 https://blog.csdn.net/zhangchuang601/article/details/79583551 #删除列 df_2 = ...
- Pandas DataFrame 数据选取和过滤
This would allow chaining operations like: pd.read_csv('imdb.txt') .sort(columns='year') .filter(lam ...
- pandas.DataFrame——pd数据框的简单认识、存csv文件
接着前天的豆瓣书单信息爬取,这一篇文章看一下利用pandas完成对数据的存储. 回想一下我们当时在最后得到了六个列表:img_urls, titles, ratings, authors, detai ...
- pandas中DataFrame和Series的数据去重
在SQL语言中去重是一件相当简单的事情,面对一个表(也可以称之为DataFrame)我们对数据进行去重只需要GROUP BY 就好. select custId,applyNo from tmp.on ...
- 用PyQt5来即时显示pandas Dataframe的数据,附qdarkstyle黑夜主题样式(美美哒的黑夜主题)
import sys from qdarkstyle import load_stylesheet_pyqt5 from PyQt5.QtWidgets import QApplication, QT ...
随机推荐
- ASP.NET-datatable转换成list对象
#region 讲DataTable转换为List对象 /// <summary> /// 利用反射将DataTable转换为List<T>对象 /// </summar ...
- LeetCode OJ 215. Kth Largest Element in an Array 堆排序求解
题目链接:https://leetcode.com/problems/kth-largest-element-in-an-array/ 215. Kth Largest Element in an A ...
- shadowOffset 具体解释
x向右为正,y向下为正 1.y<0 UILabel *label=[[UILabelalloc] initWithFrame:CGRectMake(40,40, 250,50)]; label. ...
- 棋盘覆盖问题python3实现
在2^k*2^k个方格组成的棋盘中,有一个方格被占用,用下图的4种L型骨牌覆盖全部棋盘上的其余全部方格,不能重叠. 代码例如以下: def chess(tr,tc,pr,pc,size): globa ...
- javascript系列-class10.DOM(下)
1.node节点(更详细的获取(设置)页面中所有的内容) 根据 W3C 的 HTML DOM 标准,HTML 文档中的所有内容都是节点: 元素是节点的别称,节点包含元素当然节点还有 ...
- linux ps 命令查看进程状态
显示其他用户启动的进程(a) 查看系统中属于自己的进程(x) 启动这个进程的用户和它启动的时间(u) 使用“date -s”命令来修改系统时间 比如将系统时间设定成1996年6月10日的命令如下. # ...
- spark thrift server configuration
# MainApplicationProperties # --master yarn --deploy-mode client 下的配置, client 模式表示,driver 是在本地机器上跑的, ...
- 【原创】websphere部署war包报错
应用程序在Tomcat上运行一切正常,但在websphere上部署时报以下错误:错误 500 处理请求时发生一个错误: /admin/upload.do 消息: WEB-INF/web.xml 详细错 ...
- SLAM概念学习之随机SLAM算法
这一节,在熟悉了Featue maps相关概念之后,我们将开始学习基于EKF的特征图SLAM算法. 1. 机器人,图和增强的状态向量 随机SLAM算法一般存储机器人位姿和图中的地标在单个状态向量中,然 ...
- 1028C:Rectangles
You are given n rectangles on a plane with coordinates of their bottom left and upper right points. ...