样本示意,为kdd99数据源:

  1. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
  2. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
  3. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
  4. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
  5. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
  6. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
  7. 0,udp,domain_u,SF,29,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0.00,0.00,0.00,0.00,0.50,1.00,0.00,10,3,0.30,0.30,0.30,0.00,0.00,0.00,0.00,0.00,normal.
  8. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,253,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
  9. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
  10. 0,tcp,http,SF,223,185,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,4,4,0.00,0.00,0.00,0.00,1.00,0.00,0.00,71,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
  11. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
  12. 0,tcp,http,SF,230,260,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,19,0.00,0.00,0.00,0.00,1.00,0.00,0.11,3,255,1.00,0.00,0.33,0.07,0.33,0.00,0.00,0.00,normal.
  13. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal.
  14. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,252,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
  15. 1,tcp,smtp,SF,3170,329,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,2,0.00,0.00,0.00,0.00,1.00,0.00,1.00,54,39,0.72,0.11,0.02,0.00,0.02,0.00,0.09,0.13,normal.
  16. 0,tcp,http,SF,297,13787,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,177,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
  17. 0,tcp,http,SF,291,3542,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,12,12,0.00,0.00,0.00,0.00,1.00,0.00,0.00,187,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
  18. 0,tcp,http,SF,295,753,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,21,22,0.00,0.00,0.00,0.00,1.00,0.00,0.09,196,255,1.00,0.00,0.01,0.01,0.00,0.00,0.00,0.00,normal.
  19. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
  20. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
  21. 0,tcp,http,SF,268,9235,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,5,5,0.00,0.00,0.00,0.00,1.00,0.00,0.00,58,255,1.00,0.00,0.02,0.05,0.00,0.00,0.00,0.00,normal.
  22. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,253,0.99,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack.
  23. 0,tcp,http,SF,223,185,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,3,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
  24. 0,tcp,http,SF,227,8841,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,13,13,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
  25. 0,tcp,http,SF,222,19564,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,22,23,0.00,0.00,0.00,0.00,1.00,0.00,0.09,255,255,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal.
  26. 0,tcp,ftp_data,SF,740,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,77,33,0.34,0.08,0.34,0.06,0.00,0.00,0.00,0.00,normal.
  27. 0,udp,private,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal.
  28. 0,tcp,ftp_data,SF,35195,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,10,0.00,0.00,0.00,0.00,1.00,0.00,0.00,92,44,0.43,0.07,0.43,0.05,0.00,0.00,0.00,0.00,normal.
  29. 0,tcp,ftp_data,SF,8325,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,20,0.00,0.00,0.00,0.00,1.00,0.00,0.00,103,54,0.49,0.06,0.49,0.04,0.00,0.00,0.00,0.00,normal.

代码:

  1. # -*- coding:utf-8 -*-
  2.  
  3. import re
  4. import matplotlib.pyplot as plt
  5. import os
  6. from sklearn.feature_extraction.text import CountVectorizer
  7. from sklearn import preprocessing
  8. from sklearn import cross_validation
  9. import os
  10. from sklearn.datasets import load_iris
  11. from sklearn import tree
  12. import pydotplus
  13. from sklearn.preprocessing import LabelEncoder
  14. import numpy as np
  15. import pandas as pd
  16. from sklearn_pandas import DataFrameMapper
  17.  
  18. def label(x):
  19. if x == "normal.":
  20. return 0
  21. else:
  22. return 1
  23.  
  24. if __name__ == '__main__':
  25. data = pd.read_csv('../data/kddcup99/corrected', sep=",", header=None)
  26. print data.columns
  27. print data.iloc[0,0], data.iloc[0,1]
  28. print len(data)
  29. col_cnt = len(data.columns)
  30.  
  31. normal = data.loc[data.loc[:, col_cnt-1] == "normal.", :]
  32. print "normal len:", len(normal)
  33. guess = data.loc[data.loc[:, col_cnt-1] == "guess_passwd.", :]
  34. print "normal len:", len(guess)
  35.  
  36. data = pd.concat([normal, guess])
  37. print len(data)
  38.  
  39. le = preprocessing.LabelEncoder()
  40. for i in range(col_cnt-1):
  41. if isinstance(data.iloc[0,i], str):
  42. print "tranform string column only:", i
  43. data.loc[:,i] = le.fit_transform(data.loc[:,i])
  44. data.loc[:,col_cnt-1] = data.loc[:,col_cnt-1].apply(label)
  45. print data.iloc[0,0], data.iloc[0,1]
  46. x = data.iloc[:, range(col_cnt-1)]
  47. #x = data.iloc[:, [0,4,5,6,7,8,22,23,24,25,26,27,28,29,30]]
  48. y = data.iloc[:, col_cnt-1]
  49.   
    ''' also OK
        data = data.as_matrix()
        x = data[:, range(col_cnt-1)]
        y = data[:, col_cnt-1]
    '''
  50. print "x=>"
  51. print x.iloc[0:3, :]
  52. print "y=>"
  53. print y[-3:]
  54. #v=load_kdd99("../data/kddcup99/corrected")
  55. #x,y=get_guess_passwdandNormal(v)
  56. clf = tree.DecisionTreeClassifier()
  57. clf = clf.fit(x, y)
  58. print clf
  59.  
  60. print cross_validation.cross_val_score(clf, x, y, n_jobs=-1, cv=10)
  61.  
  62. clf = clf.fit(x, y)
  63. dot_data = tree.export_graphviz(clf, out_file=None)
  64. graph = pydotplus.graph_from_dot_data(dot_data)
  65. graph.write_pdf("../photo/6/iris-dt.pdf")

结果:

  1. Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
  2. 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
  3. 34, 35, 36, 37, 38, 39, 40, 41],
  4. dtype='int64')
  5. 0 udp
  6. 311029
  7. normal len: 60593
  8. normal len: 4367
  9. 64960
  10. tranform string column only: 1
  11. tranform string column only: 2
  12. tranform string column only: 3
  13. 0 2
  14. x=>
  15. 0 1 2 3 4 5 6 7 8 9 ... 31 32 33 34 35 \
  16. 0 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0
  17. 1 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0
  18. 2 0 2 15 7 105 146 0 0 0 0 ... 255 254 1.0 0.01 0.0
  19.  
  20. 36 37 38 39 40
  21. 0 0.0 0.0 0.0 0.0 0.0
  22. 1 0.0 0.0 0.0 0.0 0.0
  23. 2 0.0 0.0 0.0 0.0 0.0
  24.  
  25. [3 rows x 41 columns]
  26. y=>
  27. 142098 1
  28. 142099 1
  29. 142101 1
  30. Name: 41, dtype: int64
  31. DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
  32. max_features=None, max_leaf_nodes=None,
  33. min_impurity_decrease=0.0, min_impurity_split=None,
  34. min_samples_leaf=1, min_samples_split=2,
  35. min_weight_fraction_leaf=0.0, presort=False, random_state=None,
  36. splitter='best')
  37. fg[ 0.9561336 0.99892258 0.99938433 0.99984606 0.99984606 0.99969212
  38. 1. 0.99984604 0.99969207 1. ]

pandas dataframe 做机器学习训练数据=》直接使用iloc或者as_matrix即可的更多相关文章

  1. python pandas.DataFrame选取、修改数据最好用.loc,.iloc,.ix

    先手工生出一个数据框吧 import numpy as np import pandas as pd df = pd.DataFrame(np.arange(0,60,2).reshape(10,3) ...

  2. pandas.DataFrame.quantile

    pandas.DataFrame.quantile 用于返回数据中的 处于1/5    1/2(中位数)等数据

  3. 机器学习之数据预处理,Pandas读取excel数据

    Python读写excel的工具库很多,比如最耳熟能详的xlrd.xlwt,xlutils,openpyxl等.其中xlrd和xlwt库通常配合使用,一个用于读,一个用于写excel.xlutils结 ...

  4. 如何通过Elasticsearch Scroll快速取出数据,构造pandas dataframe — Python多进程实现

    首先,python 多线程不能充分利用多核CPU的计算资源(只能共用一个CPU),所以得用多进程.笔者从3.7亿数据的索引,取200多万的数据,从取数据到构造pandas dataframe总共大概用 ...

  5. Pandas DataFrame数据的增、删、改、查

    Pandas DataFrame数据的增.删.改.查 https://blog.csdn.net/zhangchuang601/article/details/79583551 #删除列 df_2 = ...

  6. Pandas DataFrame 数据选取和过滤

    This would allow chaining operations like: pd.read_csv('imdb.txt') .sort(columns='year') .filter(lam ...

  7. pandas.DataFrame——pd数据框的简单认识、存csv文件

    接着前天的豆瓣书单信息爬取,这一篇文章看一下利用pandas完成对数据的存储. 回想一下我们当时在最后得到了六个列表:img_urls, titles, ratings, authors, detai ...

  8. pandas中DataFrame和Series的数据去重

    在SQL语言中去重是一件相当简单的事情,面对一个表(也可以称之为DataFrame)我们对数据进行去重只需要GROUP BY 就好. select custId,applyNo from tmp.on ...

  9. 用PyQt5来即时显示pandas Dataframe的数据,附qdarkstyle黑夜主题样式(美美哒的黑夜主题)

    import sys from qdarkstyle import load_stylesheet_pyqt5 from PyQt5.QtWidgets import QApplication, QT ...

随机推荐

  1. spring boot基础

    1.ANT下面典型的项目层次结构.(1) src存放文件.(2) class存放编译后的文件.(3) lib存放第三方JAR包.(4) dist存放打包,发布以后的代码. 2.Source Folde ...

  2. Bootstrap组件之页头、缩略图

    .page-header--指定div元素包裹页头组件. <div class="page-header"> <h1>小镇菇凉<small> 2 ...

  3. 【LeetCode】Palindrome Partitioning 解题报告

    [题目] Given a string s, partition s such that every substring of the partition is a palindrome. Retur ...

  4. angular.js学习-ng-grid

    ng-grid是基于AngularJS和JQuery的富表格控件,由AngularUI Team领衔开发,到目前为止已有2354次Commit,1076个Fork.  AngualrUI:http:/ ...

  5. C语言-100加减求和

    ----------------------------度娘的思路------------------------------------------------------ Action() { / ...

  6. 使用python进行分页操作

    class getPage: """通过这个类 获取 开始和结束点""" def __init__(self,page): try: sel ...

  7. Docker中免去sudo的设置方法

    Add the docker group if it doesn't already exist: sudo groupadd docker Add the connected user " ...

  8. jQuery 完整 ajax示例

    $(function(){ //请求参数 var list = {}; // $.ajax({ //请求方式 type : "POST", //请求的媒体类型 contentTyp ...

  9. C语言基础 (2) linux命令

    01.课程回顾 链接 ln 1.txt aaa.txt  硬链接 (两个相互独立 删除一个另外一个还在) ln -s 1.txt aaa.txt软连接 (后面的是快捷方式) 硬链接只能是文件,软连接可 ...

  10. data structure alignment(数据对齐)

    概述: 数据对齐指数据在计算机内存中排放和获取的方式.包含三个方面:数据对齐(data alignment).数据结构填充(data alignment).打包(packing) 如果数据是自然对齐的 ...