【转】关于KDD Cup '99 数据集的警告,希望从事相关工作的伙伴注意
Features
From: Terry Brugger
Date: 15 Sep 2007
Subject: KDD Cup '99 dataset (Network Intrusion) considered harmful
Oftentimes in the scientific community, we become interested in new techniques or approaches based on characteristics of the technique or approach itself. While such investigation may be informative from a pure research standpoint, the general public -- and particularly most research sponsors -- tend to be more interested in the application of this technology. To this end, the KDD Cup Challenge has, for over ten years, provided the KDD community with datasets from real world problems to demonstrate the applicability and performance of different knowledge discovery techniques. Researchers in the computer security community (based on the tone of papers published at the time) were initially excited to see a problem from their domain adopted for the 1999 KDD Cup Challenge. Since then, however, the dataset has become widely discredited. This letter is intended to briefly outline the problems that have been cited with the KDD Cup '99 dataset, and discourage its further use.
The KDD Cup '99 dataset was created by processing the tcpdump portions of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset, created by Lincoln Lab under contract to DARPA [Lippmann et al]. Since one can not know the intention (benign or malicious) of every connection on a real world network (if we could, we would not need research in intrusion detection), the artificial data was generated using a closed network, some proprietary network traffic generators, and hand-injected attacks. It was intended to simulate the traffic seen in a medium sized US Air Force base (and was created in collaboration with the AFRL in Rome, NY, which could be characterized as a medium sized US Air Force base).
Based on the published description of how the data was generated, McHugh published a fairly harsh criticism of the dataset. Among the issues raised, the most important seemed to be that no validation was ever performed to show that the DARPA dataset actually looked like real network traffic. Indeed, even a cursory examination of the data showed that the data rates were far below what will be experienced in a real medium sized network. Nevertheless, IDS researchers continued to use the dataset (and the KDD Cup dataset that was derived from it) for lack of anything better.
In 2003, Mahoney and Chan built a trivial intrusion detection system and ran it against the DARPA tcpdump data. They found numerous irregularities, including that -- due to the way the data was generated -- all the malicious packets had a TTL of 126 or 253 whereas almost all the benign packets had a TTL of 127 or 254. This served to demonstrate to most people in the network security research community that the DARPA dataset (and by extension, the KDD Cup '99 dataset) was fundamentally broken, and one could not draw any conclusions from any experiments run using them. Numerous researchers indicated to us (in personal conversations) that if they were reviewing a paper based solely on the DARPA dataset, they would reject it solely on that basis.
Indeed, at the time we were conducting our own assessment of the DARPA dataset, using Snort [Caswell and Roesch]. Trivial detection using the TTL aside, we found that it was still useful to evaluate the true positive performance of a network IDS; however, any false positive results were meaningless [Brugger and Chow]. Anonymous reviewers at respectable information security conferences were unimpressed; one noted, ``is there any interest to study the capacities of SNORT on such data?''. A reviewer from another conference summarized their review with ``The content of the paper is really out of date. If this paper appears five years ago, there is some value, but not much now.''
While the DARPA (and KDD Cup '99) dataset has fallen from grace in the network security community, we still see it widely used in the greater KDD community. Examples in the past couple years include [Kayacik et al.], [Sarasamma et al.], [Gao et al.], [Chan et al.], and [Zhang et al.]. While this sample doesn't necessarily represent the top-tier journals and conferences in the KDD community, they are to the best of our knowledge respectable, peer-reviewed publications. Obviously, the knowledge discovery researchers are well intentioned by wanting to show the usefulness of every technique imaginable to the network intrusion detection domain. Unfortunately, due to the problems with the dataset, such conclusions can not be drawn. As a result, we strongly recommend that (1) all researchers stop using the KDD Cup '99 dataset, (2) The KDD Cup and UCI websites include a warning on the KDD Cup '99 dataset webpage informing researchers that there are known problems with the dataset, and (3) peer reviewers for conferences and journals ding papers (or even outright reject them, as is common in the network security community) with results drawn solely from the KDD Cup '99 dataset.
S Terry Brugger, zow at acm dot org
UC Davis, Department of Computer Science
References
- Brugger, S. T. and J. Chow (January 2007). An assessment of the DARPA IDS Evaluation Dataset using Snort. Technical Report CSE-2007-1, University of California, Davis, Department of Computer Science, Davis, CA.http://www.cs.ucdavis.edu/research/tech-reports/2007/CSE-2007-1.pdf.
- Caswell, B. and M. Roesch (16 May 2004). Snort: The open source network intrusion detection system. http://www.snort.org/.
- Chan, A. P., W. W. Y. Ng, D. S. Yeung, and E. C. C. Tsang ( 19-21 August 2005). Comparison of different fusion approaches for network intrusion detection using ensemble of RBFNN. In Proc. of 2005 Intl. Conf. on Machine Learning and Cybernetics, Volume 6, Guangzhou, China, pp. 3846-3851. IEEE.
- Hai-Hua Gao, Hui-Hua Yang, X.-Y. W. (27-29 August 2005). Principal component neural networks based intrusion feature extraction and detection using SVM. In Advances in Natural Computation, Volume 3611 of Lecture Notes in Computer Science, Changsha, China, pp. 21-27. Springer.
- Kayacik, H. G., A. N. Zincir-Heywood, and M. I. Heywood (June 2007). A hierarchical SOM-based intrusion detection system. Engineering Applications of Artificial Intelligence 20 (4), 439-451. Full text not available; analysis based on detailed abstract.
- Lippmann, R. P., D. J. Fried, I. Graf, J. W. Haines, K. Kendall, D. McClung, D. Weber, S. Webster, D. Wyschogrod, R. K. Cunningham, and M. Zissman (January 2000). Evaluating intrusion detection systems: The 1998 DARPA off-line intrusion detection evaluation. In Proc. of the DARPA Information Survivability Conference and Exposition, Los Alamitos, CA. IEEE Computer Society Press.
- Mahoney, M. V. and P. K. Chan (8-10 September 2003). An analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for network anomaly detection. In G. Vigna, E. Jonsson, and C. Krugel (Eds.), Proc. 6th Intl. Symp. on Recent Advances in Intrusion Detection (RAID 2003), Volume 2820 of Lecture Notes in Computer Science, Pittsburgh, PA, pp. 220-237. Springer.
- McHugh, J. (2000). Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory. ACM Trans. Information System Security 3 (4), 262-294.
- Sarasamma, S. T., Q. A. Zhu, and J. Huff (April 2005). Hierarchical Kohonenen net for anomaly detection in network security. IEEE Trans. Syst., Man, Cybern. B 35 (2), 302-312.
- Zhang, C., J. Jiang, and M. Kamel (May 2005). Intrusion detection using hierarchical neural networks. Pattern Recognition Letters 26 (6), 779-791.
All opinions expressed are solely the view of the author(s), and are not necessarily shared or endorsed by The University of California, Davis, or their employer(s).
原文地址:
http://www.kdnuggets.com/news/2007/n18/4i.html
【转】关于KDD Cup '99 数据集的警告,希望从事相关工作的伙伴注意的更多相关文章
- KDD Cup 99网络入侵检测数据的分析
看论文 该数据集是从一个模拟的美国空军局域网上采集来的 9 个星期的网络连接数据, 分成具有标识的训练数据和未加标识的测试数据.测试数据和训练数据有着不同的概率分布, 测试数据包含了一些未出现在训练数 ...
- kdd cup 2019
比赛简介: 任务1:推荐最佳交通方式 任务描述:给定用户的一些信息,预测用户使用何种最佳交通方式由O(起点)到D(终点) 数据描述: profiles.csv: 属性pid:用户的ID: 属性p0~p ...
- Kdd Cup 2013 总结2
- 5-Spark高级数据分析-第五章 基于K均值聚类的网络流量异常检测
据我们所知,有‘已知的已知’,有些事,我们知道我们知道:我们也知道,有 ‘已知的未知’,也就是说,有些事,我们现在知道我们不知道.但是,同样存在‘不知的不知’——有些事,我们不知道我们不知道. 上一章 ...
- 网络安全中机器学习大合集 Awesome
网络安全中机器学习大合集 from:https://github.com/jivoi/awesome-ml-for-cybersecurity/blob/master/README_ch.md#-da ...
- R2CNN模型——用于文本目标检测的模型
引言 R2CNN全称Rotational Region CNN,是一个针对斜框文本检测的CNN模型,原型是Faster R-CNN,paper中的模型主要针对文本检测,调整后也可用于航拍图像的检测中去 ...
- 机器学习数据集,主数据集不能通过,人脸数据集介绍,从r包中获取数据集,中国河流数据集
机器学习数据集,主数据集不能通过,人脸数据集介绍,从r包中获取数据集,中国河流数据集 选自Microsoft www.tz365.Cn 作者:Lee Scott 机器之心编译 参与:李亚洲.吴攀. ...
- 美团:WSDM Cup 2019自然语言推理任务获奖解题思路
WSDM(Web Search and Data Mining,读音为Wisdom)是业界公认的高质量学术会议,注重前沿技术在工业界的落地应用,与SIGIR一起被称为信息检索领域的Top2. 刚刚在墨 ...
- 史无前例的KDD 2014大会记
2014大会记" title="史无前例的KDD 2014大会记"> 作者:蒋朦 微软亚洲研究院实习生 创造多项纪录的KDD 2014 ACM SIGKDD 国际会 ...
随机推荐
- ORACLE rowid,file# 和 rfile#
rowid简介 rowid就是唯一标志记录物理位置的一个id,在oracle 8版本以前,rowid由file#+block#+row#组成,占用6个bytes的空间,10 bit 的 file# , ...
- protobuff 配合 libevent 在Linux 和windows 下的使用
protobuff 配合 libevent 在Linux 和windows 下的使用待补全. libprotobuf.lib libproto-lite.lib libprotoc.lib
- Tornado 异步客户端
前言 Tornado是很优秀的非阻塞式服务器,我们一般用它来写Web 服务器,据说知乎就是用Tornado写的. 如果对tornado源码不是很了解,可以先看一下另一篇文章: http://yunji ...
- XML Xpath学习
Xpath是一门在xml文档中查找信息的语言. Xpath可用来在xml文档中对元素和属性进行遍历. <1>路径表达式1: 斜杠(/)作为路径内部的分隔符 同一个路径有绝对路径和相对路径两 ...
- linux shell 脚本攻略学习19--sed命令详解
sed(意为流编辑器,英语“stream editor”的缩写)是Unix/linux常见的命令行程序.sed用来把文档或字符串里面的文字经过一系列编辑命令转换为另一种格式输出,即文本替换.sed通常 ...
- AngularJS学习---Routing(路由) & Multiple Views(多个视图) step 7
1.切换分支到step7,并启动项目 git checkout step- npm start 2.需求: 在步骤7之前,应用只给我们的用户提供了一个简单的界面(一张所有手机的列表),并且所有的模板代 ...
- 对于for循环构成的九宫格里的button,如何满足“有默认选中的一个,并且只能选中一个”?
需要构造一个全局变量self.priceBtn 在九宫格写法中 ) { self.priceBtn = btn; self.priceBtn.selected = YES; } 在button的点击方 ...
- Phone Gap [error] cmd: Command failed with exit code 1
下投票 我不知道如何解决这个问题,但尝试了这一点,将解决肯定. 这是由于ANT工具找不到的tools.jar在JRE lib目录下.当我从复制的tools.jar JDK的lib目录下,以JRE li ...
- linux系统的目录结构
前言 对于每一个Linux学习者来说,了解Linux文件系统的目录结构,是学好Linux的至关重要的一步.,深入了解linux文件目录结构的标准和每个目录的详细功能,对于我们用好linux系统只管重要 ...
- 分布式日志2 用redis的队列写日志
using System; using System.Collections.Generic; using System.Linq; using System.Web; using System.We ...