【转】关于KDD Cup '99 数据集的警告，希望从事相关工作的伙伴注意

Features

From: Terry Brugger
Date: 15 Sep 2007
Subject: KDD Cup '99 dataset (Network Intrusion) considered harmful

Oftentimes in the scientific community, we become interested in new techniques or approaches based on characteristics of the technique or approach itself. While such investigation may be informative from a pure research standpoint, the general public -- and particularly most research sponsors -- tend to be more interested in the application of this technology. To this end, the KDD Cup Challenge has, for over ten years, provided the KDD community with datasets from real world problems to demonstrate the applicability and performance of different knowledge discovery techniques. Researchers in the computer security community (based on the tone of papers published at the time) were initially excited to see a problem from their domain adopted for the 1999 KDD Cup Challenge. Since then, however, the dataset has become widely discredited. This letter is intended to briefly outline the problems that have been cited with the KDD Cup '99 dataset, and discourage its further use.

The KDD Cup '99 dataset was created by processing the tcpdump portions of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset, created by Lincoln Lab under contract to DARPA [Lippmann et al]. Since one can not know the intention (benign or malicious) of every connection on a real world network (if we could, we would not need research in intrusion detection), the artificial data was generated using a closed network, some proprietary network traffic generators, and hand-injected attacks. It was intended to simulate the traffic seen in a medium sized US Air Force base (and was created in collaboration with the AFRL in Rome, NY, which could be characterized as a medium sized US Air Force base).

Based on the published description of how the data was generated, McHugh published a fairly harsh criticism of the dataset. Among the issues raised, the most important seemed to be that no validation was ever performed to show that the DARPA dataset actually looked like real network traffic. Indeed, even a cursory examination of the data showed that the data rates were far below what will be experienced in a real medium sized network. Nevertheless, IDS researchers continued to use the dataset (and the KDD Cup dataset that was derived from it) for lack of anything better.

In 2003, Mahoney and Chan built a trivial intrusion detection system and ran it against the DARPA tcpdump data. They found numerous irregularities, including that -- due to the way the data was generated -- all the malicious packets had a TTL of 126 or 253 whereas almost all the benign packets had a TTL of 127 or 254. This served to demonstrate to most people in the network security research community that the DARPA dataset (and by extension, the KDD Cup '99 dataset) was fundamentally broken, and one could not draw any conclusions from any experiments run using them. Numerous researchers indicated to us (in personal conversations) that if they were reviewing a paper based solely on the DARPA dataset, they would reject it solely on that basis.

Indeed, at the time we were conducting our own assessment of the DARPA dataset, using Snort [Caswell and Roesch]. Trivial detection using the TTL aside, we found that it was still useful to evaluate the true positive performance of a network IDS; however, any false positive results were meaningless [Brugger and Chow]. Anonymous reviewers at respectable information security conferences were unimpressed; one noted, ``is there any interest to study the capacities of SNORT on such data?''. A reviewer from another conference summarized their review with ``The content of the paper is really out of date. If this paper appears five years ago, there is some value, but not much now.''

While the DARPA (and KDD Cup '99) dataset has fallen from grace in the network security community, we still see it widely used in the greater KDD community. Examples in the past couple years include [Kayacik et al.], [Sarasamma et al.], [Gao et al.], [Chan et al.], and [Zhang et al.]. While this sample doesn't necessarily represent the top-tier journals and conferences in the KDD community, they are to the best of our knowledge respectable, peer-reviewed publications. Obviously, the knowledge discovery researchers are well intentioned by wanting to show the usefulness of every technique imaginable to the network intrusion detection domain. Unfortunately, due to the problems with the dataset, such conclusions can not be drawn. As a result, we strongly recommend that (1) all researchers stop using the KDD Cup '99 dataset, (2) The KDD Cup and UCI websites include a warning on the KDD Cup '99 dataset webpage informing researchers that there are known problems with the dataset, and (3) peer reviewers for conferences and journals ding papers (or even outright reject them, as is common in the network security community) with results drawn solely from the KDD Cup '99 dataset.

S Terry Brugger, zow at acm dot org
UC Davis, Department of Computer Science

References

Brugger, S. T. and J. Chow (January 2007). An assessment of the DARPA IDS Evaluation Dataset using Snort. Technical Report CSE-2007-1, University of California, Davis, Department of Computer Science, Davis, CA.http://www.cs.ucdavis.edu/research/tech-reports/2007/CSE-2007-1.pdf.
Caswell, B. and M. Roesch (16 May 2004). Snort: The open source network intrusion detection system. http://www.snort.org/.
Chan, A. P., W. W. Y. Ng, D. S. Yeung, and E. C. C. Tsang ( 19-21 August 2005). Comparison of different fusion approaches for network intrusion detection using ensemble of RBFNN. In Proc. of 2005 Intl. Conf. on Machine Learning and Cybernetics, Volume 6, Guangzhou, China, pp. 3846-3851. IEEE.
Hai-Hua Gao, Hui-Hua Yang, X.-Y. W. (27-29 August 2005). Principal component neural networks based intrusion feature extraction and detection using SVM. In Advances in Natural Computation, Volume 3611 of Lecture Notes in Computer Science, Changsha, China, pp. 21-27. Springer.
Kayacik, H. G., A. N. Zincir-Heywood, and M. I. Heywood (June 2007). A hierarchical SOM-based intrusion detection system. Engineering Applications of Artificial Intelligence 20 (4), 439-451. Full text not available; analysis based on detailed abstract.
Lippmann, R. P., D. J. Fried, I. Graf, J. W. Haines, K. Kendall, D. McClung, D. Weber, S. Webster, D. Wyschogrod, R. K. Cunningham, and M. Zissman (January 2000). Evaluating intrusion detection systems: The 1998 DARPA off-line intrusion detection evaluation. In Proc. of the DARPA Information Survivability Conference and Exposition, Los Alamitos, CA. IEEE Computer Society Press.
Mahoney, M. V. and P. K. Chan (8-10 September 2003). An analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for network anomaly detection. In G. Vigna, E. Jonsson, and C. Krugel (Eds.), Proc. 6th Intl. Symp. on Recent Advances in Intrusion Detection (RAID 2003), Volume 2820 of Lecture Notes in Computer Science, Pittsburgh, PA, pp. 220-237. Springer.
McHugh, J. (2000). Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory. ACM Trans. Information System Security 3 (4), 262-294.
Sarasamma, S. T., Q. A. Zhu, and J. Huff (April 2005). Hierarchical Kohonenen net for anomaly detection in network security. IEEE Trans. Syst., Man, Cybern. B 35 (2), 302-312.
Zhang, C., J. Jiang, and M. Kamel (May 2005). Intrusion detection using hierarchical neural networks. Pattern Recognition Letters 26 (6), 779-791.

All opinions expressed are solely the view of the author(s), and are not necessarily shared or endorsed by The University of California, Davis, or their employer(s).

原文地址：

http://www.kdnuggets.com/news/2007/n18/4i.html

【转】关于KDD Cup '99 数据集的警告，希望从事相关工作的伙伴注意的更多相关文章

KDD Cup 99网络入侵检测数据的分析
看论文该数据集是从一个模拟的美国空军局域网上采集来的 9 个星期的网络连接数据, 分成具有标识的训练数据和未加标识的测试数据.测试数据和训练数据有着不同的概率分布, 测试数据包含了一些未出现在训练数 ...
kdd cup 2019
比赛简介: 任务1:推荐最佳交通方式任务描述:给定用户的一些信息,预测用户使用何种最佳交通方式由O(起点)到D(终点) 数据描述: profiles.csv: 属性pid:用户的ID: 属性p0~p ...
Kdd Cup 2013 总结2
5-Spark高级数据分析-第五章基于K均值聚类的网络流量异常检测
据我们所知,有‘已知的已知’,有些事,我们知道我们知道:我们也知道,有 ‘已知的未知’,也就是说,有些事,我们现在知道我们不知道.但是,同样存在‘不知的不知’——有些事,我们不知道我们不知道. 上一章 ...
网络安全中机器学习大合集 Awesome
网络安全中机器学习大合集 from:https://github.com/jivoi/awesome-ml-for-cybersecurity/blob/master/README_ch.md#-da ...
R2CNN模型——用于文本目标检测的模型
引言 R2CNN全称Rotational Region CNN,是一个针对斜框文本检测的CNN模型,原型是Faster R-CNN,paper中的模型主要针对文本检测,调整后也可用于航拍图像的检测中去 ...
机器学习数据集,主数据集不能通过,人脸数据集介绍,从r包中获取数据集,中国河流数据集
机器学习数据集,主数据集不能通过,人脸数据集介绍,从r包中获取数据集,中国河流数据集选自Microsoft www.tz365.Cn 作者:Lee Scott 机器之心编译参与:李亚洲.吴攀. ...
美团：WSDM Cup 2019自然语言推理任务获奖解题思路
WSDM(Web Search and Data Mining,读音为Wisdom)是业界公认的高质量学术会议,注重前沿技术在工业界的落地应用,与SIGIR一起被称为信息检索领域的Top2. 刚刚在墨 ...
史无前例的KDD 2014大会记
2014大会记" title="史无前例的KDD 2014大会记"> 作者:蒋朦微软亚洲研究院实习生创造多项纪录的KDD 2014 ACM SIGKDD 国际会 ...

随机推荐

Linux内核分析第三周学习总结：构造一个简单的Linux系统MenuOS
韩玉琪 + 原创作品转载请注明出处 + <Linux内核分析>MOOC课程http://mooc.study.163.com/course/USTC-1000029000 一.Linux内 ...
lnmp重置mysql密码
第一种方法:用军哥的一键修改LNMP环境下MYSQL数据库密码脚本一键脚本肯定是非常方便.具体执行以下命令: wget http://soft.vpser.net/lnmp/ext/reset_mys ...
【MVC】异常处理
[MVC] 异常处理一 . 自定义 HandleErrorAttribute public class ExceptionLogAttribute : HandleErrorAttribute { ...
C#版SQLHelper.cs类
using System; using System.Data; using System.Xml; using System.Data.SqlClient; using System.Collect ...
swift 定制自己的Button样式
swift的UIButton类中有些公开方法可以重写,所以,如果想写出自己的UIButton,只要继承UIButton类,并重写相应的方法即可. 系统的UIButton可以添加图片,也可以添加标题,但 ...
Centos7 mysql-community-5.7.11编译安装
安装环境 [root@localhost ~]# cat /etc/centos-release CentOS Linux release 7.0.1406 (Core) 0x01 准备工作 1.到m ...
redis hash map
redis hash的使用详见文章:http://www.miaoyueyue.com/archives/235.html hash操作命令如下: hset(key, field, value):向名 ...
iOS遍历程序内某个文件夹下所有文件的属性
项目中有个文件管理系统,在做本地文件管理操作的时候,遇到了遍历本地文件的问题遍历到的文件有些不需要显示,而且需要得到文件的相关属性,在此总结下. //查找需要遍历文件夹的目录 NSString *k ...
理解em,rem以及rem的失效问题
在平常做网站写代码的时候一般都是使用px,在之前的学习时就略微的学习了一些关于em.rem的知识,但是由于一直没有用到过,所以几乎全部忘记了.今天在研究一些知识的时候用到了em,所以特意将学到的知识总 ...
[PHP] php实现文件下载
1. 设置超链接的href属性 <a href="文件地址"></a> 如果浏览器不能解析该文件,浏览器会自动下载.而如果文件是图片或者txt,会直接在浏览 ...

【转】关于KDD Cup '99 数据集的警告，希望从事相关工作的伙伴注意

Features

【转】关于KDD Cup '99 数据集的警告，希望从事相关工作的伙伴注意的更多相关文章

随机推荐

热门专题