Data Mining Tasks Description

Task 1: 2016 e-News categorisation

For this year, the dataset is sourced from 6 online news media:

The New Zealand Herald (www.nzherald.co.nz), Reuters(www.reuters.com), The Times (www.timesonline.co.uk) , Yahoo News (news.yahoo.com), BBC (www.bbc.co.uk) and The Press (www.stuff.co.nz).

Business, entertainment, sport, technology, and travel are the selected five news categories. Each document of the dataset is labelled manually by skimming over the text and determining the category. In the provided data files, each news piece is formatted as one line pure text with the last character as the class label (for training data), and we removed all punctuations and symbols during the data formation.

Note that; the dataset text is encrypted for fair play purpose, and this task is not aiming for decryption practices. So any uses of such technique are prohibited and should be avoided in your methods used for competition. Any participants alleged with this misconduct will be declared void results.

The statistical information of the training dataset is summarised as below:

Topic	No. of News
Business	361
Entertainment	343
Sport	363
Technology	356
Travel	362

Task 2: UniteCloud Operation Log for Anomaly Detection

UniteCloud is a resilient private Cloud infrastructure created in New Zealand Unitec Institute of Technology using OpenNebula for cloud orchestration and KVM for virtualization.

This dataset is the operational data that captured from real-time running UniteCloud server with a sample period of 1-minute interval. There are 243 features for each sample, which correspond to operational measurements of 243 sensors from the UniteCloud servers. The file is labelled accordingly by anomalous events and anomaly category determination over the collected log data. In the supplied training dataset, we provide 57,654 samples, with 243 sensor operation values for each sample, and the non-zero labels in the last column indicate the seven anomalous events.

The goal of this task is to identify various abnormal events accurately from ranges of sensor log files without high computational costs.

The statistical information of this dataset is summarized as:

No. of Sample	No. of Features	No. of Classes	No. of Training	No. of Testing
82,363	243	8	57,654	24,709

Task 3: Android Malware Classification

This dataset is created from a set of APK (application package) files collected from the Opera Mobile Store over the period of January to September of 2014. Just like Windows (PC) systems use an .exe file for installing software,Android use APK files for installing software on the Android operating system.

The permission system is applied as a measure to restrict access to privileged system resources and is considered as the first barrier to malware. Application developers have to explicitly declare the permissions in the AndroidManifest.xml file contained in the APK. All official Android permissions are categorized into four types: Normal, Dangerous, Signature and SignatureOrSystem. As dangerous permissions have access to restricted resources and can have a negative impact if used incorrectly, they require user’s approval at installation.

To be taken as the input of a machine-learning algorithm, permissions are commonly coded as binary variables i.e., an element in the vector could only take on two values: 1 for a requested permission and 0 otherwise. The number of all possible Android permissions varies based on the version of the OS. In this task, for each APK file under consideration, we provide a list of permissions declared in its AndoridManifest.xml file. The class label of the APK file -- +1 if it is regarded as malicious and -1 otherwise -- is determined by the detection results of security appliances hosted by VirusTotal. Note that adware was not counted as malware in our setting. The participants of CDMC 2016 competition are invited to design a classifier that could best match this result.

The statistical information of the dataset is summarized as:

No. of APK files	No. of Permissions	No. of Classes	No. of Training	No. of Testing
61,730	up to 583	2	30,920	30,810

Also, the MD5 hash is provided if you may need for checksum:
CDMC2016_AndroidPermissions.Train, md5(473f64d9e650e82325b1ce0216cc50c9)
CDMC2016_AndroidLabels.Train, md5(784b2ce7da61ff2935dca770c4bcbfb3)
CDMC2016_AndroidPermissions.Test, md5(192c70a8489c41fa95f5b95732fcdfb1)

cdmc2016数据挖掘竞赛题目Android Malware Classification的更多相关文章

CIKM Competition数据挖掘竞赛夺冠算法陈运文
CIKM Competition数据挖掘竞赛夺冠算法陈运文背景 CIKM Cup(或者称为CIKM Competition)是ACM CIKM举办的国际数据挖掘竞赛的名称.CIKM全称是Intern ...
Deep Android Malware Detection小结
题目:Deep Android Malware Detection 作者:Niall McLaughlin, Jesus Martinez del Rincon, BooJoong Kang 年份:2 ...
Kaggle "Microsoft Malware Classification Challenge"——就是沙箱恶意文件识别，有 Opcode n-gram特征 ASM文件图像纹理特征还有基于图聚类方法
使用图聚类方法:Malware Classification using Graph Clustering 见 https://github.com/rahulp0491/Malware-Classi ...
数据挖掘竞赛kaggle初战——泰坦尼克号生还预测
1.题目这道题目的地址在https://www.kaggle.com/c/titanic,题目要求大致是给出一部分泰坦尼克号乘船人员的信息与最后生还情况,利用这些数据,使用机器学习的算法,来分析预测 ...
kaggle数据挖掘竞赛初步--Titanic<派生属性&维归约>
完整代码: https://github.com/cindycindyhi/kaggle-Titanic 特征工程系列: Titanic系列之原始数据分析和数据处理 Titanic系列之数据变换 Ti ...
Android Malware Analysis
A friend of mine asked me help him to examine his Android 5.0 smartphone. He did not say what's wron ...
kaggle数据挖掘竞赛初步--Titanic<随机森林&特征重要性>
完整代码: https://github.com/cindycindyhi/kaggle-Titanic 特征工程系列: Titanic系列之原始数据分析和数据处理 Titanic系列之数据变换 Ti ...
kaggle数据挖掘竞赛初步--Titanic<数据变换>
完整代码: https://github.com/cindycindyhi/kaggle-Titanic 特征工程系列: Titanic系列之原始数据分析和数据处理 Titanic系列之数据变换 Ti ...
kaggle数据挖掘竞赛初步--Titanic<原始数据分析&缺失值处理>
Titanic是kaggle上的一道just for fun的题,没有奖金,但是数据整洁,拿来练手最好不过啦. 这道题给的数据是泰坦尼克号上的乘客的信息,预测乘客是否幸存.这是个二元分类的机器学习问题 ...

随机推荐

ubuntu上安装notepadpp
Notepad++是一套非常有特色的自由软件的纯文字编辑器(许可证:GPL).有完整的中文化接口及支持多国语言编写的功能(UTF8 技术).它的功能比 Windows 中的 Notepad(记事本)强 ...
Intellij：用Intellij出的Gradle插件进行开发
前言:之前看到网上大部分的Intellij开发教程都是采用Intellij官方文档的那个版本,配置Intellij SDK一大堆的. 现在给大家介绍简单的方法吧,我们组内大神找到的.我们需要用到的是I ...
log4j:ERROR Could not read configuration file [log4j.properties]
遇到这个错误,程序能够正常运行,log4j.properties也在classpath中,后来在网上查了资料,把下面这个语句去掉就好啦. PropertyConfigurator.configure( ...
ubuntu16.04如何查看内存和CPU的使用情况
ubuntu16.04如何查看内存和CPU的使用情况? 使用一下命令: gnome-system-monitor
bzoj1412 狼和羊的故事
Description “狼爱上羊啊爱的疯狂,谁让他们真爱了一场:狼爱上羊啊并不荒唐,他们说有爱就有方向．．．．．．” Orez听到这首歌,心想:狼和羊如此和谐,为什么不尝试羊狼合养呢?说干就干! O ...
开启远程XUL
参考:https://developer.mozilla.org/zh-cn/Remote_XUL firefox自4.0起(2011年4月版本,目前最新版为13.0),开始禁用远程XUL,这阻碍了初 ...
IDEA-servlet项目创建web项目
准备:1. 安装jdk1.82. 安装tomcat9.0(idea只支持4.0 9.0的服务器) 一.创建并设置javaweb工程 1.创建javaweb工程File --> New --&g ...
快速启动Oracle服务
快速启动Oracle服务的批处理命令步骤新建记事本粘贴如下内容: @echo off echo 确定要启动Oracle 11g服务吗? pause net start OracleOraDb11g ...
使用UITextView的dataDetectorTypes实现超链接需要注意的事项！
项目中需要在UITextView上识别URL,手机号码.邮箱地址等等信息.那么就用到了它的dataDetectorTypes属性.我的UITextView加在UITableViewCell上面的,当单 ...
jq 添加内容
向页面动态添加内容,一般用于动态网页,需要即时请求数据,并更新在页面上,使用append()更多一些,empty() - 清空所有子元素,remove() - 清除自身所有子元素. append() ...