http://www.csmining.org/cdmc2016/

Data Mining Tasks Description

Task 1: 2016 e-News categorisation

For this year, the dataset is sourced from 6 online news media:

The New Zealand Herald (www.nzherald.co.nz), Reuters(www.reuters.com), The Times (www.timesonline.co.uk) , Yahoo News (news.yahoo.com), BBC (www.bbc.co.uk) and The Press (www.stuff.co.nz).

Business, entertainment, sport, technology, and travel are the selected five news categories. Each document of the dataset is labelled manually by skimming over the text and determining the category. In the provided data files, each news piece is formatted as one line pure text with the last character as the class label (for training data), and we removed all punctuations and symbols during the data formation.

Note that; the dataset text is encrypted for fair play purpose, and this task is not aiming for decryption practices. So any uses of such technique are prohibited and should be avoided in your methods used for competition. Any participants alleged with this misconduct will be declared void results.

The statistical information of the training dataset is summarised as below:

Topic No. of News
Business 361
Entertainment 343
Sport 363
Technology 356
Travel 362

Task 2: UniteCloud Operation Log for Anomaly Detection

UniteCloud is a resilient private Cloud infrastructure created in New Zealand Unitec Institute of Technology using OpenNebula for cloud orchestration and KVM for virtualization.

This dataset is the operational data that captured from real-time running UniteCloud server with a sample period of 1-minute interval. There are 243 features for each sample, which correspond to operational measurements of 243 sensors from the UniteCloud servers. The file is labelled accordingly by anomalous events and anomaly category determination over the collected log data. In the supplied training dataset, we provide 57,654 samples, with 243 sensor operation values for each sample, and the non-zero labels in the last column indicate the seven anomalous events.

The goal of this task is to identify various abnormal events accurately from ranges of sensor log files without high computational costs.

The statistical information of this dataset is summarized as:

No. of Sample No. of Features No. of Classes

No. of Training

No. of Testing

82,363 243 8 57,654 24,709

Task 3: Android Malware Classification

This dataset is created from a set of APK (application package) files collected from the Opera Mobile Store over the period of January to September of 2014. Just like Windows (PC) systems use an .exe file for installing software,Android use APK files for installing software on the Android operating system.

The permission system is applied as a measure to restrict access to privileged system resources and is considered as the first barrier to malware. Application developers have to explicitly declare the permissions in the AndroidManifest.xml file contained in the APK. All official Android permissions are categorized into four types: Normal, Dangerous, Signature and SignatureOrSystem. As dangerous permissions have access to restricted resources and can have a negative impact if used incorrectly, they require user’s approval at installation.

To be taken as the input of a machine-learning algorithm, permissions are commonly coded as binary variables i.e., an element in the vector could only take on two values: 1 for a requested permission and 0 otherwise. The number of all possible Android permissions varies based on the version of the OS. In this task, for each APK file under consideration, we provide a list of permissions declared in its AndoridManifest.xml file. The class label of the APK file -- +1 if it is regarded as malicious and -1 otherwise -- is determined by the detection results of security appliances hosted by VirusTotal. Note that adware was not counted as malware in our setting. The participants of CDMC 2016 competition are invited to design a classifier that could best match this result.

The statistical information of the dataset is summarized as:

No. of APK files No. of Permissions No. of Classes No. of Training No. of Testing
61,730 up to 583 2 30,920 30,810

Also, the MD5 hash is provided if you may need for checksum:
CDMC2016_AndroidPermissions.Train, md5(473f64d9e650e82325b1ce0216cc50c9)
CDMC2016_AndroidLabels.Train, md5(784b2ce7da61ff2935dca770c4bcbfb3)
CDMC2016_AndroidPermissions.Test, md5(192c70a8489c41fa95f5b95732fcdfb1)

cdmc2016数据挖掘竞赛题目Android Malware Classification的更多相关文章

  1. CIKM Competition数据挖掘竞赛夺冠算法陈运文

    CIKM Competition数据挖掘竞赛夺冠算法陈运文 背景 CIKM Cup(或者称为CIKM Competition)是ACM CIKM举办的国际数据挖掘竞赛的名称.CIKM全称是Intern ...

  2. Deep Android Malware Detection小结

    题目:Deep Android Malware Detection 作者:Niall McLaughlin, Jesus Martinez del Rincon, BooJoong Kang 年份:2 ...

  3. Kaggle "Microsoft Malware Classification Challenge"——就是沙箱恶意文件识别,有 Opcode n-gram特征 ASM文件图像纹理特征 还有基于图聚类方法

    使用图聚类方法:Malware Classification using Graph Clustering 见 https://github.com/rahulp0491/Malware-Classi ...

  4. 数据挖掘竞赛kaggle初战——泰坦尼克号生还预测

    1.题目 这道题目的地址在https://www.kaggle.com/c/titanic,题目要求大致是给出一部分泰坦尼克号乘船人员的信息与最后生还情况,利用这些数据,使用机器学习的算法,来分析预测 ...

  5. kaggle数据挖掘竞赛初步--Titanic<派生属性&维归约>

    完整代码: https://github.com/cindycindyhi/kaggle-Titanic 特征工程系列: Titanic系列之原始数据分析和数据处理 Titanic系列之数据变换 Ti ...

  6. Android Malware Analysis

    A friend of mine asked me help him to examine his Android 5.0 smartphone. He did not say what's wron ...

  7. kaggle数据挖掘竞赛初步--Titanic<随机森林&特征重要性>

    完整代码: https://github.com/cindycindyhi/kaggle-Titanic 特征工程系列: Titanic系列之原始数据分析和数据处理 Titanic系列之数据变换 Ti ...

  8. kaggle数据挖掘竞赛初步--Titanic<数据变换>

    完整代码: https://github.com/cindycindyhi/kaggle-Titanic 特征工程系列: Titanic系列之原始数据分析和数据处理 Titanic系列之数据变换 Ti ...

  9. kaggle数据挖掘竞赛初步--Titanic<原始数据分析&缺失值处理>

    Titanic是kaggle上的一道just for fun的题,没有奖金,但是数据整洁,拿来练手最好不过啦. 这道题给的数据是泰坦尼克号上的乘客的信息,预测乘客是否幸存.这是个二元分类的机器学习问题 ...

随机推荐

  1. 文件下载工具类 DownLoadUtil 实战

    package com.cloud.mina.util; import java.io.File; import java.io.FileInputStream; import java.io.IOE ...

  2. POJ2082 Terrible Sets

    Terrible Sets Time Limit: 1000MS   Memory Limit: 30000K Total Submissions: 5067   Accepted: 2593 Des ...

  3. 详解TCP三握四挥

    TCP握手协议 在TCP/IP协议中,TCP协议提供可靠的连接服务,采用三次握手建立一个连接.第一次握手:建立连接时,客户端发送syn包(syn=j)到服务器,并进入SYN_SEND状态,等待服务器确 ...

  4. 《2019上半年DDoS攻击态势报告》发布:应用层攻击形势依然严峻,海量移动设备成新一代肉鸡

    2019年上半年,阿里云安全团队平均每天帮助用户防御2500余次DDoS攻击,与2018年持平.目前阿里云承载着中国40%网站流量,为全球上百万客户提供基础安全防御.可以说,阿里云上的DDoS攻防态势 ...

  5. 【JZOJ3853】【NOIP2014八校联考第2场第2试9.28】帮助Bsny(help)

    EVRT Bsny的书架乱成一团了,帮他一下吧! 他的书架上一共有n本书,我们定义混乱值是连续相同高度书本的段数.例如,如果书的高度是30,30,31,31,32,那么混乱值为3:30,32,32,3 ...

  6. thinkphp5.0 验证提示信息的类型

    以上是5.0.12版本 下面是5.0.5版本,没有elseif 上图中红方格的值只能是string类型,但是这种情况是在5.0.5版本是可以设置为array类型的

  7. 2018-10-19-Roslyn-使用-Directory.Build.props-文件定义编译

    title author date CreateTime categories Roslyn 使用 Directory.Build.props 文件定义编译 lindexi 2018-10-19 18 ...

  8. ES6对象的super关键字

    super是es6新出的关键字,它既可以当作函数使用,也可以当作对象使用,两种使用方法不尽相同 1.super用作函数使用的时候,代表父类的构造函数,es6规定在子类中使用this之前必须先执行一次s ...

  9. mysql通过TEXT字段进行关联的优化方案

    mysql如果通过超长的字段进行on关联,会导致效率很低,7k关联1.4k,结果为30+W的数据量,执行时间高达50秒. 将这个字段进行md5,然后再通过md5后的值进行关联,执行效率会大大优化,同样 ...

  10. @NOIP2018 - D1T2@ 货币系统

    目录 @题目描述@ @题解@ @代码@ @题目描述@ 在网友的国度中共有 n 种不同面额的货币,第 i 种货币的面额为 a[i],你可以假设每一种货币都有无穷多张.为了方便,我们把货币种数为 n.面额 ...