[ML] Load and preview large scale data
Ref: [Feature] Preprocessing tutorial
主要是 “无量纲化” 之前的部分。
Swipejobs is all about matching Jobs to Workers. Your challenge is to analyse the data provided and answer the questions below. You can access the data by opening the following S3 bucket: /* somewhere */ Please note that Worker (worker parquet files) has one or more job tickets (jobticket parquet files) associated with it. Using these parquet files: 求相关性
1. Is there a co-relation between jobticket.jobTicketState, jobticket.clickedCalloff and jobticket.assignedBySwipeJobs values across workers. 预测
2. Looking at Worker.profileLastUpdatedDate values, calculate an estimation for workers who will update their profile in the next two weeks. requirement
head -5 <file>
less <file>
PATH = "/home/ubuntu/work/rajdeepd-spark-ml/spark-ml/data"
user_data = sc.textFile("%s/ml-100k/u.user" % PATH) user_fields = user_data.map(lambda line: line.split("|"))
PythonRDD[29] at RDD at PythonRDD.scala:53
[['', '', 'M', 'technician', ''],
['', '', 'F', 'other', ''],
['', '', 'M', 'writer', ''],
['', '', 'M', 'technician', ''],
['', '', 'F', 'other', '']]
Spark SQL还是作为首选工具,参见:[Spark] 03 - Spark SQL
Ref: 读写parquet格式文件的几种方式
2. 用 sparkSql 读写hive中的parquet。
3. 用新旧MapReduce读写parquet格式文件。
Ref: How to read parquet data from S3 to spark dataframe Python?
spark = SparkSession.builder
.appName("app name")
.config("spark.some.config.option", true).getOrCreate() df = spark.read.parquet("s3://path/to/parquet/file.parquet")
# define the schema, corresponding to a line in the csv data file.
schema = StructType([
StructField("long", FloatType(), nullable=True),
StructField("lat", FloatType(), nullable=True),
StructField("medage", FloatType(), nullable=True),
StructField("totrooms", FloatType(), nullable=True),
StructField("totbdrms", FloatType(), nullable=True),
StructField("pop", FloatType(), nullable=True),
StructField("houshlds", FloatType(), nullable=True),
StructField("medinc", FloatType(), nullable=True),
StructField("medhv", FloatType(), nullable=True)]
# 参数中包含了column的定义
housing_df = spark.read.csv(path=HOUSING_DATA, schema=schema).cache()
# User-friendly的表格显示
# 包括了列的性质
MySQL (binlog) --> Maxwell --> Kafka --> HBase --> Parquet.
(1) MySQL到HBase
(2) HBase到Parquet
Ref: How to move HBase tables to HDFS in Parquet format?
Ref: spark 读 hbase parquet 哪个快
—— RDD方式,以及正统的高阶方法:[Spark] 03 - Spark SQL
# 可用于检查“空数据”、“不合格的数据”
def convert_year(x):
return int(x[-4:])
return 1900 # there is a 'bad' data point with a blank year, which we set to 1900 and will filter out later movie_fields = movie_data.map(lambda lines: lines.split("|"))
years = movie_fields.map(lambda fields: fields[2]).map(lambda x: convert_year(x))
num_genders = user_fields.map(lambda fields: fields[2]).distinct().count()
num_occupations = user_fields.map(lambda fields: fields[3]).distinct().count()
num_zipcodes = user_fields.map(lambda fields: fields[4]).distinct().count()
plt.hist 方法
import matplotlib.pyplot as plt ages = user_fields.map(lambda x: int(x[1])).collect()
plt.hist(ages, bins=30, color='gray', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(8, 5)
* Pandas.plot 方法
显示特征列 “medage" 的直方图。
result_df.toPandas().plot.bar(x='medage',figsize=(14, 6))
reduceByKey 方法
import numpy as np count_by_occupation = user_fields.map(lambda fields: (fields[3], 1)).reduceByKey(lambda x, y: x + y).collect()
# count_by_occupation2 = user_fields.map(lambda fields: fields[3]).countByValue() #######################################################
# 以下怎么用了 np 这个处理小数据的东东。
x_axis1 = np.array([c[0] for c in count_by_occupation])
y_axis1 = np.array([c[1] for c in count_by_occupation]) # sort by y_axis1
x_axis = x_axis1[np.argsort(y_axis1)]
y_axis = y_axis1[np.argsort(y_axis1)] pos = np.arange(len(x_axis))
width = 1.0 ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(x_axis) plt.bar(pos, y_axis, width, color='lightblue')
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 5)
RDD 获取一列
rating_data = rating_data_raw.map(lambda line: line.split("\t"))
ratings = rating_data.map(lambda fields: int(fields[2])) max_rating = ratings.reduce(lambda x, y: max(x, y))
min_rating = ratings.reduce(lambda x, y: min(x, y)) mean_rating = ratings.reduce(lambda x, y: x + y) / float(num_ratings)
median_rating = np.median(ratings.collect())
We can also use the stats function to get some similar information to the above.
ratings.stats() Out[11]:
(count: 100000, mean: 3.52986, stdev: 1.12566797076, max: 5.0, min: 1.0)
* Summary Statistics
F.round("medage", 4).alias("medage"),
F.round("totrooms", 4).alias("totrooms"),
F.round("totbdrms", 4).alias("totbdrms"),
F.round("pop", 4).alias("pop"),
F.round("houshlds", 4).alias("houshlds"),
F.round("medinc", 4).alias("medinc"),
F.round("medhv", 4).alias("medhv"))
|summary| medage| totrooms|totbdrms| pop|houshlds| medinc| medhv|
| count|20640.0| 20640.0| 20640.0| 20640.0| 20640.0|20640.0| 20640.0|
| mean|28.6395|2635.7631| 537.898|1425.4767|499.5397| 3.8707|206855.8169|
| stddev|12.5856|2181.6153|421.2479|1132.4621|382.3298| 1.8998|115395.6159|
| min| 1.0| 2.0| 1.0| 3.0| 1.0| 0.4999| 14999.0|
| max| 52.0| 39320.0| 6445.0| 35682.0| 6082.0|15.0001| 500001.0|
—— Spark SQL's DataFrame为主力工具,参考: [Spark] 03 - Spark SQL
Ref: https://github.com/drabastomek/learningPySpark/blob/master/Chapter04/LearningPySpark_Chapter04.ipynb
1. 找重复的行
print('Count of rows: {0}'.format(df.count()))
print('Count of distinct rows: {0}'.format(df.distinct().count())) # 所有列的集合
print('Count of distinct ids: {0}'.format(df.select([c for c in df.columns if c != 'id']).distinct().count())) # 自定义某些列的集合
2. 去除 "完全相同的 row",包括 index
df = df.dropDuplicates()
3. 去除 "相同的 row",不包括 index
df = df.dropDuplicates(subset=[c for c in df.columns if c != 'id'])
构造一个典型的 “问题数据表”。
df_miss = spark.createDataFrame([
(1, 143.5, 5.6, 28, 'M', 100000),
(2, 167.2, 5.4, 45, 'M', None),
(3, None , 5.2, None, None, None),
(4, 144.5, 5.9, 33, 'M', None),
(5, 133.2, 5.7, 54, 'F', None),
(6, 124.1, 5.2, None, 'F', None),
(7, 129.2, 5.3, 42, 'M', 76000),
], ['id', 'weight', 'height', 'age', 'gender', 'income'])
(1) 哪些行有缺失值?
lambda row: (row['id'], sum([c == None for c in row]))
[(1, 0), (2, 1), (3, 4), (4, 1), (5, 1), (6, 2), (7, 0)]
(2) 瞧瞧细节
df_miss.where('id == 3').show()
| id|weight|height| age|gender|income|
| 3| null| 5.2|null| null| null|
(3) 每列的缺失率如何?
(1 - (fn.count(c) / fn.count('*'))).alias(c + '_missing')
for c in df_miss.columns
|id_missing| weight_missing|height_missing| age_missing| gender_missing| income_missing|
| 0.0|0.1428571428571429| 0.0|0.2857142857142857|0.1428571428571429|0.7142857142857143|
(4) 缺失太多的特征,则“废”
df_miss_no_income = df_miss.select([c for c in df_miss.columns if c != 'income'])
| id|weight|height| age|gender|
| 1| 143.5| 5.6| 28| M|
| 2| 167.2| 5.4| 45| M|
| 3| null| 5.2|null| null|
| 4| 144.5| 5.9| 33| M|
| 5| 133.2| 5.7| 54| F|
| 6| 124.1| 5.2|null| F|
| 7| 129.2| 5.3| 42| M|
(5) 缺失太多的行,则“废”
| id|weight|height| age|gender|
| 1| 143.5| 5.6| 28| M|
| 2| 167.2| 5.4| 45| M|
| 4| 144.5| 5.9| 33| M|
| 5| 133.2| 5.7| 54| F|
| 6| 124.1| 5.2|null| F|
| 7| 129.2| 5.3| 42| M|
(6) 填补缺失值
means = df_miss_no_income.agg(
*[fn.mean(c).alias(c) for c in df_miss_no_income.columns if c != 'gender']
).toPandas().to_dict('records')[0] means['gender'] = 'missing' df_miss_no_income.fillna(means).show()
| id| weight|height|age| gender|
| 1| 143.5| 5.6| 28| M|
| 2| 167.2| 5.4| 45| M|
| 3|140.28333333333333| 5.2| 40|missing|
| 4| 144.5| 5.9| 33| M|
| 5| 133.2| 5.7| 54| F|
| 6| 124.1| 5.2| 40| F|
| 7| 129.2| 5.3| 42| M|
或者,通过 Imputer 填补缺失值,如下。
from pyspark.ml.feature import Imputer df = spark.createDataFrame([
(1.0, float("nan")),
(2.0, float("nan")),
(float("nan"), 3.0),
(4.0, 4.0),
(5.0, 5.0)
], ["a", "b"]) imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
model = imputer.fit(df) model.transform(df).show()
1. 基本策略
- 判定为“outlier”,首先要通过统计描述可视化数据。
- 常识以外的数据点也可以直接祛除,比如:age = 300
df_outliers = spark.createDataFrame([
(1, 143.5, 5.3, 28),
(2, 154.2, 5.5, 45),
(3, 342.3, 5.1, 99),
(4, 144.5, 5.5, 33),
(5, 133.2, 5.4, 54),
(6, 124.1, 5.1, 21),
(7, 129.2, 5.3, 42),
], ['id', 'weight', 'height', 'age'])
2. 定义有效区间
cols = ['weight', 'height', 'age']
bounds = {} for col in cols:
quantiles = df_outliers.approxQuantile(col, [0.25, 0.75], 0.05)
IQR = quantiles[1] - quantiles[0]
bounds[col] = [quantiles[0] - 1.5 * IQR, quantiles[1] + 1.5 * IQR] bounds
{'age': [-11.0, 93.0],
'height': [4.499999999999999, 6.1000000000000005],
'weight': [91.69999999999999, 191.7]}
3. filter有效区间
outliers = df_outliers.select(*['id'] + [
(df_outliers[c] < bounds[c][0]) |
(df_outliers[c] > bounds[c][1])
).alias(c + '_o') for c in cols
| id|weight_o|height_o|age_o|
| 1| false| false|false|
| 2| false| false|false|
| 3| true| false| true|
| 4| false| false|false|
| 5| false| false|false|
| 6| false| false|false|
| 7| false| false|false|
df_outliers = df_outliers.join(outliers, on='id')
df_outliers.filter('weight_o').select('id', 'weight').show()
df_outliers.filter('age_o').select('id', 'age').show()
| id|weight|
| 3| 342.3|
+---+------+ +---+---+
| id|age|
| 3| 99|
[ML] Load and preview large scale data的更多相关文章
- Introducing DataFrames in Apache Spark for Large Scale Data Science(中英双语)
文章标题 Introducing DataFrames in Apache Spark for Large Scale Data Science 一个用于大规模数据科学的API——DataFrame ...
- 论文笔记之:Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation
Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation Google 2016.10.06 官方 ...
- 大规模视觉识别挑战赛ILSVRC2015各团队结果和方法 Large Scale Visual Recognition Challenge 2015
Large Scale Visual Recognition Challenge 2015 (ILSVRC2015) Legend: Yellow background = winner in thi ...
- Lessons learned developing a practical large scale machine learning system
原文:http://googleresearch.blogspot.jp/2010/04/lessons-learned-developing-practical.html Lessons learn ...
- 【原】Coursera—Andrew Ng机器学习—课程笔记 Lecture 17—Large Scale Machine Learning 大规模机器学习
Lecture17 Large Scale Machine Learning大规模机器学习 17.1 大型数据集的学习 Learning With Large Datasets 如果有一个低方差的模型 ...
- [C12] 大规模机器学习(Large Scale Machine Learning)
大规模机器学习(Large Scale Machine Learning) 大型数据集的学习(Learning With Large Datasets) 如果你回顾一下最近5年或10年的机器学习历史. ...
- Could not load file or assembly 'MySql.Data.CF,
Could not load file or assembly 'MySql.Data.CF, Version=, Culture=neutral, PublicKeyToken=c56 ...
- Could not load file or assembly 'System.Data.SQLite' or one of its dependencies
试图加载格式不正确的程 异常类型 异常消息Could not load file or assembly 'System.Data.SQLite' or one of its dependencies ...
- SQLite 解决:Could not load file or assembly 'System.Data.SQLite ... 试图加载格式不正确的程序/or one of its dependencies. 找不到指定的模块。
Could not load file or assembly 'System.Data.SQLite.dll' or one of its dependencies. 找不到指定的模块. 错误提示 ...
- PAT Basic 1085 PAT单位排行 (25 分)
每次 PAT 考试结束后,考试中心都会发布一个考生单位排行榜.本题就请你实现这个功能. 输入格式: 输入第一行给出一个正整数 N(≤),即考生人数.随后 N 行,每行按下列格式给出一个考生的信息: 准 ...
- P4151 最大XOR和路径 线性基
题解见:https://www.luogu.org/problemnew/solution/P4151 其实就是找出所有环 把环上所有边异或起来得到的值扔到线性基里面 然后随便走一条从1~n的链 最后 ...
- 自定义类似smarty模板
自定义类封装模板解析功能 原理其实比较简单,就是把html文件解析为一个超级字符串,然后把类似{{$mytitle}}这种结构的变量进行替换(str_replace)当然,实际中这样做可能导致频繁的磁 ...
- 云主机用samba服务实现和windows共享文件
最近刚刚入坑了百度云的云主机BCC,准备在云主机上实现samba服务,映射到本机来当硬盘使用,可是一直怎么试都不成功,后来咨询客服之后才知道samba默认使用的端口445端口被运营商封禁了,只好更改端 ...
- parseInt parseFloat isNaN Number 区别和具体的转换规则及用法
原文链接:https://blog.csdn.net/wulove52/article/details/84953998 在javascript 我经常用到,parseInt.parseFloat.N ...
- js中Ajax工作原理(转)
在写这篇文章之前,曾经写过一篇关于AJAX技术的随笔,不过涉及到的方面很窄,对AJAX技术的背景.原理.优缺点等各个方面都很少涉及null.这次写这篇文章的背景是因为公司需要对内部程序员做一个培训.项 ...
- Win7 : 'java' is not recognized as internal or external command,
Java application is not working in Win 7 64-bit http://answers.microsoft.com/en-us/windows/forum/win ...
- IVIEW组件Table中加入EChart柱状图
展示图如下: 主要利用了render函数和updated()钩子函数进行数据填充与渲染. 1.在Table的Colums中加入 1 { 2 title: '比例图', 3 align: 'center ...
- Makefile规则介绍
Makefile 一个规则 三要素:目标,依赖,命令 目标:依赖 命令 1.第一条规则是用来生成终极目标的规则 如果规则中的依赖不存在,向下寻找其他的规则 更新机制:比较的是目标文件和 ...
- 在Postman脚本中发送请求(pm.sendRequest)
Postman的Collection(集合)/Folder(集合的子文件夹)/Request(请求)都有Pre-request script和Tests两个脚本区域, 分别可以在发送请求前和请求后使用 ...