1

2 It provides a way to initialize H2O services on each node in the Spark cluster and to access data stored in data structures of Spark and H2O.

3 Internal Backend  is easiest to deploy; however when Spark or YARN kills the executor - which is not an unusual case - the entire H2O cluster goes down because H2O does not support high availability.

4 The internal backend is the default for behavior for Sparkling Water.  Another way to change type of backend is by calling the setExternalClusterMode() or setInternalClusterMode() method on the H2OConf class. H2OConf is simple wrapper around SparkConf and inherits all properties in the Spark configuration.

5 好像在安装sparkingwater时,就会把pyspark和H2O装好: pip install h2o_pysparkling_2.3

=======================

1 启动spark :  ./sbin/start-master.sh      ./sbin/start-slave.sh spark://zcy-VirtualBox:7077

2 可以先运行一个很简单的脚本,看环境是否ready ,为了运行成功,需要把虚拟机内存调大(我改成了2g)

from pysparkling import *
from pyspark.sql import SparkSession
import h2o # Initiate SparkSession
spark = SparkSession.builder.appName("App name").getOrCreate() # Initiate H2OContext
hc = H2OContext.getOrCreate(spark) # Stop H2O and Spark services
h2o.cluster().shutdown()
spark.stop()
print ""

./bin/spark-submit --master spark://zcy-VirtualBox:7077  --conf "spark.executor.memory=1g" /home/zcy/working/tst.py

结果如下

3 运行一个稍微复杂的脚本:

import h2o
from datetime import datetime from pyspark import SparkConf, SparkFiles
from pyspark.sql import Row, SparkSession
import os
from pysparkling import * # Refine date column
def refine_date_col(data, col):
data["Day"] = data[col].day()
data["Month"] = data[col].month()
data["Year"] = data[col].year()
data["WeekNum"] = data[col].week()
data["WeekDay"] = data[col].dayOfWeek()
data["HourOfDay"] = data[col].hour() # Create weekend and season cols
# Spring = Mar, Apr, May. Summer = Jun, Jul, Aug. Autumn = Sep, Oct. Winter = Nov, Dec, Jan, Feb.
# data["Weekend"] = [ if x in ("Sun", "Sat") else for x in data["WeekDay"]]
data["Weekend"] = ((data["WeekDay"] == "Sun") | (data["WeekDay"] == "Sat"))
data["Season"] = data["Month"].cut([, , , , , ], ["Winter", "Spring", "Summer", "Autumn", "Winter"]) # This is just helper function returning path to data-files
def _locate(file_name):
if os.path.isfile("/home/zcy/working/data_tst/" + file_name):
return "/home/zcy/working/data_tst/" + file_name
else:
print "eeeeeeeeeeee" spark = SparkSession.builder.appName("ChicagoCrimeTest").getOrCreate()
# Start H2O services
h2oContext = H2OContext.getOrCreate(spark)
# Define file names
chicagoAllWeather = "chicagoAllWeather.csv"
chicagoCensus = "chicagoCensus.csv"
chicagoCrimes10k = "chicagoCrimes10k.csv.zip" # h2o.import_file expects cluster-relative path
f_weather = h2o.upload_file(_locate(chicagoAllWeather))
f_census = h2o.upload_file(_locate(chicagoCensus))
f_crimes = h2o.upload_file(_locate(chicagoCrimes10k))
print "" # Transform weather table
# Remove 1st column (date)
f_weather = f_weather[:] # Transform census table
# Remove all spaces from column names (causing problems in Spark SQL)
col_names = list(map(lambda s: s.strip().replace(' ', '_').replace('+', '_'), f_census.col_names)) # Update column names in the table
# f_weather.names = col_names
f_census.names = col_names # Transform crimes table
# Drop useless columns
f_crimes = f_crimes[:] # Set time zone to UTC for date manipulation
h2o.cluster().timezone = "Etc/UTC" # Replace ' ' by '_' in column names
col_names = list(map(lambda s: s.replace(' ', '_'), f_crimes.col_names))
f_crimes.names = col_names
refine_date_col(f_crimes, "Date")
f_crimes = f_crimes.drop("Date") # Expose H2O frames as Spark DataFrame
print ""
df_weather = h2oContext.as_spark_frame(f_weather)
df_census = h2oContext.as_spark_frame(f_census)
df_crimes = h2oContext.as_spark_frame(f_crimes) # Register DataFrames as tables
df_weather.createOrReplaceTempView("chicagoWeather")
df_census.createOrReplaceTempView("chicagoCensus")
df_crimes.createOrReplaceTempView("chicagoCrime") crimeWithWeather = spark.sql("""SELECT
a.Year, a.Month, a.Day, a.WeekNum, a.HourOfDay, a.Weekend, a.Season, a.WeekDay,
a.IUCR, a.Primary_Type, a.Location_Description, a.Community_Area, a.District,
a.Arrest, a.Domestic, a.Beat, a.Ward, a.FBI_Code,
b.minTemp, b.maxTemp, b.meanTemp,
c.PERCENT_AGED_UNDER_18_OR_OVER_64, c.PER_CAPITA_INCOME, c.HARDSHIP_INDEX,
c.PERCENT_OF_HOUSING_CROWDED, c.PERCENT_HOUSEHOLDS_BELOW_POVERTY,
c.PERCENT_AGED_16__UNEMPLOYED, c.PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA
FROM chicagoCrime a
JOIN chicagoWeather b
ON a.Year = b.year AND a.Month = b.month AND a.Day = b.day
JOIN chicagoCensus c
ON a.Community_Area = c.Community_Area_Number""") # Publish Spark DataFrame as H2OFrame with given name
crimeWithWeatherHF = h2oContext.as_h2o_frame(crimeWithWeather, "crimeWithWeatherTable")
print ""
# Transform selected String columns to categoricals
cat_cols = ["Arrest", "Season", "WeekDay", "Primary_Type", "Location_Description", "Domestic"]
for col in cat_cols :
crimeWithWeatherHF[col] = crimeWithWeatherHF[col].asfactor() # Split frame into two - we use one as the training frame and the second one as the validation frame
splits = crimeWithWeatherHF.split_frame(ratios=[0.8])
train = splits[]
test = splits[]
print ""
h2o.download_csv(train,'/home/zcy/working/data_tst/ret/train.csv')
h2o.download_csv(test,'/home/zcy/working/data_tst/ret/test.csv') # stop H2O and Spark services
h2o.cluster().shutdown()
spark.stop()

3 运行脚本,

./bin/spark-submit --master spark://zcy-VirtualBox:7077  --conf "spark.executor.memory=1g" /home/zcy/working/sparkH2O.py

sparking water的更多相关文章

  1. [LeetCode] Pacific Atlantic Water Flow 太平洋大西洋水流

    Given an m x n matrix of non-negative integers representing the height of each unit cell in a contin ...

  2. [LeetCode] Trapping Rain Water II 收集雨水之二

    Given an m x n matrix of positive integers representing the height of each unit cell in a 2D elevati ...

  3. [LeetCode] Water and Jug Problem 水罐问题

    You are given two jugs with capacities x and y litres. There is an infinite amount of water supply a ...

  4. [LeetCode] Trapping Rain Water 收集雨水

    Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...

  5. [LeetCode] Container With Most Water 装最多水的容器

    Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai). ...

  6. 如何装最多的水? — leetcode 11. Container With Most Water

    炎炎夏日,还是呆在空调房里切切题吧. Container With Most Water,题意其实有点噱头,简化下就是,给一个数组,恩,就叫 height 吧,从中任选两项 i 和 j(i <= ...

  7. 【leetcode】Container With Most Water

    题目描述: Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ...

  8. [LintCode] Trapping Rain Water 收集雨水

    Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...

  9. [LintCode] Container With Most Water 装最多水的容器

    Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai).  ...

随机推荐

  1. linux每日命令(29):chown命令

    chown将指定文件的拥有者改为指定的用户或组,用户可以是用户名或者用户ID:组可以是组名或者组ID:文件是以空格分开的要改变权限的文件列表,支持通配符.系统管理员经常使用chown命令,在将文件拷贝 ...

  2. greendao引起的NoClassDefFoundError异常解决

    在使用Android studio导入eclipse工程师报错,因为原工程引用了greendao的第三方工程包 java.lang.NoClassDefFoundError: org.greenrob ...

  3. R语言中的横向数据合并merge及纵向数据合并rbind的使用

    R语言中的横向数据合并merge及纵向数据合并rbind的使用 我们经常会遇到两个数据框拥有相同的时间或观测值,但这些列却不尽相同.处理的办法就是使用merge(x, y ,by.x = ,by.y ...

  4. jquery 选择对象随心所欲,遍历数组更是易如反掌

    jquery只要研究总结透彻了,那选择对象就会随心所欲,遍历数组更是易如反掌.选对对象,才能“娶妻生子”,才能有后续的数据处理.呵呵遍历对很关键. 怕只怕,学东西浅尝辄止一知半解.本篇特别研究总结jq ...

  5. (转)java 层调用Jni(Ndk) 持久化c c++ 对象

    对于Jni(Ndk) 很多人应该都有印象,Android的ndk接触到的机会相对会比较多,本例子以android平台为例,pc端的话就以简单的windows为例, 编码完用vs 或是 gcc进行编译成 ...

  6. Java8学习笔记(四)--接口增强

    增强点 静态方法 public interface InterfacePlus { void run(); static Date createDate(){ return new Date(); } ...

  7. IntelliJ IDEA下spring boot项目打包

    Spring Boot自带Tomcat插件,可以直接编写启动类,开启Tomcat服务 springboot适合前后端分离,打成jar进行部署更合适 application.properties配置端口 ...

  8. Rsync实现多台Windows工作电脑文件同步

    你要准备的软件有: 最新版 Rsync for windows 服务端:cwRsync_Server_2.1.5_Installer.zip 客户端:cwRsync_2.1.5_Installer.z ...

  9. 【laravel5.6】The Process class relies on proc_open, which is not available on your PHP installation.

    部署服务器的时候,使用composer来安装依赖.遇到了 解决办法: 在php.ini中,找到disable_functions选项,看看后面是否有proc_open函数被禁用了,如果有的话,去掉即可

  10. 解决UEFI启动模式下无法使用U盘启动WIN7安装界面

    问题场景 现在很多人都习惯使用U盘进行安装系统,主要是快捷方便.本文主要是讲解一下U盘在UEFI模式下无法启动Windows7安装界面的问题,可能很多人会说使用PE系统进行安装,但是因为我的主板只有独 ...