Spark教程——(5)PySpark入门
启动PySpark:
[root@node1 ~]# pyspark Python 2.7.5 (default, Nov 6 2016, 00:28:07) [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.6.0 /_/ Using Python version 2.7.5 (default, Nov 6 2016 00:28:07) SparkContext available as sc, HiveContext available as sqlContext.
上下文已经包含 sc 和 sqlContext:
SparkContext available as sc, HiveContext available as sqlContext.
执行脚本:
>>> from __future__ import print_function >>> import os >>> import sys >>> from pyspark import SparkContext >>> from pyspark.sql import SQLContext >>> from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType# RDD is created from a list of rows >>> some_rdd = sc.parallelize([Row(name="John", age=19),Row(name="Smith", age=23),Row(name="Sarah", age=18)])# Infer schema from the first row, create a DataFrame and print the schema >>> some_df = sqlContext.createDataFrame(some_rdd) >>> some_df.printSchema() root |-- age: long (nullable = true) |-- name: string (nullable = true) # Another RDD is created from a list of tuples >>> another_rdd = sc.parallelize([("John", 19), ("Smith", 23), ("Sarah", 18)])# Schema with two fields - person_name and person_age >>> schema = StructType([StructField("person_name", StringType(), False),StructField("person_age", IntegerType(), False)])# Create a DataFrame by applying the schema to the RDD and print the schema >>> another_df = sqlContext.createDataFrame(another_rdd, schema) >>> another_df.printSchema() root |-- person_name: string (nullable = false) |-- person_age: integer (nullable = false)
进入Github下载people.json文件:
并上传到HDFS上:
继续执行脚本:
# A JSON dataset is pointed to by path. # The path can be either a single text file or a directory storing text files. >>> if len(sys.argv) < 2: ... path = "/user/cf/people.json" ... else: ... path = sys.argv[1] ... # Create a DataFrame from the file(s) pointed to by path >>> people = sqlContext.jsonFile(path) [Stage 5:> (0 + 1) / 2]19/07/04 10:34:33 WARN spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0 # The inferred schema can be visualized using the printSchema() method. >>> people.printSchema() root |-- age: long (nullable = true) |-- name: string (nullable = true) # Register this DataFrame as a table. >>> people.registerAsTable("people") /opt/cloudera/parcels/CDH-5.14.2-1.cdh5.14.2.p0.3/lib/spark/python/pyspark/sql/dataframe.py:142: UserWarning: Use registerTempTable instead of registerAsTable. warnings.warn("Use registerTempTable instead of registerAsTable.") # SQL statements can be run by using the sql methods provided by sqlContext >>> teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") >>> for each in teenagers.collect(): ... print(each[0]) ... Justin
执行结束:
>>> sc.stop() >>>
参考程序:
# # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # from __future__ import print_function import os import sys from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType if __name__ == "__main__": sc = SparkContext(appName="PythonSQL") sqlContext = SQLContext(sc) # RDD is created from a list of rows some_rdd = sc.parallelize([Row(name="John", age=19), Row(name="Smith", age=23), Row(name="Sarah", age=18)]) # Infer schema from the first row, create a DataFrame and print the schema some_df = sqlContext.createDataFrame(some_rdd) some_df.printSchema() # Another RDD is created from a list of tuples another_rdd = sc.parallelize([("John", 19), ("Smith", 23), ("Sarah", 18)]) # Schema with two fields - person_name and person_age schema = StructType([StructField("person_name", StringType(), False), StructField("person_age", IntegerType(), False)]) # Create a DataFrame by applying the schema to the RDD and print the schema another_df = sqlContext.createDataFrame(another_rdd, schema) another_df.printSchema() # root # |-- age: integer (nullable = true) # |-- name: string (nullable = true) # A JSON dataset is pointed to by path. # The path can be either a single text file or a directory storing text files. if len(sys.argv) < 2: path = "file://" + \ os.path.join(os.environ['SPARK_HOME'], "examples/src/main/resources/people.json") else: path = sys.argv[1] # Create a DataFrame from the file(s) pointed to by path people = sqlContext.jsonFile(path) # root # |-- person_name: string (nullable = false) # |-- person_age: integer (nullable = false) # The inferred schema can be visualized using the printSchema() method. people.printSchema() # root # |-- age: IntegerType # |-- name: StringType # Register this DataFrame as a table. people.registerAsTable("people") # SQL statements can be run by using the sql methods provided by sqlContext teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") for each in teenagers.collect(): print(each[0]) sc.stop()
Spark教程——(5)PySpark入门的更多相关文章
- Spark教程——(11)Spark程序local模式执行、cluster模式执行以及Oozie/Hue执行的设置方式
本地执行Spark SQL程序: package com.fc //import common.util.{phoenixConnectMode, timeUtil} import org.apach ...
- Spring_MVC_教程_快速入门_深入分析
Spring MVC 教程,快速入门,深入分析 博客分类: SPRING Spring MVC 教程快速入门 资源下载: Spring_MVC_教程_快速入门_深入分析V1.1.pdf Spring ...
- AFNnetworking快速教程,官方入门教程译
AFNnetworking快速教程,官方入门教程译 分类: IOS2013-12-15 20:29 12489人阅读 评论(5) 收藏 举报 afnetworkingjsonios入门教程快速教程 A ...
- 【译】ASP.NET MVC 5 教程 - 1:入门
原文:[译]ASP.NET MVC 5 教程 - 1:入门 本教程将教你使用Visual Studio 2013 预览版构建 ASP.NET MVC 5 Web 应用程序 的基础知识.本主题还附带了一 ...
- Nginx教程(一) Nginx入门教程
Nginx教程(一) Nginx入门教程 1 Nginx入门教程 Nginx是一款轻量级的Web服务器/反向代理服务器及电子邮件(IMAP/POP3)代理服务器,并在一个BSD-like协议下发行.由 ...
- spark教程
某大神总结的spark教程, 地址 http://litaotao.github.io/introduction-to-spark?s=inner
- Android基础-系统架构分析,环境搭建,下载Android Studio,AndroidDevTools,Git使用教程,Github入门,界面设计介绍
系统架构分析 Android体系结构 安卓结构有四大层,五个部分,Android分四层为: 应用层(Applications),应用框架层(Application Framework),系统运行层(L ...
- Spark SQL 编程API入门系列之SparkSQL的依赖
不多说,直接上干货! 不带Hive支持 <dependency> <groupId>org.apache.spark</groupId> <artifactI ...
- spark教程(七)-文件读取案例
sparkSession 读取 csv 1. 利用 sparkSession 作为 spark 切入点 2. 读取 单个 csv 和 多个 csv from pyspark.sql import Sp ...
- spark教程(六)-Python 编程与 spark-submit 命令
hadoop 是 java 开发的,原生支持 java:spark 是 scala 开发的,原生支持 scala: spark 还支持 java.python.R,本文只介绍 python spark ...
随机推荐
- 「CF859E」Desk Disorder
传送门 Luogu 解题思路 一眼想到二分图:但是求不了最大匹配方案数 oho. 于是考虑这么建图: 直接将一个人可以去的两把椅子连边,然后原图中的2n个点就会形成许多联通块,这个可以分步计数. 又因 ...
- December 28th, Week 52nd Saturday, 2019
If you start at the bottom, pay your dues, life here can be a dream come true. 只要你从头开始,脚踏实地,梦想是可以成真的 ...
- .NET中的字符串(1):字符串 - 特殊的引用类型
C# string 特殊的引用类型 .Net 框架程序设计(修订版)中有这样一段描述:String类型直接继承自Object,这使得它成为一个引用类型,也就是说线程上的堆栈上不会驻留有任何字符串.(译 ...
- python opencv:使用滑动条做调色板
cv2.getTrackbarPos() 函数的 一个参数是滑动条的名字, 第二个参数是滑动条被放置窗口的名字, 第三个参数是滑动条的默认位置. 第四个参数是滑动条的最大值, 第五个函数是回调函数,每 ...
- ES6简单语法
ES6 简单语法: 变量声明 ES5 var 声明变量为全局变量 会变量提升 ES6 let 声明的变量为块级变量 且不能重复声明 不存在变量提升 # {}一个大括号为一个作用域 ES6 const ...
- win10 桌面快捷键技术
win 10 的 快捷键技术,使用还是挺流畅舒适的: Windows10技术新增键盘快捷键汇总: 1.贴靠窗口:Win +左/右> Win +上/下>窗口可以变为1/4大小放置在屏幕4个角 ...
- 解决PLSQL 查询后显示中文为问号(???)问题
我的问题已解决,在装oracle的服务器上配置了下面的两个环境变量后,重启服务器,重新录入中文,在查询即可正确显示中文. 原因: 本机(装oracle的服务器)没有配置数据库字符集环境变量,或是与数据 ...
- 8.1.1默认的map函数、reduce函数、分区函数
1.1.1 默认的map函数和reduce函数 (1)Maper和Reuducer默认类 如果没有指定maper类和reduce类,则会用默认的Maper和Reuducer类去处理数据 ...
- JavaScript - onunload失效
参考 https://stackoverflow.com/questions/7794301/window-onunload-is-not-working-properly-in-chrome-bro ...
- ImageMagick PDF到JPG有时会导致黑色背景
convert -verbose -density 300 -quality 50 -background white -alpha remove 0.pdf 0.jpg magick convert ...