http://blog.csdn.net/pipisorry/article/details/53320669

pyspark.sql.SQLContext

Main entry point for DataFrame and SQL functionality.

[pyspark.sql.SQLContext]

皮皮blog

pyspark.sql.DataFrame

A distributed collection of data grouped into named columns.

spark df和pandas df

spark df的操作基本和pandas df操作一样的[Pandas小记(6) ]

相互转换

从pandas_df转换：

spark_df = SQLContext.createDataFrame(pandas_df)

sc = SparkContext(master='local[8]', appName='kmeans')
sql_ctx = SQLContext(sc)
lldf_rdd = sql_ctx.createDataFrame(lldf)

另外，createDataFrame支持从list转换spark_df，其中list元素可以为tuple，dict，rdd

从spark_df转换：

pandas_df = spark_df.toPandas()

toPandas()

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

Note that this method should only be used if the resulting Pandas’s DataFrame is expectedto be small, as all the data is loaded into the driver’s memory.

This is only available if Pandas is installed and available.

>>> df.toPandas()
   age   name
0    2  Alice
1    5    Bob

[Spark与Pandas中DataFrame对比（详细）]

spark df方法

rdd: Returns the content as an pyspark.RDD of Row.

rollup(*cols)

Create a multi-dimensional rollup for the current DataFrame usingthe specified columns, so we can run aggregation on them.

>>> df.rollup("name", df.age).count().orderBy("name", "age").show()
+-----+----+-----+
| name| age|count|
+-----+----+-----+
| null|null|    2|
|Alice|null|    1|
|Alice|   2|    1|
|  Bob|null|    1|
|  Bob|   5|    1|
+-----+----+-----+

select(*cols)

Projects a set of expressions and returns a new DataFrame.

Parameters:	cols – list of column names (string) or expressions (Column).If one of the column names is ‘*’, that column is expanded to include all columnsin the current DataFrame.

>>> df.select('*').collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
>>> df.select('name', 'age').collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
>>> df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]

selectExpr(*expr)

Projects a set of SQL expressions and returns a new DataFrame.

This is a variant of select() that accepts SQL expressions.

>>> df.selectExpr("age * 2", "abs(age)").collect()
[Row((age * 2)=4, abs(age)=2), Row((age * 2)=10, abs(age)=5)]

toDF(*cols) Returns a new class:DataFrame that with new specified column names Parameters: cols – list of new column names (string) >>> df.toDF('f1', 'f2').collect() [Row(f1=2, f2=u'Alice'), Row(f1=5, f2=u'Bob')]
persist(storageLevel=StorageLevel(False, True, False, False, 1))¶: Sets the storage level to persist its values across operationsafter the first time it is computed. This can only be used to assigna new storage level if the RDD does not have a storage level set yet.If no storage level is specified defaults to (MEMORY_ONLY).

Parameters:	cols – list of new column names (string)

[pyspark.sql.DataFrame]

from: http://blog.csdn.net/pipisorry/article/details/53320669

ref:

Spark核心类：SQLContext和DataFrame的更多相关文章

Spark核心类：弹性分布式数据集RDD及其转换和操作pyspark.RDD
http://blog.csdn.net/pipisorry/article/details/53257188 弹性分布式数据集RDD(Resilient Distributed Dataset) 术 ...
Spark 核心篇-SparkContext
本章内容: 1.功能描述本篇文章就要根据源码分析SparkContext所做的一些事情,用过Spark的开发者都知道SparkContext是编写Spark程序用到的第一个类,足以说明SparkCo ...
Spark SQL初始化和创建DataFrame的几种方式
一.前述 1.SparkSQL介绍 Hive是Shark的前身,Shark是SparkSQL的前身,SparkSQL产生的根本原因是其完全脱离了Hive的限制. SparkSQL支持查询原 ...
【转载】Spark SQL 1.3.0 DataFrame介绍、使用
http://www.aboutyun.com/forum.php?mod=viewthread&tid=12358&page=1 1.DataFrame是什么?2.如何创建DataF ...
[Spark][Python]spark 从 avro 文件获取 Dataframe 的例子
[Spark][Python]spark 从 avro 文件获取 Dataframe 的例子从如下地址获取文件: https://github.com/databricks/spark-avro/r ...
Spark 核心篇-SparkEnv
本章内容: 1.功能概述 SparkEnv是Spark的执行环境对象,其中包括与众多Executor执行相关的对象.Spark 对任务的计算都依托于 Executor 的能力,所有的 Executor ...
科普Spark，Spark核心是什么，如何使用Spark（1）
科普Spark,Spark是什么,如何使用Spark(1)转自:http://www.aboutyun.com/thread-6849-1-1.html 阅读本文章可以带着下面问题:1.Spark基于 ...
【二】Spark 核心
spark 核心 spark core RDD创建 >>> RDD转换 >>> RDD缓存 >>> RDD行动 >>> RDD输 ...
大数据体系概览Spark、Spark核心原理、架构原理、Spark特点
大数据体系概览Spark.Spark核心原理.架构原理.Spark特点大数据体系概览(Spark的地位) 什么是Spark? Spark整体架构 Spark的特点 Spark核心原理 Spark架构 ...

随机推荐

Windows下Java开发环境安装与配置
1. 前往Oracle网站下载JDK程序并安装. http://www.oracle.com/technetwork/java/javase/downloads/index.html 目前最新的版本为 ...
多线程利器－－－队列(queue)
列表是不安全的数据结构 import threading,time li=[1,2,3,4,5] def pri(): while li: a=li[-1] print(a) time.sleep(1 ...
[HAOI2011]向量
题目描述给你一对数a,b,你可以任意使用(a,b), (a,-b), (-a,b), (-a,-b), (b,a), (b,-a), (-b,a), (-b,-a)这些向量,问你能不能拼出另一个向量 ...
codeforces round #419 E. Karen and Supermarket
On the way home, Karen decided to stop by the supermarket to buy some groceries. She needs to buy a ...
Codeforces Round #430 C. Ilya And The Tree
Ilya is very fond of graphs, especially trees. During his last trip to the forest Ilya found a very ...
C++Primer学习——const
Const int size = 512; 在编译的时候,编译器会把用到该变量的地方全部替换成对应的值. const&可以绑定字面值,所以当用常量引用绑定一个常量时,是否可以看成那个值在编译阶 ...
●BZOJ 2209 [Jsoi2011]括号序列
题链: http://www.lydsy.com/JudgeOnline/problem.php?id=2209 题解: Splay 很好的题,但是把智障的我给恶心到了... 首先不难发现,最后没 ...
SpringCloud学习之zuul
一.为什么要有网关我们先看一个图,如果按照consumer and server(最初的调用方式),如下所示这样我们要面临如下问题: 1. 用户面临着一对N的问题既用户必须知道每个服务.随着服务的 ...
Python中生成器和迭代器的功能介绍
生成器和迭代器的功能介绍 1. 生成器(generator) 1. 赋值生成器 1. 创建方法:x = (variable for variable in iterable) 例如:x = (i f ...
TensorFlow LSTM 注意力机制图解
TensorFlow LSTM Attention 机制图解深度学习的最新趋势是注意力机制.在接受采访时,现任OpenAI研究主管的Ilya Sutskever提到,注意力机制是最令人兴奋的进步之一 ...

Spark核心类：SQLContext和DataFrame