[Spark][Python]DataFrame的左右连接例子

$ hdfs dfs -cat people.json

{"name":"Alice","pcode":"94304"}
{"name":"Brayden","age":30,"pcode":"94304"}
{"name":"Carla","age":19,"pcoe":"10036"}
{"name":"Diana","age":46}
{"name":"Etienne","pcode":"94104"}

$ hdfs dfs -cat pcodes.json

{"pcode":"10036","city":"New York","state":"NY"}
{"pcode":"87501","city":"Santa Fe","state":"NM"}
{"pcode":"94304","city":"Palo Alto","state":"CA"}
{"pcode":"94104","city":"San Francisco","state":"CA"}

$pyspark

sqlContext = HiveContext(sc)
peopleDF = sqlContext.read.json("people.json")
peopleDF.limit(5).show()

+----+-------+-----+-----+
| age| name|pcode| pcoe|
+----+-------+-----+-----+
|null| Alice|94304| null|
| 30|Brayden|94304| null|
| 19| Carla| null|10036|
| 46| Diana| null| null|
|null|Etienne|94104| null|
+----+-------+-----+-----+

sqlContext = HiveContext(sc)
pcodesDF = sqlContext.read.json("pcodes.json")
pcodesDF.limit(5).show()

+-------------+-----+-----+
| city|pcode|state|
+-------------+-----+-----+
| New York|10036| NY|
| Santa Fe|87501| NM|
| Palo Alto|94304| CA|
|San Francisco|94104| CA|
+-------------+-----+-----+

mydf000 = peopleDF.join(pcodesDF,"pcode")
mydf000.limit(5).show()

+-----+----+-------+----+-------------+-----+
|pcode| age| name|pcoe| city|state|
+-----+----+-------+----+-------------+-----+
|94304|null| Alice|null| Palo Alto| CA|
|94304| 30|Brayden|null| Palo Alto| CA|
|94104|null|Etienne|null|San Francisco| CA|
+-----+----+-------+----+-------------+-----+

mydf001=peopleDF.join(pcodesDF,"pcode","leftsemi")
mydf001.limit(5).show()

+-----+----+-------+----+
|pcode| age| name|pcoe|
+-----+----+-------+----+
|94304|null| Alice|null|
|94304| 30|Brayden|null|
|94104|null|Etienne|null|
+-----+----+-------+----+

mydf002=peopleDF.join(pcodesDF,"pcode","left_outer")
mydf002.limit(5).show()

+-----+----+-------+-----+-------------+-----+
|pcode| age| name| pcoe| city|state|
+-----+----+-------+-----+-------------+-----+
|94304|null| Alice| null| Palo Alto| CA|
|94304| 30|Brayden| null| Palo Alto| CA|
| null| 19| Carla|10036| null| null|
| null| 46| Diana| null| null| null|
|94104|null|Etienne| null|San Francisco| CA|
+-----+----+-------+-----+-------------+-----+

mydf003=peopleDF.join(pcodesDF,"pcode","right_outer")
mydf003.limit(5).show()

+-----+----+-------+----+-------------+-----+
|pcode| age| name|pcoe| city|state|
+-----+----+-------+----+-------------+-----+
|10036|null| null|null| New York| NY|
|87501|null| null|null| Santa Fe| NM|
|94304|null| Alice|null| Palo Alto| CA|
|94304| 30|Brayden|null| Palo Alto| CA|
|94104|null|Etienne|null|San Francisco| CA|
+-----+----+-------+----+-------------+-----+

[Spark][Python]DataFrame的左右连接例子的更多相关文章

  1. [Spark][Python][DataFrame][RDD]DataFrame中抽取RDD例子

    [Spark][Python][DataFrame][RDD]DataFrame中抽取RDD例子 sqlContext = HiveContext(sc) peopleDF = sqlContext. ...

  2. [Spark][Python][DataFrame][RDD]从DataFrame得到RDD的例子

    [Spark][Python][DataFrame][RDD]从DataFrame得到RDD的例子 $ hdfs dfs -cat people.json {"name":&quo ...

  3. [Spark][Python][DataFrame][Write]DataFrame写入的例子

    [Spark][Python][DataFrame][Write]DataFrame写入的例子 $ hdfs dfs -cat people.json {"name":" ...

  4. [Spark][Python][DataFrame][SQL]Spark对DataFrame直接执行SQL处理的例子

    [Spark][Python][DataFrame][SQL]Spark对DataFrame直接执行SQL处理的例子 $cat people.json {"name":" ...

  5. [Spark][Python]DataFrame where 操作例子

    [Spark][Python]DataFrame中取出有限个记录的例子 的 继续 [15]: myDF=peopleDF.where("age>21") In [16]: m ...

  6. [Spark][Python]DataFrame select 操作例子

    [Spark][Python]DataFrame中取出有限个记录的例子 的 继续 In [4]: peopleDF.select("age")Out[4]: DataFrame[a ...

  7. [Spark][Python]DataFrame中取出有限个记录的例子

    [Spark][Python]DataFrame中取出有限个记录的例子: sqlContext = HiveContext(sc) peopleDF = sqlContext.read.json(&q ...

  8. [Spark][Python]DataFrame select 操作例子II

    [Spark][Python]DataFrame中取出有限个记录的   继续 In [4]: peopleDF.select("age","name") In ...

  9. [Spark][Python][RDD][DataFrame]从 RDD 构造 DataFrame 例子

    [Spark][Python][RDD][DataFrame]从 RDD 构造 DataFrame 例子 from pyspark.sql.types import * schema = Struct ...

随机推荐

  1. [20171220]toad plsql显示整形的bug.txt

    [20171220]toad plsql显示整形的bug.txt --//下午有itpub网友反应,一个查询在sqlplus,pl/sql下不同.链接如下:--//http://www.itpub.n ...

  2. node webpack4.6简单配置

    package.json { "name": "his-web", "version": "0.0.0", " ...

  3. 26_ArrayList_HashSet的比较及Hashcode分析

    实体类: package com.itcast.day1; public class ReflectPoint { private int x; public int y; public Reflec ...

  4. 启动Myeclipse报错“Failed to create the Java Virtual Machine”的解决办法

    我安装的是Myeclipse 10.7.1.装上好久没用,今天启动突然报错:Failed to create the Java Virtual Machine. 检查Myeclipse安装好使用时好的 ...

  5. android-UI组件(四):AdapterView及其子类

    http://blog.csdn.net/litianpenghaha/article/details/23270881 AdapterView组件是一组重要的组件,AdapterView本身是一个抽 ...

  6. 初探boost之timer库学习笔记

    timer   使用方法     #include <boost/timer.hpp> #include <iostream> using namespace std; usi ...

  7. Smith Numbers POJ - 1142 (暴力+分治)

    题意:给定一个N,求一个大于N的最小的Smith Numbers,Smith Numbers是一个合数,且分解质因数之后上质因子每一位上的数字之和 等于 其本身每一位数字之和(别的博客偷的题意) 思路 ...

  8. ThinkPHP5.0 实现 app微信支付功能

    相对于之前随笔写的<ThinkPHP5.0实现app支付宝支付功能>来说,php对接app微信支付功能就相对简单的多了,最近有加我的朋友问到app微信支付,所以我把app微信支付的demo ...

  9. redis类与用法

    <?phpnamespace app\common\model; class Cache { public $redis = null; public function __construct( ...

  10. 对node.js的理解?

    a.Node.js是一个基于Google Chrome V8引擎的javascript运行环境.Node.js使用了一个事件驱动.非阻塞式I/O的模型,使其轻量又高效.Node.js的包管理器npm, ...