[Spark][Python]DataFrame的左右连接例子
[Spark][Python]DataFrame的左右连接例子
$ hdfs dfs -cat people.json
{"name":"Alice","pcode":"94304"}
{"name":"Brayden","age":30,"pcode":"94304"}
{"name":"Carla","age":19,"pcoe":"10036"}
{"name":"Diana","age":46}
{"name":"Etienne","pcode":"94104"}
$ hdfs dfs -cat pcodes.json
{"pcode":"10036","city":"New York","state":"NY"}
{"pcode":"87501","city":"Santa Fe","state":"NM"}
{"pcode":"94304","city":"Palo Alto","state":"CA"}
{"pcode":"94104","city":"San Francisco","state":"CA"}
$pyspark
sqlContext = HiveContext(sc)
peopleDF = sqlContext.read.json("people.json")
peopleDF.limit(5).show()
+----+-------+-----+-----+
| age| name|pcode| pcoe|
+----+-------+-----+-----+
|null| Alice|94304| null|
| 30|Brayden|94304| null|
| 19| Carla| null|10036|
| 46| Diana| null| null|
|null|Etienne|94104| null|
+----+-------+-----+-----+
sqlContext = HiveContext(sc)
pcodesDF = sqlContext.read.json("pcodes.json")
pcodesDF.limit(5).show()
+-------------+-----+-----+
| city|pcode|state|
+-------------+-----+-----+
| New York|10036| NY|
| Santa Fe|87501| NM|
| Palo Alto|94304| CA|
|San Francisco|94104| CA|
+-------------+-----+-----+
mydf000 = peopleDF.join(pcodesDF,"pcode")
mydf000.limit(5).show()
+-----+----+-------+----+-------------+-----+
|pcode| age| name|pcoe| city|state|
+-----+----+-------+----+-------------+-----+
|94304|null| Alice|null| Palo Alto| CA|
|94304| 30|Brayden|null| Palo Alto| CA|
|94104|null|Etienne|null|San Francisco| CA|
+-----+----+-------+----+-------------+-----+
mydf001=peopleDF.join(pcodesDF,"pcode","leftsemi")
mydf001.limit(5).show()
+-----+----+-------+----+
|pcode| age| name|pcoe|
+-----+----+-------+----+
|94304|null| Alice|null|
|94304| 30|Brayden|null|
|94104|null|Etienne|null|
+-----+----+-------+----+
mydf002=peopleDF.join(pcodesDF,"pcode","left_outer")
mydf002.limit(5).show()
+-----+----+-------+-----+-------------+-----+
|pcode| age| name| pcoe| city|state|
+-----+----+-------+-----+-------------+-----+
|94304|null| Alice| null| Palo Alto| CA|
|94304| 30|Brayden| null| Palo Alto| CA|
| null| 19| Carla|10036| null| null|
| null| 46| Diana| null| null| null|
|94104|null|Etienne| null|San Francisco| CA|
+-----+----+-------+-----+-------------+-----+
mydf003=peopleDF.join(pcodesDF,"pcode","right_outer")
mydf003.limit(5).show()
+-----+----+-------+----+-------------+-----+
|pcode| age| name|pcoe| city|state|
+-----+----+-------+----+-------------+-----+
|10036|null| null|null| New York| NY|
|87501|null| null|null| Santa Fe| NM|
|94304|null| Alice|null| Palo Alto| CA|
|94304| 30|Brayden|null| Palo Alto| CA|
|94104|null|Etienne|null|San Francisco| CA|
+-----+----+-------+----+-------------+-----+
[Spark][Python]DataFrame的左右连接例子的更多相关文章
- [Spark][Python][DataFrame][RDD]DataFrame中抽取RDD例子
[Spark][Python][DataFrame][RDD]DataFrame中抽取RDD例子 sqlContext = HiveContext(sc) peopleDF = sqlContext. ...
- [Spark][Python][DataFrame][RDD]从DataFrame得到RDD的例子
[Spark][Python][DataFrame][RDD]从DataFrame得到RDD的例子 $ hdfs dfs -cat people.json {"name":&quo ...
- [Spark][Python][DataFrame][Write]DataFrame写入的例子
[Spark][Python][DataFrame][Write]DataFrame写入的例子 $ hdfs dfs -cat people.json {"name":" ...
- [Spark][Python][DataFrame][SQL]Spark对DataFrame直接执行SQL处理的例子
[Spark][Python][DataFrame][SQL]Spark对DataFrame直接执行SQL处理的例子 $cat people.json {"name":" ...
- [Spark][Python]DataFrame where 操作例子
[Spark][Python]DataFrame中取出有限个记录的例子 的 继续 [15]: myDF=peopleDF.where("age>21") In [16]: m ...
- [Spark][Python]DataFrame select 操作例子
[Spark][Python]DataFrame中取出有限个记录的例子 的 继续 In [4]: peopleDF.select("age")Out[4]: DataFrame[a ...
- [Spark][Python]DataFrame中取出有限个记录的例子
[Spark][Python]DataFrame中取出有限个记录的例子: sqlContext = HiveContext(sc) peopleDF = sqlContext.read.json(&q ...
- [Spark][Python]DataFrame select 操作例子II
[Spark][Python]DataFrame中取出有限个记录的 继续 In [4]: peopleDF.select("age","name") In ...
- [Spark][Python][RDD][DataFrame]从 RDD 构造 DataFrame 例子
[Spark][Python][RDD][DataFrame]从 RDD 构造 DataFrame 例子 from pyspark.sql.types import * schema = Struct ...
随机推荐
- Android Studio列表用法之一:ListView图文列表显示(实例)
前言: ListView这个列表控件在Android中是最常用的控件之一,几乎在所有的应用程序中都会使用到它. 目前正在做的一个记账本APP中就用到了它,主要是用它来呈现收支明细,是一个图文列表的呈现 ...
- Django中使用bookstarp框架(4)
Django中使用bookstarp框架(4) 注意:要使用bookstarp框架前,要先有css的基础 因为主要是研究后台的使用方法,就引入前端的框架,简化html上的耗时(主要是不想把时间浪费在前 ...
- mybatis学习系列三(部分)
1 forearch_oracle下批量保存(47) oracle批量插入 不支持values(),(),()方式 1.多个insert放在begin-end里面 begin insert into ...
- Python之岭回归
实现:# -*- coding: UTF-8 -*- import numpy as npfrom sklearn.linear_model import Ridge __author__ = 'zh ...
- c#所有部门及其下所部门生成树形图(递归算法获取或键值对方式获取)
部门数据库的设计: 代码: /// <summary> /// 获取部门(入口) /// </summary> /// <returns></returns& ...
- 通过DbVisualizer 工具运行DB2存储过程实现INSERT语句主键自增造数
1.需求简介 最近开发人员需要进行一批数据进行生产上SQL语句耗时过长问题的验证与优化.所以在性能测试库中批量建造数据,由于交易本身业务逻辑过于复杂以及需要各种授权,最后决定采用插表的方式完成. 2. ...
- Configure Monit on AWS CentOS7 to guard Squid proxy
Install Monit:sudo -iamazon-linux-extras install epelyum -y install monit Config monit: vim /etc/mon ...
- 扩展BootstapTable支持TreeGrid
(function ($) { 'use strict'; var sprintf = function (str) { var args = arguments, flag = true, i = ...
- python 机器学习中模型评估和调参
在做数据处理时,需要用到不同的手法,如特征标准化,主成分分析,等等会重复用到某些参数,sklearn中提供了管道,可以一次性的解决该问题 先展示先通常的做法 import pandas as pd f ...
- wordpress使用rsync加openvpn进行同步和备份
目录 wordpress使用rsync加openvpn进行同步和备份 环境 普通同步策略的弊端 改良的同步方案 从头细说起 备份数据 打包wordpress项目和备份 服务端安装rsync 服务端配置 ...