[Spark][Python]spark 从 avro 文件获取 Dataframe 的例子

从如下地址获取文件:
https://github.com/databricks/spark-avro/raw/master/src/test/resources/episodes.avro

导入到 hdfs 系统:
hdfs dfs -put episodes.avro

读入:
mydata001=sqlContext.read.format("com.databricks.spark.avro").load("episodes.avro")

交互式运行结果:

In [7]: mydata001=sqlContext.read.format("com.databricks.spark.avro").load("episodes.avro")
17/10/03 07:00:47 INFO avro.AvroRelation: Listing hdfs://localhost:8020/user/training/episodes.avro on driver

In [8]: type(mydata001)
Out[8]: pyspark.sql.dataframe.DataFrame

In [9]: mydata001.count()
17/10/03 07:01:05 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 65.5 KB, free 65.5 KB)
17/10/03 07:01:05 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 21.4 KB, free 86.9 KB)
17/10/03 07:01:05 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:40075 (size: 21.4 KB, free: 208.8 MB)
17/10/03 07:01:05 INFO spark.SparkContext: Created broadcast 3 from count at NativeMethodAccessorImpl.java:-2
17/10/03 07:01:05 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 230.4 KB, free 317.3 KB)
17/10/03 07:01:06 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 21.5 KB, free 338.8 KB)
17/10/03 07:01:06 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:40075 (size: 21.5 KB, free: 208.8 MB)
17/10/03 07:01:06 INFO spark.SparkContext: Created broadcast 4 from hadoopFile at AvroRelation.scala:121
17/10/03 07:01:06 INFO mapred.FileInputFormat: Total input paths to process : 1
17/10/03 07:01:07 INFO spark.SparkContext: Starting job: count at NativeMethodAccessorImpl.java:-2
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Registering RDD 16 (count at NativeMethodAccessorImpl.java:-2)
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Got job 1 (count at NativeMethodAccessorImpl.java:-2) with 1 output partitions
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (count at NativeMethodAccessorImpl.java:-2)
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 2)
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 2 (MapPartitionsRDD[16] at count at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/10/03 07:01:07 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 11.5 KB, free 350.3 KB)
17/10/03 07:01:07 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 5.7 KB, free 356.0 KB)
17/10/03 07:01:07 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:40075 (size: 5.7 KB, free: 208.8 MB)
17/10/03 07:01:07 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 2 (MapPartitionsRDD[16] at count at NativeMethodAccessorImpl.java:-2)
17/10/03 07:01:07 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
17/10/03 07:01:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2249 bytes)
17/10/03 07:01:07 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 2)
17/10/03 07:01:07 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/episodes.avro:0+597
17/10/03 07:01:08 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 2). 2484 bytes result sent to driver
17/10/03 07:01:08 INFO scheduler.DAGScheduler: ShuffleMapStage 2 (count at NativeMethodAccessorImpl.java:-2) finished in 0.691 s
17/10/03 07:01:08 INFO scheduler.DAGScheduler: looking for newly runnable stages
17/10/03 07:01:08 INFO scheduler.DAGScheduler: running: Set()
17/10/03 07:01:08 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 3)
17/10/03 07:01:08 INFO scheduler.DAGScheduler: failed: Set()
17/10/03 07:01:08 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 693 ms on localhost (1/1)
17/10/03 07:01:08 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
17/10/03 07:01:08 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[19] at count at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/10/03 07:01:08 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 12.6 KB, free 368.5 KB)
17/10/03 07:01:08 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 6.1 KB, free 374.7 KB)
17/10/03 07:01:08 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:40075 (size: 6.1 KB, free: 208.8 MB)
17/10/03 07:01:08 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006
17/10/03 07:01:08 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[19] at count at NativeMethodAccessorImpl.java:-2)
17/10/03 07:01:08 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
17/10/03 07:01:08 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, partition 0,NODE_LOCAL, 1999 bytes)
17/10/03 07:01:08 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 3)
17/10/03 07:01:08 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
17/10/03 07:01:08 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
17/10/03 07:01:08 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 3). 1666 bytes result sent to driver
17/10/03 07:01:08 INFO scheduler.DAGScheduler: ResultStage 3 (count at NativeMethodAccessorImpl.java:-2) finished in 0.344 s
17/10/03 07:01:08 INFO scheduler.DAGScheduler: Job 1 finished: count at NativeMethodAccessorImpl.java:-2, took 1.480495 s
17/10/03 07:01:08 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 345 ms on localhost (1/1)
17/10/03 07:01:08 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
Out[9]: 8

In [10]: mydata001.take(1)
17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 230.1 KB, free 604.8 KB)
17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 21.4 KB, free 626.2 KB)
17/10/03 07:01:18 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:40075 (size: 21.4 KB, free: 208.7 MB)
17/10/03 07:01:18 INFO spark.SparkContext: Created broadcast 7 from take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 230.5 KB, free 856.7 KB)
17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 21.5 KB, free 878.2 KB)
17/10/03 07:01:18 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:40075 (size: 21.5 KB, free: 208.7 MB)
17/10/03 07:01:18 INFO spark.SparkContext: Created broadcast 8 from take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO mapred.FileInputFormat: Total input paths to process : 1
17/10/03 07:01:18 INFO spark.SparkContext: Starting job: take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Got job 2 (take at <ipython-input-10-35862abbc114>:1) with 1 output partitions
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (take at <ipython-input-10-35862abbc114>:1)
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Missing parents: List()
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[27] at take at <ipython-input-10-35862abbc114>:1), which has no missing parents
17/10/03 07:01:19 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 5.6 KB, free 883.8 KB)
17/10/03 07:01:19 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 3.0 KB, free 886.9 KB)
17/10/03 07:01:19 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:40075 (size: 3.0 KB, free: 208.7 MB)
17/10/03 07:01:19 INFO spark.SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:1006
17/10/03 07:01:19 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[27] at take at <ipython-input-10-35862abbc114>:1)
17/10/03 07:01:19 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
17/10/03 07:01:19 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, localhost, partition 0,PROCESS_LOCAL, 2260 bytes)
17/10/03 07:01:19 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 4)
17/10/03 07:01:19 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/episodes.avro:0+597
17/10/03 07:01:19 INFO codegen.GenerateUnsafeProjection: Code generated in 124.624053 ms
17/10/03 07:01:19 INFO executor.Executor: Finished task 0.0 in stage 4.0 (TID 4). 2237 bytes result sent to driver
17/10/03 07:01:19 INFO scheduler.DAGScheduler: ResultStage 4 (take at <ipython-input-10-35862abbc114>:1) finished in 0.415 s
17/10/03 07:01:19 INFO scheduler.DAGScheduler: Job 2 finished: take at <ipython-input-10-35862abbc114>:1, took 0.565858 s
17/10/03 07:01:19 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 4) in 415 ms on localhost (1/1)
17/10/03 07:01:19 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
Out[10]: [Row(title=u'The Eleventh Hour', air_date=u'3 April 2010', doctor=11)]

In [11]:

[Spark][Python]spark 从 avro 文件获取 Dataframe 的例子的更多相关文章

  1. [Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子:

    [Spark][Python]Spark 访问 mysql , 生成 dataframe 的例子: mydf001=sqlContext.read.format("jdbc").o ...

  2. Spark中如何生成Avro文件

    研究spark的目的之一就是要取代MR,目前我司MR的一个典型应用场景即为生成Avro文件,然后加载到HIVE表里,所以如何在Spark中生成Avro文件,就是必然之路了. 我本人由于对java不熟, ...

  3. [Spark][Python]Spark Python 索引页

    Spark Python 索引页 为了查找方便,建立此页 === RDD 基本操作: [Spark][Python]groupByKey例子

  4. [spark][python]Spark map 处理

    map 就是对一个RDD的各个元素都施加处理,得到一个新的RDD 的过程 [training@localhost ~]$ cat names.txtYear,First Name,County,Sex ...

  5. python批量读取txt文件为DataFrame

    我们有时候会批量处理同一个文件夹下的文件,并且希望读取到一个文件里面便于我们计算操作.比方我有下图一系列的txt文件,我该如何把它们写入一个txt文件中并且读取为DataFrame格式呢? 首先我们要 ...

  6. Python抓取远程文件获取真实文件名

    用urllib下载远程文件并转存到hdfs服务器,在下载时,下载地址中不一定包含文件名,需要从连接信息中获取. 1 file_url = request.form.get('file_url') 2 ...

  7. Python——urllib函数网络文件获取

    */ * Copyright (c) 2016,烟台大学计算机与控制工程学院 * All rights reserved. * 文件名:text.cpp * 作者:常轩 * 微信公众号:Worldhe ...

  8. [Spark][Python]Spark Join 小例子

    [training@localhost ~]$ hdfs dfs -cat people.json {"name":"Alice","pcode&qu ...

  9. [Spark][python]以DataFrame方式打开Json文件的例子

    [Spark][python]以DataFrame方式打开Json文件的例子: [training@localhost ~]$ cat people.json{"name":&qu ...

随机推荐

  1. 安卓开发_浅谈主配置文件(AndroidManifest.xml)

    AndroidManifest.xml本质:是整个应用的主配置清单文件包含:该应用的包名,版本号,组件,权限等信息作用:记录该应用的相关的配置信息 一.常用标签(1).全局篇(包名,版本信息)(2). ...

  2. 13.1、多进程:进程锁Lock、信号量、事件

    进程锁: 为什么要有进程锁:假如现在有一台打印机,qq要使用打印机,word文档也要使用打印机,如果没有使用进程锁,可能会导致一些问题,比如QQ的任务打印到一半,Word插进来,于是打印出来的结果是各 ...

  3. JHipster技术简介

    本文简单介绍Jhipster是什么,为什么用Jhipster,怎么用Jhipster. WHAT - 技术栈 JHipster是什么 JHipster是一个开发平台,用于生成,开发,部署Spring ...

  4. html 知识整理

    一. 前言 本文全面介绍了html的定义.使用和具体常用标签. 参考资料:菜鸟教程 二.定义 html是HyperText Markup Language的简称,也就是超文本标记语言的缩写.通过htm ...

  5. [20171213]john破解oracle口令.txt

    [20171213]john破解oracle口令.txt --//跟别人讨论的oracle破解问题,我曾经提过不要使用6位字符以下的密码,其实不管那种系统低于6位口令非常容易破解.--//而且orac ...

  6. Java概述和项目演示

    Java概述和项目演示 1. 软件开发学习方法 多敲 多思考 解决问题 技术文档阅读(中文,英文) 项目文档 多阅读源码 2. 计算机 简称电脑,执行一系列指令的电子设备 3. 硬件组成 输入设备:键 ...

  7. 09LaTeX学习系列之---Latex 字体的设置

    目录 目录: (一) 字体族的设置 1.说明: 2.源代码: 3.输出结果: (二) 字体系列的设置 1.源代码: 2.输出效果: (三) 字体形状的设置 1.源代码: 2.输出效果: (四) 字体大 ...

  8. .net core 入坑经验 - 3、MVC Core之jQuery不能使用了?

    在View中添加了一段jQuery代码用来控制一个按钮的点击事件.发现运行时提示$对象没有定义,经过在浏览器右键查看源文件发现,script代码在引用jquery代码的上方,执行时jquery还未引入 ...

  9. MATLAB矩阵的LU分解及在解线性方程组中的应用

    作者:凯鲁嘎吉 - 博客园http://www.cnblogs.com/kailugaji/ 三.实验程序 五.解答(按如下顺序提交电子版) 1.(程序) (1)LU分解源程序: function [ ...

  10. C语言变量定义与数据溢出(初学者)

    1.变量定义的一般形式为:类型说明符.变量名标识符等:例:int a,b,c;(abc为整型变量) 在书写变量定义时应注意以下几点: (1)允许在一个类型说明符后,定义多个相同类型的变量.各变量之间用 ...