如何在Windows上的Jupyter Notebook中安装和运行PySpark
When I write PySpark code, I use Jupyter notebook to test my code before submitting a job on the cluster. In this post, I will show you how to install and run PySpark locally in Jupyter Notebook on Windows. I’ve tested this guide on a dozen Windows 7 and 10 PCs in different languages.
A. Items needed
- Spark distribution from spark.apache.org
Python and Jupyter Notebook. You can get both by installing the Python 3.x version of Anaconda distribution.
winutils.exe — a Hadoop binary for Windows — from Steve Loughran’s GitHub repo. Go to the corresponding Hadoop version in the Spark distribution and find winutils.exe under /bin. For example, https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe .
- The findspark Python module, which can be installed by running python -m pip install findspark either in Windows command prompt or Git bash if Python is installed in item 2. You can find command prompt by searching cmd in the search box.
- If you don’t have Java or your Java version is 7.x or less, download and install Java from Oracle. I recommend getting the latest JDK (current version 9.0.1).
- If you don’t know how to unpack a .tgz file on Windows, you can download and install 7-zip on Windows to unpack the .tgz file from Spark distribution in item 1 by right-clicking on the file icon and select 7-zip > Extract Here.
B. Installing PySpark
After getting all the items in section A, let’s set up PySpark.
- Unpack the .tgz file. For example, I unpacked with 7zip from step A6 and put mine under D:\spark\spark-2.2.1-bin-hadoop2.7
Move the winutils.exe downloaded from step A3 to the \bin folder of Spark distribution. For example, D:\spark\spark-2.2.1-bin-hadoop2.7\bin\winutils.exe
Add environment variables: the environment variables let Windows find where the files are when we start the PySpark kernel. You can find the environment variable settings by putting “environ…” in the search box.
The variables to add are, in my example,
Name | Value |
---|---|
SPARK_HOME | D:\spark\spark-2.2.1-bin-hadoop2.7 |
HADOOP_HOME | D:\spark\spark-2.2.1-bin-hadoop2.7 |
PYSPARK_DRIVER_PYTHON | jupyter |
PYSPARK_DRIVER_PYTHON_OPTS | notebook |
In the same environment variable settings window, look for the Path or PATH variable, click edit and add D:\spark\spark-2.2.1-bin-hadoop2.7\bin to it. In Windows 7 you need to separate the values in Path with a semicolon ; between the values.
(Optional, if see Java related error in step C) Find the installed Java JDK folder from step A5, for example, D:\Program Files\Java\jdk1.8.0_121, and add the following environment variable
Name | Value |
---|---|
JAVA_HOME | D:\Progra~1\Java\jdk1.8.0_121 |
If JDK is installed under \Program Files (x86), then replace the Progra~1 part by Progra~2 instead. In my experience, this error only occurs in Windows 7, and I think it’s because Spark couldn’t parse the space in the folder name.
Edit (1/23/19): You might also find Gerard’s comment helpful: http://disq.us/p/1z5qou4
C. Running PySpark in Jupyter Notebook
To run Jupyter notebook, open Windows command prompt or Git Bash and run jupyter notebook. If you use Anaconda Navigator to open Jupyter Notebook instead, you might see a Java gateway process exited before sending the driver its port number error from PySpark in step C. Fall back to Windows cmd if it happens.
Once inside Jupyter notebook, open a Python 3 notebook
In the notebook, run the following code
import findspark
findspark.init()
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.sql('''select 'spark' as hello ''')
df.show()
When you press run, it might trigger a Windows firewall pop-up. I pressed cancel on the pop-up as blocking the connection doesn’t affect PySpark.
If you see the following output, then you have installed PySpark on your Windows system!
- How to Install and Run PySpark in Jupyter Notebook on Windows
- How to Turn Python Functions into PySpark Functions (UDF)
- PySpark Dataframe Basics
如何在Windows上的Jupyter Notebook中安装和运行PySpark的更多相关文章
- 解决在jupyter notebook中遇到的ImportError: matplotlib is required for plotting问题
昨天学习pandas和matplotlib的过程中, 在jupyter notebook遇到ImportError: matplotlib is required for plotting错误, 以下 ...
- 在jupyter notebook 中同时使用安装不同版本的python内核-从而可以进行切换
在安装anaconda的时候,默认安装的是python3.6 但是cs231n课程作业是在py2.7环境下运行的.所以需要在jupyter notebook中安装并启用python2.7版本 方法: ...
- 在jupyter notebook中同时安装python2和python3
之前讨论过在anaconda下安装多个python版本,本期来讨论下,jupyter notebook中怎样同时安装python2.7 和python3.x. 由于我之前使用的jupyter note ...
- Windows下的Jupyter Notebook 安装与自定义启动(图文详解)
不多说,直接上干货! 前期博客 Windows下的Python 3.6.1的下载与安装(适合32bits和64bits)(图文详解) 这是我自定义的Python 的安装目录 (D:\SoftWare\ ...
- 非线性函数的最小二乘拟合及在Jupyter notebook中输入公式 [原创]
突然有个想法,能否通过学习一阶RC电路的阶跃响应得到RC电路的结构特征——时间常数τ(即R*C).回答无疑是肯定的,但问题是怎样通过最小二乘法.正规方程,以更多的采样点数来降低信号采集噪声对τ估计值的 ...
- Windows下的Jupyter Notebook 安装与自定义启动
1.Jupyter Notebook 和 pip 为了更加方便地写 Python 代码,还需要安装 Jupyter notebook. 利用 pip 安装 Jupyter notebook. 为什么要 ...
- windows系统下jupyter notebook使用虚拟环境
目录 [亲测好使]windows系统下jupyter notebook使用虚拟环境 在虚拟环境中安装jupyter,并添加到jupyter kernel 参考 [未测试,但觉得比上面那方法好,因为上面 ...
- (转)如何在Windows上安装多个MySQL
原文:http://www.blogjava.net/hongjunli/archive/2009/03/01/257216.html 如何在Windows上安装多个MySQL 本文以免安装版的mys ...
- Redis简介以及如何在Windows上安装Redis
Redis简介 Redis是一个速度非常快的非关系型内存数据库. Redis提供了Java,C/C++,C#,PHP,JavaScript,Perl,Object-C,Python,Ruby,Erla ...
随机推荐
- yum解决 "Couldn't resolve host 'apt.sw.be'" 错误
1.yum无法安装工具 failure: repodata/repomd.xml from dag: [Errno 256] No more mirrors to try.http://apt. ...
- Java反射机制概念及应用场景
Java的反射机制相信大家在平时的业务开发过程中应该很少使用到,但是在一些基础框架的搭建上应用非常广泛,今天简单的总结学习一下. 1. 什么是反射机制? Java反射机制是在运行状态中,对于任意一个类 ...
- python基础之 正则表达式,re模块
1.正则表达式 正则表达式:是字符串的规则,只是检测字符串是否符合条件的规则而已 1.检测某一段字符串是否符合规则 2.将符合规则的匹配出来re模块:是用来操作正则表达式的 2.正则表达式组成 字符组 ...
- ThinkPHP安全规范指引
流年 发布于 ThinkPHP官方博客: https://blog.thinkphp.cn/789333 本文主要和大家探讨一下ThinkPHP的安全注意事项,可以作为ThinkPHP建议的安全规范实 ...
- 初识python爬虫框架Scrapy
Scrapy,按照其官网(https://scrapy.org/)上的解释:一个开源和协作式的框架,用快速.简单.可扩展的方式从网站提取所需的数据. 我们一开始上手爬虫的时候,接触的是urllib.r ...
- C#路径中获取文件全路径、目录、扩展名、文件名称
C#路径中获取文件全路径.目录.扩展名.文件名称常用函数 需要引用System.IO 直接可以调用Path的静态方法 class Program { static void Main(string[] ...
- 基数排序-LSD
这个最低位优先的基数排序,非常适合移植到硬件设备中,所以,我们提供一个C源码 —————————————————————————————————————— #include <stdio.h&g ...
- des加密算法java&c#
项目中用到的数据加密方式是ECB模式的DES加密得到的十六进制字符串.技术支持让写一个.net版的加密算法.这里做一下记录. java版: 16进制使用的是bouncycastle. import c ...
- Hadoop集群故障诊断
集群故障诊断通行方法:1.cloudera manager 监控和管理软件本身出问题了(没有任何数据),集群还是好的,业务还在正常跑:2.监控软件是好的,从监控里发现了很多问题,如CPU飙高.内存飙高 ...
- PHP洗牌、猴子选大王两个小算法
<一>洗牌算法 /** *洗牌算法washCard *@param $cardNum *@return array */ function washCard($cardNum) { $ca ...