Apache Spark

Prerequisite

Download

Install

  • unzip
    • tar -zxvf spark-2......tgz
  • move into /usr/local
    • sudo mv spark-2....tgz /usr/local
  • rename
    • sudo mv /usr/local/spark-2... /usr/local/spark

PySpark in Jupyter

There are two ways to get PySpark available in a Jupyter Notebook:

  1. Configure PySpark driver to use Jupyter Notebook: running pyspark will automatically open a Jupyter Notebook.
  2. Load a regular Jupyter Notebook and load PySpark using findSpark package.

Method 1 - Configure PySpark driver

  • Update PySpark driver environment variables:
  • add these lines to your .bashrc or .bash_profile or .profile
SPARK_PATH=/usr/local/spark

# Params below are used to launch the PySpark shell in Jupyter Notebook.
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

function snotebook(){
  # The master param is used for setting the master node address.
  # Here we launch Spark locally on 2 cores for local testing.
  $SPARK_PATH/bin/pyspark --master local[2]
}
  • There is another and more generalized way to use PySpark in a Jupyter Notebook.
  • pip install findspark
import findspark
findspark.init()

import pyspark
import random

sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000

def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()

pi = 4 * count / num_samples
print(pi)

sc.stop()

results matching ""

    No results matching ""