Apache Spark

Prerequisite

Java를 미리 설치해야 한다.
Java 설치방법

Download

http://spark.apache.org/downloads.html 으로 이동을해서 원하는 버전과 타입에 맞게 다운로드를 할 수 있다.
wget https://www.apache.org/dyn/closer.lua/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

Install

unzip
- tar -zxvf spark-2......tgz
move into /usr/local
- sudo mv spark-2....tgz /usr/local
rename
- sudo mv /usr/local/spark-2... /usr/local/spark

PySpark in Jupyter

There are two ways to get PySpark available in a Jupyter Notebook:

Configure PySpark driver to use Jupyter Notebook: running pyspark will automatically open a Jupyter Notebook.
Load a regular Jupyter Notebook and load PySpark using findSpark package.

Method 1 - Configure PySpark driver

Update PySpark driver environment variables:
add these lines to your .bashrc or .bash_profile or .profile

SPARK_PATH=/usr/local/spark

# Params below are used to launch the PySpark shell in Jupyter Notebook.
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

function snotebook(){
  # The master param is used for setting the master node address.
  # Here we launch Spark locally on 2 cores for local testing.
  $SPARK_PATH/bin/pyspark --master local[2]
}

Method 2 - FindSpark package (recommended)

There is another and more generalized way to use PySpark in a Jupyter Notebook.
pip install findspark

import findspark
findspark.init()

import pyspark
import random

sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000

def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()

pi = 4 * count / num_samples
print(pi)

sc.stop()

Apache Spark

Apache Spark

Prerequisite

Download

Install

PySpark in Jupyter

Method 1 - Configure PySpark driver

Method 2 - FindSpark package (recommended)

results matching ""

No results matching ""