Apache Spark
Prerequisite
- Java를 미리 설치해야 한다.
- Java 설치방법
Download
- http://spark.apache.org/downloads.html 으로 이동을해서 원하는 버전과 타입에 맞게 다운로드를 할 수 있다.
- wget https://www.apache.org/dyn/closer.lua/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
Install
- unzip
tar -zxvf spark-2......tgz
- move into /usr/local
sudo mv spark-2....tgz /usr/local
- rename
sudo mv /usr/local/spark-2... /usr/local/spark
PySpark in Jupyter
There are two ways to get PySpark available in a Jupyter Notebook:
- Configure PySpark driver to use Jupyter Notebook: running pyspark will automatically open a Jupyter Notebook.
- Load a regular Jupyter Notebook and load PySpark using findSpark package.
Method 1 - Configure PySpark driver
- Update PySpark driver environment variables:
- add these lines to your
.bashrc
or.bash_profile
or.profile
SPARK_PATH=/usr/local/spark
# Params below are used to launch the PySpark shell in Jupyter Notebook.
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
function snotebook(){
# The master param is used for setting the master node address.
# Here we launch Spark locally on 2 cores for local testing.
$SPARK_PATH/bin/pyspark --master local[2]
}
Method 2 - FindSpark package (recommended)
- There is another and more generalized way to use PySpark in a Jupyter Notebook.
pip install findspark
import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()