- Nikhil Bhaskar
- July 22, 2021
How To Install Apache Spark On Ubuntu 20.04 LTS
Apache Spark is a free & open-source framework. It is used for distributed cluster-computing system & big data workloads. It is a engine for large-scale data processing & provides high-level APIs compatible in Java, Scala & Python
Install Apache Spark On Ubuntu
Update the system.
apt-get update
Install Java.
apt-get install openjdk-11-jdk
Check Java version.
java --version
Here is the command output.
openjdk 11.0.11
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing)
Install Scala
apt-get install scala
Check Scala version.
scala -version
Here is the command output.
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
Login to Scala.
scala
Here is the command output.
elcome to Scala 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.11).
Type in expressions for evaluation. Or try :help.
scala>
Run the command.
scala> println("Hello World")
Hello World
Install Apache Spark
Download the file.
curl -O https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
Extract the downloaded file.
tar xvf spark-3.1.1-bin-hadoop3.2.tgz
Change the location of download extract file.
mv spark-3.1.1-bin-hadoop3.2/ /opt/spark
Open bashrc configuration file.
vim ~/.bashrc
Add the following lines:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Activate the bashrc file.
source ~/.bashrc
Start a master server.
start-master.sh
Here is the command output.
starting org.apache.spark.deploy.master.Master, logging to
/opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu.out
Open port number 8080 on ufw firewall.
ufw allow 8080/tcp
Access Apache Spark web-interface.
http://server-ip:8080/
Here is the output.
Start the worker process
start-slave.sh spark://ubuntu:7077
Use Spark shell.
/opt/spark/bin/spark-shell
Use pyspark for python.
/opt/spark/bin/pyspark