PySpark – A Beginner’s Guide to Apache Spark and Big Data

The article “PySpark – A Beginner’s Guide to Apache Spark and Big Data” first appeared on AlgoTrading101 Blog.

Excerpt

What is PySpark?

PySpark is a Python library that serves as an interface for Apache Spark.

What is Apache Spark?

Apache Spark is an open-source distributed computing engine that is used for Big Data processing.

It is a general-purpose engine as it supports Python, R, SQL, Scala, and Java.

What is Apache Spark used for?

Apache Spark is often used with Big Data as it allows for distributed computing and it offers built-in data streaming, machine learning, SQL, and graph processing. It is often used by data engineers and data scientists.

What is PySpark used for?

PySpark is used as an API for Apache Spark. This allows us to leave the Apache Spark terminal and enter our preferred Python programming IDE without losing what Apache Spark has to offer.

Is Apache Spark free?

Apache Spark is an open-source engine and thus it is completely free to download and use.

Why should I use Apache Spark?

  • Apache Spark offers distributed computing
  • Apache Spark is easy to use
  • Apache Spark is free
  • Offer advanced analytics
  • Is a very powerful engine
  • Offers machine learning, streaming, SQL, and graph processing modules
  • Is applicable to various programming languages like Python, R, Java…
  • Has a good community and is advancing as a product

Why shouldn’t I use Apache Spark?

  • Apache Spark can have scaling problems with compute-intensive jobs
  • It can consume a lot of memory
  • Can have issues with small files
  • Is constrained by the number of available ML algorithms

Why should I use PySpark?

  • PySpark is easy to use
  • PySpark can handle synchronization errors
  • The learning curve isn’t steep as in other languages like Scala
  • Can easily handle big data
  • Has all the pros of Apache Spark added to it

Why shouldn’t I use PySpark?

  • PySpark can be less efficient as it uses Python
  • It is slow when compared to other languages like Scala
  • It can be replaced with other libraries like Dask that easily integrate with Pandas (depends on the problem and dataset)
  • Suffers from all the cons of Apache Spark

What are some Apache Spark alternatives?

Apache Spark can be replaced with some alternatives and they are the following:

  • Apache Hadoop
  • Google BigQuery
  • Amazon EMR
  • IBM Analytics Engine
  • Apache Flink
  • Lumify
  • Presto
  • Apache Pig

What are some Apache Spark clients?

Some of the programming clients that has Apache Spark APIs are the following:

How to get started with Apache Spark?

In order to get started with Apache Spark and the PySpark library, we will need to go through multiple steps. This can be a bit confusing if you have never done something similar but don’t worry. We will do it together!

Prerequisites

The first things that we need to take care of are the prerequisites that we need in order to make Apache Spark and PySpark work. These prerequisites are Java 8, Python 3, and something to extract .tar files.

Let’s see what Java version are you rocking on your computer. If you’re on Windows like me, go to Start, type cmd, and enter the Command Prompt. When there, type the following command:

java -version

And you’ll get a message similar to this one that will specify your Java version:

java version "1.8.0_281"

If you didn’t get a response you don’t have Java installed. If your java is outdated ( < 8) or non-existent, go over to the following link and download the latest version.

If you, for some reason, don’t have Python installed here is a link to download it. And lastly, for the extraction of .tar files, I use 7-zip. You can use anything that does the job.

Download and set-up

Go over to the following link and download the 3.0.3. Spark release that is pre-built for Apache Hadoop 2.7.

 wp-image-165525

Now click the blue link that is written under number 3 and select one of the mirrors that you would like to download from. While it is downloading create a folder named Spark in your root drive (C:).

Go into that folder and extract the downloaded file into it. The next thing that you need to add is the winutils.exe file for the underlying Hadoop version that Spark will be utilizing.

To do this, go over to the following GitHub page and select the version of Hadoop that we downloaded. After that, scroll down until you see the winutils.exe file. Click on it and download it.

 wp-image-165527

Now create a new folder in your root drive and name it “Hadoop”, then create a folder inside of that folder and name it “bin”. Inside the bin folder paste the winutils.exe file that we just downloaded.

Now for the final steps, we need to configure our environmental variables. Environmental variables allow us to add Spark and Hadoop to our system PATH. This way we can call Spark in Python as they will be on the same PATH.

Click Start and type “environment”. Then select the “Edit the system environment variables” option. A new window will pop up and in the lower right corner of it select “Environment Variables”.

 =

A new window will appear that will show your environmental variables. In my case, I already have Spark there:

 =

To add it there, click on “New”. Then set the name to be “SPARK_HOME” and for the Variable value add the path where you downloaded your spark. It should be something like this C:Sparkspark... Click OK.

For the next step be sure to be careful and not change your Path. Click on the “Path” in your user variables and then select “Edit”. A new window will appear, click on the “New” button and then write this %SPARK_HOME%bin

You’ve successfully added Spark to your PATH! Now, repeat this process for both Hadoop and Java. The only things that will change will be their locations and the end name that you give to them.

Your end product should look like this:

 wp-image-165536

Launch Spark

Now let us launch our Spark and see it in its full glory. Start a new command prompt and then enter spark-shell to launch Spark. A new window will appear with Spark up and running.

 wp-image-165540

Now open up your browser and write http://localhost:4040/ or whatever the name of your system is. This will open up the Apache Spark UI where you will be able to see all the information you might need.

 wp-image-165541

What are the main components of Apache Spark?

There are several components that make Apache Spark and they are the following:

  • Spark Core – is the main part of Apache Spark that provides in-built memory computing and does all the basic I/O functions, memory management, and much more.
  • Spark Streaming – allows for data streaming that can go up to a couple of gigabytes per second.
  • Spark SQL – allows the use of SQL (Structured Query Language) for easier data manipulation and analysis.
  • MlLib – packs several machine learning models that can be used in several programming languages.
  • GraphX – provides several methods for implementing graph theory to your dataset (i.e. network analysis).

What is the Apache Spark RDD?

Apache Spark RDD (Resilient Distributed Dataset) is a data structure that serves as the main building block. An RDD can be seen as an immutable and partitioned set of data values that can be processed on a distributed system.

To conclude, they are resilient because they are immutable, distributed as they have partitions that can be processed in a distributed manner, and datasets as they hold our data.

How to use PySpark in Jupyter Notebooks?

To use PySpark in your Jupyter notebook, all you need to do is to install the PySpark pip package with the following command:

pip install pyspark

As your Python is located on your system PATH it will work with your Apache Spark. If you want to use something like Google Colab you will need to run the following block of code that will set up Apache Spark for you:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz
!tar xf spark-3.0.3-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.3-bin-hadoop2.7"
import findspark
findspark.init()

If you want to use Kaggle like we’re going to do, you can just go straight to the “pip install pyspark” command as Apache Spark will be ready for use.

Visit AlgoTrading101 Blog for additional insight on this topic: https://algotrading101.com/learn/pyspark-guide/.

Disclosure: Interactive Brokers

Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

This material is from AlgoTrading101 and is being posted with its permission. The views expressed in this material are solely those of the author and/or AlgoTrading101 and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.