Native Spark Jobs on Kubernetes

This post describes what the ‘Spark on Kubernetes‘ project is, and shows how to run the first alpha release of Spark with Kubernetes support against a Kubernetes (or GKE cluster). ‘Spark on Kubernetes’ is an effort that began around November 2016 and is being developed actively by several open source community members.

Spark is a top-level open source Apache project and an important data processing framework. It is used to run a diverse set of workloads ranging from analytics to machine learning to batch and streaming.

Kubernetes is an open source cluster manager that was donated to the CNCF by Google. It has a large and thriving community and is a broad rethink of cluster management at scale. Kubernetes strives to be the one place for running all types of applications, be it stateless applications, databases, or data processing frameworks like Spark.

Previously, there was only one way to run Spark on Kubernetes and that was to stand up a Spark Standalone cluster within Kubernetes. This had several drawbacks:

  • Multiple layers of resource management: Two levels of cluster management with no knowledge of one another.
  • Multi-tenancy: No easy way to scale the cluster up/down automatically in response to load and share cluster with non-batch jobs.
  • Locality and Performance: Locality with respect to storage, as well as general performance are difficult to reason about.

In order to mitigate these issues, we started to look into adding Kubernetes as a native cluster scheduler backend for Spark. What that means is that the Spark project would contain specific code to talk to Kubernetes directly and leverage the framework to launch Spark jobs without needing additional setup. Spark on Kubernetes attempts to solve this problem.

I’ll now go over how you can use Spark on Kubernetes to run Spark Jobs on any Kubernetes cluster. Brief instructions can be found within the project itself but this post details some of those steps for first-time users or people who are new to either project. Note that this is an alpha release and under active development at this time and there may be stability issues.


 

Step 1: Provision a Kubernetes Cluster

If you already have a Kubernetes cluster, you can proceed to the next step.

Provisioning a new Kubernetes cluster can be done in a few minutes using Google Container Engine or your local machine by running Minikube. Google Container Engine is part of Google Cloud Platform, and stands up a hosted Kubernetes cluster on your behalf. Minikube is a local single-node Kubernetes cluster for testing and development purposes. If you are using minikube, ensure that you have allocated at least 4 GB of RAM and 4 CPUs to it in order to run the Spark Job in a local single-node cluster.

$ minikube start --memory 4000 --cpus 4
Starting local Kubernetes cluster...
Starting VM...
SSH-ing files into VM...
Setting up certs...
Starting cluster components...
Connecting to cluster...
Setting up kubeconfig...
Kubectl is now configured to use the cluster.

We also require that kubectl be installed and setup for us to communicate with the Kubernetes cluster we have set up.

Step 2: Download the Spark distribution

The Spark distribution can be downloaded from here. It is a fork of Spark version 2.1 which contains additional logic to run against Kubernetes. There are two versions, one compiled for Hadoop-2.7+ and the other without. In this example, we are not using any persistence, so, the one built without Hadoop can be downloaded. Once it is downloaded, extract the tarball to a directory of your choice.

$ wget https://github.com/apache-spark-on-k8s/spark/releases/download/v2.1.0-kubernetes-0.1.0-alpha.1/spark-2.1.0-k8s-0.1.0-without-hadoop.tgz

Length: 136010334 (130M) [application/octet-stream]
Saving to: ‘spark-2.1.0-k8s-0.1.0-without-hadoop.tgz’

Step 3: Run a Spark Job

In order to test this release, let us try and run a Spark Job against this cluster. First, we find the identity of the Kubernetes api-server of the current Kubernetes cluster.

$ kubectl cluster-info 
Kubernetes master is running at https://192.168.99.101:8443

Here, we make note of the URL that the Kubernetes master is running as. We need this URL in order to send requests to our Kubernetes cluster. If you have multiple Kubernetes clusters and contexts already configured, switch to the cluster that you will be launching Spark jobs in.

In order to run a test job, from the extracted release tarball that you downloaded in the previous step, you can run:

bin/spark-submit \
  --deploy-mode cluster \
  --class org.apache.spark.examples.SparkPi \
  --master k8s://https://: \
  --kubernetes-namespace default \
  --conf spark.executor.instances=2 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.1.0-kubernetes-0.1.0-rc1 \
  --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.1.0-kubernetes-0.1.0-rc1 \
  examples/jars/spark-examples_2.11-2.1.0-k8s-0.1.0-SNAPSHOT.jar 1000

In the above, you can replace k8s://https://: with the actual IP address and Port of the api-server that we discovered in the beginning of this step.

For example, I’d use:

...
--master k8s://https://192.168.99.101:8443 \
...

The number of executors to be provisioned in the above example is 2. This should launch the Spark driver and subsequently executors within the cluster and start the Spark Job. In order to test how this works, you can use:

$ kubectl get pods

The output of the command should show a spark-driver pod, which will eventually spawn one or more executor pods and execute your job. In order to get streaming logs from the driver, the command is:

$ kubectl logs -f 

Finally, when the job finishes, the driver pod will go to ‘Completed’ state and you can read the output using kubectl logs as mentioned above.

You can try other variants of jobs with higher numbers of executors, etc. Kubernetes supports several properties to help tune and control SparkJobs which are all currently supported.

Demo

Advanced Use Cases

There is a lot of upcoming work in this area. Some additional resources such as design documents can be found in our community resource page. As a community, we are focused on tackling problems with Spark integration such as:

  • Dynamic Executor Scaling
  • Encryption and Security
  • Integrations and data locality with with HDFS, Cassandra, Kafka, etc
  • Custom Container Images

Future posts will cover some of these more advanced use cases.

Afterword

We welcome feedback as well as new contributors in our Github Repository.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s