Cloud Dataproc Best Practices

Suds Kumar

3 min readJan 24, 2021

Specify cluster image versions.
Cloud Dataproc uses image versions to bundle operating system and big data components (including core and optional components) and GCP connectors into a single package that is deployed on a cluster.
If you don’t specify an image version when creating a new cluster, Cloud Dataproc will default to the most recent stable image version. For production environments, we recommend that you always associate your cluster creation step with a specific minor Cloud Dataproc version, as shown in this example gcloud command:
gcloud dataproc clusters create my-pinned-cluster --image-version 1.4-debian9
Know when to use custom images.
If you have dependencies that must be shipped with the cluster, like native Python libraries that must be installed on all nodes, or specific security hardening software or virus protection software requirements for the image, you should create a custom image from the latest image in your target minor track.
Use the Jobs API for submissions.
The Cloud Dataproc Jobs API makes it possible to submit a job to an existing Cloud Dataproc cluster with a jobs.submit call over HTTP, using the gcloud command-line tool or the GCP Console itself.
Control the location of your initialization actions.
Initialization actions let you provide your own customizations to Cloud Dataproc. We’ve taken some of the most commonly installed OSS components and made example installation scripts available in the dataproc-initializaton-actions GitHub repository.
While these scripts provide an easy way to get started, when you’re running in a production environment you should always run these initialization actions from a location that you control. Typically, a first step is to copy the Google-provided script into your own Cloud Storage location.
Keep an eye on Dataproc release notes.
Cloud Dataproc releases new sub-minor image versions each week. To stay on top of all the latest changes, review the release notes that accompany each change to Cloud Dataproc. You can also add this URL into your favorite feed reader.
Know how to investigate failures.
Even with these practices in place, an error may still occur. When an error occurs because of something that happens within the cluster itself and not simply in a Cloud Dataproc API call, the first place to look will be your cluster’s staging bucket. Typically, you will be able to find the Cloud Storage location of your cluster’s staging bucket in the error message itself.
Use Google Cloud Storage as primary data source and sink
Secondary workers — preemptible and non-preemptible VMs
(1)Processing only — Secondary workers do not store data. They only function as processing nodes. Therefore, you can use secondary workers to scale compute without scaling storage.
(2)No secondary-worker-only clusters — Your cluster must have primary workers. If you create a cluster and you do not specify the number of primary workers, Dataproc adds two primary workers to the cluster.
(3)There are two types of secondary workers: preemptible and non-preemptible. All secondary workers in your cluster must be of the same type, either preemptible or non-preemptible. The default is preemptible.
For best results, the number of preemptible workers in your cluster should be less than 50% of the total number of all workers (primary plus all secondary workers) in your cluster.

Cloud Dataproc Best Practices

Written by Suds Kumar