In this blog, we will focus on Dataproc PHS best practices. To set up PHS to access web interfaces of MapReduce and Spark job history files, please refer to Dataproc documentation.
PHS Best Practices
Cluster Planning and Maintenance
It’s common to have a single PHS for a given GCP project. If needed, you can create two or more PHSs pointing to different GCS buckets in a project. This allows you to isolate and monitor specific business applications that run multiple ephemeral jobs and require a dedicated PHS.
For disaster recovery, you can quickly spin up a new PHS in another region in the event of a zonal or regional failure.
If you require High Availability (HA), you can spin up two or more PHS instances across zones or regions. All instances can be backed by the dual-regional or multi-regional GCS bucket.
You can run PHS on a single-node Dataproc cluster, as it is not running large-scale parallel processing jobs. For the PHS machine type:
N2 are the most cost-effective and performant machines for Dataproc. We also recommend 500-1000GB pd-standard disks.
For <1000 apps and if there are apps with 50K-100K tasks, we suggest n2-highmem-8.
For <1000 apps and there are apps with 50K-100K tasks, we suggest n2-highmem-8.
For >10000, we suggest n2-highmem16.
We recommend you benchmark with your Spark applications in the testing environment before configuring PHS in production. Once in production, we recommend monitoring your GCE backed PHS instance for memory and CPU utilization and tweaking machine shape as required.
In the event of significant performance degradation within the Spark UI due to a large amount of applications or large jobs generating large event logs, you can recreate the PHS with increased machine size with higher memory.
As Dataproc releases new sub-minor versions on a bi-weekly cadence or greater, we recommend recreating your PHS instance so it has access to the latest Dataproc binaries and OS security patches.
As PHS services (e.g. Spark UI, MapReduce History Server) are backwards compatible, it’s suggested to create a Dataproc 2.0+ based PHS cluster for all instances.
Logs Storage
Configure spark:spark.history.fs.logDirectory to specify where to store event log history written by ephemeral clusters or serverless Spark. You need to create the GCS bucket in advance.
Event logs are critical for PHS servers. As the event logs are stored in a GCS bucket, it is recommended to use a multi-Region GCS bucket for high availability. Objects inside the multi-region bucket are stored redundantly in at least two separate geographic places separated by at least 100 miles.
Configuration
PHS is stateless and it constructs the Spark UI of the applications by reading the application’s event logs from the GCS bucket. SPARK_DAEMON_MEMORY is the memory to allocate to the history server and has a default of 3840m. If too many users are trying to access the Spark UI and access job application details, or if there are long-running Spark jobs (iterated through several stages with 50K or 100K tasks), the heap size is probably too small. Since there is no way to limit the number of tasks stored on the heap, try increasing the heap size to 8g or 16g until you find a number that works for your scenario.
If you’ve increased the heap size and still see performance issues, you can configure spark.history.retainedApplications and lower the number of retained applications in the PHS.
Configure mapred:mapreduce.jobhistory.read-only.dir-pattern to access MapReduce job history logs written by ephemeral clusters.
By default, spark:spark.history.fs.gs.outputstream.type is set to BASIC. The job cluster will send data to GCS after job completion. Set this to FLUSHABLE_COMPOSITE to copy data to GCS at regular intervals while the job is running.
Configure spark:spark.history.fs.gs.outputstream.sync.min.interval.ms to control the frequency at which the job cluster transfers data to GCS.
To enable the executor logs in PHS, specify the custom Spark executor log URL for supporting external log service. Configure the following properties: