November 9, 2018
Because of the scale of big data clusters, it is crucial that developers make the best use of a cluster’s hardware resources. However, it is challenging to figure out the best parameter settings for an entire big data software stack.
Intel and Dell EMC have collaborated on research to help developers better optimize big data clusters. Their research shows that the workload performance is CPU-sensitive and sensitive to scaling the number of nodes in a cluster. Accurate simulations of these workloads provide a development tool for choosing better values for configuration parameters.
For this optimization project, the research team used the TPCx-BB benchmark to evaluate the performance of both software and hardware components for big data clusters. The white paper the team published, “Optimize configuration parameters faster and more accurately, and speed up analyses of scaling big-data clusters,” offers advice on how to extend TPCx-BB to evaluate data ingestion and real-time processing. Finally, they share some ideas on how to implement fuller TPCx-BB coverage for end-to-end big data clusters.
The project compared the optimized parameter values suggested by Intel® CoFluent™ Technology for Big Data to the settings chosen by big-data experts. Results showed that Intel CoFluent delivered a 32% gain in the benchmark performance score over the parameter choices of expert human developers. This is an improvement equivalent to the performance gain typically seen from a new processor generation.
|Optimized Settings||Benchmark score BBQpm@1000|
|Optimized by expert developer||85|
|Optimized by Intel® CoFluent™ Technology for Big Data||112|
Learn more: Scaling and Optimizing Big Data Clusters