Apache Spark is a popular distributed big data processing framework which is used by many companies to process data in batches as well as real-time.
However, how fast your spark program runs depends a lot on the code that a developer/engineer writes.
Most people who are new to data engineering often try to increase the resources such as the cores/memory to increase the speed of execution.
But from my experience, that is not always needed. e.g. To process 10GB data, you generally don’t need 1TB of memory and 400 cores which is just a waste of resources and will lead to huge infra costs.
Here are some tips to avoid such bottleneck situations
#1 Don’t use GroupByKey
GroupByKey is used for collecting data with respect to a key.
GroupByKey shuffles/redistributes all data to their respective partitions before merging them. This leads to lot of network I/O.
A better approach is to use ReduceByKey. It first groups data on a each partition and then shuffles the data to respective partitions thus reducing the network traffic.
Node1 -> (User1, 1), (User2, 1), (User1, 3)
Node2 -> (User1, 3), (User2, 3), (User1, 4)
If we do GroupByKey on users, then User1’s data from Node2 to Node1 will be transferred as such but in case of ReduceByKey data transferred will be (User1, (3,4)) (data is first reduced on Node 2 and then sent over network)
#2 Don’t use Pyspark/Native Scala Spark
Since python is the most commonly used language among data engineers/scientists leading to popularity of pyspark.
Spark runs as a JVM process and Pyspark is a python process which leads to an additional communication overhead between these two.
Scala native code is much better, however, spark doesn’t apply any optimisations on top of it.
The fastest and cleanest approach is to use Spark SQL. The Catalyst optimiser in spark applies tons of optimisations such as predicate pushdowns and boolean expression simplification. I’ll be writing in detail on this in my latter article.
#3 Partition data properly
This is one of the most common yet ignored issue.
Suppose you partition some user data on country. So a country with more number of users will take more time to process, compared to country with less number of users.
In this case most of the nodes on which smaller countries have been processed will lie idle and your spark task will take longer to finish.
A better approach will be to key on userid. This will create a larger number of small partitions which can be processed simultaneously on different executors.
#4 Don’t run large SQL queries on sources
Sources such as MySQL or Hive usually have limited resources and most of the time no distributed computing. So if the data is not too large e.g. 100 GB or greater ( i.e. it won’t cause the network to become the bottleneck), you can read all the data and use spark to do all the necessary filters and joins.
#5 Never use .collect()
Collect brings all the data from the executors to the driver and hence cause a lot of network load and may cause spark to crash if data is too large for driver’s memory.
You should only use collect for printing a subset of data for debugging otherwise refrain from using it.
Software Developer | Technical Writer | Lives in Bangalore, IndiaLearn more
Data from Goodreads
Homo Deus: A History of Tomorrow
Yuval Noah Harari13 % (1 year ago)13 % (1 year ago)
Data from Goodreads
Lifespan: Why We Age—and Why We Don't Have To
David A. Sinclair
Thinking, Fast and Slow
Loonshots: How to Nurture the Crazy Ideas That Win Wars, Cure Diseases, and Transform Industries