Whole Stage Code Gen in Spark

Subham Singhal
3 min readOct 4, 2024

--

WSCG is a query plan optimisation technique used in most DBMS systems. Before undestanding about WSCG we need to first understand issues spark 1.x had earlier and how it got fixed with WSCG.

Spark1.x and Volcano Iterator Model:

Let’s start with a query plan and understand how it would get executed in spark1.x.

Select count(*) from sales where item_id=512

Corresponding Query Plan would be:

The way volacano iterator model works is that each operator implement a common interface which provides a function to emit next data.

In Spark RDD there is a “compute” function which provides Iterator[InternalRow].

def compute(): Iterator[InternalRow]

All of these operator would would fetch one record from prev operator using iterator.next() and process row and provide output to next operator.

This is the execution model:

Advantantage of this model is that operator get nice interface(iterator) and can get fetch data with iterator.next() method and does not need to know about operator prev or next to it but on downside

  1. There are too many virtual calls for each record. As can be seen here there are 4 virtual calls for a single record.
  2. High memory usage. Intermediate results needs to be written in memory to be consumed by next operator.
  3. Unable to leverage lot of modern techniques like pipelining, prefetching, branch prediction, SIMD, loop unrolling etc.

Here comes WSCG to the rescue…

What WSCG does is it combines multiple operators into a single function call. During runtime generate byte code which will be run and this would only need a single function call.

Spark execution after WSCG:

And this is how query plan looks after conversion:

But not all spark operator supports Code generation, in such case spark will fall back to volcano iterator model or vectorized execution in spark 2.0

Checkout my 2nd blog for internal working of WSCG:

Reference:

--

--

No responses yet