A short performance story: Scala Native and CArgs
Recently, I got intrigued about Scala Native, essentially a compiler+runtime which allows a native performance for Scala — in particular, no JVM. However, my initial experiments were giving abysmal performance, worse than that of naive python or bash. Read on to find where the devil lay, and how did I find out!
Problem Statement
I started reading the Modern Systems Programming in Scala Native, and the first benchmarked example is about going through a large TSV-file, and finding a row which has the max value in one of the columns. In bash, you can do this via cat <file> | sort -k <x> -r -n | head -n1
(which is asympotically suboptimal, nlogn
when n
is sufficient). On a file with 10M rows, this runs on my machine for 9.6s. A naive solution in Python, basically just line_value = int(line.split('\t')[x])
and then max_so_far = line_value if line_value > max_so_far else max_so_far
, run for about 3.8s.
The solution in Scala Native, demonstrated in the book and implemented in the appendix, differs from how you would approach this in vanilla Scala by using the C-style functions fgets
, sscanf
, strncpy
instead of the usual Scala/Java string methods. Additionally, you have to reason in pointers (e.g., to extract an int from string, you need to sscanf(my_string, c"%d", int_pointer)
) and explicitly allocate memory on stack e.g. for intermediate string buffers. One would thus expect this is compensated by C-level performance.
The shock came when the solution run for lengthy 12s!
Problem Solution
After initial check persuaded me that I’m explicitly doing all the allocations outside of the cycle that iterates over the lines; and that within the cycle, there are just vanilla C methods, I resorted to perf
(see e.g. here, but note that perf
is a general purpose tool, not just for Scala native). It turned out, compared to the respective flame graph of the python solution, that an unreasonably big chunk of time was occupied by Heap_AllocSmall
— which was very surprising, as all the allocations I was doing myself were on stack.
It turned out that the devil is in the sscanf(input_string, format, p1, p2, p3, ...)
, that is, the line where you parse a line of the TSV-file into individual columns. Going into the Scala Native’s code itself, I found out that the wrapper between the user code and the C-libraries has to collect the var args into a CVarArgList
, and the respective code does some new Array
etc, that is, heap allocations!
Going around this was, in this case, pretty simple — I explicitly did val args = toCVarArgList(p1, p2, p3, ...)
before I entered the lines-iterating cycle, and inside it called just sscanf(input_string, format, args)
. And now I’m on expected ~2s performance.
Conclusion
- Don’t trust benchmarks and claims until you replicate them.
perf
and flame graphs are cool.- Performance bottlenecks can be hidden at surprising places such as function invocations, especially if those are at language boundaries.
As for Scala Native itself — it feels quite weird and dangerous that every call of a C function which has varargs requires a new List on the heap. In this case, I was lucky that the vararg set is static — but in more involved cases, the code would have to be way more clumsy. I’m not an expert and there is a chance I’m doing something completely wrong, of course (I haven’t even finished the book yet!); and I have no idea how to design it better.
However, this little mishap aside, Scala Native still feels like a nice thing to exist in my toolbox.