1 min readOct 9, 2020
did you know that hdfs-shell has actually built-in merge parquet command, shorthanded `mp`? It is not invoking a spark job but rather parquet-tools's command to do so. The resulting job is not distributed, so won't scale for huge files, but overall is very handy and you can control for maximum resulting file size with a parameter. (Note -- I did not implement the feature myself :), but I was close to when it was written)