Quit Emailing Yourself

2 links tagged with all of: data-processing + apache-spark

Links

Why Apache Spark is often considered as slow?

OSS Vanilla Spark is a versatile distributed query engine capable of handling various workloads but is generally slower than pure vectorized engines like Trino or Snowflake for OLAP tasks due to its hybrid processing model. While Spark's approach allows for flexibility in processing semi-structured data and complex queries, it lacks the optimization specific to columnar data formats. The article also discusses potential enhancements to transform Spark into a more vectorized engine through various extensions and solutions.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

apache-spark ✓ + olap + vectorization + query-optimization data-processing ✓

How to Fix Data Skew in Apache Spark with the Salting Technique | HackerNoon

The article discusses the issue of data skew in Apache Spark and presents the salting technique as an effective solution. By introducing randomness into the data partitioning process, the salting method helps to evenly distribute data across partitions, improving performance and reducing processing time. The author provides practical insights on implementing this technique to enhance Spark applications.

Saved by tldr-importer · Last saved October 29, 2025 · 1 min read

apache-spark ✓ + data-skew + salting-technique + performance data-processing ✓