3 links
tagged with all of: data-processing + spark
Click any tag below to further narrow down your results
Links
The article discusses the complexities and challenges associated with configuring Spark, a popular data processing framework. It highlights various configuration options, their implications, and the often confusing nature of Spark's settings, making it difficult for users to optimize their applications effectively. The author emphasizes the importance of understanding these configurations to harness Spark's full potential.
Pinterest is transitioning from its aging Hadoop-based platform to a Kubernetes-based data processing solution named Moka, designed to address scalability and performance needs. The first part of this series discusses the rationale behind this shift, the architecture of the new platform, and initial design considerations, while outlining the benefits of using Kubernetes for data processing at massive scale.
Migrating from DataFrame to Dataset in Apache Spark can significantly reduce runtime errors thanks to type safety, compile-time checks, and clearer schema awareness. This transition addresses common issues such as human errors and schema mismatches, ultimately leading to more robust and maintainable data processing systems. The article provides insights into the advantages of using Dataset over DataFrame for large-scale data processing, emphasizing correctness and maintainability.