Click any tag below to further narrow down your results
Links
This article introduces the features of Apache Spark 4.1, highlighting advancements like Spark Declarative Pipelines for easier data transformation, Real-Time Mode for low-latency streaming, and improved PySpark performance with Arrow-native UDFs. It also covers enhancements in SQL capabilities and Spark Connect for better stability and scalability.
Kostas Pardalis discusses Fenic, an open-source DataFrame engine inspired by PySpark, aimed at enhancing data engineering for AI applications. He highlights how Fenic incorporates semantic operators to improve data transformation and management, addressing the limitations of traditional data infrastructure in the AI era.
This article explains how to use built-in PySpark functions to efficiently manipulate map data types in data pipelines. It covers functions like `transform_keys`, `map_filter`, and `map_contains_key`, highlighting their utility in cleaning and transforming semi-structured data.
This article explains how Apache Hudi manages schema evolution in data lakehouses, allowing for seamless changes in data structures without disrupting pipelines. It covers practical implementation using PySpark and highlights the benefits of agility, backward compatibility, and pipeline reliability.