Click any tag below to further narrow down your results
+ apache-spark
(1)
+ transformations
(1)
+ maps
(1)
+ pyspark
(1)
+ cloud-computing
(1)
+ data-modeling
(1)
+ business-intelligence
(1)
+ machine-learning
(1)
+ performance
(1)
+ netflix
(1)
+ data-management
(1)
+ data-infrastructure
(1)
+ pyiceberg
(1)
+ software-architecture
(1)
+ distributed-graph
(1)
Links
This article explains how to use built-in PySpark functions to efficiently manipulate map data types in data pipelines. It covers functions like `transform_keys`, `map_filter`, and `map_contains_key`, highlighting their utility in cleaning and transforming semi-structured data.
This article explores the evolving role of data engineers over the past 50 years, highlighting their often unnoticed contributions to data infrastructure. It discusses the challenges they face, such as managing dependencies and schema changes, while emphasizing that the core problems remain unchanged despite new tools and technologies.
The article introduces Apache Spark 4.0, highlighting its new features, performance improvements, and enhancements aimed at simplifying data processing tasks. It emphasizes the importance of this release for developers and data engineers seeking to leverage Spark's capabilities for big data analytics and machine learning applications.
The article introduces PyIceberg, a tool designed to help data engineers manage and query large datasets efficiently. It emphasizes the importance of handling data in motion and how PyIceberg integrates with modern data infrastructure to streamline processes. Key features and use cases are highlighted to showcase its effectiveness in data engineering workflows.
Netflix has developed a Real-Time Distributed Graph (RDG) to address the complexities arising from their evolving business model, which includes streaming, ads, and gaming. The first part of this series details the architecture and ingestion pipeline that processes vast amounts of data to facilitate quick querying and insights.
Tulika Bhatt, a senior software engineer at Netflix, discusses her experiences with large-scale data processing and the challenges of managing impression data for personalization. She emphasizes the need for a balance between off-the-shelf solutions and custom-built systems while highlighting the complexities of ensuring data quality and observability in high-speed environments. The conversation also touches on the future of data engineering technologies and the impact of generative AI on data management practices.