5 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
This article explains Spark Declarative Pipelines (SDP), a framework for creating data pipelines in Spark. It covers key concepts like flows, datasets, and pipelines, along with how to implement them in Python and SQL. The guide also includes installation instructions and usage of the command line interface.
If you do, here's more
Spark Declarative Pipelines (SDP) is a framework for constructing data pipelines in Apache Spark. It streamlines ETL processes by allowing developers to specify what data transformations to perform rather than managing the execution logistics. SDP supports both batch and streaming data processing, making it versatile for various use cases, including data ingestion from cloud storage and message buses like Kafka and Kinesis.
The core component of SDP is the flow, which processes data from a source, applies logic, and outputs to a target dataset. Key elements include datasets, streaming tables, materialized views, and temporary views, each serving specific roles within the pipeline. For instance, a streaming table processes incoming data incrementally, while a materialized view precomputes results for efficiency. Pipelines themselves are the main units of development, orchestrating dependencies and execution order automatically.
To manage pipelines, the `spark-pipelines` command line interface (CLI) provides commands for generating projects, executing pipelines, and conducting dry runs to catch potential errors. Installation is straightforward via pip. Python programming with SDP involves using decorators to define materialized and temporary views, as well as streaming tables. For instance, the `@dp.materialized_view` decorator creates a view based on batch data, while the `@dp.table` decorator is used for streaming data. This setup allows for straightforward integration and querying of tables within the defined pipeline.
Questions about this article
No questions yet.