Quit Emailing Yourself

GitHub - StabRise/spark-pdf: PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it

3 min read | Saved October 29, 2025 | Copied!

spark 🤖 pdf 🤖 databricks 🤖 ocr 🤖 data-source 🤖

Do you care about this?

The project provides a custom data source for Apache Spark, enabling users to read PDF files into Spark DataFrames. It supports efficient reading of large PDF files, including scanned documents with OCR capabilities, and is compatible with various Spark versions and Databricks. The package is available in the Maven Central Repository and includes various configuration options for handling PDFs.

If you do, here's more

Click "Generate Summary" to create a detailed 2-4 paragraph summary of this article.

Questions about this article

No questions yet.