Quit Emailing Yourself

Click any tag below to further narrow down your results

GitHub - StabRise/spark-pdf: PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it

The project provides a custom data source for Apache Spark, enabling users to read PDF files into Spark DataFrames. It supports efficient reading of large PDF files, including scanned documents with OCR capabilities, and is compatible with various Spark versions and Databricks. The package is available in the Maven Central Repository and includes various configuration options for handling PDFs.

Saved by tldr-importer · Last saved October 29, 2025 · 3 min read

spark ✓ pdf ✓ + databricks ocr ✓ + data-source

Links

GitHub - StabRise/spark-pdf: PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it