Quit Emailing Yourself

GitHub - StabRise/spark-pdf: PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it

The project provides a custom data source for Apache Spark, enabling users to read PDF files into Spark DataFrames. It supports efficient reading of large PDF files, including scanned documents with OCR capabilities, and is compatible with various Spark versions and Databricks. The package is available in the Maven Central Repository and includes various configuration options for handling PDFs.

Saved by tldr-importer · Last saved October 29, 2025 · 3 min read

+ spark pdf ✓ + databricks ocr ✓ + data-source

GitHub - docling-project/docling: Get your documents ready for gen AI

Docling is a versatile document processing tool that can parse various formats, including advanced PDF features and extensive OCR support. It integrates seamlessly with generative AI frameworks, providing a unified document representation and multiple export options while ensuring local execution for sensitive data. Users can install it easily via package managers and utilize its CLI for document conversions and advanced features.

Saved by tldr-importer · Last saved October 29, 2025 · 2 min read

+ document-processing pdf ✓ + ai-integration ocr ✓ + cli

Links

GitHub - StabRise/spark-pdf: PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it

GitHub - docling-project/docling: Get your documents ready for gen AI