The project provides a custom data source for Apache Spark, enabling users to read PDF files into Spark DataFrames. It supports efficient reading of large PDF files, including scanned documents with OCR capabilities, and is compatible with various Spark versions and Databricks. The package is available in the Maven Central Repository and includes various configuration options for handling PDFs.
Docling is a versatile document processing tool that can parse various formats, including advanced PDF features and extensive OCR support. It integrates seamlessly with generative AI frameworks, providing a unified document representation and multiple export options while ensuring local execution for sensitive data. Users can install it easily via package managers and utilize its CLI for document conversions and advanced features.