Use Tesseract to make a PDF searchable.

Installation

Install Tesseract v3.0.5. This library shells out to it.

Then install this package. Maven-style:

<dependency>
  <groupId>org.overviewproject</groupId>
  <artifactId>pdfocr_2.12</artifactId>
  <version>0.0.10</version>
</dependency>

Sbt-style:

dependencies += "org.overviewproject" %% "pdfocr" % "0.0.10"

Usage

You've got to use Scala. Code something like this:

import java.nio.file.Path
import java.util.Locale
import org.overviewproject.pdfocr.{PdfOcr,PdfOcrProgress,PdfOcrResult}
import org.overviewproject.pdfocr.exceptions._
import scala.concurrent.Future

val pdfOcr = new PdfOcr()                          // default settings: finds tesseract in your $PATH
val inPdf = new Path("/path/to/needs-ocr.pdf")     // exists
val outPdf = new Path("/path/to/ocr-finished.pdf") // doesn't exist; will be deleted if it does
val process = PdfOcr.makePdfSearchable(inPdf, outPdf, Seq(Locale("en")))

process.progress // Future[PdfOcrProgress]
  .map { progress =>
    // It's a Future because we don't know how many pages there are until
    // we begin parsing the PDF, which takes time.

    progress.value       // 0.0 ... 1.0
    progress.currentPage // 1 .. nPages
    progress.nPages      // n
  }

process.result // Future[PdfTextResult]
  .map { result =>
    // do something with outPdf now...

    // Also, since the data is handy and would otherwise take a long time
    // to compute, PdfOcr returns the text, in pages.
    val text = result.pages.map(_.text).mkString("\n")
  }
  .recover {
    // outPdf is guaranteed not to exist

    case TesseractMissingException => throw
    case TesseractLanguageMissingException => throw
    case EncryptedPdfException => throw
    case InvalidPdfException => throw
    // Other errors may happen -- PDFBox bugs, Tesseract bugs,
    // out-of-memory.... You shouldn't catch those.
  }

// Or if you got impatient, you could:
process.cancel // Future[Unit]

How PdfOcr behaves

PdfOcr processes one page at a time.
PdfOcr sends Tesseract any page that's missing fonts or missing 100 characters of text.
PdfOcr's progress reports are page-by-page. If one page needs OCR and nine don't, the progress report will be unintuitive.
PdfOcr communicates with Tesseract via stdin and stdout.
For any method that will block on I/O, PdfOcr returns a Future. In other words: blocking methods are asynchronous.
PdfOcr does heavy computations (especially in PdfPage) which are slow. These are non-blocking and synchronous.

Developing

First, Install sbt.

After that,

Run sbt ~test to run unit tests in the background.
Edit files in src/test until a test fails.
Edit files in src/main until the test passes.
Return to step 2.
Commit to a git branch, push it to GitHub, and submit a pull request.

Publishing

We use [sbt-sonatype](https://github.com/xerial/sbt-sonatype for more details) for all this.

Setup: using the sbt-sonatype instructions, ensure you've done these things:

Created an account at https://oss.sonatype.org and get access to this project.
Created ~/.sbt/1.0/sonatype.sbt with your credentials.

Then, every new version:

sbt publishSigned to deploy to staging
sbt sonatypeRelease to close and promote it

If the version ends in -SNAPSHOT, you won't be able to release it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Installation

Usage

How PdfOcr behaves

Developing

Publishing

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Installation

Usage

How PdfOcr behaves

Developing

Publishing

License