OCR search support?

I’ve read elsewhere in the forum that this was planned in the past, but I haven’t seen anything recently.

Is there currently a plan to add this feature? If so, when could we expect it?

Note: I do not mean OCR recognition. I mean making PDFs that already have an OCR layer searchable within Tropy.

No one is working on this at the moment so unfortunately I can’t give you any roadmap for this. We’d definitely like to add OCR support, though it hasn’t been decided if this should be via plugin or builtin.

Fetching the text layer from a PDF is likely not going to happen for now (though it could be done as an import plugin). The reason is that Tropy’s PDF support is built for PDFs used as containers for embedded images. Of course you can import regular PDFs, but since Tropy will convert them to images this is not an ideal approach. A better solution will be to use dedicated viewer components for different file types, not just images (i.e., audio, video, text), because a dedicated PDF viewer component will be better suited to work with PDF text than the current image viewer.

Thanks for the reply. Would a quicker approach be to simply extract the text layer and insert it as a note on the first image?

Since every page will be turned into one image, I’d extract the text for each page and attach it as a note for the respective image. But one combined note attached to the first image is would similarly work as well.

Would such a solution be helpful to you? I’m sure we could come up with something useful if you have something like pdftotext or pdftohtml installed (either via plugin or using the API).

Yes, simply pasting the OCR text as a note on the first image would be a big benefit. What you describe is better (OCR text on each page), but if what I described is quicker/easier, I would prefer that as it would be a big leap in terms of work flow. (Currently I use a separate app for search, then paste the document name into Tropy to find it.)

1 Like