Optical Character Recognition?

nancyjacobs · July 23, 2018, 12:09pm

I am new to all data management software, including Tropy. I have organized my archival photos and am now working with content. My questions is: is it possible to transfer this well-organized set of photos to some package that can perform OCR on the handwriting? It may be messy and i may have to edit it carefully, but it could be easier than transcribing the set.

inukshuk · July 23, 2018, 12:37pm

Text recognition (for both printed as well as hand-written texts) is something we would like to support in Tropy in the long run; likely via plugins to a dedicated OCR service. We’re still evaluating possibilities, so please feel free to post suggestions or ideas on the subject here.

More specifically, we’ve been exploring a potential collaboration with the Transkribus project so that’s certainly something worth checking out!

nancyjacobs · July 24, 2018, 5:33am

Thanks. Transkribus is very intriguing but it would take a lot of time and effort to teach it the handwriting on these surveys. One further question: is it possible to export my items as multipage PDFs?

abbymullen · July 24, 2018, 1:20pm

Hi Nancy,

No, it’s not currently possible to export your items as multipage PDFs. Could you describe a little more about why you would want that, and what format you’d imagine that type of export would look like?

mhedstrom · July 17, 2019, 5:49pm

Along these same lines, I am wondering if anyone has developed a workflow using something like Automator (or AppleScript?) that might allow me to take the photos I have in Tropy, run them through an OCR app like ABBYY FineReader, and then paste the results in the notes field of Tropy.

I could certainly do this manually, one photo at a time, but ideally it would all be scripted or automated. I am still trying to figure out how to do it, but it should be possible. If I come up with a workflow myself, I’ll certainly share it here.

(I am, by the way, a 20th-century US historian, so nearly all my documents are either typed or printed. I have very little handwriting to deal with other than signatures on letters.)

inukshuk · July 17, 2019, 8:06pm

OCR notes is exactly one of the motivations for providing an API to your project (we’ll likely start rolling out the API starting with Tropy 1.7 this fall or early winter) – that should make it fairly easy to write a script which, for example, runs through all your photos, sends them to an OCR app and writes the result back to Tropy as a note attached to the photo.

Since the Tropy project file is a SQLite database you can already do this today using SQLite. If you want to give this a try, we’re happy to help. Adding notes to photos is relatively easy (working with metadata fields is more complicated, because of Tropy’s support for templates). Basically, you would run a simple query using SQLite to fetch all photo paths and ids, send each photo to the OCR app, and then create a note using the photo id and the OCR result. Like I said, I’m happy to help you sort this out if you like – once we add the API something like this will definitely find its way into the documentation.

mhedstrom · July 17, 2019, 8:32pm

I’d be thrilled for help with this, having never used SQLite. Happy to be a guinea pig for others. I know my way around the Mac pretty well and can follow directions!

inukshuk · July 18, 2019, 8:35pm

It all depends on what kind of script you have in mind. I’d suggest you start with a script that can work with list of (hard-coded) file paths and send them to your OCR tool to get text back for each file.

You could then pluck in the actual files of your Tropy project. For example, in a macOS terminal/shell you could get the list with the following command:

$ sqlite3 -noheader -list moby.tpy "select path from photos"

This would print each photo’s path on a line by itself (other outputs such as CSV are also possible) but one path per line would definitely be easy to consume for typical shell scripts. Obviously this query could be tuned (e.g., to exclude deleted items or to include only photos in a certain list, and so on).

I’ll save creating notes for the time being; for that we’ll also need to print out the id of each photo, along with its path; if we can manage to send those paths on to your OCR tool and get text back I’m certain we’ll manage to send the results back to Tropy.

HSJensen · July 19, 2019, 12:36pm

Do you think it would be possible to use the OCR-tool that is available for Google Docs? It seems really good.

mhedstrom · July 20, 2019, 3:45am

I’ll need to do some work – some research and learning – to be able to fully follow what you have given above.

By the way, I am sure you are super busy, but I am close to George Mason and could swing by if you’d find it useful to work with an ordinary user on a script like this. In any case, thank you!

inukshuk · July 20, 2019, 11:51am

Using Google’s OCR sounds tempting, especially if you store your Tropy photos in Google Drive. Apparently it’s possible to OCR the photos if they’re in Google Drive, but I’m not sure if this is easily scriptable. Even if not, it might be relatively easy to kick the process off for the photos you want manually, however, the issue is that Google stores the results in Google Docs, as you pointed out, and not in files which are synced back to your local drive (at least I believe that’s the case). To write a script, which runs on your computer and creates notes in your Tropy project file, the script must be able to access the contents of the OCR result, either in a local file or given a URL where the contents can be downloaded: I don’t think it’s possible to access the contents in Google Docs easily like that.

So unless I’m missing something, using Google’s OCR should be relatively straightforward, especially if you already keep your photos in Google Drive. But you’d have to copy the OCR contents manually back into Tropy.

inukshuk · July 20, 2019, 12:03pm

I think the next step is to find out how to script sending a photo to your OCR tool to receive text content. This depends on your OCR tool of choice and the kind of script you’d like to use. My example above is assumes using a Unix shell script of some kind (which should work fine on macOS, Linux and even Windows nowadays); it should also work with Automator / Apple Script.

I’m a few timezones away from GMU myself, but I’ll keep this in mind, thanks for the offer! Once we start rolling out the API in the fall we were also planning on 1-2 blog posts to demonstrate possible usage, so perhaps we could collaborate on something like that?

jbbennett · October 27, 2021, 7:48pm

Have there been any developments on this OCR front? I’m a new user with lots of images just imported into Tropy (mostly typescript) and am trying to figure out how best to use an OCR to get the transcription into the notes section.

inukshuk · October 28, 2021, 7:45am

We’re considering a built-in OCR solution for the next development cycle. Meanwhile, I think the best approach would be to use a dedicated OCR tool and use a script to create JSON data to import the photos with notes. I’m happy to help with this if you would like to give it a try.

jbbennett · October 28, 2021, 4:52pm

I appreciate the offer, though I have no experience or skills with that sort of thing so that would be a big ask. My current process is to open the image in a Google Doc, which does the OCR below the image in the doc, then manually pasting it into the notes section of Tropy.

JayLiemy · March 3, 2023, 9:20am

From my expereince, messy handwriting isn’t good recognizable in ocr aps

AndewBear · March 7, 2023, 11:56am

That’s true. And also, Tropy itself doesn’t currently have OCR functionality built-in, but there are plenty of OCR software options available that you could use in combination with Tropy.
One popular OCR engine is Smart Engines, which is open-source and supports many different languages and scripts. Try exporting your photos from Tropy and using Smart Engines to extract text from them. Depending on the quality of the handwriting, you may need to do some manual editing to correct errors in the OCR output.