Enable PDF import

Birgit.Reissland · June 6, 2018, 7:15pm

Hi I LOVE Tropy but is there any chance that you enable us to import PDF and other image file formats ???

inukshuk · June 7, 2018, 12:26pm

Yes, TIFF, JPEG2000 and PDF import are high up on our wishlist. We’re not currently working on it, but it’s definitely get to it. Stay tuned!

HistoryResearcher · June 25, 2018, 7:15pm

I wanted to chime in here that after searching far and wide for a software solution for managing archival research, Tropy is the first that has the potential to be the best fit for me. However, the vast majority of files I receive from archives are in TIFF or PDF format. Personally, I always take photos in RAW and convert to TIFF (never jpeg, unless it is my only option). Virtually none of my files are JPEGs.

I would strongly request that your team to consider working on TIFF and PDF import. Great work on an awesome piece of software. I look forward to following your project as it develops!

kutne · July 23, 2018, 3:39pm

Another echo for TIFF and PDF import capabilities

lovesarthistory · August 27, 2018, 4:30pm

Yes, please do add support for TIFF files, and please let us know when it is in the sandbox or scheduled to be available. Almost all museums supply hi-res images in TIFF, and it would be wonderful if art historians could use Tropy as it is intended to be used, using the files they have.

torncurtain · August 30, 2018, 3:55pm

A possible workaround I thought of—before support for opening PDF, TIFF, or any other file format within Tropy would be added—is that you could add the ability to create entries in the Tropy database for items of any file format. That way we could add our metadata to those files, so they could be put alongside all the JPEGs we currently have, and sort them, tag, take notes etc. But once we double click on the file reference in Tropy, it could open in an external program of our choosing, based on the file extension.

This occurred to me as a possible stopgap measure until other file formats could be read natively within Tropy. Mendeley functions this way for epub files, which can be added to our reference library, but open in an external program, unlike PDFs which open natively.

The ability to work and sort files of other formats is a feature that is becoming more urgent in my own work, simply because different periods and archives may have resources associated with them that are not in JPEG format.

Thanks and I would love to hear what others think about this.

abbymullen · August 30, 2018, 6:00pm

We’ve also considered this possibility, though we haven’t come to a firm decision yet. Stay tuned!

Emily · October 15, 2018, 9:29pm

Agreed! Many photos I use Tropy for are multi-page pdf files (so I can group documents together that span several pages), and it would be very useful to be able to keep them with other photos.

tkelly7 · November 16, 2018, 6:47pm

Being able to load .pdf files into Tropy is increasingly important to me because a number of the archives I work with are scanning their textual sources (correspondence, newsletters, etc.) as searchable .pdf files. If I could load them directly into Tropy, I’d be in heaven.

As for .tiff files, I’ve been using a service to scan some largish map files of late and they return them to me as .tiff files. So echoes from me for needing to be able to load .tiff and/or RAW files.

Thanks!

inukshuk · November 19, 2018, 9:44am

TIFF support is almost ready and will likely go into a beta release later this week.

As for PDF, since you mentioned that the files have been OCR’ed already, I’d be curious to know, would you expect to import only the photos included in the PDF into Tropy, or also the text (as notes)? And would you be able to share such a PDF with us for testing?

tkelly7 · November 19, 2018, 3:35pm

Hi:

At a minimum I’d want to incorporate the images of the pdfs, because I’m using Tropy not just as an image database, but as my project database. Being able to include pdfs there would help me consolidate my research images into one location rather than having them spread across Tropy and Zotero. If I could incorporate the text as notes that would be even better. I’ll send Abby a sample of one of the files so you can play with it. It is made from the campus copiers here rather than from my own scanner.

rwaxman · November 19, 2018, 5:13pm

Just want to add my support for PDF imports. I use TinyScanner for my archive photos, which automatically saves as a PDF, and I’d love to be able to import these files.

torncurtain · November 20, 2018, 10:01am

I’d like to echo @tkelly7’s concerns and input about PDF import. After a certain point in my research period, the archive switches from jpg images I took myself to PDFs generated by the archive. It would be amazing to have all of these in one place.

OCRed text import would be incredible, but I agree that the first step is getting them into Tropy in the first place, so I can edit the metadata and sort and integrate them into my full project.

I can send you sample files or you can literally use millions of sample files from the US National Archives Access to Archival Databases website. I am working specifically with the Central Foreign Policy Files Database. Due to the small file size, clarity of the text, and already present list of metadata, these ought to be among the easier files to deal with.

Full Central Foreign Policy Files Database. https://aad.archives.gov/aad/series-description.jsp?s=4073&cat=all&bc=sl

Sample search (this takes you to a page with many records to download): https://aad.archives.gov/aad/display-partial-records.jsp?dt=2472&sc=25993%2C25962%2C25986%2C25942%2C25958%2C25973%2C25959%2C25946&cat=all&tf=X&bc=sl%2Cfd&q=&as_alq=&as_anq=&as_epq=&as_woq=&nfo_25993=D%2C0%2C1900&op_25993=3&txt_25993=&txt_25993=&nfo_25962=V%2C0%2C1900&op_25962=0&txt_25962=&nfo_25986=V%2C0%2C1900&op_25986=0&txt_25986=&nfo_25942=V%2C0%2C1900&op_25942=0&txt_25942=Beirut&nfo_25958=V%2C100%2C1900&op_25958=0&txt_25958=&nfo_25973=V%2C0%2C1900&cl_25973=&nfo_25959=V%2C0%2C1900&op_25959=0&txt_25959=&nfo_25946=V%2C4000%2C1900&op_25946=0&txt_25946=&rpp=50

IntrepidHistorian · January 3, 2019, 1:27pm

I would also like to add my support for PDFs. Any news on the timeline for this feature?

rwdahn · January 24, 2019, 3:14pm

Hi, I just found out about your project today. (Wish it had been available while I was on my archival trip a few years back!) Great that you all are filling this desperately needed hole in academic software!

I would love to use Tropy going forward, but 99% of my archival images are in PDF format, and most archives I visit (in Germany) provide PDFs when electronic copies are allowed to be made. So PDF import would be a very valuable feature for me, and I think many other researchers. Do you think this could be added? Thanks so much!

moxostoma · February 2, 2019, 2:08am

Before I start converting a large number of pdfs to images so I can include them in a project, is there any way to guess when Tropy might be able to read them?
Please don’t read this as anything other than a question–I’m not urging you to hurry or demanding the feature. Just curious.

inukshuk · February 4, 2019, 10:55am

For PDF import we need to distinguish between two different kinds of PDF usage (the examples in this thread cover both): what I would cal ‘real’ PDF documents and PDFs which are collection of (raster) images. Support for the latter is what we’re looking to add to Tropy soon (please don’t hold me to it, but within a month or two) – this would basically extract all the images from the PDF and import them all into a single item (after import you can split up the photos into multiple items any way you want); if you share or move the project file, Tropy would still need access to the original PDF to extract the images again.

For proper PDF documents there multiple possibilities: we could convert each page into an image and import those the same way; or we could replace Tropy’s image viewer with a PDF viewer. Either way, this is something that has not been designed yet (it’s part of the discussion of supporting other resource material than just images).

moxostoma · February 4, 2019, 2:34pm

Thanks for the info.
I kind of figured it would be raster images only, at least at first.

Are comments or highlighted text in PDFs stored in a way that can be read? If so, then perhaps there’s a way it could grab those along with the image of the page.

It would be useful to allow the user to choose only specific pages to extract and import, since sometimes you might only need a page or two from a 500 page pdf. For me, in most cases, it would be only the pages that have comments/highlights/annotations, since I went through and marked all the useful info in hundreds of PDFs several years ago. If the PDF flags those pages in some way that’s visible to other software, that could be a useful feature.

StoltHD · March 12, 2019, 8:12pm

But why would you want to do all the work over again, when you have multiple open source solutions for all feature Tropy ever need, including both text based and image based PDF’s?
Just look to Zotero and its pdf plugin, just use a never version of the libraries… if you don’t use js, its lots of other libraries out there to…
Same with TIFF, its multiple solutions out there for for any image viewer with edit functionality…
dcraw for RAW, Cimg for the rest of the formats… or similar depending on your programming language…
If you implement ImageMagick, you will get any and all you ever need…

I really hoped for this project to be something I could use, but since it doesn’t support the most basic fileformats, a combination of Zotero, digiKam, and Transcript is still the only solution if I want to use desktop software.

The design is clean, its not bloated and that’s how it should be…

disstime · August 29, 2019, 11:28am

I’ve seen a few references to PDF import coming soon over on the Tropy Twitter account… is there any news as to when this might be happening? If not, that’s fine! In any case, I’m looking forward to it (and prepared to start converting some of my PDFs in the meantime)!