CSV import of multipage PDFs

rharper · June 29, 2025, 8:50pm

New user here. I have several hundred multipage PDFs (scanned manuscripts) with metadata recorded in a csv file. I used the CSV plug-in to import them. The metadata imported perfectly, but the plug-in imported only the first page of each pdf file. To get the full file, I had to go to each individual item and use the “add photo” feature. Is the plug-in limited to importing just the single page, or is there a way for me to import the entire file at one go? I have a few thousand more such documents to import so I’d like to do this as efficiently as possible. Thank you.

inukshuk · June 29, 2025, 8:58pm

The CSV import assumes that the metadata structure is given by the file and doesn’t make any assumptions. It’s possible to import multi-page items this way, but you’d have to specify each photo in an item by repeating the relevant columns in the CSV file. That said, I believe that this solution was intended for importing multiple files and does not work well with multi-page PDFs, because we currently have no column to indicate the page in a file. So this way, you could import all the pages with metadata but the photo would always show the first page in the PDF.

I think we’ll have to add a ‘page’ column to the PDF plugin to support this.

rharper · July 1, 2025, 5:59pm

Related question: when importing multipage documents directly (not via the csv plugin), is it easier to import individual images and then merge them within Tropy, or to import them as multipage PDFs (one document per pdf) and have Tropy convert them into multi-image items? I understand that Tropy was originally developed to work with individual images, but the conversion process for pdfs seems to simplify things by keeping the document pages together.

rharper · July 1, 2025, 7:04pm

Also, regarding the original question: it sounds like the optimal approach with the csv plugin would be to mass import the metadata without providing the path, and then add each pdf to its respective item one at a time. In other words, use the csv plugin to import the collection metadata, then use “add photos” to bring in the pdfs. Does that make sense?

inukshuk · July 3, 2025, 9:27am

If you have a choice, I’d always opt to import multi-page PDFs as separate images. Tropy always preserves your original import, so if you import multiple pages from a PDF they will always stay a page of that PDF. That’s not necessarily a bad thing of course, but Tropy allows you to later move the page/photo around (e.g. split it out to be a separate item, or move to another item altogether etc.) and all photo operations that require the original image will always have to extract the page from the original file. Also, for PDF pages, Tropy has to create full-size images for each page in the image cache – if you import images instead that is not required.

So, coming back to your question, I would recommend importing the photos separately and then combine them in Tropy as necessary. You can even consider using lists to group collections of photos. Tropy’s interface works best if you keep most information at the item-level – if you have items with hundreds of photos in them, it’s more difficult to access information at the photo level. This is something we’d like to improve in the future, but with the current UI I’d try to keep the number of photos per item low.

Yes, with the current CVS plugin you could import the first pages of each PDF and then add each PDF again to get all the other pages. What I would recommend, though, is to first convert the PDF to images (one folder per PDF) and then update the paths in CSV to point to the first page/image instead. This way you could also add all the other pages if you add extra path columns at the end. Doing it this way would allow you to also add metadata at the photo/page level. But you could also just import the first page and then add the remaining pages later on.

rharper · July 3, 2025, 5:41pm

Thank you! Some of my pdfs contain up to 100 pages each, comprising many short documents. So I am finding the best method, as you suggest, is to extract the pages in Acrobat as single-page files, then import all of them into Tropy. Many thanks for your help.