Integrate ALTO documents in Omeka-S / Universal Viewer

Fred · January 21, 2021, 1:46pm

Hi,

I’m looking for integrating a newspapers collection in Omeka-s. Each document of this collection are both in PDF format (with plain text) and in ALTO format (derivated from my PDF documents).

My first approch was to acces my PDF documents with the PDF viewer embedded in each page. In this case, it’s not possible to use the IIIF API to redistribute and re-use informations contained into these documents for example. The search fonctionalities into the PDF viewer are a little basic too.

So, I want to use my ALTO documents into Universal Viewer. At this time, I don’t understand excatly what is the good way to do this. Actually, for each PDF document, several documents are products during the ALTO translation process (one XML file by document and some PNG files (one by page of the document)). How can I link properly these files (XML + PNGs) to one item ?

Ideally, I want to be able to search terms directly into Universal Viewer but I don’t know if the IIIF Search module is up to date on the last version of Omeka-S.

So, have you any idea to do this properly ?

Thanks by advance.

dan · January 21, 2021, 4:05pm

I have no clue either, but am also interested in the topic. There is still the module ExtractText. Maybe that could be an approach?

Daniel_KM · January 22, 2021, 12:51pm

The pdf format is not managed by the IIIF standard, that supports only image (before version 3), audio and video (v3 only). So Universal Viewer supports it because it interprets the iiif manifests in a way that it can display pdf too, and other formats (3d), if managed. So it’s not possible for iiif to search inside pdf.

Nevertheless, it’s a common need. So @symac, who develops the module Iiif Search, created the module Extract Pdf, that extracts the text in a single xml file and the module Iiif Search searches inside it to answer to the requests from the viewer. Of course, because the protocol IIIF Search manages only images, each image is attached to the item.

So to support Alto, you can either convert the xmls into the simple xml format used by the module, or adapt the module to support it directly. Or you can wait for February, because I’ll do it for a project soon.

Daniel_KM · January 22, 2021, 12:55pm

And the module is up to date, I use it here for example: https://collections.maison-salins.fr/s/patrimoine/item/1638.

dan · January 26, 2021, 7:48am

That’s fantatsic!
Do you add Alto/XML to the Item as Media in Omeka Backend?! Or how does the connection between image, item and full text work? Can the IIIF Server module already support the Search API?

Daniel_KM · January 27, 2021, 7:18am

Each image and the pdf are attached to the item; the module ExtractOcr extracts ocr from pdf and module Iiif Search searches inside it.

For the alto xml files, they will be managed in end of February. I don’t know yet if they will be attached to the item or not. The aim is to avoid to extract ocr (but it is used anyway for the main search).

dan · January 28, 2021, 2:51pm

Is it mandatory to have a PDF? Often you only have the images and the OCR results as Alto XML (Page XML, or whatever). IIIF server would then have to combine images and XML, right?

It would be good if there was a way to get along without the ExtractOCR module, because the OCR could be alreade there from other OCR engine (Abby e.g.)

Daniel_KM · January 28, 2021, 3:00pm

Indeed, the pdf will be useless when iiif search will be able to extract strings and positions from the Alto files. In the case of the site above, the pdf will still be used, because we want to allow users to download it.

Fred · February 2, 2021, 11:16am

Thanks for your reply. The last time I checked, this module was not yet up to date.

This example is exactly what I want to obtain. The search fonctionnality seem to be active when a valid XML file is explicitly added to the item. In my case, for the test, I add the ALTO XML file but searchs return nothing (if these files are not yet managed by the module, it’s normal).

Unfortunatly, nothing happend in the viewer when I add my PDF file to the item. The text is correctly extracted from my PDF and added to the metadata of the document. I see that a XML file is generated but I have an error during the process and I’m investigating. It’s probably an error on my side…

Daniel_KM · February 4, 2021, 1:58pm

Check in the browser console, it may helps.

Fred · February 9, 2021, 12:50pm

I’ve found the problem. The way to solve it is explain on the link above : Thumbnails are not generated

Now, both my xml file and pdf thumbnail are generated without any error. On the other hand, I’m still not able to perform a search in the viewer when this xml file is added as a media to the item (no search bar is displayed in universal viewer). I keep looking for that…

system · October 17, 2021, 12:50pm

This topic was automatically closed 250 days after the last reply. New replies are no longer allowed.