OCR-ed PDFs - best way?

Hi.
I read some posts about that topic here on Omeka forum. I need the best and most simple way to manage OCR-ed PDFs so they could be searched, not only from outside, but also when they are opened inside Universal viewer.
I’m willing to convert those PDFs to any other format if it is needed and if it is better for Universal viewer and Omeka. Any suggestions?

No suggestions? I would be satisfied with another viewer that displays only PDF documents (I want to keep Universal Viewer for images as it is perfect for that). I would also be content if search functionality is available only within the document when opened in the viewer; I just need some suggestions.

Hi,

We’ve set it up as follows in our Omeka S instance.

  • Module PdfViewer is being used for viewing PDF files. For image files we use Mirador or UniversalViewer.
  • We’ve adapted the theme so that it uses the PdfViewer in case the mediaType of the resource is a PDF and IIIF viewer (mirador or Universal) in any other case. View the code here.
  • OCR’ed content is extracted from the PDF with module ExtractText, which will store the extracted text as a string in a metadata field on item level.
  • Metadata is indexed in Solr. For this you need 2 modules (AdvancedSearch and the Solr adapter) and a separate Solr instance.

In this way, one can search through the full text contents of all PDFs using the AdvancedSearch module. This returns a result set of items that match the search string.
Subsequently, a user can open any of the items in the result set and enter the same search string in the search box of the PDF viewer to find the exact page where the text is located.

It would be ideal if one could click on an item in the search result set and go the exact page (with highlighted text) where the search string is located, but this is unfortunately not how it works in Omeka S. Searching remains a two-step process.

You’re welcome to try out this yourself at our Omeka S instance.

  1. Open the advanced search page: Myths, legends and fairy tales · Search · Maastricht University Digital Collections
  2. Enter a search string. Example: "the coastwise lights"
  3. Result is 1 item called “A song of the English”. Open this item.
  4. Search for the coastwise lights inside the PDF (click the magnifying glass icon). You will find highlighted results on page 13 and 30.

If you’re a bit more technical, you could also examine our Omeka S in Docker stack to reproduce this yourself.

Best regards,
Maarten Coonen

1 Like

Thank you so much. I edited show.phtml and it works like a charm. :grinning:

I’ll try that part with ExtractText and report here for feedback.

I checked your web, beautiful web site, really beautiful!

1 Like