What are the best practices for full text search with Omeka S?

dfox · April 4, 2018, 2:15pm

I have a set of searchable PDFs. Thanks to some help I received here earlier, I can display these using the UniversalViewer and even search within each PDF from the item page.

Now I want to be able to search the full text of all of my PDFs via the site-wide search. I figured the first step was to extract the text from the PDF. I did this and then added this text file as media alongside my PDF. Inferring from some other information I saw about Omeka Classic I realized that this would probably not be searchable. So, I added the full text to a metadata field on the item media (the text file) and not on the item itself. I still was not able to search this text.

When I add the text to the item metadata (and not the media metadata), this is searchable. Which is great, I guess, only the text is incredibly long and fairly riddled with errors. I don’t really want to expose it, I just want to use it for search. I see there is a plugin for Omeka Classic to hide metadata elements, but there doesn’t seem to be anything similar (yet) for Omeka S.

Anyway, I am curious what others are doing? Is what I tried above the right approach or is there some method I missed? If it is the right approach, does anyone have any ideas how I might hide the text from my users?

Thanks!

jflatnes · April 5, 2018, 4:03pm

I don’t think there’s anything you missed. Those would be the two natural approaches I’d suggest: using HTML media to store the text or storing it in a metadata property.

As for hiding the display of some properties, you’re right that there’s not currently a module for that, but we did just add the necessary functionality that module would need in Omeka S 1.1.0: filters that allow a module to remove properties from the set that will be displayed.

The module doesn’t exist just yet, but I think it will be coming along soon.

dfox · April 5, 2018, 4:46pm

Thanks for your response. Perhaps there is something I missed with the ‘add as media’ approach. You mentioned ‘HTML media’. I uploaded a plain/text file. Should I bother trying to upload an HTML version of the text?

jflatnes · April 5, 2018, 5:02pm

It wouldn’t make much of a difference in your case.

The HTML media type is basically our “blessed” option for text-type content. In Omeka Classic people describing text often resorted to putting large blocks of text in metadata fields, when that text was really the “data” the metadata was describing. Since Omeka S doesn’t allow HTML in metadata fields, we added the HTML media type to have an option for direct entry of rich text to allow for that kind of content in the system.

However, it still isn’t really plugged into search. “Sitewide” search that works across content types is still something we’re investigating how best to accomplish. It’s also something some outside developers have looked at with modules using things like external search engines such as Solr. The MySQL fulltext search support has been our method in the past with Omeka Classic, but it has many well-known shortcomings that often frustrate users.

dfox · April 5, 2018, 5:03pm

Thanks for the clarification. Last question: I see there are modules for search with Solr for Omeka S. Worth looking at these next?

dfox · April 5, 2018, 5:27pm

Or perhaps I should just roll back to Omeka Classic.

jflatnes · April 6, 2018, 12:24am

Oooooh, you’re putting on the pressure!

Here, try out Hide Properties, the successor to Classic’s Hide Elements plugin, hot off the presses (or whatever makes modules).

You want to download the “HideProperties-1.0.0.zip” file, not the “source code” links. I’ll also get it listed on the official modules list on omeka.org soon.

Daniel_KM · April 6, 2018, 8:58am

About the module Solr, it works perfectly and is not so complicated to install and to manage: all can be done in Omeka admin interface. I write a full readme that explains it (https://github.com/daniel-km/omeka-s-module-solr and the associate search module (https://github.com/daniel-km/omeka-s-module-search).

Note: the upstream version works fine, but only on Omeka s beta 2 and my fixes and improvements are not yet integrated upstream (https://github.com/biblibre/omeka-s-module-solr).

Amanda · April 6, 2018, 11:05am

We’re using the Solr full-text search with Omeka Classic and the PDF to Text plug-in. It works fine in extracting the PDF’s text and then searching it. There are pros and cons to Solr – the pros are obviously the ability to really provide a robust full-text search that can be modified according to various factors in the Solr core XML files. I love this feature because it allows full control of relevance weighting and highlighting adjustments and what not.

The cons have nothing to do with Omeka itself and everything to do with Solr on the back end, which I sometimes (ok, often) find convoluted and difficult to understand. (Mind you I am NOT a server admin, have little background in it, and yet we have to run it on our server for this…) I still haven’t figured out a good way to deal with a Solr core that keeps mysteriously de-linking itself every time the server reboots (instructions via the Solr user list led nowhere), and there is no easy way to provide user authentication for the Solr admin console via web browser. The Solr built-in method blocks permissions for Omeka to access Solr, so in the end the only current solution is a reverse proxy. I haven’t figured out how to do the latter successfully yet.

dfox · April 6, 2018, 11:05am

@jflatnes Sorry, was meant as a question not a threat But I’m grateful for the new module, anyway. I’ll give it a shot today.

dfox · April 6, 2018, 11:16am

Thanks for your thoughts on Solr, @Amanda. As I’ve been trying to figure this out I’ve been helped by many of your older questions and comments. So, thanks!

dfox · April 6, 2018, 11:21am

I think I will try this, as well, as it looks fairly straightforward. My understanding is that the content of the index will be the same as the site-wide search, but I will get access to Solr features like hit-highlighting. So, I still need to solve the problem of where to put the text itself. And I guess the solution to that is to stick it in the metadata, which now seems kind of obvious. But a few days ago, was not.

Amanda · April 6, 2018, 12:13pm

Actually, it’s different because the default site-wide search is rather exact. This can matter -our site uses Chinese and there are no spaces between words. Solr allows the use of a Chinese dictionary that correctly tokenizes the indexed text, so basically the search then looks for the correct pairings of words (and indeed finds phrases that are sometimes separated by another word in between) without the need for spacing. Omeka doesn’t do that by default.

dfox · April 6, 2018, 12:31pm

Sure, I’m confident that the Solr search is much more powerful. I just meant that the content of the search index is the same for both. I had thought, incorrectly, that text files added as media might become part of the search index (associated with their source item). I’m assuming the PDF module you are using extracts the text and adds it to a metadata field?

dfox · April 7, 2018, 5:12pm

I tried it. The metadata that is ‘hidden’ – is it still being returned from the database when I load the item page? I have quite a lot of text and I am still seeing some page load errors. I think I need to split my books into smaller components. Anyway, thanks for the fast turnaround.

jflatnes · April 9, 2018, 5:21pm

Yeah generally it’s still going to be retrieved… how big of text are we talking here?

dfox · April 10, 2018, 11:55pm

One example is an ~800 page PDF, the extracted text of which is ~4 MB. These are city directories, which are often quite long.

Amanda · April 17, 2018, 2:01pm

Whoops, just saw this now. Yeah, the plug-in extracts the text. Our largest document is about ~230 pages long, so I don’t know about 800 pages, but it indexes it just fine – I can search and find the text using the Solr search.

jwagnerwebster · April 25, 2018, 6:59pm

@Amanda @dfox I am following your conversation and would love to hear more about the way you’ve set everything up. I’m investigating using Omeka S with Solr for a collection of long PDF documents that we need to full-text search, but I’m not even sure where to begin. Like you @dfox I would rather not have the OCR text visible in the metadata, so I’d love to hear how that worked out. And I’m wondering if any of the sandbox options have the ability to test out the Solr plugin?

Amanda · May 2, 2018, 9:21am

We are still using Omeka Classic here, but I think what you’d like to do is pretty simple once you’ve installed the appropriate modules… in Omeka Classic one installs the Solr and PDF Text plug-ins, the latter of which extracts the text from the PDFs and indexes it in the database alongside everything else. For us, this has included searchable PDFs of up to 220 pages or so, albeit we don’t have more than 5-10 documents of that length; the majority are under 20 pages or so.

In terms of setting things up on the back end, the biggest pain IMO is the Solr server. I am not a server admin or guru of any sort, and I still get frustrated with that. But the Omeka side is relatively easy going.