I have a set of searchable PDFs. Thanks to some help I received here earlier, I can display these using the UniversalViewer and even search within each PDF from the item page.
Now I want to be able to search the full text of all of my PDFs via the site-wide search. I figured the first step was to extract the text from the PDF. I did this and then added this text file as media alongside my PDF. Inferring from some other information I saw about Omeka Classic I realized that this would probably not be searchable. So, I added the full text to a metadata field on the item media (the text file) and not on the item itself. I still was not able to search this text.
When I add the text to the item metadata (and not the media metadata), this is searchable. Which is great, I guess, only the text is incredibly long and fairly riddled with errors. I donāt really want to expose it, I just want to use it for search. I see there is a plugin for Omeka Classic to hide metadata elements, but there doesnāt seem to be anything similar (yet) for Omeka S.
Anyway, I am curious what others are doing? Is what I tried above the right approach or is there some method I missed? If it is the right approach, does anyone have any ideas how I might hide the text from my users?
I donāt think thereās anything you missed. Those would be the two natural approaches Iād suggest: using HTML media to store the text or storing it in a metadata property.
As for hiding the display of some properties, youāre right that thereās not currently a module for that, but we did just add the necessary functionality that module would need in Omeka S 1.1.0: filters that allow a module to remove properties from the set that will be displayed.
The module doesnāt exist just yet, but I think it will be coming along soon.
Thanks for your response. Perhaps there is something I missed with the āadd as mediaā approach. You mentioned āHTML mediaā. I uploaded a plain/text file. Should I bother trying to upload an HTML version of the text?
It wouldnāt make much of a difference in your case.
The HTML media type is basically our āblessedā option for text-type content. In Omeka Classic people describing text often resorted to putting large blocks of text in metadata fields, when that text was really the ādataā the metadata was describing. Since Omeka S doesnāt allow HTML in metadata fields, we added the HTML media type to have an option for direct entry of rich text to allow for that kind of content in the system.
However, it still isnāt really plugged into search. āSitewideā search that works across content types is still something weāre investigating how best to accomplish. Itās also something some outside developers have looked at with modules using things like external search engines such as Solr. The MySQL fulltext search support has been our method in the past with Omeka Classic, but it has many well-known shortcomings that often frustrate users.
Here, try out Hide Properties, the successor to Classicās Hide Elements plugin, hot off the presses (or whatever makes modules).
You want to download the āHideProperties-1.0.0.zipā file, not the āsource codeā links. Iāll also get it listed on the official modules list on omeka.org soon.
Weāre using the Solr full-text search with Omeka Classic and the PDF to Text plug-in. It works fine in extracting the PDFās text and then searching it. There are pros and cons to Solr ā the pros are obviously the ability to really provide a robust full-text search that can be modified according to various factors in the Solr core XML files. I love this feature because it allows full control of relevance weighting and highlighting adjustments and what not.
The cons have nothing to do with Omeka itself and everything to do with Solr on the back end, which I sometimes (ok, often) find convoluted and difficult to understand. (Mind you I am NOT a server admin, have little background in it, and yet we have to run it on our server for thisā¦) I still havenāt figured out a good way to deal with a Solr core that keeps mysteriously de-linking itself every time the server reboots (instructions via the Solr user list led nowhere), and there is no easy way to provide user authentication for the Solr admin console via web browser. The Solr built-in method blocks permissions for Omeka to access Solr, so in the end the only current solution is a reverse proxy. I havenāt figured out how to do the latter successfully yet.
Thanks for your thoughts on Solr, @Amanda. As Iāve been trying to figure this out Iāve been helped by many of your older questions and comments. So, thanks!
I think I will try this, as well, as it looks fairly straightforward. My understanding is that the content of the index will be the same as the site-wide search, but I will get access to Solr features like hit-highlighting. So, I still need to solve the problem of where to put the text itself. And I guess the solution to that is to stick it in the metadata, which now seems kind of obvious. But a few days ago, was not.
Actually, itās different because the default site-wide search is rather exact. This can matter -our site uses Chinese and there are no spaces between words. Solr allows the use of a Chinese dictionary that correctly tokenizes the indexed text, so basically the search then looks for the correct pairings of words (and indeed finds phrases that are sometimes separated by another word in between) without the need for spacing. Omeka doesnāt do that by default.
Sure, Iām confident that the Solr search is much more powerful. I just meant that the content of the search index is the same for both. I had thought, incorrectly, that text files added as media might become part of the search index (associated with their source item). Iām assuming the PDF module you are using extracts the text and adds it to a metadata field?
I tried it. The metadata that is āhiddenā ā is it still being returned from the database when I load the item page? I have quite a lot of text and I am still seeing some page load errors. I think I need to split my books into smaller components. Anyway, thanks for the fast turnaround.
Whoops, just saw this now. Yeah, the plug-in extracts the text. Our largest document is about ~230 pages long, so I donāt know about 800 pages, but it indexes it just fine ā I can search and find the text using the Solr search.
@Amanda@dfox I am following your conversation and would love to hear more about the way youāve set everything up. Iām investigating using Omeka S with Solr for a collection of long PDF documents that we need to full-text search, but Iām not even sure where to begin. Like you @dfox I would rather not have the OCR text visible in the metadata, so Iād love to hear how that worked out. And Iām wondering if any of the sandbox options have the ability to test out the Solr plugin?
We are still using Omeka Classic here, but I think what youād like to do is pretty simple once youāve installed the appropriate modules⦠in Omeka Classic one installs the Solr and PDF Text plug-ins, the latter of which extracts the text from the PDFs and indexes it in the database alongside everything else. For us, this has included searchable PDFs of up to 220 pages or so, albeit we donāt have more than 5-10 documents of that length; the majority are under 20 pages or so.
In terms of setting things up on the back end, the biggest pain IMO is the Solr server. I am not a server admin or guru of any sort, and I still get frustrated with that. But the Omeka side is relatively easy going.