Omeka S SOLR indexing for PDFs, HTML, items and item set metadata

digitalissoanalog · January 17, 2022, 4:16pm

Dear Community,

We are excited about the possibility of indexing and searching our Omeka S hosted content using the SOLR module. We want to be able to index:

PDFs
Item and Item Metadata
Text content within HTML blocks on pages

We have set up a separate SOLR instance and have configured the Omeka S plugin to communicate with SOLR. So far, it seems that SOLR is only indexing items and item sets, and not HTML or PDF content. It is not clear to us from the documentation whether we have configured the module or the server incorrectly. Does anyone have practical experience in configuring an Omeka S + SOLR environment, who might be able to work out where we are going wrong?

Thanks

Amanda · February 8, 2022, 9:50am

I don’t have experience with solr for Omeka S, only solr for Omeka classic. We use the PDF to text plug-in to store the text of PDFs and then solr indexes and searches that. Is there a similar PDF to text plug-in for Omeka S? Or how exactly is the text being stored?

luca.g · February 9, 2022, 1:52pm

Hello,

a few days have passed so you perhaps you have already solved your issues. The way I’d go about it is making sure that everything is in place Solr-wise, before stepping into the Omeka configuration. So, querying the Solr instance directly and see what happens, or just look at the indexes in the core configuration. That way you’re sure if all the data has been indexed or not.

Daniel_KM · February 9, 2022, 6:43pm

As long as the content is in a property, it can be indexed. The media html can be indexed too. In all cases, you need to create the mapping between the source (omeka property or media etc.) and the destination (the solr indexes). The site pages are currently not indexed, but it may be an improvement.

brian.c.rogers · February 11, 2022, 11:08am

Thanks Daniel, I have managed to make a sub-property for items that is the HTML content of a a media attachment, using your AdvancedSearch and SearchSolr modules.

I think that only leaves me the problem of how to automatically extract the text when a PDF is ingested and put that into a media attachment.

Best,
Brian Rogers

system · October 19, 2022, 11:09am

This topic was automatically closed 250 days after the last reply. New replies are no longer allowed.