Extract Text on previously imported XML files

jsuszczynski · April 17, 2024, 5:14pm

Hello, all -

We used the CSV Import module to load a few thousand items into our Omeka S installation. These items contained metadata as well as media (JPG and TEI-XML files).

After the import, we decided we wanted the built-in Omeka S search functionality to include the full-text of those TEI-XML files. I installed the Extract Text module, in hope that the filegetcontents would work well enough to pull the text into the search, and attempted to Bulk Edit the items using the ‘Extract Text ==> Refresh Text’ functionality in the Bulk Editing screen.

However, after doing so I’m not seeing any items that match the query: “extracttext:extracted_text has any value” … And none of the full-text of these TEI-XML files is reflected in the site search.

I wonder if I’m missing a step in the documentation, or if I should indeed expect that the ‘extracted_text’ property should be filled with the content of those XML files?

Happy to provide more details or clarifications if needed - thanks!

jimsafley · April 17, 2024, 6:30pm

The Extract Text module does not support XML files. I’ve created an issue to address this.

jsuszczynski · April 18, 2024, 7:01pm

Thanks for creating that issue, hopefully that functionality is added in the future.

The weird thing is that the Extract Text seem to work on my initial test of a single record, and the extracted text from a TEI file was properly indexed. The issue came when I tried to Bulk Edit many records at once - the Extract Text didn’t do anything. Very strange…

Are there any workarounds that you’ve seen? The contingency plan I have is to add a new ‘teitext’ field in our Letter resource, then programatically loop through each TEI file, strip out the XML tags, and throw the blob of text into that field so that it’s indexed. Seems a bit sloppy but I’m not sure what other options are available for adding the contents of TEI-XML files to the Omeka S site search. Thanks!

jimsafley · April 18, 2024, 7:31pm

The weird thing is that the Extract Text seem to work on my initial test of a single record, and the extracted text from a TEI file was properly indexed. The issue came when I tried to Bulk Edit many records at once - the Extract Text didn’t do anything. Very strange…

That is strange. If you attach the TEI file here, I can take a look.

Are there any workarounds that you’ve seen? The contingency plan I have is to add a new ‘teitext’ field in our Letter resource, then programatically loop through each TEI file, strip out the XML tags, and throw the blob of text into that field so that it’s indexed. Seems a bit sloppy but I’m not sure what other options are available for adding the contents of TEI-XML files to the Omeka S site search. Thanks!

That strategy will work, for sure, but it will take some time if you have many files.