pdfText plugin and Amazon S3

Hi all;

We are currently piloting a test installation of Omeka here at GVSU, and so far, have run into only one major problem with using it: in order to get indexable full-text representations of PDF files, we’ve installed the pdfText plugin. However, we are also using Omeka with amazon S3, which the plugin is not compatible with. Full-text indexing of our PDF transcripts is pretty important, and we don’t see any other way to get Omeka to do it, so we were wondering if anyone else has had this problem and if so, how they solved it. We are looking at the plugin to see if it can be altered to support S3, as well as possibly constructing a workflow outside of Omeka to generate the necessary files (although we only want those files to be indexable, not visible to the end user. Are there other options? Has anyone tried to modify the plugin to do this already? Thanks for any help!

PDF Text is only incompatible with remote storage like S3 for processing already-uploaded files. When a new PDF gets uploaded, it processes the text out of it before it gets stored to S3. It should work fine if you’re starting from scratch, as long as you have it installed and activated before you upload the PDFs.

If you do really need to support already-uploaded PDFs, that would have to be a modification to the plugin. It would have to temporarily redownload each file from S3 so it could process it locally.

Thank you! That solves our problem, I think.