I am brand new to Omeka, and started by installing the PDF Text plugin. I uploaded a single 16-page PDF and set the plugin to indexing the fulltext. After about half an hour, it does not appear to have done anything. When I look at the PDF Text : Text field, it’s empty. If I edit the entry by hand, it will search as desired, but the plugin doesn’t appear to be updating the field.
I FTPd the PDF up to the server and ran pdftotext on the command line, and it work ok (though it does throw some errors about fonts unless I run it with the -q quiet mode). Is there any way to diagnose where it’s breaking down, or track its progress, or anything like that?
The plugin successfully extracts text from that PDF on my installation, and I get no errors when extracting text on the command line. What version of pdftotext do you have? Run pdftotext -v to get the version.
It’s pdftotext version 0.12.4. And it’s working on the command line with this file.
Given that it’s the first document I’m uploading to a new Omeka install and the first plugin I’m adding to that install, I’m guessing that something about my setup isn’t working as planned, and I wonder where I can start on tracking down the problem. Maybe a bad permissions setting?
Is it possible to tell if the plugin as initiating any processes or if it’s not even getting that far? I don’t know what tools might exist on the server for diagnosing the problem.
We’ve got the logging working, but I’m not seeing any log results when I try to run the PDF Text plugin. So I want to see if my expectations are correct about how this should work. I made a video to show my process:
All I’m doing is
Going to the plugin page
Clicking “Configure” for the PDF Text plugin
Checking the checkbox on the plugin page and clicking “Save Change”
The screen reports that the changes have been saved.
Is that all I’m supposed to do? Should there be any other sign that the plugin is operating? The video also shows my path to check on the metadata – I’m doing:
Go to the one item in the collection
Click Edit
Go to the Files tab
Click Edit next to the PDF
Click on the PDF Text tab
Looking for text in what is sadly a blank field
Is this all correct? Is there something that I’m not seeing that I should be?
I was entirely misunderstanding how the plugin worked. I had read something last week that made me think that it worked on files that had already been uploaded, rather than files that were uploaded in the future. I’ve just added two files and it extracted the text just fine.
That’s great, but the plugin should indeed extract text from PDFs that have already been uploaded, following the actions you took in the video. I suspect that Omeka is not auto-detecting the correct path to PHP-CLI for running background processes. You can open application/config/config.ini and add the correct path to background.php.path. You can find the path by running $ which php on the command line.
Thanks – I’m not sure if it’s working because I had already deleted the file that didn’t get indexed, but since there are only two items in the repository now, maybe it won’t matter.