PDF Text plugin does not appear to extract fulltext

I am brand new to Omeka, and started by installing the PDF Text plugin. I uploaded a single 16-page PDF and set the plugin to indexing the fulltext. After about half an hour, it does not appear to have done anything. When I look at the PDF Text : Text field, it’s empty. If I edit the entry by hand, it will search as desired, but the plugin doesn’t appear to be updating the field.

I FTPd the PDF up to the server and ran pdftotext on the command line, and it work ok (though it does throw some errors about fonts unless I run it with the -q quiet mode). Is there any way to diagnose where it’s breaking down, or track its progress, or anything like that?

Thanks
Ken

PS - I’m using Omeka 2.6.1 and PDF Text 1.3

The best way for me to diagnose where it’s breaking down is to test the file on my installation. Would you send me the problem PDF (upload or link)?

Hi Jim – thanks for the reply.
Here’s a link to a downloadable copy of the file:

Thanks
Ken

The plugin successfully extracts text from that PDF on my installation, and I get no errors when extracting text on the command line. What version of pdftotext do you have? Run pdftotext -v to get the version.

It’s pdftotext version 0.12.4. And it’s working on the command line with this file.

Given that it’s the first document I’m uploading to a new Omeka install and the first plugin I’m adding to that install, I’m guessing that something about my setup isn’t working as planned, and I wonder where I can start on tracking down the problem. Maybe a bad permissions setting?

Is it possible to tell if the plugin as initiating any processes or if it’s not even getting that far? I don’t know what tools might exist on the server for diagnosing the problem.

Any ideas?

Thanks,
Ken

You can activate error logging following these directions: https://omeka.org/classic/docs/Troubleshooting/Retrieving_Error_Messages/#activate-error-logging

Ever more complicated: I can’t get the logging to work at all. Any idea how I can troubleshoot that?

The Omeka install was done through Softaculous on a web host using Lightspeed. Too many layers of trouble…

Perhaps this solution will work? Cannot upload any files and errors.log doesn't work

Otherwise, contact Softaculous or Lightspeed about the issue.

Thanks – I’ll take a look at this, and already contacted Softaculous. We’ll see what comes of it…

Thank yoU!
Ken

We’ve got the logging working, but I’m not seeing any log results when I try to run the PDF Text plugin. So I want to see if my expectations are correct about how this should work. I made a video to show my process:

All I’m doing is

  1. Going to the plugin page
  2. Clicking “Configure” for the PDF Text plugin
  3. Checking the checkbox on the plugin page and clicking “Save Change”
  4. The screen reports that the changes have been saved.

Is that all I’m supposed to do? Should there be any other sign that the plugin is operating? The video also shows my path to check on the metadata – I’m doing:

  1. Go to the one item in the collection
  2. Click Edit
  3. Go to the Files tab
  4. Click Edit next to the PDF
  5. Click on the PDF Text tab
  6. Looking for text in what is sadly a blank field

Is this all correct? Is there something that I’m not seeing that I should be?

Thanks

What happens when you add a new PDF file to an item? Do you see the extracted text in the file’s “PDF Text:text” element?

Yes! That worked!

I was entirely misunderstanding how the plugin worked. I had read something last week that made me think that it worked on files that had already been uploaded, rather than files that were uploaded in the future. I’ve just added two files and it extracted the text just fine.

Thank you so much!
Ken

That’s great, but the plugin should indeed extract text from PDFs that have already been uploaded, following the actions you took in the video. I suspect that Omeka is not auto-detecting the correct path to PHP-CLI for running background processes. You can open application/config/config.ini and add the correct path to background.php.path. You can find the path by running $ which php on the command line.

Thanks – I’m not sure if it’s working because I had already deleted the file that didn’t get indexed, but since there are only two items in the repository now, maybe it won’t matter.

Thanks!

This topic was automatically closed after 250 days. New replies are no longer allowed.