PDF Text plugin does not appear to extract fulltext

kenirwin · October 16, 2018, 8:11pm

I am brand new to Omeka, and started by installing the PDF Text plugin. I uploaded a single 16-page PDF and set the plugin to indexing the fulltext. After about half an hour, it does not appear to have done anything. When I look at the PDF Text : Text field, it’s empty. If I edit the entry by hand, it will search as desired, but the plugin doesn’t appear to be updating the field.

I FTPd the PDF up to the server and ran pdftotext on the command line, and it work ok (though it does throw some errors about fonts unless I run it with the -q quiet mode). Is there any way to diagnose where it’s breaking down, or track its progress, or anything like that?

Thanks
Ken

PS - I’m using Omeka 2.6.1 and PDF Text 1.3

jimsafley · October 18, 2018, 2:30pm

The best way for me to diagnose where it’s breaking down is to test the file on my installation. Would you send me the problem PDF (upload or link)?

kenirwin · October 18, 2018, 3:07pm

Hi Jim – thanks for the reply.
Here’s a link to a downloadable copy of the file:

Thanks
Ken

jimsafley · October 18, 2018, 3:36pm

The plugin successfully extracts text from that PDF on my installation, and I get no errors when extracting text on the command line. What version of pdftotext do you have? Run pdftotext -v to get the version.

kenirwin · October 18, 2018, 3:45pm

It’s pdftotext version 0.12.4. And it’s working on the command line with this file.

Given that it’s the first document I’m uploading to a new Omeka install and the first plugin I’m adding to that install, I’m guessing that something about my setup isn’t working as planned, and I wonder where I can start on tracking down the problem. Maybe a bad permissions setting?

Is it possible to tell if the plugin as initiating any processes or if it’s not even getting that far? I don’t know what tools might exist on the server for diagnosing the problem.

Any ideas?

Thanks,
Ken

jimsafley · October 18, 2018, 3:50pm

You can activate error logging following these directions: https://omeka.org/classic/docs/Troubleshooting/Retrieving_Error_Messages/#activate-error-logging

kenirwin · October 19, 2018, 2:07pm

Ever more complicated: I can’t get the logging to work at all. Any idea how I can troubleshoot that?

The Omeka install was done through Softaculous on a web host using Lightspeed. Too many layers of trouble…

jimsafley · October 19, 2018, 3:09pm

Perhaps this solution will work? Cannot upload any files and errors.log doesn't work

Otherwise, contact Softaculous or Lightspeed about the issue.

kenirwin · October 19, 2018, 9:14pm

Thanks – I’ll take a look at this, and already contacted Softaculous. We’ll see what comes of it…

Thank yoU!
Ken

kenirwin · October 22, 2018, 1:10pm

We’ve got the logging working, but I’m not seeing any log results when I try to run the PDF Text plugin. So I want to see if my expectations are correct about how this should work. I made a video to show my process:

All I’m doing is

Going to the plugin page
Clicking “Configure” for the PDF Text plugin
Checking the checkbox on the plugin page and clicking “Save Change”
The screen reports that the changes have been saved.

Is that all I’m supposed to do? Should there be any other sign that the plugin is operating? The video also shows my path to check on the metadata – I’m doing:

Go to the one item in the collection
Click Edit
Go to the Files tab
Click Edit next to the PDF
Click on the PDF Text tab
Looking for text in what is sadly a blank field

Is this all correct? Is there something that I’m not seeing that I should be?

Thanks

jimsafley · October 22, 2018, 1:35pm

What happens when you add a new PDF file to an item? Do you see the extracted text in the file’s “PDF Text:text” element?

kenirwin · October 22, 2018, 2:04pm

Yes! That worked!

I was entirely misunderstanding how the plugin worked. I had read something last week that made me think that it worked on files that had already been uploaded, rather than files that were uploaded in the future. I’ve just added two files and it extracted the text just fine.

Thank you so much!
Ken

jimsafley · October 22, 2018, 2:26pm

That’s great, but the plugin should indeed extract text from PDFs that have already been uploaded, following the actions you took in the video. I suspect that Omeka is not auto-detecting the correct path to PHP-CLI for running background processes. You can open application/config/config.ini and add the correct path to background.php.path. You can find the path by running $ which php on the command line.

kenirwin · October 22, 2018, 2:35pm

Thanks – I’m not sure if it’s working because I had already deleted the file that didn’t get indexed, but since there are only two items in the repository now, maybe it won’t matter.

Thanks!

system · June 23, 2019, 8:20pm

This topic was automatically closed after 250 days. New replies are no longer allowed.