PDF text not being indexed for search

Amanda · March 3, 2017, 1:45pm

Hi - so we’re using the solr search plug-in. The PDF to text plug-in successfully works in terms of creating and storing the text, but the system does not seem to be indexing the text even after I choose “Clear and Reindex.” Has anyone else had a similar problem? If not, I guess I need to take this to the github page for the solr plug-in…

EDIT: I found this old forum post, which appears to suggest this should be possible, but says:

“You just have to make sure to index the PDFs with it first and then launch the Solr indexation once again.”

What’s the difference between that and:

“Running that, then reindexing Solr, might do the trick.”

Amanda · March 5, 2017, 7:55pm

Hi again - I just wanted to update from my previous post that for some reason the text also appears to not work in the regular (non-solr-powered search). Is there something I could check on to see why this would be happening? I can’t search the text from the public or admin side even using the advanced search criteria. Is the text for the PDF stored in the same place as Item Type Metadata text or elsewhere?

Thanks!

Amanda · March 6, 2017, 1:44pm

OK, so never mind on the solr search - there’s an option within the plug-in to search that text, so I got that bit working.

However, if I use the non-solr advanced search page, I still can’t get the search to find the records. Is there something I need to add to the keyword search or search by field on this page?

jimsafley · March 6, 2017, 3:08pm

Hi Amanda,

The plugin saves PDF text as File metadata, under “PDF Text:Text”. First check to see if the File (not the Item) has the PDF text. If it does the plugin should be working correctly and a full-text search should return expected results.

What often trips people up is that the advanced search does not perform a full-text search, rather it performs a simple “LIKE ‘%keyword%’” search. If you want to see results you’ll need to use the general-purpose, full-text search that’s found on every page of the website.

Amanda · March 6, 2017, 3:15pm

Hi Jim,

Thanks for the response. Yep, I found the text within the File metadata, and even tracked it to the database itself on the back end. No worries there - the text is stored correctly, and the solr search has no problem finding it using the full-text keyword search. It’s just the advanced search and the admin search, as you say.

If I wanted to change the admin or advanced search to perform the full-text search (specifically, the keywords text field): could you remind me where I’d change that?

Thanks for your help!

jimsafley · March 6, 2017, 3:57pm

As far as I know there’s no configuration to change the advanced search from LIKE to full-text. It’s designed specifically to narrow the search by a list of parameters, which is incompatible with full-text. I know this distinction is lost for most users but I’m not sure how to reconcile it.

Amanda · March 8, 2017, 2:51pm

Sorry to bug you again on this, but I’m a little confused and I’m not sure if it’s because we’re using the solr search plug-in (which ONLY works with the site-wide keyword search on every page in the header) or what. As it stands now, the site-wide search using solr DOES pick up the PDF text. So the issue seems to be the Advanced Search page. I’m not quite sure what or how I can change the Advanced Search page to default to using the solr search. Perhaps I should give an example to help explain.

We have two places people can search: the site-wide keyword search in the top right corner (solr search), and a link to the Advanced Search page, which also has a keyword search (that currently does not work on solr search). In both cases, it seems that the Item Type Metadata “Text” is properly indexed and searched (using LIKE is perhaps irrelevant here; I just mean a basic search to find anything in the text fields). Example: if I search for “some random phrase” in either place - top right or through Advanced Search - it successfully locates ALL documents where Item Type Metadata - Text contains “some random phrase.” Furthermore, if I search for “some random phrase” using the site-wide keyword (solr) search, then “some random phrase” can easily also be found in the PDF Text.

However, when I go to the Advanced Search page and search for “some random phrase” in the keyword text field, only those items with Item Type Metadata “Text” are found, i.e., it’s as if PDF text is never even searched. I do not think this is supposed to happen; if it can find “some random phrase” in ITM “Text” then why can it not find the same phrase in PDF Text?

jimsafley · March 8, 2017, 3:55pm

I’m not familiar with the SolrSearch plugin, so maybe someone else can clarify its role here.

Still, you’re likely seeing different results because the site-wide search and advanced search query different tables. The site-wide search performs a full-text (i.e. relevancy) search on the search_texts table, while the advanced search page performs a search on the element_texts table.

When the PDF Text plugin detects a PDF file, it extracts the text and saves it as File metadata, not Item metadata. This is an important distinction because the advanced search only queries Item metadata, so the PDF text is invisible when using that page. The site-wide search, on the other hand, does query File metadata (because of an automatic process that appends it to the parent Item’s metadata).

In short you aren’t seeing PDF text in your advanced search results because the advanced search doesn’t search File metadata. You are seeing PDF text in your site-wide search because the full-text table contains an aggregation of File and Item metadata.

I’ve created an issue on GitHub that may solve this problem, but I’m not sure if and when we can get to it.

Amanda · March 8, 2017, 5:25pm

Thank you for the clarification - I did indeed realize that there was a File metadata, but did not realize (for some reason) that different tables are being searched for the site-wide versus advanced search pages. That makes a lot more sense now. Thanks for creating the issue - I appreciate it - and I won’t worry about it right now as we’re still developing our database privately.