Accessing PDF Text content in search results (keyword in context/KWIC)

kenirwin · December 10, 2018, 5:32pm

I’m trying to modify the theme/search/index.php page to show keyword in context (KWIC) in the results. Is there a method for accessing the relevant results?

I did a var_dump on $searchText and I see a pretty complex object (object(Omeka_Db)) in ["_db":protected]
and buried deep inside the Omeka_Db object is the fulltext that I’d want to extract a section of.

Does Omeka have a method for doing that? I fear that the “protected” aspect may mean that it’s not accessible. I couldn’t figure out raw PHP syntax that would support accessing that (e.g. $searchText->[’"_db":protected’] )

Is there a way to do this?

I saw that the LiveBook plugin does have some sort of KWIC support, but I’m not sure that I want to go that route – although if anyone is using both PDFText and LiveBook, I’d be interested in hearing how that works.

Any ideas?

Thanks!
Ken

jflatnes · December 10, 2018, 6:39pm

You should be able to get the text out of the $searchText object with just $searchText->text.

kenirwin · December 10, 2018, 7:26pm

I had hoped that would be the case, and I forgot to mention that I’d tried it, but $searchText->text returns NULL, but I’m not sure why.

Looking on the database side, I see this table: omkjy_element_texts
It contains a text column that contains the information I want.

Here’s what my var_dump looks like:

>  object(SearchText)#68 (15) {
>   ["record_type"]=>
>   string(4) "File"
>   ["record_id"]=>
>   int(11)
>   ["public"]=>
>   NULL
>   ["title"]=>
>   string(28) "Torch 88.10 - 2001-12-04.pdf"
>   ["text"]=>
>   NULL
>   ["id"]=>
>   NULL
>   ["_errors":"Omeka_Record_AbstractRecord":private]=>
>   object(Omeka_Validate_Errors)#175 (2) {
>     ["_errors":protected]=>
>     array(0) {
>     }
>     ["storage":"ArrayObject":private]=>
>     array(0) {
>     }
>   }
>   ["_cache":protected]=>
>   array(0) {
>   }
>   ["_mixins":protected]=>
>   array(0) {
>   }
>   ["_db":protected]=>
>   object(Omeka_Db)#52 (4) {
>     ["prefix"]=>
>     string(6) "omkjy_"
>     ["_adapter":protected]=>
>     object(Zend_Db_Adapter_Mysqli)#50 (12) {
>       ["_numericDataTypes":protected]=>
>       array(16) {
>         [0]=>
>         int(0)
>         [1]=>
>         int(1)
>         [2]=>
>         int(2)
>         ["INT"]=>
>         int(0)
>         ["INTEGER"]=>
>         int(0)
>         ["MEDIUMINT"]=>
>         int(0)
>         ["SMALLINT"]=>
>         int(0)
>         ["TINYINT"]=>
>         int(0)
>         ["BIGINT"]=>
>         int(1)
>         ["SERIAL"]=>
>         int(1)
>         ["DEC"]=>
>         int(2)
>         ["DECIMAL"]=>
>         int(2)
>         ["DOUBLE"]=>
>         int(2)
>         ["DOUBLE PRECISION"]=>
>         int(2)
>         ["FIXED"]=>
>         int(2)
>         ["FLOAT"]=>
>         int(2)
>       }
>       ["_stmt":protected]=>
>       object(Zend_Db_Statement_Mysqli)#395 (12) {
>         ["_keys":protected]=>
>         array(6) {
>           [0]=>
>           string(2) "id"
>           [1]=>
>           string(9) "record_id"
>           [2]=>
>           string(11) "record_type"
>           [3]=>
>           string(10) "element_id"
>           [4]=>
>           string(4) "html"
>           [5]=>
>           string(4) "text"
>         }
>         ["_values":protected]=>
>         array(6) {
>           [0]=>
>           &int(234)
>           [1]=>
>           &int(11)
>           [2]=>
>           &string(4) "File"
>           [3]=>
>           &int(52)
>           [4]=>
>           &int(0)
>           [5]=>
>           &string(118048) "
> The student-run newspaper of Wittenberg University
> VOLUME 88, ISSUE 10
> SPRINGFIELD, OHIO
> Mount Union ends Wittenberg’s title shot with a crushing 49-21 victory
> powering Riders moved the def..."

jflatnes · December 10, 2018, 9:23pm

Ah, I forgot: when we query the search_texts table we don’t actually select the text, since it can be quite large and we don’t actually use it.

You can look up the element texts for each result if you want to do that… you’d be using the $record variable instead and using the metadata helper or some other method of loading the element texts.

kenirwin · December 11, 2018, 8:02pm

Thanks. I feel like I’m getting close here, but I’m struggling to extract the metadata. In the theme/search/index.php view template I’m trying this:

$m = new Omeka_View_Helper_Metadata();
var_dump($m->metadata($record, array('PDF Text','Text')));

I get the response:
Omeka_Record_Exception. There is no element "PDF Text", "Text"!

Am I approaching this wrong? Here’s what I’m looking at:

The text I want is in omeka_element_texts.text

omeka_element_texts.text all have omeka_element_texts.element_id=52
omeka_elements.id = 52 has element_set_id=4, name=“Text”
omeka_element_sets.id=4 has name=“PDF Text”

The metadata view helper docs describe:
$metadata (string|array) – The metadata field to retrieve. If a string, refers to a property of the record itself. If an array, refers to an Element: the first entry is the set name, the second is the element name.

So I think my set name is “PDF Text” and my element name is “Text”, but it doesn’t seem to work. Any ideas what I’m doing wrong? I’ve tried swapping the order of the array elements just in case, but it didn’t work either.

Thank you for your help!

jflatnes · December 11, 2018, 8:34pm

A couple things: typically you’d just use the global function metadata in a theme view like this, rather than creating an instance of the helper.

Likely the source of your issue here: the PDF Text element set is, to my recollection, only specified for Files, not Items. If you make that metadata call on an Item, it will give you this kind of error since the element doesn’t exist for Items. If Files and Items are mixed in your results and you only want this to run on the Files, you can simply check the record type, which is already pulled out in the search results view.

If you are having Items as the results and want to get at the file metadata… you could loop over the files for the item and call metadata on them… or you could probably do a query to get at the actual “search text” table text column, as an alternative… we don’t select it as part of the actual search process but you could use get_record to re-request it from the database and you should get everything including the text column.

kenirwin · December 11, 2018, 8:53pm

Ah ha! That worked – thank you so much.
Here’s the code that gave access to the content I wanted:

        if (get_class($record) == "File") {
            $fulltext = metadata($record, array('PDF Text','Text'),'all');
        }

Thanks for your help!