What are the best practices for full text search with Omeka S?

dfox · June 14, 2018, 2:02am

Not much in the indexing job log:

2018-06-14T01:55:10+00:00 INFO (6): Start
2018-06-14T01:55:10+00:00 INFO (6): Index id: 1
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #2 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #3 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #4 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #5 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #6 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #7 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #8 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #9 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #10 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #11 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #12 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #13 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #14 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #15 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #16 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #17 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #18 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #19 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #20 (items)
2018-06-14T01:55:10+00:00 INFO (6): Indexing resource #48 (items)
2018-06-14T01:55:10+00:00 INFO (6): Commit
2018-06-14T01:55:10+00:00 INFO (6): Commit
2018-06-14T01:55:10+00:00 INFO (6): End

Nothing in the logs found on the Solr admin page either.

Hmmm.

dfox · June 14, 2018, 2:45am

I set up a new index just to confirm what I was seeing. I left the mappings and configuration in place as you see above. After running the indexing job I looked back at my Solr config to see the dynamic fields that were added. Everything appeared as before (dc_terms_s, is_public_b, etc.), except media_content_txt_en_split is not seen here.

I added additional files both via the Sideload and the HTML formatter, but nothing seems to be making it into the Solr index. If there’s anything else I can check which might help debug this, let me know.

pols12 · June 14, 2018, 1:49pm

I believed I have found why text media are not indexed, but I still don’t understand why media ingested from HTML ingester are not indexed.

Please try to edit Solr/src/Service/ValueExtractor/ItemValueExtractorFactory.php at line 42:

github.com

Daniel-KM/Omeka-S-module-Solr/blob/db25110a5b441e0cfd5e23863e98b6f7d4af9997/src/Service/ValueExtractor/ItemValueExtractorFactory.php#L42


use Interop\Container\ContainerInterface;
use Zend\ServiceManager\Factory\FactoryInterface;
use Solr\ValueExtractor\ItemValueExtractor;


class ItemValueExtractorFactory implements FactoryInterface
{
public function __invoke(ContainerInterface $services, $requestedName, array $options = null)
{
    $api = $services->get('Omeka\ApiManager');
    $config = $services->get('Config');
    $baseFilepath = $config['file_store']['local']['base_path'] ?: (OMEKA_PATH . '/files');


    $itemValueExtractor = new ItemValueExtractor;
    $itemValueExtractor->setApiManager($api);
    $itemValueExtractor->setBaseFilepath($baseFilepath);


    return $itemValueExtractor;
}
}

Replace /files with /files/original like this:

$baseFilepath = $config['file_store']['local']['base_path'] ?: (OMEKA_PATH . '/files/original');

Besides, try to add a property to your media and to map it in order to see whether it is well indexed.

dfox · June 17, 2018, 8:52pm

Thank you, @pols12. I made the change to the item extractor and was able to index the content of the sideloaded media file. I also followed your suggestion of adding metadata to this media file and mapping this property; this, too, worked like a charm.

I haven’t looked further at HTML ingester, but I don’t really need this. If I discover anything in this area, I will let you know.