Slow CSV import due to daniel-km/noid4php

I’am experiencing a really slow CSV import (running version 2.6.1 of the CSV import module on Omeka S 4.4.1). It seems to be focussed on the “creates”, not the “updates”.

This particular CSV import does not have any links to media nor iiif, it does have references (bij sdo:identifier) to other Omeke resources. A module in use which might have an impact (and I do not want to disable for obvious reasons): Ark (version 3.5.15). Other modules like Solr Search, History and Statistics are already disabled to hunt down the bottleneck.

Currently this Omeka S instance has about 1.7M items each with an ARK identifier.

I have now used xhprof to measure the the import action, specifically line 357 of the create function in the Job/Import.php file.

The profiling output of a single execution of this function has the following output:

This shows that the _dba_fetch_range and dba_nextkey functions - which can be found in Ark/vendor/daniel-km/noid4php/lib/Noid.php - take nearly 400 seconds to mint 50 new ARKs for 50 new resources !?

Is this expected behaviour? Well, if I look at the code of Noid.php in daniel-km/noid4php (version 1.1.2 as is specified in composer.json of the Ark module, 1.2.1 is the current version of noid4php released in april this year) I see the following comment with the _dba_fetch_range function:

       /**
         * Workaround to get an array of all keys matching a simple pattern.
         *
         * @internal The default extension "dba" doesn't allow to get range of keys.
         * This workaround may be slow on big bases and may need a lot of memory.
         * @todo Build a partial temporary base to avoid memory out for big bases.
         *
         * @param string $pattern The pattern of the keys to retrieve (no regex).
         * @param resource $db
         * @return array Ordered associative array of matching keys and values.
         */

@Daniel_KM can I use noid4php version 1.2.1 as a drop-in replacement for version 1.1.2? Or will this be part of the nex release of Omeka-S-module-Ark?

Would disabling the ARK module for the import and enabling after the import work, or would the “Create ARKs” admin action then take a very long time (just shifting the problem).

I really want ARKs for all of my items, so I hope a solution can be found to make creating ARKS a scalable function.

FYI: I had no luck in replacing noid4php version 1.1.2 with 1.2.1 (too bad, because the new version also allows the usage of mariadb instead of BerkelyDB)

I tried disabling the ARK module prior to import and enabling the module afterwards. This did work, the “Create ARKs” function generates ARKs for those items which don’t have a ARK yet, at good speed! So this is for now my work-around.

1 Like