CSV import doesn't finish

coret · August 4, 2023, 10:50am

I’m trying to import a TSV file which has 19.215 lines. The import seems to stop after 40 updates of 25 items, which I see in the log. There are no errors in the Log (besides Info’s like shown below) or application.log. The Jobs status stays in In Progress.

Any idea’s what is causing this? Certainly there’s not a limit of 1000 items?

Log

25 items were updated (update): "ark:/60537/bCgjFV" (70896), "ark:/60537/bumJw5" (70897), "ark:/60537/bms8k2" (70898), "ark:/60537/bdxya1" (70899), "ark:/60537/b5PEGc" (70900), "ark:/60537/bXV6fW" (70901), "ark:/60537/bQ0w53" (70902), "ark:/60537/bAp1p0" (69921), "ark:/60537/bI5UVS" (70903), "ark:/60537/bAbjL0" (70904), "ark:/60537/bsgJBZ" (70905), "ark:/60537/bjn8r7" (70906), "ark:/60537/bbFeXu" (70907), "ark:/60537/b3KENt" (70908), "ark:/60537/bVQ6nc" (70909), "ark:/60537/bNVwcb" (70910), "ark:/60537/bG0V3u" (70911), "ark:/60537/by6jSg" (70912), "ark:/60537/bqbJIf" (70913), "ark:/60537/bhuQfX" (70914), "ark:/60537/b9Af55" (70915), "ark:/60537/b1FEVU" (70916), "ark:/60537/bTL6vE" (70917), "ark:/60537/bLQwkD" (70918), "ark:/60537/bDVVaC" (70919).

Job

Status
   In Progress
Started
   August 4, 2023 at 10:07:19 AM UTC
Ended
   [not ended]
Class
   CSVImport\Job\Import
Owner
   goudatijdmachine
Args
{
    "filename": "bestaandepercelen.tsv",
    "filesize": "15744156",
    "filepath": "/tmp/omekaSFqE5Y",
    "media_type": "text/tab-separated-values",
    "resource_type": "items",
    "comment": "",
    "automap_check_names_alone": true,
    "column-property": {
        "0": {
            "dcterms:identifier": 10
        },
        "1": {
            "sdo:identifier": 1082
        },
        "2": {
            "gtm:kadastraleAanduiding": 1361
        },
        "3": {
            "gtm:kadastraleSectie": 1362
        },
        "4": {
            "gtm:kadastraalNummer": 1363
        },
        "5": {
            "gtm:oppervlakte": 2048
        },
        "8": {
            "geo:asWKT": 1162
        },
        "9": {
            "dcterms:source": 11
        },
        "10": {
            "rdfs:seeAlso": 3506
        }
    },
    "column-data-type": [
        "literal",
        "literal",
        "literal",
        "resource",
        "literal",
        "literal",
        "literal",
        "literal",
        "geometry",
        "uri",
        "uri"
    ],
    "column-map-lat": {
        "6": "1"
    },
    "column-map-lng": {
        "7": "1"
    },
    "generic-data": "default",
    "media-source": "default",
    "mapping-module": "default",
    "o:resource_template": {
        "o:id": 14
    },
    "o:resource_class": {
        "o:id": 911
    },
    "o:owner": "",
    "o:is_public": 1,
    "o:item_set": [
        "69864"
    ],
    "o:site": [
        "2"
    ],
    "multivalue_separator": ";",
    "global_language": "",
    "action": "update",
    "identifier_column": 0,
    "identifier_property": "dcterms:identifier",
    "action_unidentified": "skip",
    "rows_by_batch": 25,
    "column-multivalue": []
}
Log
   view log
Log (database)
   View log (40 results)

System info

Omeka S > Version 4.0.1
PHP > Version 7.4.33
SAPI fpm-fcgi
Memory Limit 500M
POST Size Limit 100M
File Upload Limit 100M
Garbage Collection Yes
Modules
active > Advanced Resource Template (3.4.24), Advanced Search (3.4.12), Api Info (3.4.10), Ark (3.5.13.4), Blocks Disposition (3.4.2.3-beta), Bulk Edit (3.4.21), CSV Import (2.4.1), Clean Url (3.17.6), Custom Vocab (2.0.1), Data Type Geometry (3.4.3), Data Type RDF (3.4.7), Derivative Media Optimizer (3.4.6), Easy Admin (3.4.12), Extract Text (1.3.0), Feed (rss/atom) (3.4.3.3), File Sideload (1.7.1), Generic module (3.4.44), History Log (3.4.10-beta-2), Iiif Search (3.4.4), IIIF Server (3.6.14), Image Server (3.6.14), Inverse Properties (1.0.0), Item Sets Tree (0.7.0), Log (3.4.16), Mapping (1.9.0), NDE Termennetwerk (1.0.0), Numeric Data Types (1.11.3), PDF Embed (1.2.1), Sharing (1.4.0), Sitemaps (1.1), Statistics (3.4.5.3), Universal Viewer (3.6.7), URI Dereferencer (1.3.2), Value Suggest (1.16.1)
not_active Ark URL (0.3.0), Block Plus (3.4.15.9), Comment (3.4.1.12), Linked Data Sets (0.1), Reference (3.4.43), Resource Meta (1.0.0), Advanced Search adapter for Solr (3.5.43)
not_installed Custom Ontology (3.4.5.1), Block Plus (3.4.15.9)
invalid_omeka_version Notify Create/Update Item (0.0.2)
Paths PHP CLI path /usr/bin/php
Paths ImageMagick directory /usr/local/bin/

coret · August 4, 2023, 2:44pm

I also encountered this issue on a “create” CSV import of 1627 items. First time it would stop after 600 imported items, seconds time after 250 items. I was watching the perform-job.php process on the commandline with ps -ef and noticed the process disappeared (=stopped) before it was finished with the complete import. Again, nothing in Log and application.log.

FYI: the Omeka\Job\DispatchStrategy is configured in module.config.php as Omeka\Job\DispatchStrategy\PhpCli, so this can’t be a timeout issue as there is no timeout on PHP CLI.

Next, I truncated the csvimport_entity and csvimport_import tables, and ran the last job (identified by job id 1403) on the commandline with:

$ ./application/data/scripts/perform-job.php --job-id 1403 --base-path /data --server-url https://www.goudatijdmachine.nl

This time, the CSV import completed successfully, importing all items from the TSV file…

jflatnes · August 4, 2023, 4:30pm

As you say, it shouldn’t be a timeout issue, unless there’s some external watchdog or something on your server killing processes run by the web server if they run for long enough.

“In progress” if the process is actually not running anymore generally means a fatal error occurred: I would say the most usual situation is exceeding the PHP configured memory limit, but the process being killed would definitely apply also. A memory issue doesn’t really match with your experience of it running fine when manually run, either.

You can enable logging for the PHP CLI (how you do this is somewhat server/distribution dependent), which should indicate any fatal error occurring from PHP’s side.

coret · August 4, 2023, 7:05pm

Thanks for your suggestion. By default the PHP CLI logging ends up in /var/log/syslog and there I found it was an issue in the geometry data I was importing and about which @Daniel_KM DateTypeGeometry module raised an ‘Invalid geometry’ error (some geometries where indeed wrong).

So this begs the question: in general, can errors in jobs which are dispatched via PHP CLI be ‘catched’ so they show up in the Log and set the job status to error? Or should modules (which may be triggered during a CSV import) not throw errors but use another means of communicating the error?

jflatnes · August 4, 2023, 7:19pm

It depends on the exact nature of the error, but typically any errors that would go to the Omeka error log will be logged in the job’s log. Things like exceptions, for example are typically handled this way.

If the error is a PHP-level fatal error like a memory limit problem then catching that and logging it can be trickier.

What I described is the default logging functionality: I do notice you’re using a “Log” module; how logging works with that module I don’t have any insight into.

coret · August 4, 2023, 7:23pm

It depends on the exact nature of the error, but typically any errors that would go to the Omeka error log will be logged in the job’s log. Things like exceptions, for example are typically handled this way.

Yes, but not errors raised by CSV Import perform-job.php which is handled via PHP CLI ? As this is a separate process I guess the application.log and Log module are “out of the loop”…

jflatnes · August 4, 2023, 7:25pm

A background job kicks off as a separate process but it runs, basically, the whole Omeka S application in the job. So logging and other modules and things like that are all still in play. This is how, for example, CSV Import can support importing data that’s module-specific.

coret · August 15, 2023, 10:30pm

I’m having similar issues with another job: a Bulk Edit / DatatypeGeometry job converting the mappings markers of 77 items to geometry properties. As this job didn’t finish I tried rerunning the job with the perform-job.php script. After a while this script showed Killed on the command-line.

A dmesg | grep php showed that my operating system was responsible for killing the process (basically saying 100GB is too much, which I tend to agree with):

[2470803.487281] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=php8.1-fpm.service,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-181327.scope,task=php7.4,pid=2723600,uid=1000
[2470803.487522] Out of memory: Killed process 2723600 (php7.4) total-vm:101029400kB, anon-rss:100278628kB, file-rss:5456kB, shmem-rss:17336kB, UID:1000 pgtables:196600kB oom_score_adj:0
[2470806.762298] oom_reaper: reaped process 2723600 (php7.4), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

So, how can I (in general) debug situations like this?

jflatnes · August 15, 2023, 11:23pm

That’s certainly an odd one: it’s quite rare to see a PHP process killed for itself using too much memory because PHP has its own internal memory limit which is almost always vastly lower than the values you’re talking about. Omeka itself and our modules don’t attempt to increase or disable the memory limit. I’m not aware of any similar reports from other users.

I don’t have any immediate advice, though… Use of a debugger or inserted debugging/logging statements in the job’s code is an option, use of strace to get a look at what the process is doing as it runs… these are very general bits of advice only. For something that evades PHP’s memory limit (unless you’ve just disabled it yourself) I’d look at PHP extensions first, I suppose. They can allocate memory that’s not accounted for under PHP’s memory limit, in my experience.

coret · August 15, 2023, 11:24pm

I reduced the memory limit for PHP CLI to 30 GB and ran the following command:

 MEMPROF_PROFILE=dump_on_limit php7.4 -dextension=memprof.so perform-job.php

which lead to

PHP Fatal error:  Allowed memory size of 30048576000 bytes exhausted (tried to allocate 25767710720 bytes) (memprof dumped to /tmp/memprof.callgrind.1774338386270327) in /home/http/goudatijdmachine.nl/omeka-s-4.0.3/application/src/DataType/Resource/AbstractResource.php on line 108

So now, at least a linenumber to work on (great, because I have no clue what to look for when I load the memprof dump into QCacheGrind).

jflatnes · August 15, 2023, 11:34pm

Oh, so you did just have a very high or disabled PHP memory limit.

Any number of gigabytes really is way off the scale of what’s expected for memory usage for Omeka S at least. Regular operation works with levels like 128 megabytes.

The error message for a memory limit issue isn’t always the most helpful as you can just get an innocent allocation that happens to be the straw that broke the camel’s back, but you definitely have something useful there as it seems to be trying to allocate 25 GB or so in one go… though it also references just the end of a class that doesn’t do anything much interesting.

Less “scientific” options like logging or printing the memory_get_usage value periodically can be helpful as a pointer and something that’s easy to do.