Slow imports over 10k

benbakelaar · June 8, 2020, 5:33pm

I’ve posted about this before, but I never had confirmation from two separate servers, two different projects. It seems to me that there is very consistent behavior in which the import ( CSV Import version 2.1.0 by Omeka Team and Daniel Berthereau) slows down over the course of 1000s of items, so that by the time it gets to 10,000 items, it is only managing about 2 imports per second, and by the time it reaches 15,000 items, it is 1 or less. In the beginning, I observe record creation rates of 10-20 per second.

In this 2nd case which confirms the behavior for me, I even disabled thumbnail creation (see other post) - so all that is happening is a file is being copied from Sideload directory into /files/original, and a media record being created. This is on a clean AWS EC2 environment, whereas my 1st example which I posted about previously was on an institutional shared server which can reasonably be suspected of having weird issues.

I feel like I’m the only one dealing with 10k plus imports, but if you have 100,000+ records that you want to import - having to split them into 10k chunks is a lot of manual work (even with some fancy text manipulation scripting), and having to wait 90+ mins in between just creates workflow challenges.

I think it might be something with the PHP processes being called internally? One at a time, but then times thousands.

jflatnes · June 8, 2020, 7:21pm

This is consistent regardless of how many items are already in the site, or not?

The CSV Importer already breaks its input into smaller chunks on its own in an effort to deal with some effects of Doctrine, our database layer, that tend to slow down operations as the number of items, etc. handled in a single operation grows. If imports basically uniformly follow the same pattern of slowing over time then it would indicate some problem with that system, I think.

If instead the speed is more of a factor of simply how much is already in the database (so the last “chunked” 10,000 item import in a batch of 10 is still equally/similarly slow as the last records of a 100,000 item import), that would indicate different issues at work.

One further question: are these “item” imports, another specific type, or “mixed resources” imports?

benbakelaar · June 8, 2020, 7:30pm

Thanks John, regarding your first question, maybe there is something to that. These operations are generally “append”, and always “item”. In the case of Edison Papers, there are already 150k items (just metadata), and then we are importing to attach images. In the case of Metascripta, there are 1k items (just metadata), and similarly importing to attach images.

I was just reading the note under “Number of rows to process by batch”. Today I had been experimenting with higher batch numbers - but do you think a number lower than 20 (that’s what the message says is the default) would yield better results? Or that 20 would yield better results than 100 batch size?

So here’s the current stats:

Started

June 8, 2020 2:21:40pm
It’s now 3:27pm on my local computer, so ~5-6 minutes.

It’s already at 8000 items, roughly half of the total CSV records. But now it appears to have that slower rate, currently ~4-5 media items (in the media table) per second. Compare that to the last 16k record import taking 2.5 hours according to the job details.

EDIT
3:31pm
View log (8429 results)

3:44pm
View log (10104 results)

So I guess my point is, it doesn’t make sense to have such an inconsistent rate of processing. Although both servers are different, each is a LAMP stack with a SQL db (in the case of AWS, mariadb). Each is using SSD hard drive. Each is using Sideload and CSV Import. And each demonstrates the same behavior of a rapidly decreasing rate of “import” - in the case that the tail end is the “normal”, then I’m not sure where the “burst” comes from at the beginning. But I’m assuming that the “burst” at the beginning is “normal”, and then the tail end is the degraded/slow issue.

Current import args


Args

{
    "filename": "import-ojpg-151-170.csv",
    "filesize": "492859",
    "filepath": "\/tmp\/omekaoYqHAy",
    "media_type": "text\/csv",
    "resource_type": "items",
    "comment": "",
    "automap_check_names_alone": true,
    "column-property": [
        {
            "dcterms:identifier": 10
        }
    ],
    "column-data-type": [
        "literal",
        "literal"
    ],
    "column-media_source": {
        "1": "sideload"
    },
    "generic-data": "default",
    "media-source": "default",
    "o:resource_template": "",
    "o:resource_class": "",
    "o:owner": "",
    "o:is_public": 1,
    "multivalue_separator": ",",
    "global_language": "",
    "action": "append",
    "identifier_column": 0,
    "identifier_property": "dcterms:identifier",
    "action_unidentified": "skip",
    "rows_by_batch": 100,
    "column-multivalue": [],
    "delimiter": ";",
    "enclosure": "\"",
    "escape": "\\"
}

jflatnes · June 8, 2020, 7:56pm

These are “append” imports, so you’re actually doing batch updates of existing items, yes?

I think this may be the source of the issue: we have functionality for doing a “batch create” that CSV Import uses when creating new items, that’s what I was alluding to in my previous message. Updates don’t work the same way, as we don’t have a “batch update” to use in that situation.

I think you may be basically experiencing the Doctrine slowdown of having a large number of “managed” records. Basically, what the batch create I mentioned does is clear out the imported records from Doctrine after every “batch,” so you don’t experience an ever-growing set of records and the slowdown (and memory usage) that come along with that. We should be able to have some way around doing something similar for updates, if that’s indeed the issue here.

It would be particularly instructive to see if a pure “create” import does or does not have the poor performance characteristics you’re seeing here. If my guess of what’s going on is right, a “create” job shouldn’t have this same problem.

benbakelaar · June 8, 2020, 7:59pm

OK! It would be great to get to the bottom of this. What you said makes a lot of sense - and in prior threads, I’ve always assumed “import” is a monolithic function so never mentioned what type it was.

I’m happy to try out a create. Does my create need to include the files (media_url in my CSV) in order to be comparable?

jflatnes · June 8, 2020, 8:01pm

It’s probably best to include them, I guess. Since a create will be creating all the items as well, it’s never going to be perfectly comparable but I think including the files is going to end up “closer.”

benbakelaar · June 9, 2020, 1:51pm

I hope to have some time to do the test today. But I tried setting the batch size to 1, per the message instructions about avoiding issues - have never tried 1 before. This reduced the import time from 2.5 hrs to 1 hr 15 mins! For about 12,000 items. And I think the system default out of the box is 20. I kept increasing it, 100, 500. It’s a little counter-intuitive because when you are dealing with SQL, it’s better to do multi-value inserts rather than one at a time. But in the case of Doctrine and the architecture you described, 1 is the most efficient value (for all operations other than Create which has a “bulk” mode). Does that sound correct?

Daniel_KM · June 11, 2020, 10:15am

The main issue for slow process is the search of identifiers. The module searches identifiers to avoid to duplicate resources and ideally, the module should be improved to do a two steps import, first to extract all identifiers and to get the list of resource ids one time (and possibly to do a dry run), then to import all the rows.

benbakelaar · June 11, 2020, 4:50pm

Who is in charge of CSV Import now? I see it is a collaborative effort between Omeka team and you Daniel.

The reason I ask is to know if this a fundable improvement?

jflatnes · June 11, 2020, 8:04pm

I’m skeptical that identifier lookup is a significant aspect of what we’re seeing here, actually. The row identifiers for updates are already searched only once per batch. That doesn’t really square with Ben’s observation of decreasing import times when decreasing the batch size, since small batch sizes mean more identifier lookup queries. Not to say there couldn’t be improvements made, but that I’m not sure it’s really the main issue at play.

Ben, did you ever have time to try out “create”? Hard data on what kinds of differences you’re seeing here would be extremely helpful.

As for who’s “in charge,” the home of the repository for the CSV Import module is still the “official” Omeka S organization. Daniel contributed a very significant portion of the current importer’s functionality, particularly around supporting updates, among other things. What with everything being open source and forkable, “in charge” is a slightly tricky concept to me, but suffice it to say from our end that we (the Omeka team) are continuing to support and improve CSV Import.

benbakelaar · June 11, 2020, 8:33pm

Alright let me try to do the test right now. I just have to work with a new Omeka-S install because there are no “new” records to test with, and also have to create a new import CSV that has both the metadata AND the media… hmm, but actually, not sure how that is going to work because I have 1 document (with associated metadata); but then a variable number of images for each record. Can Sideload handle a field of images with an internal separator?

Like “xy/xy2349.jpg;xy/xy2350.jpg;xy/xy2351.jpg”? What about if there were 100 individual images inside that single field?

I’m recalling that this is why I started doing the “document metadata” import first (as a CREATE, to establish the record); and then the image import afterwards, as an APPEND.

jflatnes · June 11, 2020, 8:49pm

If you mark the column as Multivalue and use the multivalue separator in the cell, you can have multiple media within a single cell.

As for 100 images for a singe cell… I would think it should be OK, but it’s certainly bigger than the kind of data we’d typically be testing with. How do you usually handle this situation, with just 100 rows each listing one media to append?

benbakelaar · June 11, 2020, 9:07pm

Yes, exactly. And for METAscripta (where we have a better server), we have some manuscripts with 400-500 images. For Edison Papers, a lot are 1-5 images, but some notebooks are 100-200 images each.

benbakelaar · June 12, 2020, 2:06am

Wasn’t able to do the create vs. append test, but I did do this. On the better server (t2.large on AWS), I imported 5 “rolls” of images as “append”. This was a total of 4138 records.

It took 10 minutes when I set batch size to 1. Very fast compared to prior imports with larger batch sizes.

Then, I doubled, doing 10 “rolls” of images as “append”. This was a total of 8419 records. After ~10 minutes, it was in nearly the same location as the first job - slightly more than 4000 records.

After 20 minutes (10 mins more), however - it was at just slightly over 6000 records. The total job time was 33 minutes.

It really feels to me like it has something to do with the ~5000 mark.

On the slower server (other project), we are seeing 10k record imports taking 13.5 hours, but observing it hitting 5k in much less than half that time. We also observed the final 1000 records taking the last ~3.5 hours.

Will try to get more hard data for you.

jflatnes · June 12, 2020, 6:41pm

Thanks for providing this information.

It may be easier for us to try this out the other way around: I’ve made a small branch of the CSV Import module that adds “clearing” of the Doctrine layer similar to what exists already for “create” jobs, so you could see if that improves your experience. There’s still more I would change, probably, but this should give a strong idea as to whether it’s the Doctrine layer that’s to blame for your performance issues.

It’s the branch clear-em-on-updates.

If you already have it as a Git checkout, just checking out that branch should be enough to pick up the change. If you’re normally using a zipped copy, note that when initially installing CSV Import from Git, you need to run composer install in the plugin folder.

SharonLeon · June 18, 2020, 6:05pm

@benbakelaar Any news on testing out this branch? If it helps, we’d love to merge it get a new version out with our next set of module releases.

benbakelaar · June 19, 2020, 4:46pm

Thanks for the poke Sharon!

I’m a little stuck on this.

I always use the “release” ZIP copy to install modules which just involves extracting the zip and then moving to the UI. I am not totally clear here - it sounds like if I download this “branch” as a ZIP, and remove the existing CSV import… I still need to run composer install? But as I’ve never done that (for a plugin/module on an existing installed LAMP stack app that is in production), I’m not sure I have the right version of composer, if I should install composer, etc. Basically I’m not familiar enough with composer to know what to do.

Beyond that, I’ve lost one of my testing sites. With the Edison Papers site, we’ve shifted our strategy now to simply leave “NoThumbnail” setting in place and use UniversalViewer (with IIIF built in!). That gives us the smallest hard drive footprint for the 20GB of images (240k image files) that comprise the “set” for the legacy digital edition.

The 2nd testing site is METAscripta, which is running on a t2.large on AWS EC2, and is much much faster than the Edison install. That being said, I’ve just reviewed the job history to establish the same pattern of “slowing down” as imports go on. I’m going to include that as a screenshot so the formatting doesn’t get lost.

16%20PM

So in my pattern, it appears there is a “doubling time” every 4000 records. So once I get some insight on the “composer install” installation of this branch, I’ll be able to test if it creates a more linear rate of import - both without thumbnails (as you can see I’ve done), and then with thumbnails (will turn it back on).

benbakelaar · June 19, 2020, 6:34pm

OK I was able to successfully install the clear-em branch, install composer on the server, and run composer install. So I’m all set up to do the testing - except I don’t have access to the image files that I need to run the imports I should be able to get them uploaded on Sunday or Monday morning and post the results.

SharonLeon · June 19, 2020, 8:44pm

Good to hear, Ben. I’m glad you figured it out.

Thanks for doing the testing!

benbakelaar · June 22, 2020, 9:02pm

Quick results - it’s definitely quicker! I’m still seeing the “slow down”.

CLEAR-EM BRANCH

June 22 5 manus 4027 items 1 (batch size) 7 mins NO thumbnails
June 22 10 manus 8227 items 1 (batch size) 20 mins NO thumbnails
(** 14 mins = 6166 items)

Going to try a 12k import tonight, see if it follows that same pattern. But again, definitely quicker.