Slow imports over 10k

CLEAR-EM BRANCH

On AWS EC2, t2.large server, using “1” as the batch import size for “append” operation.

June 22 5 manus 4027 items 7 mins NO thumbnails
June 22 10 manus 8227 items 20 mins NO thumbnails
June 23 15 manus 12052 items 37 mins NO thumbnails

Definitely better - about 50% reduction in speed when it comes to 12k-13k items. I’m sure there is some sort of curve around what the average import size is, so I’m guessing above 10k is in the pretty low end.

I still see the doubling time at play. I’m going to look into Stack Overflow, maybe that is just the nature of SQL? Or MariaDB/MySQL? I’ve never done such close testing before, so perhaps it’s just a natural pattern with large SQL operations.

The scaling’s not too far off linear there at least for your second example. To some extent I’d expect some slowdown no matter what as the operations of the SQL server get incrementally slower with more data already loaded, though this probably indicates some remaining inefficiencies on our end.

Still, an improvement of time for ~12,000 items from over an hour to 37 minutes, if I’m reading your data correctly, is nothing to sneeze at.

These are all still “update” imports, right? Here’s the thing (something I didn’t know/remember until after looking back into this): the batch size input doesn’t have any actual effect for an update: the code forces the batch size to 1 for all imports. I know why we have that in place for “mixed resource” imports: its very common there for CSVs to reference an item in a previous row, and keeping the batch size to 1 guarantees that this will work (the previous item will be fully added before we move onto the next row). I’m less convinced that we need to be doing this for just regular updates, though.

What this does mean though is that there’s potentially something to Daniel’s previous comment, at least in your situation of doing updates: the system is doing a lot more queries to look up what item goes with what identifier than is really necessary, and that would be something that slows down as the process moves along. I’d still say the speed effect here is smaller than this initial “clear EM” change, but it’s probably also worth seeing just what kind of effect is there if we re-enable “batching” for updates.

1 Like

As for your comment on the typical size of an import, in my anecdotal experience the median import is in the realm of hundreds or thousands of rows.

But we obviously have people, you included, doing much larger imports, and that’s after all the purpose of the import module.

Absolutely nothing to sneeze at! A 33%-50% increase is pretty awesome :slight_smile: :clap:

These are “append” imports - adding media to the existing item.

This is the closest thing I found based on my searching. I couldn’t find anything in Stack Overflow, most of the posts are about debugging a particular query, not a pattern.

My search terms: “mysql batch import update query slow increase”

https://www.electricmonk.nl/log/2013/11/06/increasing-performance-of-bulk-updates-of-large-tables-in-mysql/

Yeah, it’s more of changes to our specific stuff that we’d be looking at here. We’re never going to be as fast as just executing queries for all sorts of reasons, but we can obviously limit unnecessarily slow things.

I’ve updated the clear-em-on-updates branch to re-enable batching for append and the other “update” actions, so now your selected batch size should take effect. Any improvements as a result of this change should be totally down to reducing the number of queries for identifiers (there’s other speed enhancements available to batch creation that still don’t apply here).

The batch size also controls how often we “clear” the EM (the speed-increasing operation performed by the previous update you were running), so there’s likely to be an effect of increased batch sizes eventually reducing performance at some point. Even the default of 20 would cut out a large number of select queries, though, so you should be able to tell if that’s a significant factor.

Just pulling from the branch is enough to update; you don’t need to do the composer step again.

Oh, and I should say thanks for trying out these changes and reporting back in such detail. Big real-world workloads like these don’t come along all that often, especially not in this sort of “experimental” fashion, so this is very helpful.

1 Like

OK great, I’ll try it out on the next import. Will have to wait until a break in home internet usage so I can upload a few GB of images that will be imported.

OK one data point - much faster again! Nearly a 50% improvement over the last improvement - from 20 mins for 7k items to 11 mins. And that’s down from 1 hr (7.8k items, batch size 250 on the regular CSV Import release) to 33 mins (8.4k items with batch size 1 on regular CSV Import release).

So the job below, with a batch import size of 20: note that I also modified MySQL config per that article linked above, tried out InnoDB buffer pool of 4GB and key buffer size of 256M. But based on reviewing top, free, and iostat during the import - it is very clearly processor bound. PHP was around 70% and mysqld was around 30% when I was looking, and stayed that way while I was monitoring. Disk and RAM were hardly utilized according to the utility reports.

June 23 10 manus 7513 items 11 mins NO thumbnails (InnoDB 4GB)

Status

Completed

Started

June 23, 2020 at 3:24:37 PM CDT

Ended

June 23, 2020 at 3:35:37 PM CDT

1 Like

Final import for METAscripta project
13,843 items
Append operation (attach images to items via Sideload)
No thumbnails
Batch import size = 50
21 minutes!

Started

June 25, 2020 at 8:05:23 AM CDT

Ended

June 25, 2020 at 8:26:25 AM CDT

Well this is all encouraging. I think we’ll go ahead and make these changes in the main branch as well, then.

Thanks again for your help on this.

1 Like

This topic was automatically closed 250 days after the last reply. New replies are no longer allowed.