Performance with large datasets

I have an Omeka S (v4.1.1) installation with 1.8M items and 614K media and growing.
On certain places within admin I am really experience degraded performance. For example, the /admin/item page takes 8-9 seconds to load. On the front-end the performance is still ok, also because of a load-balanced MariaDB (v11) cluster on serious hardware and of course Solr (on yet another server). Still, I hope optimizations can be done to improve the performance of the back-end.

To this end I am looking for expensive queries which show up in the mariadb-slow.log, like the query for the /admin/item page (pretty formatted by me):

SELECT 
	r0_.id AS id_0, r0_.title AS title_1, r0_.is_public AS is_public_2, r0_.created AS created_3, r0_.modified AS modified_4, 
    r0_.resource_type AS resource_type_5, r0_.owner_id AS owner_id_6, r0_.resource_class_id AS resource_class_id_7, 
    r0_.resource_template_id AS resource_template_id_8, r0_.thumbnail_id AS thumbnail_id_9, i1_.primary_media_id AS primary_media_id_10 
FROM 
	item i1_ 
INNER JOIN 
	resource r0_ ON i1_.id = r0_.id 
GROUP BY 
	r0_.id 
ORDER BY 
	r0_.created DESC, r0_.id DESC 
LIMIT 20 OFFSET 80;

which takes over 7-8 seconds. The EXPLAIN of this query looks like this:

An improvement could be to add an index:

CREATE INDEX idx_resource_created_id ON resource (created DESC, id DESC);

Note: this index only covers this option out of all sort options. So you could argue this is not a scalable solution as you have to create a lot of indexes for all possible sorts (both ascending and descending). Bu as the standard sort on the /admin/item is “descending by created”, in practice this would be very beneficial.

With this index in use (*) the query responds in 0.0017 seconds! With an EXPLAIN of:

*) Strangely, after adding the index, the query optimizer didn’t pick the index, I had to “strongly suggest” to use the index. I also had to remove the GROUP BY (which isn’t necessary anyway because item:resource is always 1:1, no?):

SELECT 
	r0_.id AS id_0, r0_.title AS title_1, r0_.is_public AS is_public_2, r0_.created AS created_3, r0_.modified AS modified_4, 
    r0_.resource_type AS resource_type_5, r0_.owner_id AS owner_id_6, r0_.resource_class_id AS resource_class_id_7, 
    r0_.resource_template_id AS resource_template_id_8, r0_.thumbnail_id AS thumbnail_id_9, i1_.primary_media_id AS primary_media_id_10 
FROM 
	item i1_ 
INNER JOIN 
	resource r0_ USE INDEX (idx_resource_created_id) ON i1_.id = r0_.id 
#	resource r0_ ON i1_.id = r0_.id 
#GROUP BY 
#	r0_.id 
ORDER BY 
	r0_.created DESC, r0_.id DESC 
LIMIT 20 OFFSET 80;

I hope this kind of insight helps in optimizing the ORM PHP to increase performance. I’d appreciate advice on how to give insights or to directly improve performance.

2 Likes

The group by is there not so much for this standard case but for use when other sorting/filtering involves additional joins. We might be able to explore only using it when it’s necessary… but doing that in a way that’s not going to break module compatibility and so on might be challenging. Though we might be able to do something like checking/counting for joins in the query builder before adding the group by.

1 Like

I don’t have any insight to add to this, but we’ve definitely been experiencing this kind of performance lag when browsing the media on admin where we have 230k media.

@coret, you mentioned having to remove the GROUP BY clause for your indexing to have an effect.

The reason that’s happening is that indexes don’t get used, at least not in an efficient way, when there’s a GROUP BY that differs from the ORDER BY. This suggests another way around on having an efficient listing: we have the default order as “created,” but it could instead be “id,” and thereby have a matching order and group without modifying anything. ID also already has the necessary index, the primary key.

Without modifying the core at all, you can choose for ID to be the default sort field using user settings by entering id as the “custom sort by.” There is no difference in most cases between the two sorting options. “Created” becomes different only in some minor corner cases, like modification of the database.

Thanks, setting the sort defaults to ID in user settings really improves the performance x 20. Loading the Items page (in my 1.8M items instance) in admin now takes 400ms, yeah!

Maybe make these the standard sort defaults for an out-of-the box Omeka S installation and new users.