Sorting slow on Dublin Core:Date with 100k items

benbakelaar · June 13, 2018, 12:57pm

Hi, wondering if anyone else with a large number of items in the database has made any optimizations to the SQL database structure, or used some plugin to speed up sorting?

http://edison.rutgers.edu/digital/items/browse?type=1&page=3

Response time / page load is between 1.0-2.0 seconds. Perfectly fine.

http://edison.rutgers.edu/digital/items/browse?type=1&sort_field=Dublin+Core%2CDate&sort_dir=a&page=2

Response time / page load is 10+ seconds.

type=1 > just the documents, this doesn’t slow things down
&sort_field=Dublin+Core%2CDate > sort by this metadata field
&sort_dir=a > whether it’s here or default, doesn’t affect speed
&page=2 > so after every click to page ahead, 10+ seconds to reach next page

benbakelaar · June 13, 2018, 1:09pm

There are 1.9 million rows in omeka_element_texts table, and as far as element_id = 40 (Dublin Core:Date), there are 146,833 rows corresponding with 146,833 items in the database.

I can run this query directly in SQL:

SELECT * FROM omeka_element_texts WHERE element_id=40 ORDER BY text ASC LIMIT 100,100

Query time is 0.6 seconds

SELECT * FROM omeka_element_texts WHERE element_id=40 ORDER BY text ASC LIMIT 1000,1000 > 0.8 seconds

SELECT * FROM omeka_element_texts WHERE element_id=40 ORDER BY text ASC LIMIT 10000,10000 > 1.0 seconds

SELECT * FROM omeka_element_texts WHERE element_id=40 ORDER BY text ASC LIMIT 100000,100000 > 1.5 seconds

Now I am assuming the entire result set does not have to be retrieved since pagination is built in. But even with 100k records, it only takes 1.5 seconds to return the result set. I haven’t dug into the code to find out the exact SQL query that Omeka is running, maybe there are some joins and some per-record looping also.

jflatnes · June 13, 2018, 6:55pm

That’s obviously far too slow to be acceptable.

Definitely there’s both per-record looping and joins happening in reality… in particular the actual query being done is a query for items, joining to the element_texts table. The per-record looping isn’t going to be a significant aspect here since it’s pretty much constant between different queries.

So presumably it’s the joining that’s leading to the poor results here… You can get us to log the actual queries performed by enabling log.sql in application/config/config.ini. My guess would be the particular query actually being used is resulting in MySQL deciding not to use the index… DESCRIBE would show for sure.

benbakelaar · June 14, 2018, 2:22am

Just ran the profileDb option.

Will be testing this, just wanted to post:

Executed 264 queries in 17.82079410553 seconds
Average query length: 0.067503007975492 seconds
Queries per second: 14.814154657568
Longest query length: 10.536134004593

Longest query: SELECT items.* FROM omeka_items AS items LEFT JOIN omeka_element_texts AS et_sort ON et_sort.record_id = items.id AND et_sort.record_type = ‘Item’ AND et_sort.element_id = 40 LEFT JOIN omeka_item_types AS item_types ON items.item_type_id = item_types.id WHERE (item_types.id IN (1)) GROUP BY items.id, items.id ORDER BY IF(ISNULL(et_sort.text), 1, 0) ASC, et_sort.text ASC, items.id ASC LIMIT 50 OFFSET 100

benbakelaar · June 14, 2018, 2:28am

I can run this query variant directly in less than 0.01 seconds. So, it appears to be the ORDER BY et_sort.text which is slowing it down.

SELECT et_sort.text, items.* FROM omeka_items AS items 
  LEFT JOIN omeka_element_texts AS et_sort 
    ON et_sort.record_id = items.id 
    AND et_sort.record_type = 'Item' 
    AND et_sort.element_id = 40 
LIMIT 50

benbakelaar · June 14, 2018, 2:45am

Here’s the EXPLAIN. I’m not familiar with interpreting these. But I did run REPAIR, ANALYZE, OPTIMIZE, etc… and even delete and recreate the ‘text’ index for omeka_element_texts. It’s down to 8.8 seconds from 9.6-9.7.

|1|SIMPLE|items|index||PRIMARY|4||147138|Using temporary; Using filesort|
|1|SIMPLE|et_sort|ref|record_type_record_id,element_id|record_type_record_id|156|const,omeka_418.items.id|6|Using where|

And here’s some manual testing.

SELECT * FROM omeka_element_texts WHERE element_id=40 ORDER BY text LIMIT 50; > 0.5 seconds
SELECT * FROM omeka_element_texts WHERE element_id=40 ORDER BY text LIMIT 500; > 0.58 seconds
SELECT * FROM omeka_element_texts WHERE element_id=40 ORDER BY text LIMIT 5000; > 0.82 seconds
SELECT * FROM omeka_element_texts WHERE element_id=40 ORDER BY text LIMIT 50000; > 1.05 seconds
SELECT * FROM omeka_element_texts WHERE element_id=40 ORDER BY text LIMIT 100000 OFFSET 100000; > 1.354 seconds, 46,883 rows (last of result set)

benbakelaar · June 14, 2018, 2:53am

Hey, progress!

SELECT et_sort.text, items.* FROM omeka_items AS items 
  INNER JOIN omeka_element_texts AS et_sort 
    ON et_sort.record_id = items.id 
    AND et_sort.record_type = 'Item' 
    AND et_sort.element_id = 40 
GROUP BY items.id 
ORDER BY et_sort.text ASC, items.id ASC 
LIMIT 50 OFFSET 100

1.734 seconds

Also I’m sure you’re aware that MySQL/MariaDB will attempt to guess the best join if you just state “JOIN”.

SELECT et_sort.text, items.* FROM omeka_items AS items 
  JOIN omeka_element_texts AS et_sort 
    ON et_sort.record_id = items.id 
    AND et_sort.record_type = 'Item' 
    AND et_sort.element_id = 40 
GROUP BY items.id 
ORDER BY et_sort.text ASC, items.id ASC 
LIMIT 50 OFFSET 100

1.744 seconds

And then LEFT JOIN…

SELECT et_sort.text, items.* FROM omeka_items AS items 
  LEFT JOIN omeka_element_texts AS et_sort 
    ON et_sort.record_id = items.id 
    AND et_sort.record_type = 'Item' 
    AND et_sort.element_id = 40 
GROUP BY items.id 
ORDER BY et_sort.text ASC, items.id ASC 
LIMIT 50 OFFSET 100

9.2 seconds

From a StackOverflow article:

There are different types of joins available in SQL:

INNER JOIN: returns rows when there is a match in both tables.

LEFT JOIN: returns all rows from the left table, even if there are no matches in the right table.

RIGHT JOIN: returns all rows from the right table, even if there are no matches in the left table.

FULL JOIN: It combines the results of both left and right outer joins.

I am assuming there is a reason for using a LEFT JOIN in the generic query constructor which has to handle all kinds of variations between installations, vocabularies, filters, searches, user selections, etc. But just to confirm, you do intend the join results to include “all rows from LEFT even if no matches”?

jflatnes · June 14, 2018, 3:56am

We use a LEFT JOIN here because we want to include results that don’t have a text for the given field at all: the INNER JOIN version will exclude any Item that doesn’t have any texts with element_id equal to 40.

benbakelaar · June 14, 2018, 1:29pm

OK right, got it.

Well, after some googling I tried reversing the query direction and table order so that the sorting was happening on the indexed table first, then joining in the item data.

SELECT et_sort.text, items.* FROM omeka_items AS items 
  LEFT JOIN omeka_element_texts AS et_sort 
    ON et_sort.record_id = items.id 
    AND et_sort.record_type = 'Item' 
    AND et_sort.element_id = 40 
ORDER BY et_sort.text ASC

10.1 seconds

SELECT * FROM omeka_element_texts AS et_sort 
RIGHT JOIN omeka_items b ON b.id = et_sort.record_id
WHERE et_sort.record_type = 'Item' AND et_sort.element_id = 40 
ORDER BY et_sort.text ASC

1.2 seconds

So, aside from writing a custom browse page or plugin, would this arrangement hold true for the broader use cases? On the surface it seems it would, since the “text” field in omeka_element_texts is indexed. If you join the text field to the item table, then it is not indexed in that temporary table and has to be file sorted. But if you join the item table to the element_texts table, you are sorting on a field that is indexed and just adding in the other stuff.

Not super great on JOIN directionality, so I’m not sure if RIGHT JOIN is always the same result when tables are reversed… but in this case, it returns the same row count.

jflatnes · June 14, 2018, 6:01pm

Can you share which version of MySQL you’re using? The specifics of the optimizer and index usage can vary quite a bit from version to version.

benbakelaar · June 15, 2018, 1:53am

Wow that would be crazy if a minor version changed the speed!

Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 284399
Server version: 5.5.56-MariaDB MariaDB Server

Linux version 3.10.0-862.2.3.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Wed May 9 18:05:47 UTC 2018

luk.puk · June 15, 2018, 8:49am

The SQL rewrite like this doesn’t produce same results like the original query using LEFT JOIN. Omeka may have items with no row in element_texts table for Dublin Core:Date. Those items will be never shown (if that is a problem). And the equivalent rewrite doesn’t make any improvements in speed (in my tests):

SELECT * FROM omeka_element_texts AS et_sort 
RIGHT JOIN omeka_items b
    ON b.id = et_sort.record_id 
    AND et_sort.record_type = 'Item' 
    AND et_sort.element_id = 40 
ORDER BY et_sort.text ASC

I have tried your query on database with almost 10 millions (5x more than yours) rows in element_texts table and my result is about 4.5 secs. If you don’t find any better SQL rewrite, there’s possible index improvement for this kind of sorting:

ALTER TABLE `omeka_element_texts`
	ADD INDEX `sorting` (`record_id`, `element_id`);

If that doesn’t help, you may consider upgrading MySQL server to version 5.7 (that should boost performance for you just by simple upgrade). And if it’s still slow, you will need to do database tuning. Start with basic changes like increasing innodb_buffer_pool_size/innodb_log_file_size and/or max_heap_table_size/tmp_table_size. There’s plenty material online about that topic, and MySQLTuner can give you good overview about current state and give you some recommendations. Just don’t tune too much, run benchmark and see if that helps…

benbakelaar · June 16, 2018, 12:24am

Thank you Luka! The additional index you provided decreased total query execution time by about 2 seconds, to near 7 seconds.

I then proceeded to upgrade from MariaDB 5.5 (on CentOS7) to MariaDB 10.0 (wanted to take it step by step, not jump to latest new version). Now the query is 2.2-2.3 seconds! MUCH BETTER!

And sorry John, I should have clarified it is MariaDB earlier, maybe there are some differences between MySQL 5.5 and MariaDB 5.5?

As far as usability goes, it feels more like 3 seconds when you are browsing the Omeka site, add sorting, and then page through results. So, much much better, 10 seconds was not acceptable - but I still want to get it lower if possible - even if it means writing a hook/filter/plugin.

jflatnes · June 16, 2018, 12:46am

Well, I’m glad to hear you’ve had some improvements.

As for version to version and MySQL vs. MariaDB, there are definitely changes, but it can be a bit of a pain to run down exactly what they are and which versions introduce the relevant improvements. It’s definitely a possible factor as you saw though, while you had a relatively recent version before we support very old MySQL versions as well that can have quite different performance characteristics.

I was thinking about a similar/same index change as @luk.puk mentioned, and that’s definitely an option in terms of an easy win to include in the core.

As for improving things further… the things that really slow us down with sorting are that you have to deal with there being 0, 1, or many values for each item for a given element, and that we want to support arbitrary sorts even though the average site may only use a small handful. Additionally we get some problems because the column you’re sorting on is a huge text field and we can only index the first so many characters, so MySQL can sometimes decide it can’t really use the index.

A plugin could be made that does a few things that would help with sorting:

Guarantee a 1:1 relationship of sorting information to items/records/whatever, thereby allowing an INNER JOIN
Make a copy of the appropriate data in a CHAR or VARCHAR column small enough to actually be completely indexed
Additionally you’d have the option to do things like preprocess the data you’re using to sort by (say, to remove intial articles from titles)

Of course, the downside is that you then have to keep this extra copy of the data up to date (as well as initially create it if you have preexisting items or change the “indexed” fields). It’d basically be something similar to the search_texts table used for fulltext search, but geared toward sorting.

benbakelaar · June 16, 2018, 1:19am

That’s exactly what I was thinking. Obviously I’m influenced by the nature of the Edison Papers, but this is not a dynamic collection. Yes, it will grow. But we are starting with 140k documents which will hardly change over time. So it’s better in this case to have a compiled index like search_texts that can be refreshed on demand.

Although the Edison scenario may not be typical, you can imagine that at some point most “collections” that are running on Omeka will stop growing, and if they are larger than X documents it would be better to use a compiled index.

I have the SQL abilities to create this, and I have PHP abilities - I just don’t fully understand the inner workings of Omeka core + plugins yet to easily implement. But I’m learning.

luk.puk · June 17, 2018, 7:12pm

This point leads to another handy feature, which is - support for numeric sorting. This in combination with query speed improvement makes me also interested in such a plugin. I will see if I can make some initial code fast and share it here @benbakelaar

benbakelaar · June 18, 2018, 5:09pm

Any idea how I could do a manual test, where I construct a numerical index in a separate table, and then replace the browse page (or list item function/loop)? My main issue is translating a SQL query into ZF3 OO code. It’s very confusing when you don’t know the syntax.

jflatnes · June 18, 2018, 7:28pm

A couple things:

You’d be looking at Zend Framework 1 if you’re working with Omeka Classic. This is a pretty major distinction as ZF 2 and 3 are very different from 1 in many ways (plus, S, where ZF2/3 is used, actually uses Doctrine for DB stuff)
You probably should look into using the items_browse_sql hook if you want to try something out: rather than having to replace the browse page or anything like that, you can just get direct access to our “Select” object to add clauses. This is the hook plugins generally use to alter a query to provide their own sorting or filtering options.
The most relevant documentation in terms of figuring out the code equivalent of the query you want to write is the Zend_Db_Select page of the old ZF manual. It’s a pretty straightforward “query builder” kind of class, you mostly call methods for adding clauses, joins, etc. With of course some changes for your specifics, you’d probably be looking at doing something like what the core Item table’s sorting method does: adding a join and an an ORDER BY referencing that joined table.

benbakelaar · June 18, 2018, 9:47pm

AWEsome thanks John for the info. I have been working in both Omeka and Omeka-S, so am getting things mixed up I imagine the ZF1 db query builder is much simpler.