Using Mroonga to enable full-text search of CJK text

kentaro · January 28, 2020, 6:54am

Hi,

I’m testing Omeka S’s full-text search feature. While the default full-text search support of MySQL/MariaDB’s InnoDB works very will for English, it cannot index Japanese text (probably the same in Chinese or Korean).

Instead of using Solr module, I have tested Mroonga, a plugin for MySQL/MariaDB that provides a better CJK full-text search and it worked satisfactorily as far as I tested. (only tested with test data)

Using Mroonga is quite simple - just ALTER TABLE the ‘fulltext_search’ table as follow:

ALTER TABLE fulltext_search DROP FOREIGN KEY FK_AA31FE4A7E3C61F9;
ALTER TABLE fulltext_search ENGINE Mroonga COMMENT ‘engine “InnoDB”’
ALTER TABLE fulltext_search ADD CONSTRAINT FK_AA31FE4A7E3C61F9 FOREIGN KEY ( owner_id ) REFERENCES ‘user’ (‘id’) ON DELETE SET NULL;

So, now I plan to develop an Omeka S module to use Mroonga, but facing a problem.

How can the module get the name of the FOREIGN KEY of the fulltext_search table?
‘FK_AA31FE4A7E3C61F9’ seems very strange for me. I found that the name in application/data/migrations/20190515055359_AddOwnerAndIsPublic.php, but cannot understand the background of the name.

jflatnes · January 28, 2020, 9:40pm

The names of the foreign keys get generated by Doctrine… though they’re odd-looking, they are consistent.

jflatnes · January 28, 2020, 9:45pm

Have you looked at all at altering the minimum full-text word length for your server ? The default is 3, with an official recommendation to set it to 1 for CJK text. It’s a server setting that we can’t mess with from the Omeka side, but it’s an option for particular installations to look at, and it does have the advantage of not needing to alter the actual table definition.

kentaro · January 29, 2020, 3:13am

Thank you for your reply.
FK name: OK, I’ll retrieve the name via Doctrine.
minimum full-text word length:
I’ll test it later, but I guess it won’t work because I’m using MariaDB on my server, which does not support N-gram indexing yet. Plus, even MySQL requires some additional settings to use N-gram indexing. See the following links:

kentaro · January 29, 2020, 6:57am

OK, I tested the latest Omeka S (4.1.0) with MySQL 5.7.29 and confirmed that N-gram based full-text seach worked nicely. To enable N-gram indexing, I had to modify the table schema like following:

mysql> ALTER TABLE fulltext_search add FULLTEXT KEY IDX_AA31FE4A2B36786B3B8BA7C7 (title, text) WITH PARSER NGRAM;
(dropping and adding the foreign key are skipped)

In this case, there’s no need to set innodb_ft_min_token_size to 1.

jflatnes · January 29, 2020, 4:04pm

Interesting, thanks for reporting your results. I wasn’t familiar with the ngram parser option.

Especially with the MariaDB issues, I’m not sure it’s something we can provide within Omeka S itself but it’s good information for people to have anyway as a change they could make to improve search for their own CJK datasets.

kentaro · February 8, 2020, 7:59am

I have made two modules for Omeka-S enabling CJK-ready full-text search. Still it’s an experimental, but works well with my dataset.

I want to integrate them to a single module, but it’s just a plan at this moment.