Import of multilingual text

jlayt · October 27, 2023, 4:17pm

Hi,

I’m working on the data migration from our old database into Omeka S, and we have 133,000 strings of multilingual text to import. Any given text field (47 of them in total across all Resource Templates) could have text in any one of 15 supported languages, with some fields also having multiple values in different languages.

From looking at both the CSV Import and Bulk Import modules, the usual way to do this seems to be to have multiple columns for each field, one for each language, so 705 columns spread across all the Items. This seems a little excessive, especially if configuring it manually in CSV Import!

I was hoping there would be an easier way? Ideally, each text would have the language tag embedded in it in some standard way and get parsed and set from there. I might have to look at hacking that together if there’s no easier way.

Can I just put in a vote for a nice simple JSON importer using a standardised format, perhaps based on the OmekaS API json import? Could make structured data imports a lot easier!

John.

AllanaMayer · October 27, 2023, 5:57pm

You are correct that CSV Import wants one column per language per field. I realize the prospect of separating out your data into 15 spreadsheets and doing 15 imports (using Append) doesn’t sound ideal, but that might be the best way to proceed.

The Batch Edit function may help - select all of the French columns, apply “fr” in the language field, save, repeat. Unfortunately you can’t batch-apply mappings (select all the Title columns, map to Title, save, repeat) at this time.

https://omeka.org/s/docs/user-manual/modules/csvimport/#batch-edit

If your 47 fields currently have language information saved inside them (‘“string”,language’) or beside them in each row, you could do some cleanup & spreadsheet prep with OpenRefine, and split those multivalues.

But I think if we’re going to add functionality to CSV Import specifically (or if you’re going to hack something), the fastest thing may be to modify automapping of column names that allow for more granularity in the column titles (dcterms:title.fr, dcterms:title.en) so you can use those as a reference in the interface for adding language tags while still getting the convenience of 15 columns automapped to each field. I’m going to make an issue on the module’s github to that extent, and another to look at batch-mapping.

In the meantime, I would double-check the existing Omeka S importers just in case you can format your data into another system that would import cleanly. I’ll leave the json import question to one of our developers.

jflatnes · October 27, 2023, 6:13pm

On the JSON front, if you do have your data formatted compatibly with Omeka’s JSON-LD API format, you can use the API directly to create items.

jlayt · October 30, 2023, 2:29pm

Thanks both.

I had been thinking a 2-3 line hack to CSVImport/Mapping/PropertyMapping to look for a prefix of for example “en@” for literals when a default language is set for the column.

I had hoped to just adapt some standard CSV dump scripts we use, but looking now at the work involved on some other parts of the export, it will be better to just go full custom script calling the JSON-LD API. That way I avoid the manual messing with import settings, and can track the new IDs to make the script restartable/re-runable.

Thanks!

system · October 24, 2024, 2:29pm

This topic was automatically closed 360 days after the last reply. New replies are no longer allowed.