How to prevent or remove duplicate linked items?

Hey all, I’m loving Omeka so far!

I’ve been using CSV Import to add Items, which is working great except appending seems to not detect that an Item is already linked from a property.

Is there an easy way to either (a) prevent CSV Import from creating a duplicate linked resource in a property, or (b) easily de-duplicate across the whole collection now that I have several thousand records with duplicates?

I know there will be lots of CSV imports in my future.

1 Like

Hi @walldigexh ,

When you ran the CSV Importer, which option did you select for the Action?

1 Like

Hi @fackrellj,

Thanks for the reply! I used “Append”.

I did give “Revise”, “Update”, and “Replace” all a try and it looks like they all replace the entire field with the new field, whereas “Append” adds new linked resources to the existing set of linked resources appearing in a given field, which is what I want.

In my example, I uploaded a .csv file to load the initial “Brian Bjornsen” and “Marlboro Man” records in my screenshot. Then I uploaded another different .csv file that nonetheless also contained “Brian Bjornsen” and “Marlboro Man” in that field for that particular item. In that case, I used “Append” which added those two items a second time each, which isn’t what I wanted. If a resource is already linked, I don’t want it added a second time.

I’m pretty sure I’ve made a silly mistake somewhere, but now I’m just pretty lost. I checked that the links do point to the same item numbers, which they do.

And part (b) is, now that a ton of my records have linked resources described twice in the same field, how can I deduplicate this so each field has a linked resource only appearing once?

I’m hoping I missed something totally obvious! :sweat_smile:

Hi @walldigexh ,

Perhaps I’m not understanding what you’re trying to do with the 2 uploads.

It seems like the easiest solution would be to combine the 2 files, de-duplicate them before uploading, and use one of the “Revise,” “Update,” or “Replace” options to update all of the fields, which will remove the duplicated linked resources.

Here’s a bit on my workflow: I’m a rural historian working with folks of pretty advanced age to collect info on photographs and other works I’ve scanned and cataloged. I’ve given the participants spreadsheets because that’s what scared them the least. Some of these folks are in their mid-nineties and I love them, but it was like I broke some of them when I showed them Omeka’s interface. Giving them a user account just ain’t happening.

Instead, I give my informants a spreadsheet with the item identifiers, along with a thumb drive of files with identifier filenames, and Granny and Gramps fill in who’s in the photo, where it was made, any stories they remember, and so on. So far, it’s working great.

When I get their CSV file, I do a very basic cleanup before importing the csv file into Omeka. I guess I could export the database and run a script check myself, but I thought maybe I was missing a simple checkbox setting or something to disallow duplicates on import.

Bottom line: I don’t want “Append” to add a resource that’s already linked.

1 Like

“Append” does what it says, though, and appends metadata values. If what you want is to ignore values if they already exist (in any metadata field — the CSV Import options are per-upload, not per-field) and to only add metadata if it is new, it seems to me that you want “Revise”, which will replace values if they are new and not delete metadata.

If what you want is to do different things with different metadata fields (Append in some cases, Revise/Update/Replace etc. in others), you’re going to need to do more than one upload. The CSV Import module settings are not fine-grained enough to append some field’s new metadata and ignore other fields.

I tried to use Revise again, but it replaces the entire field. I want to append new linked resources to the existing linked resources in a field, not replace the whole field – but I definitely don’t want to add a linked resource twice.

Since preventing duplicates isn’t possible, I just wrote a de-duplication script to act directly on the database to remove all duplicate resources in every item’s field.

There is actually deduplication logic built in to how “Append” works already, but it looks like there’s a problem with it. The code in place right now incorrectly allows for one duplicate for every value, if the value is a linked resource. The detection doesn’t properly match up an existing linked resource with one coming from the import. Subsequent imports will deduplicate those values, but then add a new duplicate, so you get this steady state of 2 copies of a linked resource if you repeatedly “Append” the same data.

I’ve just made a change to the way the deduplication works in CSV Import that should fix this problem. I’m also taking this opportunity to remove its historical behavior of “deduplicating” existing values of updated items; with this new setup, CSV Import will not create new duplicates when importing with Append, but it will leave any existing duplicates in place.

We’ll of course need to put it through some testing before a release.

2 Likes

Oh! This is great news to see.

Huge thanks for looking into this, @jflatnes, I was tearing my hair out about what I was doing wrong.

I’m using this SQL statement directly on the database to deduplicate linked resources after my imports for now. I’m making liberal use of mysqldump just to be safe, but it seems to be working. What do you think?

DELETE v1 FROM value v1 JOIN value v2 ON v1.resource_id = v2.resource_id AND v1.property_id = v2.property_id AND v1.value_resource_id = v2.value_resource_id AND v1.id > v2.id;

1 Like