How to prevent or remove duplicate linked items?

walldigexh · September 18, 2024, 1:05am

Hey all, I’m loving Omeka so far!

I’ve been using CSV Import to add Items, which is working great except appending seems to not detect that an Item is already linked from a property.

Is there an easy way to either (a) prevent CSV Import from creating a duplicate linked resource in a property, or (b) easily de-duplicate across the whole collection now that I have several thousand records with duplicates?

I know there will be lots of CSV imports in my future.

fackrellj · September 18, 2024, 2:12am

Hi @walldigexh ,

When you ran the CSV Importer, which option did you select for the Action?

walldigexh · September 19, 2024, 10:10pm

Hi @fackrellj,

Thanks for the reply! I used “Append”.

I did give “Revise”, “Update”, and “Replace” all a try and it looks like they all replace the entire field with the new field, whereas “Append” adds new linked resources to the existing set of linked resources appearing in a given field, which is what I want.

In my example, I uploaded a .csv file to load the initial “Brian Bjornsen” and “Marlboro Man” records in my screenshot. Then I uploaded another different .csv file that nonetheless also contained “Brian Bjornsen” and “Marlboro Man” in that field for that particular item. In that case, I used “Append” which added those two items a second time each, which isn’t what I wanted. If a resource is already linked, I don’t want it added a second time.

I’m pretty sure I’ve made a silly mistake somewhere, but now I’m just pretty lost. I checked that the links do point to the same item numbers, which they do.

And part (b) is, now that a ton of my records have linked resources described twice in the same field, how can I deduplicate this so each field has a linked resource only appearing once?

I’m hoping I missed something totally obvious!

fackrellj · September 20, 2024, 1:20am

Hi @walldigexh ,

Perhaps I’m not understanding what you’re trying to do with the 2 uploads.

It seems like the easiest solution would be to combine the 2 files, de-duplicate them before uploading, and use one of the “Revise,” “Update,” or “Replace” options to update all of the fields, which will remove the duplicated linked resources.

walldigexh · September 20, 2024, 3:00am

Here’s a bit on my workflow: I’m a rural historian working with folks of pretty advanced age to collect info on photographs and other works I’ve scanned and cataloged. I’ve given the participants spreadsheets because that’s what scared them the least. Some of these folks are in their mid-nineties and I love them, but it was like I broke some of them when I showed them Omeka’s interface. Giving them a user account just ain’t happening.

Instead, I give my informants a spreadsheet with the item identifiers, along with a thumb drive of files with identifier filenames, and Granny and Gramps fill in who’s in the photo, where it was made, any stories they remember, and so on. So far, it’s working great.

When I get their CSV file, I do a very basic cleanup before importing the csv file into Omeka. I guess I could export the database and run a script check myself, but I thought maybe I was missing a simple checkbox setting or something to disallow duplicates on import.

Bottom line: I don’t want “Append” to add a resource that’s already linked.

triplingual · September 20, 2024, 12:22pm

“Append” does what it says, though, and appends metadata values. If what you want is to ignore values if they already exist (in any metadata field — the CSV Import options are per-upload, not per-field) and to only add metadata if it is new, it seems to me that you want “Revise”, which will replace values if they are new and not delete metadata.

If what you want is to do different things with different metadata fields (Append in some cases, Revise/Update/Replace etc. in others), you’re going to need to do more than one upload. The CSV Import module settings are not fine-grained enough to append some field’s new metadata and ignore other fields.

walldigexh · September 20, 2024, 6:31pm

I tried to use Revise again, but it replaces the entire field. I want to append new linked resources to the existing linked resources in a field, not replace the whole field – but I definitely don’t want to add a linked resource twice.

Since preventing duplicates isn’t possible, I just wrote a de-duplication script to act directly on the database to remove all duplicate resources in every item’s field.

jflatnes · September 20, 2024, 10:10pm

There is actually deduplication logic built in to how “Append” works already, but it looks like there’s a problem with it. The code in place right now incorrectly allows for one duplicate for every value, if the value is a linked resource. The detection doesn’t properly match up an existing linked resource with one coming from the import. Subsequent imports will deduplicate those values, but then add a new duplicate, so you get this steady state of 2 copies of a linked resource if you repeatedly “Append” the same data.

I’ve just made a change to the way the deduplication works in CSV Import that should fix this problem. I’m also taking this opportunity to remove its historical behavior of “deduplicating” existing values of updated items; with this new setup, CSV Import will not create new duplicates when importing with Append, but it will leave any existing duplicates in place.

We’ll of course need to put it through some testing before a release.

triplingual · September 23, 2024, 12:48pm

Oh! This is great news to see.

walldigexh · September 26, 2024, 10:31pm

Huge thanks for looking into this, @jflatnes, I was tearing my hair out about what I was doing wrong.

I’m using this SQL statement directly on the database to deduplicate linked resources after my imports for now. I’m making liberal use of mysqldump just to be safe, but it seems to be working. What do you think?

DELETE v1 FROM value v1 JOIN value v2 ON v1.resource_id = v2.resource_id AND v1.property_id = v2.property_id AND v1.value_resource_id = v2.value_resource_id AND v1.id > v2.id;

jflatnes · October 31, 2024, 8:06pm

@walldigexh That looks fine to me in terms of a direct query, keeping the lower-ID one is probably a sound choice. Backups, as you’re doing, are of course a good idea whenever editing directly like this.

The updated CSV Import with the new deduplication logic is out, so you can take it for a spin. As i mentioned before, it won’t remove any duplicates you already have but should avoid creating new ones when doing appends.

walldigexh · November 4, 2024, 2:50am

@jflatnes Thanks! I appreciate your help!

I’ve updated to CSV Import 2.6.2 and it looks like there’s no deduplication at all now when I use Append. I used a super simple CSV file to test. The WDE.000614 record is the Paradise, Oregon record you see in the screenshot below.

wde:title,wde:relatedTo
Shumaker Grade,WDE.000614

My SQL dedupe trick is working well at the moment, so there’s no rush on a fix. Let me know if I can give you any more info.

jflatnes · November 4, 2024, 4:16pm

Hmm, well that’s not supposed to happen. Thanks for reporting your results.

I’ve got a suspicion on what might be happening here… can you tell me what the type column in the database is for these 3 values? Or, the whole row from the value table for these would work nicely also.

One other thing, as a slight change to your simple testing, could you also map a literal value in the append? The same process is used to deduplicate resource links, literal text values, and URLs. It would be interesting to see if all types are allowing duplicates or just resource types.

walldigexh · November 17, 2024, 5:16pm

@jflatnes I didn’t want to disappear without a trace – just letting you know I plan to get back to you this week.

system · November 12, 2025, 5:16pm

This topic was automatically closed 360 days after the last reply. New replies are no longer allowed.