Items showing up on wrong sites through google search

Hi all,

We had hoped that this issue would be solved by the new item setups from Omeka S 3.0 but unfortunately that is not the case. Our sites are indexed by Google and quite often when we search for an item it will show up but be displayed on a site that is in no way connected to it. The most recent example of this is when we search for “crossing ice bridge at Niagara falls brock” on Google, the first two results are sites that are in no way connected to that item. The third result is the correct site.

Any insight into why this is happening and how we can prevent it would be appreciated.

Thanks,

Daniel

1 Like

I’m wondering if maybe the items were visible on the “wrong” sites (as in, in the browse page) at one point? Google will keep crawling URLs it already knows about even after the links to them have disappeared, so that’s a possible explanation here.

Even in the current version, we don’t strictly forbid viewing “outside” items through the lens of a site, to account for, for example, the case of an item within a site that links to an item not in the site. Though making such cases like what you have here into a 404 error is a possible enhancement, made easier by the 3.0 changes.

Have you tried using Google’s removal tool on these problematic results?

1 Like

Hi John,

I removed the URL of the first result (Reading the Middle Ages site) using the removal tool on the Google Search Console but weirdly all it did was make it so that the video no longer shows up. Item page with metadata is still showing in Google results. Comparison of blocked page versus non blocked page can be seen between the first and second result.

Does this provide any insight into a solution or is it just more confusing?

Thanks,

Daniel

And now it has disappeared correctly. Maybe the indexing took a while to clear the site? I will try remapping all of our sites and see if that resolved the issue.

Thanks,

Daniel

I have been reported multiple times the same issue with our website, and I had this on my “to do” list. Thank you @Dbrett for bringing this up.

After doing some manual testing, it would appear that this is very common (as in, “most of the times”) in our case. If I search for the title of an item, I get sent to a random site of our Omeka installation as top result (and the “right” site is not even listed). As some of our sites are very specific, finding completely unrelated contents on them looks bizzarre from the outside.

I feel this is not a minor issue, as this is often the main way people discover our archive.

The “proper” Google way to deal with this would be by using so called “canonical urls” as detailed by Google, where they list all the pro et contra of various technical implementations. This allows to mention which is the preferred version of a page when the same content can be found on different URLs. In Omeka S, this would probably imply having a “main” website for each item, to be set either systematically when items are added to a site or as an individual property.

I think this would be the most desirable solution, as it would imply no visible change for the user (so the item pages would still exist for the cases you mention), but would let search engines do their work and send people to the right page.

Hi thanks for the response!

@giocomai Would you be able to clarify the process you are describing above? Would it require the manual adjustment of every possible combination of site and item? If so that would not be a viable solution in our case as we have more than 3000 items of 60+ sites.

@jflatnes I had previously thought that the result was removed from the google search results by the removal tool when it was no longer the first result but I was wrong. Running the removal tool from the Google Search Console on the page just moved it down the search result and (this is the strangest part) the item page is now broken. So it’s actually even worse now because we have an incorrect page still but now its a broken incorrect page.

This is all a bit too high level for me honestly so any further insights would be apreciated.

Thanks,

Daniel

Hi @Dbrett.

So, here’s my thinking about this so far.

Option 1. Everything remains as it is in the database and in Omeka, but we add a nice php line in the theme that outputs something like this:
<link rel="canonical" href="https://example.com/s/website_where_item_is_actually_included/item/1234" />

Since we currently have no better information, website_where_item_is_actually_included would probably be the site with the lowest id, if the item is included in more than one site, or probably the first public website, if the item is not explicitly included in any website.

Option 2. A slightly better version would be that this line would be preceded by an if statement such as “if this item is not part of the current website, then output this nice canonical url”. This would probably be better because if an item is included in more than one site, then it would be up to Google to decide which is the most important, and long-term we’d expect Google to get this right (and anyway, it couldn’t get it erally wrong, as all sites where the item is not included would still have a link to the canonical url). Also, this would ensure that the change would not annoy people who, for example, have all items added to an old ugly site with a low id number, and they’d much prefer Google to send them anywhere else… with the if statement, Google decides, but only among sites where the item was added.

Option 3. Would be that Omeka defaults along the lines of Option 2, but would allow the admin to explicitly say which site is the main site for an item if they so wish. This could probably happen either when items are added to a site, in some other bulk form (e.g. all items that are in x item set will have site y as main site), or manually. If the admin says nothing explicitly, things would happen along the lines of Option 2, so only items pages of sites where they have not been added will tell search engines the preferred address for that given item. This feature could conceivably be introduced either in core Omeka S or as a module.

Option 4. is actually a workaround. If there is really no way to write a nice php line such as the one I propose above or if that is undesirable for some other reason I’m not currently considering, it would be possible to achieve something similar by parsing the Omeka’s REST API, and systematically create a sitemap.xml file that implements the same as above. As the page I linked above suggests, sitemap.xml files are less effective than the other solutions, but it should probably still work. This could also be more of a pain to maintain, as this may be implemented outside of Omeka, so you would probably have a script that runs once a day or something, and regenerates an updated sitemap.xml (or could have the same done by a module inside Omeka… even if this seems more complicated than option 1 and 2).

All of the above would still be better that manually adding links to the removal tool, which is nice if there’s just the odd link out of place, but becomes soon a huge pain if you need to to do this for each new item for each of the dozen websites where that item has not been added.

I’d be curious to hear how common this experience is with Omeka S 3, but I would find it surprising if this was bugging only @Dbrett and me.

Considering that Google is really the main source of traffic for our Omeka S sites, as I suppose is the case for many others, I feel it’s not a minor nuisance.

I am not very familiar with either php or the internals of Omeka, so any help in writing that nice php line would certainly be very welcome.

Long term, if not Option 3, I feel at least something like Option 2 should be part of Omeka’s core (as an helper function) or added to default themes. Ultimately, the nice thing about canonical urls is that they do not bring any user-visible change, they would represent a huge improvement for many, and wouldn’t make a difference for those who didn’t have a problem in the first place.

Looking forward to hear thoughts about this! Thanks a lot as usual!

My thinking is that an option to make Omeka S 404 for items not in the site is probably the most straightforward fix. Cross-cutting linkages could either be filtered before display (though this could get complicated) or simply documented as a problem area best avoided.

The “canonical” link tag could also help, but will be problematic in cases where items really are used in multiple sites legitimately, and Google also doesn’t necessarily have to respect the canonical rel when it sees it.

It should be possible to trial out one or both of these approaches with a module.

What I’m curious about is what leads to the situation in the first place: Google only indexes what it can crawl to, so to me the obvious answer is that these sites were published at some point with everything assigned to them, and Google crawled those “wrong” item pages at that point. Does that track with either of your experiences? If there’s something else going on then it would affect what kinds of changes would have to be made.

1 Like

Thanks for sharing your thoughts. I do see the issue with the canonical approach, which basically implies that there’s just one original version of a page and all other are copies, in a context in which items may legitimately be included in more than one site.

With 404, those item pages would not exist in the first place, hence preventing trouble.

In our case, I’m quite sure those items were never added to those other sites, i.e. they were never visibile via the browse (/item) page in unrelevant sites.

How does Google know about them in the first place is a good question, surely. (I’d have the same question about some media pages that turn up in google results, but in that case we probably did have a previous theme linking to media pages).

I will try to troubleshoot further and perhaps give a try to either the php canonical or the sitemap solution (with sitemap, you could probably include all valid item pages, and then let google do its magic).

Hi guys,

To @jflatnes question, I have no way to tell if at some point one of our users accidentaly assigned an item to the wrong site and then changed it later. I am however led to believe that this is not the case due to the volume of the items found to show up on incorrect sites when searched in Google. My situation is one where I support Omeka S for our University but researchers making exhibits really have free reign of the platform while they are working on their exhibits.

I am pretty new supporting platforms like this so my knowledge of PHP and Googles Crawlers is limited to basic Google-Fu results but I’m happy to help in any way I can to aid in the development of a solution.

Thanks again for all of the help!

Daniel

Hi @jflatnes, after some more troubleshooting I figured out what leads to the crawling in the first place. In brief, the answer is… using cross references between Omeka items.

We use quite a lot of cross references, for fields such as dcterms:provenance (home institution or owner of the original item), for authors of pictures (each author is an omeka item of class “person”, which allows to include some basic info about them, as well as full list of the contents by them included in the archive), locations (the vast majority of items in our collections are expected to be part of one of a few dozens administrative units, and each of these administriative unit is an omeka item… again, we have a brief description of that territory, and full list of all items that relate to it); , "is part of"in different contexts, etc.

This is all very nice and allows to explore the archive smoothly, but…

But on the item page, say, of one of these locations or of an author, there is a full list of all items related in some way to it, no matter if they are added to the current site or not. As we use either locations, authors, or “provenance” in most sites, in the end everything is available for crawlers on all sites… through lists included at the bottom of item pages.

I just run a crawler on my own Omeka S installation, and I get to see lots of unrelated items on most sites. And so does Google.

I suspect that this issue ultimately affects anybody using a lot of cross-references among Omeka items.

So I do want all such items used in cross references (location, provenance, author, etc.) to be available in a site even if they are not added to it, exactly as it’s happening now (it’s nice and important that they do not appear in the browse or search function). But I would probably be happy to have listed under them only items associated to the current site (not 100% sure, I should think this through and ask for feedback internally).

Anyway, I see this can get quite complicated, and preferences and expectations may differ.

I’ve started setting up a script for building sitemap.xml from the API, which seems absolutely doable, but I am afraid it won’t help much, as Google will probably still keep versions it finds via crawling even if they are not in the sitemap.xml. Removing URLS would be a pain, as it would need to be done routinely as new items are added.

After some more thinking, I’ll pitch another suggestion: what about using the no index tag for item pages that are not added to a site? (more details about “no index” here)

My new suggestion for a nice php line would then be much easier than the one I suggested in my previous post, and would simply be: if item is not associated to current site, then include in header <meta name="robots" content="noindex">. This could fix the issue, with no visible change to the user, and no complex trade-offs: the item would be visible, possibly on multiple sites, and search engines would ignore it only on sites when it is not included. What do you think?

Sorry about the lengthy post!

Right, I had looked into this before and just forgot about it when discussing this before. I think… yes, it’s pretty likely that this will be the most plausible solution that works with all the various ways people might use Omeka S, but still fix the problem here.

This one should also be doable as a module that checks the site membership of the current item/media and adds the meta tag if it’s not in the site.

1 Like

To clarify, you’re referring to a module here because there is no obvious way to this at other levels, right?

I gave this a try, and came up with this code that seems to work as expected if included in view/omeka/site/item/show.phtml. This is however useless, as the noindex tag must be in <head>.

Is there a somewhat straightforward way to get something like this to work in the layout.phtml?

I see that headTitle() gets item-level data into head, but I struggle to apply the same approach to my case (and not even sure if this is the best route).

<?php $sites = $item->sites()?>

<?php if (count($sites) > 0): ?>
  <?php if (!in_array($site, $sites)) {
    echo '<meta name="robots" content="noindex">'; 
    } 
  ?>
<?php endif; ?>

I suggested a module just because that’s would be the best way of making it work across any theme.

There’s a helper for adding meta tags that works just like headTitle: use $this->headMeta()->appendName()

1 Like

Bingo!

Thanks a lot.

I added the following code on top of view/omeka/site/item/show.phtml in my theme, and this seems to behave as expected.

I’ll deploy this on my public server and hopefully search engines will adapt after their next crawl.

<?php 
$sites = $item->sites();

if (count($sites) > 0) {
    if (in_array($site, $sites)) {
      echo $this->headMeta()->appendName('robots', 'index');
    } else {
      echo $this->headMeta()->appendName('robots', 'noindex');
    }
  } else {
    echo $this->headMeta()->appendName('robots', 'noindex');
  }
?>

N.B. if anybody else sees this, they may want to consider that the last line starting with echo in the above chunk sets the ‘noindex’ tag for items that are not added to any sites. I can imagine cases where others may want to have them indexed instead.