Items showing up on wrong sites through google search

Dbrett · March 29, 2021, 2:02pm

Hi all,

We had hoped that this issue would be solved by the new item setups from Omeka S 3.0 but unfortunately that is not the case. Our sites are indexed by Google and quite often when we search for an item it will show up but be displayed on a site that is in no way connected to it. The most recent example of this is when we search for “crossing ice bridge at Niagara falls brock” on Google, the first two results are sites that are in no way connected to that item. The third result is the correct site.

Any insight into why this is happening and how we can prevent it would be appreciated.

Thanks,

Daniel

jflatnes · March 30, 2021, 2:32pm

I’m wondering if maybe the items were visible on the “wrong” sites (as in, in the browse page) at one point? Google will keep crawling URLs it already knows about even after the links to them have disappeared, so that’s a possible explanation here.

Even in the current version, we don’t strictly forbid viewing “outside” items through the lens of a site, to account for, for example, the case of an item within a site that links to an item not in the site. Though making such cases like what you have here into a 404 error is a possible enhancement, made easier by the 3.0 changes.

Have you tried using Google’s removal tool on these problematic results?

Dbrett · March 31, 2021, 12:15pm

Hi John,

I removed the URL of the first result (Reading the Middle Ages site) using the removal tool on the Google Search Console but weirdly all it did was make it so that the video no longer shows up. Item page with metadata is still showing in Google results. Comparison of blocked page versus non blocked page can be seen between the first and second result.

Does this provide any insight into a solution or is it just more confusing?

Thanks,

Daniel

Dbrett · April 1, 2021, 8:04pm

And now it has disappeared correctly. Maybe the indexing took a while to clear the site? I will try remapping all of our sites and see if that resolved the issue.

Thanks,

Daniel

giocomai · April 4, 2021, 8:09am

I have been reported multiple times the same issue with our website, and I had this on my “to do” list. Thank you @Dbrett for bringing this up.

After doing some manual testing, it would appear that this is very common (as in, “most of the times”) in our case. If I search for the title of an item, I get sent to a random site of our Omeka installation as top result (and the “right” site is not even listed). As some of our sites are very specific, finding completely unrelated contents on them looks bizzarre from the outside.

I feel this is not a minor issue, as this is often the main way people discover our archive.

The “proper” Google way to deal with this would be by using so called “canonical urls” as detailed by Google, where they list all the pro et contra of various technical implementations. This allows to mention which is the preferred version of a page when the same content can be found on different URLs. In Omeka S, this would probably imply having a “main” website for each item, to be set either systematically when items are added to a site or as an individual property.

I think this would be the most desirable solution, as it would imply no visible change for the user (so the item pages would still exist for the cases you mention), but would let search engines do their work and send people to the right page.

Dbrett · April 5, 2021, 1:52pm

Hi thanks for the response!

@giocomai Would you be able to clarify the process you are describing above? Would it require the manual adjustment of every possible combination of site and item? If so that would not be a viable solution in our case as we have more than 3000 items of 60+ sites.

@jflatnes I had previously thought that the result was removed from the google search results by the removal tool when it was no longer the first result but I was wrong. Running the removal tool from the Google Search Console on the page just moved it down the search result and (this is the strangest part) the item page is now broken. So it’s actually even worse now because we have an incorrect page still but now its a broken incorrect page.

This is all a bit too high level for me honestly so any further insights would be apreciated.

Thanks,

Daniel

giocomai · April 5, 2021, 7:44pm

Hi @Dbrett.

So, here’s my thinking about this so far.

Option 1. Everything remains as it is in the database and in Omeka, but we add a nice php line in the theme that outputs something like this:
<link rel="canonical" href="https://example.com/s/website_where_item_is_actually_included/item/1234" />

Since we currently have no better information, website_where_item_is_actually_included would probably be the site with the lowest id, if the item is included in more than one site, or probably the first public website, if the item is not explicitly included in any website.

Option 2. A slightly better version would be that this line would be preceded by an if statement such as “if this item is not part of the current website, then output this nice canonical url”. This would probably be better because if an item is included in more than one site, then it would be up to Google to decide which is the most important, and long-term we’d expect Google to get this right (and anyway, it couldn’t get it erally wrong, as all sites where the item is not included would still have a link to the canonical url). Also, this would ensure that the change would not annoy people who, for example, have all items added to an old ugly site with a low id number, and they’d much prefer Google to send them anywhere else… with the if statement, Google decides, but only among sites where the item was added.

Option 3. Would be that Omeka defaults along the lines of Option 2, but would allow the admin to explicitly say which site is the main site for an item if they so wish. This could probably happen either when items are added to a site, in some other bulk form (e.g. all items that are in x item set will have site y as main site), or manually. If the admin says nothing explicitly, things would happen along the lines of Option 2, so only items pages of sites where they have not been added will tell search engines the preferred address for that given item. This feature could conceivably be introduced either in core Omeka S or as a module.

Option 4. is actually a workaround. If there is really no way to write a nice php line such as the one I propose above or if that is undesirable for some other reason I’m not currently considering, it would be possible to achieve something similar by parsing the Omeka’s REST API, and systematically create a sitemap.xml file that implements the same as above. As the page I linked above suggests, sitemap.xml files are less effective than the other solutions, but it should probably still work. This could also be more of a pain to maintain, as this may be implemented outside of Omeka, so you would probably have a script that runs once a day or something, and regenerates an updated sitemap.xml (or could have the same done by a module inside Omeka… even if this seems more complicated than option 1 and 2).

All of the above would still be better that manually adding links to the removal tool, which is nice if there’s just the odd link out of place, but becomes soon a huge pain if you need to to do this for each new item for each of the dozen websites where that item has not been added.

I’d be curious to hear how common this experience is with Omeka S 3, but I would find it surprising if this was bugging only @Dbrett and me.

Considering that Google is really the main source of traffic for our Omeka S sites, as I suppose is the case for many others, I feel it’s not a minor nuisance.

I am not very familiar with either php or the internals of Omeka, so any help in writing that nice php line would certainly be very welcome.

Long term, if not Option 3, I feel at least something like Option 2 should be part of Omeka’s core (as an helper function) or added to default themes. Ultimately, the nice thing about canonical urls is that they do not bring any user-visible change, they would represent a huge improvement for many, and wouldn’t make a difference for those who didn’t have a problem in the first place.

Looking forward to hear thoughts about this! Thanks a lot as usual!

jflatnes · April 5, 2021, 8:50pm

My thinking is that an option to make Omeka S 404 for items not in the site is probably the most straightforward fix. Cross-cutting linkages could either be filtered before display (though this could get complicated) or simply documented as a problem area best avoided.

The “canonical” link tag could also help, but will be problematic in cases where items really are used in multiple sites legitimately, and Google also doesn’t necessarily have to respect the canonical rel when it sees it.

It should be possible to trial out one or both of these approaches with a module.

What I’m curious about is what leads to the situation in the first place: Google only indexes what it can crawl to, so to me the obvious answer is that these sites were published at some point with everything assigned to them, and Google crawled those “wrong” item pages at that point. Does that track with either of your experiences? If there’s something else going on then it would affect what kinds of changes would have to be made.

giocomai · April 6, 2021, 8:24am

Thanks for sharing your thoughts. I do see the issue with the canonical approach, which basically implies that there’s just one original version of a page and all other are copies, in a context in which items may legitimately be included in more than one site.

With 404, those item pages would not exist in the first place, hence preventing trouble.

In our case, I’m quite sure those items were never added to those other sites, i.e. they were never visibile via the browse (/item) page in unrelevant sites.

How does Google know about them in the first place is a good question, surely. (I’d have the same question about some media pages that turn up in google results, but in that case we probably did have a previous theme linking to media pages).

I will try to troubleshoot further and perhaps give a try to either the php canonical or the sitemap solution (with sitemap, you could probably include all valid item pages, and then let google do its magic).

Dbrett · April 6, 2021, 6:35pm

Hi guys,

To @jflatnes question, I have no way to tell if at some point one of our users accidentaly assigned an item to the wrong site and then changed it later. I am however led to believe that this is not the case due to the volume of the items found to show up on incorrect sites when searched in Google. My situation is one where I support Omeka S for our University but researchers making exhibits really have free reign of the platform while they are working on their exhibits.

I am pretty new supporting platforms like this so my knowledge of PHP and Googles Crawlers is limited to basic Google-Fu results but I’m happy to help in any way I can to aid in the development of a solution.

Thanks again for all of the help!

Daniel

giocomai · April 6, 2021, 8:59pm

Hi @jflatnes, after some more troubleshooting I figured out what leads to the crawling in the first place. In brief, the answer is… using cross references between Omeka items.

We use quite a lot of cross references, for fields such as dcterms:provenance (home institution or owner of the original item), for authors of pictures (each author is an omeka item of class “person”, which allows to include some basic info about them, as well as full list of the contents by them included in the archive), locations (the vast majority of items in our collections are expected to be part of one of a few dozens administrative units, and each of these administriative unit is an omeka item… again, we have a brief description of that territory, and full list of all items that relate to it); , "is part of"in different contexts, etc.

This is all very nice and allows to explore the archive smoothly, but…

But on the item page, say, of one of these locations or of an author, there is a full list of all items related in some way to it, no matter if they are added to the current site or not. As we use either locations, authors, or “provenance” in most sites, in the end everything is available for crawlers on all sites… through lists included at the bottom of item pages.

I just run a crawler on my own Omeka S installation, and I get to see lots of unrelated items on most sites. And so does Google.

I suspect that this issue ultimately affects anybody using a lot of cross-references among Omeka items.

So I do want all such items used in cross references (location, provenance, author, etc.) to be available in a site even if they are not added to it, exactly as it’s happening now (it’s nice and important that they do not appear in the browse or search function). But I would probably be happy to have listed under them only items associated to the current site (not 100% sure, I should think this through and ask for feedback internally).

Anyway, I see this can get quite complicated, and preferences and expectations may differ.

I’ve started setting up a script for building sitemap.xml from the API, which seems absolutely doable, but I am afraid it won’t help much, as Google will probably still keep versions it finds via crawling even if they are not in the sitemap.xml. Removing URLS would be a pain, as it would need to be done routinely as new items are added.

After some more thinking, I’ll pitch another suggestion: what about using the no index tag for item pages that are not added to a site? (more details about “no index” here)

My new suggestion for a nice php line would then be much easier than the one I suggested in my previous post, and would simply be: if item is not associated to current site, then include in header <meta name="robots" content="noindex">. This could fix the issue, with no visible change to the user, and no complex trade-offs: the item would be visible, possibly on multiple sites, and search engines would ignore it only on sites when it is not included. What do you think?

Sorry about the lengthy post!

jflatnes · April 7, 2021, 4:14pm

Right, I had looked into this before and just forgot about it when discussing this before. I think… yes, it’s pretty likely that this will be the most plausible solution that works with all the various ways people might use Omeka S, but still fix the problem here.

This one should also be doable as a module that checks the site membership of the current item/media and adds the meta tag if it’s not in the site.

giocomai · April 11, 2021, 2:28pm

To clarify, you’re referring to a module here because there is no obvious way to this at other levels, right?

I gave this a try, and came up with this code that seems to work as expected if included in view/omeka/site/item/show.phtml. This is however useless, as the noindex tag must be in <head>.

Is there a somewhat straightforward way to get something like this to work in the layout.phtml?

I see that headTitle() gets item-level data into head, but I struggle to apply the same approach to my case (and not even sure if this is the best route).

<?php $sites = $item->sites()?>

<?php if (count($sites) > 0): ?>
  <?php if (!in_array($site, $sites)) {
    echo '<meta name="robots" content="noindex">'; 
    } 
  ?>
<?php endif; ?>

jflatnes · April 11, 2021, 4:55pm

I suggested a module just because that’s would be the best way of making it work across any theme.

There’s a helper for adding meta tags that works just like headTitle: use $this->headMeta()->appendName()

giocomai · April 11, 2021, 6:48pm

Bingo!

Thanks a lot.

I added the following code on top of view/omeka/site/item/show.phtml in my theme, and this seems to behave as expected.

I’ll deploy this on my public server and hopefully search engines will adapt after their next crawl.

<?php 
$sites = $item->sites();

if (count($sites) > 0) {
    if (in_array($site, $sites)) {
      echo $this->headMeta()->appendName('robots', 'index');
    } else {
      echo $this->headMeta()->appendName('robots', 'noindex');
    }
  } else {
    echo $this->headMeta()->appendName('robots', 'noindex');
  }
?>

N.B. if anybody else sees this, they may want to consider that the last line starting with echo in the above chunk sets the ‘noindex’ tag for items that are not added to any sites. I can imagine cases where others may want to have them indexed instead.

giocomai · June 26, 2021, 9:19am

Quick follow-up to confirm that the solution proposed in the previous post seems to work as expected. Google takes its time, but after some weeks it correctly removes from search results all item pages that were shown on sites they don’t belong to. See screenshot from Google’s search console:

(the small numer of pages with errors is caused by an unrelated issue in a theme that I have just fixed).

This seems to be a great improvement for the user experience in the real world, as many users reached our archive via Google searches that led them to the right item on the wrong site, leading to some confusion.

So I feel this may be the preferred solution for many who:

make generous use of cross references between items
have multiple sites where it would be odd to see items that are not explicitly associatd with them

Finally, people who read this thread may be interested in the discussion about the persistent identifier module apparently under development, and perhaps, depending on their scenario, consider the relevance of the canonical tag in that context.

ManOnDaMoon · June 26, 2021, 4:18pm

Hello

If this is of interest to you, I’ve created a Sitemaps module that generates one sitemap.xml file for each site the option is enabled.
It is still I early development state but for now I’m having positive returns from google search console even if everything is still not indexed from those sitemaps files.

sanjinmuftic · November 24, 2021, 11:39am

Hello,

Thank you all for this discussion and thread, it is super useful because on our institutional installation of Omeka S is experience the same issues with the google search. Thank you @ManOnDaMoon for the sitemap module which I have just installed to see if it makes a difference.

I am not sure if this is part of the same issue or something that is unrelated, but we have also noticed that if we manually change the site slug through the public url on any item, we can access items that are only meant to be seen on one site. This happens even if that item is only attached to a particular site.
For example this is the actual item on the site: Reimagining Tragedy from Africa and the Global South · Oral history interview with Ben Omowafola Tomoloju · Ibali
but if i change the website slug i will still get the item:
https://ibali.uct.ac.za/s/woac/item/841 or even Oral history interview with Ben Omowafola Tomoloju · Showcasing Connections through Collections · Ibali

I know that most people won’t change the slug, but they might change the numerical identifier and then see an item that does not belong on that site through that particular theme/site.

Now this item is limited to just the one site, but is still showing up on others.

I know that each site has the

Restrict Browse to attached items

unchecked. If it is checked then most of the items that were not explicitly added to pages do not show up at all.

Not sure if this is related to the google search issue, but would appreciate any guidance about how to limit the presentation of items to their specific site, which I imagine would also stop the google search - even though I am aware that few might try to manually change the address bar with another item number.

sanjinmuftic · November 25, 2021, 10:19am

Hello again,

I have now also taken the solution provided by @giocomai and added the code to my item show pages and will keep an eye on the results through the search console. Thanks so much @giocomai your code is very smooth!

giocomai · November 25, 2021, 11:45am

Thanks, much appreciated! I actually feel this could be default behaviour, or perhaps a site setting… if you don’t add an item to a site, you probably don’t want people to reach you from Google through an item on that site. In terms of how people actually reach our Omeka S websites in the real world, there are other issues that may be more difficult to approach (e.g. people who get via search engines directly to pdf files, rather than first to the pages showing relevant metadata, which would be more useful), but I feel the issue of items crawled from Omeka sites to which they are not associated is a relatively easy one, once you’re aware about it.