We’ve noticed a new issue on our Google indexing error reports for our omeka classic digital collections site (http://digtialcollections.library.gvsu.edu). We have several thousand links indicating “duplicate without user-selected canonical.” I looked throught his report, and most of the links look like atom feeds:
From looking at the Google documentation, it looks like this error indicates the indexing software thinks this is duplicate content, possibly because there is more than one link to it from different places on the site. I’m not sure if/wether there’s anything I should do about this. The recommended solution is to include a canonical tag on the page, but I’m not aware of any way to edit the XML/atom feed of an item or collection. Thoughts?
We’ve seen some signs of Google doing this in other places: it gets kind of stuck in a loop constantly adding “amp;” to the URLs it’s crawling. It seems to be the omeka-xml format that causes it to do this. (You have atom at the end, but it’s the omeka-xml’s nearer the start of the URL that are the sign this is probably the source).
The omeka-xml output contains the current URL in the output XML and my best guess is that this is what triggers the problem… but we’ve checked into it on our end and as best I can tell, we’re not doing anything wrong in how we create the XML or those URLs, or how we declare that the document is XML. So we’re a little stumped on that one. We haven’t personally seen any other crawlers than Google doing this, so I don’t know if it’s their “fault” here or what.
Inside Google’s Webmaster Tools I believe there’s an option to report Googlebot crawling issues which you might be able to try. Other than that, you could look at blocking all the omeka-xml URLs, or just these broken ones, from Googlebot, either with robots.txt or Apache rules.