Huge collections

ZeT · November 27, 2020, 7:21pm

Hi! I’m wondering, how stable Omeka S works with large sites (50-100 thousand pdf). Is there a problem with the speed of searching of documents and the stability of work, maybe there are some examples of real installations? thank you!

mebrett · November 30, 2020, 1:47pm

In terms of examples, here are a few from the directory:

The {Papers of the War Department](https://wardepartmentpapers.org/s/home/page/home) has 79,261 items
Paris Science and Letters has over 13,000 items
Neptun has over 5000 items

benbakelaar · December 4, 2020, 8:27pm

I asked a similar question several years ago (maybe even 4). Will try to find a link to it!

But what I’ve learned over the years is, it’s not a concern of how many files (images, PDF, etc.) that are attached to the “items”.

The key question is, how many digital objects, aka “items”, aka “documents” or whatever terminology you use, will there be in the site?

If each file is its own unique item, with its own unique metadata, then I would ask what is the content inside those PDFs? Because Omeka isn’t really a document management system, a digital asset management system (DAM) or even knowledge management system (KMS) << from my perspective. There is better software out there for those needs (again, IMO). But for instance, if those PDFs represented historical documents, for instance scans, then it might make sense to use Omeka.

In terms of search capability, once you are at 100k documents, I believe you would want to implement a search engine such as Lucene, Solr (there is a module), or Elasticsearch (wish there was a module!).

kloor · December 7, 2020, 1:49pm

I maintain two large Omeka-S sites for the University Libraries at Bowling Green State University. The largest is an index for the Historical Collections of the Great Lakes which has over 460,000 records:
https://greatlakes.bgsu.edu/

We also have a keyword index for our student newspaper, The BG News, with over 120,000 records:
https://lib.bgsu.edu/bgnewsindex/

We ran into a few minor performance issues due to the amount of records back with Omeka-S v1, but by v3 I think they’ve all been resolved. The built-in search does seem to work quickly now, but we continue to use Daniel-KM’s Search and SearchSolr modules for additional features such as facets:

We currently do not have an Omeka-S site with PDF files. I’d imagine you would need to find a way to extract OCR text from the files themselves and insert it as metadata within Omeka-S to make the files searchable. This module may be an option:

benbakelaar · December 7, 2020, 10:09pm

Nearly 500k records! That’s a great case study.

I also see you have Refine/Faceted Browsing working, that’s awesome. Via Solr.

The site seems to run extremely fast - faster than I would expect, even for a site with much less records. Would you mind sharing the LAMP server details behind the installation? It feels to me like there was some optimization work behind the fast UX!

system · August 14, 2021, 10:09pm

This topic was automatically closed 250 days after the last reply. New replies are no longer allowed.