Alternatives to AWS S3 external storage? Pairing with actual dropbox or other cloud storage space

axelf · February 2, 2017, 11:24pm

Hello there,

Hoping this post is not too much out of purpose, I’ll first provide some context.

I am getting prepared to install and run an Omeka website for a collaborative research project involving some 30 collaborators across several countries.
My major issue is storage space. Considering my own private local archives gathers up to 300 Gb, my estimate is that the project may gather at least 1Tb if not more original files.
Therefore, the choice of an appropriate hosting solution is critical in terms of costs and scalability. In fact, most affordable hosting solutions to run Omeka come with a limited storage availability (up to 500Gb on OVH, for instance).
Therefore, I would like to be able to dissociate the hosting solutions for running Omeka itself, together with the metadata and eventually thumbnails (VPS hosting), from the hosting solution for the actual archival files being managed (possibly a cheaper Cloud storage solution).
Omeka forums and how-to guides found on the web provide such solution only for S3 storage provided by AWS-Amazon Web Services. This is not satisfactory, since with my current knowledge (newbie), I am unable to forecast the effective costs related to such solution. I would rather opt for a Cloud storage solution based on data volume rather than data traffic and instances requests, either:

one single cloud storage account (dropbox Pro, google drive, owncloud managed storage space, …);
several cloud storage accounts [multiple repositories, as apparently possible with ICA-AToM], one for each collaborator/user and individually managed by each of them (this would also allow for a smoother copyright issue management in such a collaborative project).
The point here is that the Omeka install should be able to display thumbnails of each file associated to an item, and recall the original file, without storing it on the same server where the install is hosted.

If it can be done with one single S3-AWS bucket, why not with a differently located storage space, and why not with multiple storage spaces.

Any help to understand whether this can be done would be greatly appreciated.
Cheers,

jflatnes · February 3, 2017, 10:00pm

Swapping out the storage layer for some other single storage provider should be relatively simple. Omeka’s storage layer is pluggable, so the code responsible for storing files and getting URLs that reference them can be changed easily. It just happens that S3 is the alternative that we’ve written and ship with Omeka.

Partitioning the storage between several different storage locations is a little trickier as Omeka just treats its storage as one big undifferentiated location, but would still be possible with the same system. A thin storage adapter could do something like inspect the path being stored or requested and use that to decide how to store or retrieve the file.

Both those options would require writing some code for a custom storage adapter, though (this code would likely live in a plugin). Depending on what system you want to use, it’s possible that it would allow you to instead just mount the remote storage as a filesystem directly. In this way you could potentially use Omeka’s normal local filesystem storage and still have cloud storage actually occurring (and even achieve the local/cloud hybrid you mentioned if say, just the original folder were mounted in this way).

axelf · February 4, 2017, 9:39am

Thank you John for the rapid and prompt answer,
I get your point, and I am positively reassured as far as our needs are concerned.
I will therefore start setting up a fresh Omeka install on my chosen VPS provider, and try to configure it to store files on a different Clouded storage provider.
To be sure, the reason why i was asking for the possibility of partitioning the storage to different locations is both financial and legal. In fact,

distributing the file storage hosting costs to each “users” may bring this cost close to zero, since it is quite easy to obtain free personal clouded storage for as much as 10-20Gb, especially for university staff (one could actually have a configuration where each “user’s” original files are locally stored on their personal computer, while only the metadata and thumbnails are uploaded to the Omeka account;
partitioning the file storage in different storage locations (one per “user”) would leave the burden of protecting and managing copyrights on original files to the single “user”, reducing this burden to the management of copyrights on metadata (which by the way can easily be linked to the single “user” at the “add item” step).
But perhaps this is in fact “a little trickier”, since it would probably require some sort of LDAP identification at the login step associating a specific storage location to a specific “user”…

Nevertheless, to keep it “relatively simple”, considering “Omeka is designed with non-IT specialists in mind” and considering I am “non-IT specialist” (as are the “users” of our future Omeka install) , what would it take to provide a “relatively simple” step-by-step instruction on how to set up a specific file storage location in an Omeka install?
Where would I start, from the posts explaining how to set up an S3 external storage configuration?
Where and when should one provide the location of such file storage location (URL/FTP/Webdav ?), when and where should one provide the appropriate identifiers (username/password)?
Besides, is there a setting that should be made at the install stage to manage access rights to such external storage? In fact, while it is clear how to provide “users” with specific access rights to original files, it is unclear how to restrict the publishing on the public website/interface of the original files alone (while making the metadata and thumbnails still publicly searchable)… but perhaps this does not need to be configured at the install/set up stage and can be managed later on, therefore being the topic of a separate post in the forum?

Hence, any help from the forum to make this work would be greatly appreciated.
Thank you,

jflatnes · February 4, 2017, 9:56pm

Splitting user-by-user would be significantly more complex: the only thing you have easy access to in the storage layer is the path of the file being stored or requested. You can get access to global data like the current user, but keeping track of things so it works correctly the way you’re describing would not be a simple undertaking. The system just isn’t designed to work that way.

As for configuring the storage: to work well it pretty much needs to be done up front, before any items or files or created. It’s possible to do so later, but you’d have to manually move things around to the proper locations if you go that route.

The S3 instructions are a reasonable starting point. The storage system is configured through the application/config/config.ini file. You tell Omeka what adapter it should use to handle storing files, and give whatever options that storage adapter needs there as well. This generally would include the location/URL as well as whatever authentication options are needed. The storage class is simple, just implementing a small interface (see for example a community member’s own S3 adapter that uses Amazon’s official PHP SDK rather than Zend’s). An adapter that stores to any other provider would probably look pretty much the same, but just using that provider’s SDK instead.

Bifurcating the storage locations is perhaps a little more complex, but it can pretty much delegate all the work to other existing adapters. I’d envision the solution looking something like this: the adapter looks to see if the storage path starts with “original/”: if it does, it passes off to a cloud storage adapter, otherwise it passes off to the Filesystem adapter. The result is that “original” files are stored with the cloud provider, and all other files are stored locally.

Managing access to the original files can be a little more complex because again it’s not exactly something that Omeka is designed to do. One option with cloud storage is to just configure the storage so public access isn’t allowed at all. S3 for example also allows for signature-based URLs where the system creates a time-limited URL to allow access, preventing hotlinking and even direct “guessing” of the URL.