Skip to content

4. Limitations and configuration options

kfletch edited this page Mar 22, 2013 · 1 revision

There are very few inherent limits to the DataStage system, other than the physical capacity of the hardware running it. The DataStage administrator may wish to create file size limits for individual users, or for the total instance, but it is not required.

The administrator can set:

There may also be firewall considerations – these may work in favour of your group (additional security preventing unauthorised access) or against you (preventing access by people outside your institution, when you would like to make a file publicly accessible). See the wiki pages on https://github.com/dataflow/DataStage/wiki/2.-Installing-and-configuring-DataStage and https://github.com/dataflow/DataStage/wiki/3.-Using-DataStage-via-web-interface-and-mapped-drive for further information.

Long-term storage

We cannot guarantee to preserve your data. We provide software which is as robust as we can make it, but we cannot take responsibility for the hardware you install it on, or how you configure it.

Software development will carry on after the end of the DataFlow project (www.dataflow.ox.ac.uk) (ongoing as of March 2013). The Bodleian Library at the University of Oxford has decided to trial DataStage for its own use, and will continue open-source development practices, so users can expect bug fixes and security patches until at least May 2013 (and hopefully beyond).

However – no matter what may happen to the DatStage software itself, so long as you have root access to the machine holding your data, you can always get your files back. So long as you maintain that machine, your data will be on it. Make sure you set up some sort of back-up so that data does not disappear if the server is damaged.

Secure connections

DataStage can be accessed via ordinary http, samba, webdav and SFTP. These are not inherently unsafe but are common targets for hackers. You can add security to your web interface by obtaining an SSL certificate (enabling you to use the more secure HTTPS protocol) -- talk to your IT support team. You can also choose to reduce the ports available (e.g. not allowing samba: this is discussed in the configuration section, as well as the mapped drive section of this wiki). You can also install third-party software, such as DenyHosts (http://denyhosts.sourceforge.net/) to block undesired IP address (computers), or configure DataStage to only allow requests from previously-approved IP addresses. Your IT support team should be able to set up any of these options; you can probably figure out how to do it on your own, if you have access to Google, patience, and spare time.

User configurations

The DataStage system is optimised for research groups where most members work mostly independently, or where individual contributions to group goals are well established. So all "shared" and "collaborative" items are filed under the owner's name, rather than in some communal pool.

If you work in a highly collaborative team (where it might be difficult to establish who should "own" a particular file), you could create another user, called something like "PROJECT," and instruct group members to save files to the "collab" area of the "PROJECT" user. Then all files will be in one place, but will still be owned by the individuals who submit these files -- making it clear who has uploaded what. You could also give group members individual credentials, plus the "PROJECT" credentials, so they can choose which identity to assume when adding files to your file-store.

Note: the lack of a collective pool of files was due to a limitation in Ubuntu's file permissions at an early stage of our software development. It may be possible to make a collective pool on more recent versions of Ubuntu, but we haven't tried it. If you get it working, let us know, and/or contribute to the code! katherine.fletcher@cs.ox.ac.uk