Data Ingress

This document is focused on data ingress methods supported by the DSH platform.

Choosing An Ingress Mechanism

For those looking to ingress small quantities of data (files <=1GB each) or only have access to a web browser we suggest using the Web based file ingress mechanism
If files larger than 1GB need to be ingressed we suggest using the SFTP based file ingress mechanism

Web Based File Ingress

University of Sheffield users of the DSH can request data ingress into their project environment by raising a TopDesk ticket with the SDS.

Info

When making a formal request for Data Ingress you should provide the following information:

The name of the project environment you want to ingress data to.
A short description of the reasons why the data needs to be ingressed.
A list and short description of each file to be ingressed.

Briefly the process is as follows:

A user identifies the specific files needed—such as a dataset, a CSV, or code.
A formal request is submitted to the SDS by raising a topdesk ticket.
A link, that has a predefined expiration date and can’t be used afterwards, is emailed to the requester where the data is uploaded to a neutral holding area outside the secure DSH.
Before the data moves from the holding area into the DSH, it undergoes manual checks to ensure the contents are correct.
Once cleared, the SDS team moves the files into a S3 storage area within the project workspace.

Important

If the uploaded files differ or additional files are added to the ingress request link the request will be rejected and you must reapply.
If there are files with the same name and file structure already existing within the DSH these files will be overwritten.
Each DSH workspace will have only one ingress bucket. We recommend using a robust and clean directory structure to avoid messiness and files being overwritten.

SFTP Based File Ingress

Info

The SFTP service is made available to approved data providers upon request, if you've not been explicitly pointed towards using this service then you most likely don't have access. Access is provided on a project by project basis depending on data sharing agreements.

If this service looks like a good fit for your needs please get in touch with us via the IT Services Helpdesk or if you are an external data provider please reach out to your UoS contacts.

To upload data to the SFTP system it's assumed you will be able to install an AWS Transfer Family supported SFTP client.

Take a look at the docs AWS docs here or skip ahead to connecting for information on supported clients.

This section of the docs is aimed at those looking to upload data to the SFTP service. The process is broken down into two steps:

Generating credentials to be used to access the system.
Connecting and uploading data.

Generating Keys

Should you be granted access to the service you'll need to generate an RSA or ECDSA key pair, and forward the public key to your internal contact.

WindowsMac/Linux

We suggest you use PuTTYgen to generate keys on Windows machines. This is included in the full installer of Putty found here.

From PuTTYgen select either EdDSA or RSA as the type of key to generate, then click on "Generate":

With the key generated we highly recommend you enter a strong password in the key passphrase fields before saving the private key. As the name suggests this key is private and should not be shared with anyone!

You'll also want to save the public key, this is the file you'll need to send on to your internal contact.

Mac and Linux machines come with the ssh-keygen command baked in and can be used here to generate the keys we require.

Run the below via a terminal replacing <key-name> with a filename of your choosing. You may wish to cd into a suitable directory first.

ssh-keygen -t ed25519 -f <key-name>

This command will ask you to enter a passphrase for the key, we highly recommend you do so.

Once this has been entered the system will generate 2 new files, your private key is the file with the name you specified after the -f and the public key which is the same again but suffixed with .pub

Take care to keep your private key safe as it should not be shared with anyone! The <key-name>.pub file should be forwarded onto your internal contact.

Connecting

Once you've been given the green light that your account has been created with the public key you've provided from the steps above you'll want to connect into the service to start transferring data.

sftp.rcc.shef.ac.uk via port 22 is the primary endpoint for accessing the service. Use this when configuring the server address and port with the software suggested below.

WindowsMac/Linux

Although we're tool agnostic this document providing step by step guidance for WinSCP. Should you feel confident with configuration other good tools such as FileZilla will work just fine.

You'll first need to change some settings in WinSCP:

Open the preferences dialogue box from the Options menu in the top right.

From here navigate to the "Transfer" tab, select "Default" and "Edit...":

This will open the "Transfer settings" box, from here ensure the "Preserve timestamp" box is unchecked:

After confirming the transfer settings, enter the "Endurance" tab below and set the "Enable transfer resume/transfer to temporary filename for" setting to "Disable":

With these now set you may need re-open WinSCP to see the Login form, once open you'll want to make sure that the file protocol SFTP is selected:

With the server address entered in the host name and user name fields entered in you'll want to click on the Advanced... button to select your private key.

From the left hand side of this new menu go to the SSH - Authentication tab and under the text box for Private key file: click on the ... button to open a file selection prompt. This will allow you to select the private key .ppk file you generated in the steps above.

With those filled you should now be able to log into the SFTP service.

We don't yet have specific guidance on connecting to the SFTP service via Mac or Linux machines, however there are many good tools out there that we're happy to suggest:

Cyberduck for Mac
FileZilla for Linux or Mac

Accessing Data

If you are a user of the system now looking to access the data uploaded by any of the ingress mechanisms read on.

When data is uploaded via any of the ingress mechanisms the data will be placed inside of a new bucket within your project given the name <PROJECT NAME>-ingress like shown below:

Warning

A bucket with this prefix is created whenever one of our ingress systems is used and one does not already exist. If you have already created a bucket with this naming structure be warned that these services will interact with the bucket.

The upload process to the ingress mechanisms are one way, that means that data that comes into the system cannot go out this way. For example if you were to upload data into this bucket it will not be made accessible to the SFTP users. At a technical level objects within the ingress bucket are air-gapped from the various ingress mechanisms.

Note

You are free to use the ingress bucket however it is strongly advised that you move data out and into a more suitable location for regular usage as newly uploaded data could overwrite files of the same name