Uploading data in bulk¶
This tutorial will follow a Flattened Data Layout. With a project that has this example layout:
.
├── biospecimen_experiment_1
│ ├── fileA.txt
│ └── fileB.txt
├── biospecimen_experiment_2
│ ├── fileC.txt
│ └── fileD.txt
├── single_cell_RNAseq_batch_1
│ ├── SRR12345678_R1.fastq.gz
│ └── SRR12345678_R2.fastq.gz
└── single_cell_RNAseq_batch_2
├── SRR12345678_R1.fastq.gz
└── SRR12345678_R2.fastq.gz
Tutorial Purpose¶
In this tutorial you will:
- Find the synapse ID of your project
- Create a manifest TSV file to upload data in bulk
- Upload all of the files for our project
- Add an annotation to all of our files
- Add a provenance/activity record to one of our files
Prerequisites¶
- Make sure that you have completed the following tutorials:
- This tutorial is setup to upload the data from
~/my_ad_project
, make sure that this or another desired directory exists. - Pandas is used in this tutorial. Refer to our installation guide to install it. Feel free to skip this portion of the tutorial if you do not wish to use Pandas. You may also use external tools to open and manipulate Tab Separated Value (TSV) files.
1. Find the synapse ID of your project¶
First let's set up some constants we'll use in this script, and find the ID of our project
import os
import synapseclient
import synapseutils
syn = synapseclient.Synapse()
syn.login()
# Create some constants to store the paths to the data
DIRECTORY_FOR_MY_PROJECT = os.path.expanduser(os.path.join("~", "my_ad_project"))
PATH_TO_MANIFEST_FILE = os.path.expanduser(os.path.join("~", "manifest-for-upload.tsv"))
# Step 1: Let's find the synapse ID of our project:
my_project_id = syn.findEntityId(
name="My uniquely named project about Alzheimer's Disease"
)
2. Create a manifest TSV file to upload data in bulk¶
Let's "walk" our directory on disk to create a manifest file for upload
# Step 2: Create a manifest TSV file to upload data in bulk
# Note: When this command is run it will re-create your directory structure within
# Synapse. Be aware of this before running this command.
# If folders with the exact names already exists in Synapse, those folders will be used.
synapseutils.generate_sync_manifest(
syn=syn,
directory_path=DIRECTORY_FOR_MY_PROJECT,
parent_id=my_project_id,
manifest_path=PATH_TO_MANIFEST_FILE,
)
After this has been run if you inspect the TSV file created you'll see it will look similar to this:
path parent
/home/user_name/my_ad_project/single_cell_RNAseq_batch_2/SRR12345678_R2.fastq.gz syn60109537
/home/user_name/my_ad_project/single_cell_RNAseq_batch_2/SRR12345678_R1.fastq.gz syn60109537
/home/user_name/my_ad_project/biospecimen_experiment_2/fileD.txt syn60109543
/home/user_name/my_ad_project/biospecimen_experiment_2/fileC.txt syn60109543
/home/user_name/my_ad_project/single_cell_RNAseq_batch_1/SRR12345678_R2.fastq.gz syn60109534
/home/user_name/my_ad_project/single_cell_RNAseq_batch_1/SRR12345678_R1.fastq.gz syn60109534
/home/user_name/my_ad_project/biospecimen_experiment_1/fileA.txt syn60109540
/home/user_name/my_ad_project/biospecimen_experiment_1/fileB.txt syn60109540
3. Upload the data in bulk¶
# Step 3: After generating the manifest file, we can upload the data in bulk
synapseutils.syncToSynapse(
syn=syn, manifestFile=PATH_TO_MANIFEST_FILE, sendMessages=False
)
While this is running you'll see output in your console similar to:
Validation and upload of: /home/user_name/manifest-for-upload.tsv
Validating columns of manifest.....OK
Validating that all paths exist...........OK
Validating that all files are unique...OK
Validating that all the files are not empty...OK
Validating file names...
OK
Validating provenance...OK
Validating that parents exist and are containers...OK
We are about to upload 8 files with a total size of 8.
Uploading 8 files: 100%|███████████████████| 8.00/8.00 [00:01<00:00, 6.09B/s]
4. Add an annotation to our manifest file¶
At this point in the tutorial we will start to use pandas to manipulate a TSV file. If you are not comfortable with pandas you may use any tool that can open and manipulate TSV such as excel or google sheets.
# Step 4: Let's add an annotation to our manifest file
# Pandas is a powerful data manipulation library in Python, although it is not required
# for this tutorial, it is used here to demonstrate how you can manipulate the manifest
# file before uploading it to Synapse.
import pandas as pd
# Read TSV file into a pandas DataFrame
df = pd.read_csv(PATH_TO_MANIFEST_FILE, sep="\t")
# Add a new column to the DataFrame
df["species"] = "Homo sapiens"
# Write the DataFrame back to the manifest file
df.to_csv(PATH_TO_MANIFEST_FILE, sep="\t", index=False)
synapseutils.syncToSynapse(
syn=syn,
manifestFile=PATH_TO_MANIFEST_FILE,
sendMessages=False,
)
Now that you have uploaded and annotated your files you'll be able to inspect your data on the Files tab of your project in the synapse web UI. Each file will have a single annotation that you added in the previous step. In more advanced workflows you'll likely need to build a more complex manifest file, but this should give you a good starting point.
5. Create an Activity/Provenance¶
Let's create an Activity/Provenance record for one of our files. In otherwords, we will record the steps taken to generate the file.
In this code we are finding a row in our TSV file and pointing to the file path of another file within our manifest. By doing this we are creating a relationship between the two files. This is a simple example of how you can create a provenance record in Synapse. Additionally we'll link off to a sample URL that describes a process that we may have executed to generate the file.
# Step 5: Let's create an Activity/Provenance
# First let's find the row in the TSV we want to update. This code finds the row number
# that we would like to update.
row_index = df[
df["path"] == f"{DIRECTORY_FOR_MY_PROJECT}/biospecimen_experiment_1/fileA.txt"
].index
# After finding the row we want to update let's go ahead and add a relationship to
# another file in our manifest. This allows us to say "We used 'this' file in some way".
df.loc[
row_index, "used"
] = f"{DIRECTORY_FOR_MY_PROJECT}/single_cell_RNAseq_batch_1/SRR12345678_R1.fastq.gz"
# Let's also link to the pipeline that we ran in order to produce these results. In a
# real scenario you may want to link to a specific run of the tool where the results
# were produced.
df.loc[row_index, "executed"] = "https://nf-co.re/rnaseq/3.14.0"
# Let's also add a description for this Activity/Provenance
df.loc[
row_index, "activityDescription"
] = "Experiment results created as a result of the linked data while running the pipeline."
# Write the DataFrame back to the manifest file
df.to_csv(PATH_TO_MANIFEST_FILE, sep="\t", index=False)
synapseutils.syncToSynapse(
syn=syn,
manifestFile=PATH_TO_MANIFEST_FILE,
sendMessages=False,
)
After running this code we may again inspect the synapse web UI. In this screenshot i've navigated to the Files tab and selected the file that we added a Provenance record to.
Source code for this tutorial¶
Click to show me
"""
Here is where you'll find the code for the uploading data in bulk tutorial.
"""
import os
import synapseclient
import synapseutils
syn = synapseclient.Synapse()
syn.login()
# Create some constants to store the paths to the data
DIRECTORY_FOR_MY_PROJECT = os.path.expanduser(os.path.join("~", "my_ad_project"))
PATH_TO_MANIFEST_FILE = os.path.expanduser(os.path.join("~", "manifest-for-upload.tsv"))
# Step 1: Let's find the synapse ID of our project:
my_project_id = syn.findEntityId(
name="My uniquely named project about Alzheimer's Disease"
)
# Step 2: Create a manifest TSV file to upload data in bulk
# Note: When this command is run it will re-create your directory structure within
# Synapse. Be aware of this before running this command.
# If folders with the exact names already exists in Synapse, those folders will be used.
synapseutils.generate_sync_manifest(
syn=syn,
directory_path=DIRECTORY_FOR_MY_PROJECT,
parent_id=my_project_id,
manifest_path=PATH_TO_MANIFEST_FILE,
)
# Step 3: After generating the manifest file, we can upload the data in bulk
synapseutils.syncToSynapse(
syn=syn, manifestFile=PATH_TO_MANIFEST_FILE, sendMessages=False
)
# Step 4: Let's add an annotation to our manifest file
# Pandas is a powerful data manipulation library in Python, although it is not required
# for this tutorial, it is used here to demonstrate how you can manipulate the manifest
# file before uploading it to Synapse.
import pandas as pd
# Read TSV file into a pandas DataFrame
df = pd.read_csv(PATH_TO_MANIFEST_FILE, sep="\t")
# Add a new column to the DataFrame
df["species"] = "Homo sapiens"
# Write the DataFrame back to the manifest file
df.to_csv(PATH_TO_MANIFEST_FILE, sep="\t", index=False)
synapseutils.syncToSynapse(
syn=syn,
manifestFile=PATH_TO_MANIFEST_FILE,
sendMessages=False,
)
# Step 5: Let's create an Activity/Provenance
# First let's find the row in the TSV we want to update. This code finds the row number
# that we would like to update.
row_index = df[
df["path"] == f"{DIRECTORY_FOR_MY_PROJECT}/biospecimen_experiment_1/fileA.txt"
].index
# After finding the row we want to update let's go ahead and add a relationship to
# another file in our manifest. This allows us to say "We used 'this' file in some way".
df.loc[
row_index, "used"
] = f"{DIRECTORY_FOR_MY_PROJECT}/single_cell_RNAseq_batch_1/SRR12345678_R1.fastq.gz"
# Let's also link to the pipeline that we ran in order to produce these results. In a
# real scenario you may want to link to a specific run of the tool where the results
# were produced.
df.loc[row_index, "executed"] = "https://nf-co.re/rnaseq/3.14.0"
# Let's also add a description for this Activity/Provenance
df.loc[
row_index, "activityDescription"
] = "Experiment results created as a result of the linked data while running the pipeline."
# Write the DataFrame back to the manifest file
df.to_csv(PATH_TO_MANIFEST_FILE, sep="\t", index=False)
synapseutils.syncToSynapse(
syn=syn,
manifestFile=PATH_TO_MANIFEST_FILE,
sendMessages=False,
)