Downloading data in bulk¶
Contained within this tutorial is an experimental interface for working with the Synapse Python Client. These interfaces are subject to change at any time. Use at your own risk.
This tutorial will follow a Flattened Data Layout. With a project that has this example layout:
.
├── biospecimen_experiment_1
│ ├── fileA.txt
│ └── fileB.txt
├── biospecimen_experiment_2
│ ├── fileC.txt
│ └── fileD.txt
├── single_cell_RNAseq_batch_1
│ ├── SRR12345678_R1.fastq.gz
│ └── SRR12345678_R2.fastq.gz
└── single_cell_RNAseq_batch_2
├── SRR12345678_R1.fastq.gz
└── SRR12345678_R2.fastq.gz
Tutorial Purpose¶
In this tutorial you will:
- Download all files/folder from a project
- Download all files/folders for a specific folder within the project
- Loop over all files/folders on the project/folder object instances
Prerequisites¶
- Make sure that you have completed the following tutorials:
- This tutorial is setup to download the data to
~/my_ad_project
, make sure that this or another desired directory exists.
1. Download all files/folder from a project¶
First let's set up some constants we'll use in this script¶
import os
import synapseclient
from synapseclient.models import Folder, Project
syn = synapseclient.Synapse()
syn.login()
# Create some constants to store the paths to the data
DIRECTORY_TO_SYNC_PROJECT_TO = os.path.expanduser(os.path.join("~", "my_ad_project"))
FOLDER_NAME_TO_SYNC = "biospecimen_experiment_1"
DIRECTORY_TO_SYNC_FOLDER_TO = os.path.join(
DIRECTORY_TO_SYNC_PROJECT_TO, FOLDER_NAME_TO_SYNC
)
Next we'll create an instance of the Project we are going to sync¶
# Step 1: Create an instance of the container I want to sync the data from and sync
project = Project(name="My uniquely named project about Alzheimer's Disease")
Finally we'll sync the project from synapse to your local machine¶
# We'll set the `if_collision` to `keep.local` so that we don't overwrite any files
project.sync_from_synapse(path=DIRECTORY_TO_SYNC_PROJECT_TO, if_collision="keep.local")
# Print out the contents of the directory where the data was synced to
# Explore the directory to see the contents have been recursively synced.
print(os.listdir(DIRECTORY_TO_SYNC_PROJECT_TO))
While syncing your project you'll see results like:
Syncing Project (syn53185532:My uniquely named project about Alzheimer's Disease) from Synapse.
Syncing Folder (syn53205630:experiment_notes) from Synapse.
Syncing Folder (syn53205632:notes_2022) from Synapse.
Syncing Folder (syn53205629:single_cell_RNAseq_batch_1) from Synapse.
Syncing Folder (syn53205656:single_cell_RNAseq_batch_2) from Synapse.
Syncing Folder (syn53205631:notes_2023) from Synapse.
Downloading [####################]100.00% 4.0bytes/4.0bytes (1.8kB/s) fileA.txt Done...
Downloading [####################]100.00% 3.0bytes/3.0bytes (1.1kB/s) SRR92345678_R1.fastq.gz Done...
Downloading [####################]100.00% 4.0bytes/4.0bytes (1.7kB/s) SRR12345678_R1.fastq.gz Done...
Downloading [####################]100.00% 4.0bytes/4.0bytes (1.9kB/s) fileC.txt Done...
Downloading [####################]100.00% 4.0bytes/4.0bytes (2.7kB/s) fileB.txt Done...
Downloading [####################]100.00% 4.0bytes/4.0bytes (2.7kB/s) SRR12345678_R2.fastq.gz Done...
Downloading [####################]100.00% 4.0bytes/4.0bytes (2.6kB/s) SRR12345678_R2.fastq.gz Done...
Downloading [####################]100.00% 4.0bytes/4.0bytes (1.8kB/s) SRR12345678_R1.fastq.gz Done...
Downloading [####################]100.00% 3.0bytes/3.0bytes (1.5kB/s) SRR92345678_R2.fastq.gz Done...
Downloading [####################]100.00% 4.0bytes/4.0bytes (1.6kB/s) fileD.txt Done...
['single_cell_RNAseq_batch_2', 'single_cell_RNAseq_batch_1', 'experiment_notes']
2. Download all files/folders for a specific folder within the project¶
Following the same set of steps let's sync a specific folder
# Step 2: The same as step 1, but for a single folder
folder = Folder(name=FOLDER_NAME_TO_SYNC, parent_id=project.id)
folder.sync_from_synapse(path=DIRECTORY_TO_SYNC_FOLDER_TO, if_collision="keep.local")
print(os.listdir(os.path.expanduser(DIRECTORY_TO_SYNC_FOLDER_TO)))
While syncing your folder you'll see results like:
Syncing Folder (syn53205630:experiment_notes) from Synapse.
Syncing Folder (syn53205632:notes_2022) from Synapse.
Syncing Folder (syn53205631:notes_2023) from Synapse.
['notes_2022', 'notes_2023']
You'll notice that no files are downloaded. This is because the client will
see that you already have the content within this folder and will not attempt to
download the content again. If you were to use an if_collision
of "overwrite.local"
you would see that when the content on your machine does not match Synapse the file
will be overwritten.
3. Loop over all files/folders on the project/folder object instances¶
Using sync_from_synapse
will load into memory the state of all Folders and Files
retrieved from Synapse. This will allow you to loop over the contents of your container.
# Step 3: Loop over all files/folders on the project/folder object instances
for folder_at_root in project.folders:
print(f"Folder at root: {folder_at_root.name}")
for file_in_root_folder in folder_at_root.files:
print(f"File in {folder_at_root.name}: {file_in_root_folder.name}")
for folder_in_folder in folder_at_root.folders:
print(f"Folder in {folder_at_root.name}: {folder_in_folder.name}")
for file_in_folder in folder_in_folder.files:
print(f"File in {folder_in_folder.name}: {file_in_folder.name}")
The result of traversing some of your project structure should look like:
Folder at root: experiment_notes
Folder in experiment_notes: notes_2022
File in notes_2022: fileA.txt
File in notes_2022: fileB.txt
Folder in experiment_notes: notes_2023
File in notes_2023: fileC.txt
File in notes_2023: fileD.txt
Folder at root: single_cell_RNAseq_batch_1
File in single_cell_RNAseq_batch_1: SRR12345678_R1.fastq.gz
File in single_cell_RNAseq_batch_1: SRR12345678_R2.fastq.gz
File in single_cell_RNAseq_batch_1: SRR92345678_R1.fastq.gz
File in single_cell_RNAseq_batch_1: SRR92345678_R2.fastq.gz
Folder at root: single_cell_RNAseq_batch_2
File in single_cell_RNAseq_batch_2: SRR12345678_R1.fastq.gz
File in single_cell_RNAseq_batch_2: SRR12345678_R2.fastq.gz
Source code for this tutorial¶
Click to show me
"""
Here is where you'll find the code for the downloading data in bulk tutorial.
"""
import os
import synapseclient
from synapseclient.models import Folder, Project
syn = synapseclient.Synapse()
syn.login()
# Create some constants to store the paths to the data
DIRECTORY_TO_SYNC_PROJECT_TO = os.path.expanduser(os.path.join("~", "my_ad_project"))
FOLDER_NAME_TO_SYNC = "biospecimen_experiment_1"
DIRECTORY_TO_SYNC_FOLDER_TO = os.path.join(
DIRECTORY_TO_SYNC_PROJECT_TO, FOLDER_NAME_TO_SYNC
)
# Step 1: Create an instance of the container I want to sync the data from and sync
project = Project(name="My uniquely named project about Alzheimer's Disease")
# We'll set the `if_collision` to `keep.local` so that we don't overwrite any files
project.sync_from_synapse(path=DIRECTORY_TO_SYNC_PROJECT_TO, if_collision="keep.local")
# Print out the contents of the directory where the data was synced to
# Explore the directory to see the contents have been recursively synced.
print(os.listdir(DIRECTORY_TO_SYNC_PROJECT_TO))
# Step 2: The same as step 1, but for a single folder
folder = Folder(name=FOLDER_NAME_TO_SYNC, parent_id=project.id)
folder.sync_from_synapse(path=DIRECTORY_TO_SYNC_FOLDER_TO, if_collision="keep.local")
print(os.listdir(os.path.expanduser(DIRECTORY_TO_SYNC_FOLDER_TO)))
# Step 3: Loop over all files/folders on the project/folder object instances
for folder_at_root in project.folders:
print(f"Folder at root: {folder_at_root.name}")
for file_in_root_folder in folder_at_root.files:
print(f"File in {folder_at_root.name}: {file_in_root_folder.name}")
for folder_in_folder in folder_at_root.folders:
print(f"Folder in {folder_at_root.name}: {folder_in_folder.name}")
for file_in_folder in folder_in_folder.files:
print(f"File in {folder_in_folder.name}: {file_in_folder.name}")