Skip to content

Downloading data in bulk

Contained within this tutorial is an experimental interface for working with the Synapse Python Client. These interfaces are subject to change at any time. Use at your own risk.

This tutorial will follow a Flattened Data Layout. With a project that has this example layout:

.
├── biospecimen_experiment_1
│   ├── fileA.txt
│   └── fileB.txt
├── biospecimen_experiment_2
│   ├── fileC.txt
│   └── fileD.txt
├── single_cell_RNAseq_batch_1
│   ├── SRR12345678_R1.fastq.gz
│   └── SRR12345678_R2.fastq.gz
└── single_cell_RNAseq_batch_2
    ├── SRR12345678_R1.fastq.gz
    └── SRR12345678_R2.fastq.gz

Tutorial Purpose

In this tutorial you will:

  1. Download all files/folder from a project
  2. Download all files/folders for a specific folder within the project
  3. Loop over all files/folders on the project/folder object instances

Prerequisites

  • Make sure that you have completed the following tutorials:
  • This tutorial is setup to download the data to ~/my_ad_project, make sure that this or another desired directory exists.

1. Download all files/folder from a project

First let's set up some constants we'll use in this script

import os
from synapseclient.models import (
    Folder,
    Project,
)
import synapseclient

syn = synapseclient.Synapse()
syn.login()

# Create some constants to store the paths to the data
DIRECTORY_TO_SYNC_PROJECT_TO = os.path.expanduser(os.path.join("~", "my_ad_project"))
FOLDER_NAME_TO_SYNC = "biospecimen_experiment_1"
DIRECTORY_TO_SYNC_FOLDER_TO = os.path.join(
    DIRECTORY_TO_SYNC_PROJECT_TO, FOLDER_NAME_TO_SYNC
)

Next we'll create an instance of the Project we are going to sync

project = Project(name="My uniquely named project about Alzheimer's Disease")

Finally we'll sync the project from synapse to your local machine

# We'll set the `if_collision` to `keep.local` so that we don't overwrite any files
project.sync_from_synapse(path=DIRECTORY_TO_SYNC_PROJECT_TO, if_collision="keep.local")

# Print out the contents of the directory where the data was synced to
# Explore the directory to see the contents have been recursively synced.
print(os.listdir(DIRECTORY_TO_SYNC_PROJECT_TO))
While syncing your project you'll see results like:
Syncing Project (syn53185532:My uniquely named project about Alzheimer's Disease) from Synapse.
Syncing Folder (syn53205630:experiment_notes) from Synapse.
Syncing Folder (syn53205632:notes_2022) from Synapse.
Syncing Folder (syn53205629:single_cell_RNAseq_batch_1) from Synapse.
Syncing Folder (syn53205656:single_cell_RNAseq_batch_2) from Synapse.
Syncing Folder (syn53205631:notes_2023) from Synapse.
Downloading  [####################]100.00%   4.0bytes/4.0bytes (1.8kB/s) fileA.txt Done...
Downloading  [####################]100.00%   3.0bytes/3.0bytes (1.1kB/s) SRR92345678_R1.fastq.gz Done...
Downloading  [####################]100.00%   4.0bytes/4.0bytes (1.7kB/s) SRR12345678_R1.fastq.gz Done...
Downloading  [####################]100.00%   4.0bytes/4.0bytes (1.9kB/s) fileC.txt Done...
Downloading  [####################]100.00%   4.0bytes/4.0bytes (2.7kB/s) fileB.txt Done...
Downloading  [####################]100.00%   4.0bytes/4.0bytes (2.7kB/s) SRR12345678_R2.fastq.gz Done...
Downloading  [####################]100.00%   4.0bytes/4.0bytes (2.6kB/s) SRR12345678_R2.fastq.gz Done...
Downloading  [####################]100.00%   4.0bytes/4.0bytes (1.8kB/s) SRR12345678_R1.fastq.gz Done...
Downloading  [####################]100.00%   3.0bytes/3.0bytes (1.5kB/s) SRR92345678_R2.fastq.gz Done...
Downloading  [####################]100.00%   4.0bytes/4.0bytes (1.6kB/s) fileD.txt Done...
['single_cell_RNAseq_batch_2', 'single_cell_RNAseq_batch_1', 'experiment_notes']

2. Download all files/folders for a specific folder within the project

Following the same set of steps let's sync a specific folder

folder = Folder(name=FOLDER_NAME_TO_SYNC, parent_id=project.id)

folder.sync_from_synapse(path=DIRECTORY_TO_SYNC_FOLDER_TO, if_collision="keep.local")

print(os.listdir(os.path.expanduser(DIRECTORY_TO_SYNC_FOLDER_TO)))
While syncing your folder you'll see results like:
Syncing Folder (syn53205630:experiment_notes) from Synapse.
Syncing Folder (syn53205632:notes_2022) from Synapse.
Syncing Folder (syn53205631:notes_2023) from Synapse.
['notes_2022', 'notes_2023']

You'll notice that no files are downloaded. This is because the client will see that you already have the content within this folder and will not attempt to download the content again. If you were to use an if_collision of "overwrite.local" you would see that when the content on your machine does not match Synapse the file will be overwritten.

3. Loop over all files/folders on the project/folder object instances

Using sync_from_synapse will load into memory the state of all Folders and Files retrieved from Synapse. This will allow you to loop over the contents of your container.

for folder_at_root in project.folders:
    print(f"Folder at root: {folder_at_root.name}")

    for file_in_root_folder in folder_at_root.files:
        print(f"File in {folder_at_root.name}: {file_in_root_folder.name}")

    for folder_in_folder in folder_at_root.folders:
        print(f"Folder in {folder_at_root.name}: {folder_in_folder.name}")
        for file_in_folder in folder_in_folder.files:
            print(f"File in {folder_in_folder.name}: {file_in_folder.name}")
The result of traversing some of your project structure should look like:
Folder at root: experiment_notes
Folder in experiment_notes: notes_2022
File in notes_2022: fileA.txt
File in notes_2022: fileB.txt
Folder in experiment_notes: notes_2023
File in notes_2023: fileC.txt
File in notes_2023: fileD.txt
Folder at root: single_cell_RNAseq_batch_1
File in single_cell_RNAseq_batch_1: SRR12345678_R1.fastq.gz
File in single_cell_RNAseq_batch_1: SRR12345678_R2.fastq.gz
File in single_cell_RNAseq_batch_1: SRR92345678_R1.fastq.gz
File in single_cell_RNAseq_batch_1: SRR92345678_R2.fastq.gz
Folder at root: single_cell_RNAseq_batch_2
File in single_cell_RNAseq_batch_2: SRR12345678_R1.fastq.gz
File in single_cell_RNAseq_batch_2: SRR12345678_R2.fastq.gz

Source code for this tutorial

Click to show me
"""
Here is where you'll find the code for the downloading data in bulk tutorial.
"""

import os
from synapseclient.models import (
    Folder,
    Project,
)
import synapseclient

syn = synapseclient.Synapse()
syn.login()

# Create some constants to store the paths to the data
DIRECTORY_TO_SYNC_PROJECT_TO = os.path.expanduser(os.path.join("~", "my_ad_project"))
FOLDER_NAME_TO_SYNC = "biospecimen_experiment_1"
DIRECTORY_TO_SYNC_FOLDER_TO = os.path.join(
    DIRECTORY_TO_SYNC_PROJECT_TO, FOLDER_NAME_TO_SYNC
)

# Step 1: Create an instance of the container I want to sync the data from and sync
project = Project(name="My uniquely named project about Alzheimer's Disease")

# We'll set the `if_collision` to `keep.local` so that we don't overwrite any files
project.sync_from_synapse(path=DIRECTORY_TO_SYNC_PROJECT_TO, if_collision="keep.local")

# Print out the contents of the directory where the data was synced to
# Explore the directory to see the contents have been recursively synced.
print(os.listdir(DIRECTORY_TO_SYNC_PROJECT_TO))

# Step 2: The same as step 1, but for a single folder
folder = Folder(name=FOLDER_NAME_TO_SYNC, parent_id=project.id)

folder.sync_from_synapse(path=DIRECTORY_TO_SYNC_FOLDER_TO, if_collision="keep.local")

print(os.listdir(os.path.expanduser(DIRECTORY_TO_SYNC_FOLDER_TO)))

# Step 3: Loop over all files/folders on the project/folder object instances
for folder_at_root in project.folders:
    print(f"Folder at root: {folder_at_root.name}")

    for file_in_root_folder in folder_at_root.files:
        print(f"File in {folder_at_root.name}: {file_in_root_folder.name}")

    for folder_in_folder in folder_at_root.folders:
        print(f"Folder in {folder_at_root.name}: {folder_in_folder.name}")
        for file_in_folder in folder_in_folder.files:
            print(f"File in {folder_in_folder.name}: {file_in_folder.name}")