Datasets¶
Datasets in Synapse are a way to organize, annotate, and publish sets of files for others to use. Datasets behave similarly to Tables and EntityViews, but provide some default behavior that makes it easy to put a group of files together.
This tutorial will walk through basics of working with datasets using the Synapse Python Client.
Tutorial Purpose¶
In this tutorial, you will:
- Create a dataset
- Add files to the dataset
- Query the dataset
- Add a custom column to the dataset
- Save a snapshot of the dataset
Prerequisites¶
- This tutorial assumes that you have a project in Synapse with one or more files in it. To test all of the ways to add files to a dataset, you will need to have at least 3 files in your project. A structure like this is recommended:
Project ├── File 1 ├── File 2 ├── Folder 1 │ ├── File 4 │ ├── ...
- Pandas must be installed as shown in the installation documentation
1. Get the ID of your Synapse project¶
Let's get started by authenticating with Synapse and retrieving the ID of your project.
import pandas as pd
from synapseclient import Synapse
from synapseclient.models import (
Column,
ColumnType,
Dataset,
EntityRef,
File,
Folder,
Project,
)
# First, let's get the project that we want to create the dataset in
syn = Synapse()
syn.login()
project = Project(
name="My uniquely named project about Alzheimer's Disease"
).get() # Replace with your project name
project_id = project.id
print(f"My project ID is {project_id}")
2. Create your Dataset¶
Next, we will create the dataset. We will use the project ID to tell Synapse where we want the dataset to be created. After this step, we will have a Dataset object with all of the needed information to start building the dataset.
my_new_dataset = Dataset(parent_id=project_id, name="My New Dataset").store()
print(f"My Dataset's ID is {my_new_dataset.id}")
Because we haven't added any files to the dataset yet, it will be empty, but if you view the dataset's schema in the UI, you will notice that datasets come with default columns that help to describe each file that we add to the dataset.
3. Add files to the dataset¶
Let's add some files to the dataset now. There are three ways to add files to a dataset:
- Add an Entity Reference to a file with its ID and version
my_new_dataset.add_item( EntityRef(id="syn51790029", version=1) ) # Replace with the ID of the file you want to add
- Add a File with its ID and version
my_new_dataset.add_item( File(id="syn51790028", version_number=1) ) # Replace with the ID of the file you want to add
- Add a Folder. When adding a folder, all child files inside of the folder are added to the dataset recursively.
my_new_dataset.add_item( Folder(id="syn64893446") ) # Replace with the ID of the folder you want to add
Whenever we make changes to the dataset, we need to call the store()
method to save the changes to Synapse.
my_new_dataset.store()
And now we are able to see our dataset with all of the files that we added to it.
4. Retrieve the dataset¶
Now that we have a dataset with some files in it, we can retrieve the dataset from Synapse the next time we need to use it.
my_retrieved_dataset = Dataset(id=my_new_dataset.id).get()
print(f"My Dataset's ID is {my_retrieved_dataset.id}")
print(len(my_retrieved_dataset.items))
5. Query the dataset¶
Now that we have a dataset with some files in it, we can query the dataset to find files that match certain criteria.
rows = Dataset.query(
query=f"SELECT * FROM {my_retrieved_dataset.id} WHERE name like '%test%'"
)
print(rows)
6. Add a custom column to the dataset¶
We can also add a custom column to the dataset. This will allow us to annotate files in the dataset with additional information.
my_retrieved_dataset.add_column(
column=Column(
name="my_annotation",
column_type=ColumnType.STRING,
)
)
my_retrieved_dataset.store()
Our custom column isn't all that useful empty, so let's update the dataset with some values.
modified_data = pd.DataFrame(
{
"id": "syn51790028", # The ID of one of our Files
"my_annotation": ["excellent data"],
}
)
my_retrieved_dataset.update_rows(
values=modified_data, primary_keys=["id"], dry_run=False
)
7. Save a snapshot of the dataset¶
Finally, let's save a snapshot of the dataset. This creates a read-only version of the dataset that captures the current state of the dataset and can be referenced later.
snapshot_info = my_retrieved_dataset.snapshot(
comment="My first snapshot",
label="My first snapshot",
)
print(snapshot_info)
Source Code for this Tutorial¶
Click to show me
"""Here is where you'll find the code for the dataset tutorial."""
import pandas as pd
from synapseclient import Synapse
from synapseclient.models import (
Column,
ColumnType,
Dataset,
EntityRef,
File,
Folder,
Project,
)
# First, let's get the project that we want to create the dataset in
syn = Synapse()
syn.login()
project = Project(
name="My uniquely named project about Alzheimer's Disease"
).get() # Replace with your project name
project_id = project.id
print(f"My project ID is {project_id}")
# Next, let's create the dataset. We'll use the project id as the parent id.
# To begin, the dataset will be empty, but if you view the dataset's schema in the UI,
# you will notice that datasets come with default columns.
my_new_dataset = Dataset(parent_id=project_id, name="My New Dataset").store()
print(f"My Dataset's ID is {my_new_dataset.id}")
# Now, let's add some files to the dataset. There are three ways to add files to a dataset:
# 1. Add an Entity Reference to a file with its ID and version
my_new_dataset.add_item(
EntityRef(id="syn51790029", version=1)
) # Replace with the ID of the file you want to add
# 2. Add a File with its ID and version
my_new_dataset.add_item(
File(id="syn51790028", version_number=1)
) # Replace with the ID of the file you want to add
# 3. Add a Folder. In this case, all child files of the folder are added to the dataset recursively.
my_new_dataset.add_item(
Folder(id="syn64893446")
) # Replace with the ID of the folder you want to add
# Our changes won't be persisted to Synapse until we call the store() method.
my_new_dataset.store()
# Now that our Dataset with all of our files has been created, the next time
# we want to use it, we can retrieve it from Synapse.
my_retrieved_dataset = Dataset(id=my_new_dataset.id).get()
print(f"My Dataset's ID is {my_retrieved_dataset.id}")
print(len(my_retrieved_dataset.items))
# If you want to query your dataset for files that match certain criteria, you can do so
# using the query method.
rows = Dataset.query(
query=f"SELECT * FROM {my_retrieved_dataset.id} WHERE name like '%test%'"
)
print(rows)
# In addition to the default columns, you may want to annotate items in your dataset using
# custom columns.
my_retrieved_dataset.add_column(
column=Column(
name="my_annotation",
column_type=ColumnType.STRING,
)
)
my_retrieved_dataset.store()
# Now that our custom column has been added, we can update the dataset with new values.
modified_data = pd.DataFrame(
{
"id": "syn51790028", # The ID of one of our Files
"my_annotation": ["excellent data"],
}
)
my_retrieved_dataset.update_rows(
values=modified_data, primary_keys=["id"], dry_run=False
)
# Finally, let's save a snapshot of the dataset.
snapshot_info = my_retrieved_dataset.snapshot(
comment="My first snapshot",
label="My first snapshot",
)
print(snapshot_info)