Curator
synapseclient.extensions.curator
¶
Synapse Curator Extensions
This module provides library functions for metadata curation tasks in Synapse.
Functions¶
create_file_based_metadata_task
¶
create_file_based_metadata_task(folder_id: str, curation_task_name: str, instructions: str, attach_wiki: bool = True, entity_view_name: str = 'JSON Schema view', schema_uri: Optional[str] = None, enable_derived_annotations: bool = False, *, synapse_client: Optional[Synapse] = None) -> Tuple[str, str]
Create a file view for a schema-bound folder using schematic.
Creating a file-based metadata curation task with schema binding
In this example, we create an EntityView and CurationTask for file-based metadata curation. If a schema_uri is provided, it will be bound to the folder.
import synapseclient
from synapseclient.extensions.curator import create_file_based_metadata_task
syn = synapseclient.Synapse()
syn.login()
entity_view_id, task_id = create_file_based_metadata_task(
synapse_client=syn,
folder_id="syn12345678",
curation_task_name="BiospecimenMetadataTemplate",
instructions="Please curate this metadata according to the schema requirements",
attach_wiki=True,
entity_view_name="Biospecimen Metadata View",
schema_uri="sage.schemas.v2571-amp.Biospecimen.schema-0.0.1"
)
| PARAMETER | DESCRIPTION |
|---|---|
folder_id
|
The Synapse Folder ID to create the file view for.
TYPE:
|
curation_task_name
|
Name for the CurationTask (used as data_type field). Must be unique within the project, otherwise if it matches an existing CurationTask, that task will be updated with new data.
TYPE:
|
instructions
|
Instructions for the curation task.
TYPE:
|
attach_wiki
|
Whether or not to attach a Synapse Wiki (default: True).
TYPE:
|
entity_view_name
|
Name for the created entity view (default: "JSON Schema view").
TYPE:
|
schema_uri
|
Optional JSON schema URI to bind to the folder. If provided, the schema will be bound to the folder before creating the entity view. (e.g., 'sage.schemas.v2571-amp.Biospecimen.schema-0.0.1') |
enable_derived_annotations
|
If true, enable derived annotations. Defaults to False.
TYPE:
|
synapse_client
|
If not passed in and caching was not disabled by
|
| RETURNS | DESCRIPTION |
|---|---|
Tuple[str, str]
|
A tuple containing: - The Synapse ID of the entity view created - The task ID of the curation task created |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If required parameters are missing. |
SynapseError
|
If there are issues with Synapse operations. |
Source code in synapseclient/extensions/curator/file_based_metadata_task.py
293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 | |
create_record_based_metadata_task
¶
create_record_based_metadata_task(project_id: str, folder_id: str, record_set_name: str, record_set_description: str, curation_task_name: str, upsert_keys: List[str], instructions: str, schema_uri: str, bind_schema_to_record_set: bool = True, enable_derived_annotations: bool = False, *, synapse_client: Optional[Synapse] = None) -> Tuple[RecordSet, CurationTask, Grid]
Generate and upload CSV templates as a RecordSet for record-based metadata, create a CurationTask, and also create a Grid to bootstrap the ValidationStatistics.
A number of schema URIs that are already registered to Synapse can be found at:
If you have yet to create and register your JSON schema in Synapse, please refer to the tutorial at https://python-docs.synapse.org/en/stable/tutorials/python/json_schema/.
Creating a record-based metadata curation task with a schema URI
In this example, we create a RecordSet and CurationTask for biospecimen metadata
curation using a schema URI. By default this will also bind the schema to the
RecordSet, however the bind_schema_to_record_set parameter can be set to
False to skip that step.
import synapseclient
from synapseclient.extensions.curator import create_record_based_metadata_task
syn = synapseclient.Synapse()
syn.login()
record_set, task, grid = create_record_based_metadata_task(
synapse_client=syn,
project_id="syn12345678",
folder_id="syn87654321",
record_set_name="BiospecimenMetadata_RecordSet",
record_set_description="RecordSet for biospecimen metadata curation",
curation_task_name="BiospecimenMetadataTemplate",
upsert_keys=["specimenID"],
instructions="Please curate this metadata according to the schema requirements",
schema_uri="schema-org-schema.name.schema-v1.0.0"
)
| PARAMETER | DESCRIPTION |
|---|---|
project_id
|
The Synapse ID of the project where the folder exists.
TYPE:
|
folder_id
|
The Synapse ID of the folder to upload to.
TYPE:
|
record_set_name
|
Name for the RecordSet.
TYPE:
|
record_set_description
|
Description for the RecordSet.
TYPE:
|
curation_task_name
|
Name for the CurationTask (used as data_type field). Must be unique within the project, otherwise if it matches an existing CurationTask, that task will be updated with new data.
TYPE:
|
upsert_keys
|
List of column names to use as upsert keys. |
instructions
|
Instructions for the curation task.
TYPE:
|
schema_uri
|
JSON schema URI for the RecordSet schema. (e.g., 'sage.schemas.v2571-amp.Biospecimen.schema-0.0.1', 'sage.schemas.v2571-ad.Analysis.schema-0.0.0')
TYPE:
|
bind_schema_to_record_set
|
Whether to bind the given schema to the RecordSet (default: True).
TYPE:
|
enable_derived_annotations
|
If true, enable derived annotations. Defaults to False.
TYPE:
|
synapse_client
|
If not passed in and caching was not disabled by
|
| RETURNS | DESCRIPTION |
|---|---|
Tuple[RecordSet, CurationTask, Grid]
|
Tuple containing the created RecordSet, CurationTask, and Grid objects |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If required parameters are missing or if schema_uri is not provided. |
SynapseError
|
If there are issues with Synapse operations. |
Source code in synapseclient/extensions/curator/record_based_metadata_task.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 | |
generate_jsonld
¶
generate_jsonld(schema: Any, data_model_labels: DisplayLabelType, output_jsonld: Optional[str], *, synapse_client: Optional[Synapse] = None) -> dict
Convert a CSV data model specification to JSON-LD format with validation and error checking.
This function parses your CSV data model (containing attributes, validation rules,
dependencies, and valid values), converts it to a graph-based JSON-LD representation,
validates the structure for common errors, and saves the result. The generated JSON-LD
file serves as input for generate_jsonschema() and other data model operations.
Data Model Requirements:
Your CSV should include columns defining:
- Attribute names: Property/attribute identifiers
- Display names: Human-readable labels (optional but recommended)
- Descriptions: Documentation for each attribute
- Valid values: Allowed enum values for attributes (comma-separated)
- Validation rules: Rules like
list,regex,inRange,required, etc. - Dependencies: Relationships between attributes using
dependsOn - Required status: Whether attributes are mandatory
Validation Checks Performed:
- Ensures all required fields (like
displayName) are present - Detects cycles in attribute dependencies (which would create invalid schemas)
- Checks for blacklisted characters in display names that Synapse doesn't allow
- Validates that attribute names don't conflict with reserved system names
- Verifies the graph structure is a valid directed acyclic graph (DAG)
| PARAMETER | DESCRIPTION |
|---|---|
schema
|
Path to your data model CSV file. This file should contain your complete data model specification with all attributes, validation rules, and relationships.
TYPE:
|
data_model_labels
|
Label format for the JSON-LD output:
TYPE:
|
output_jsonld
|
Path where the JSON-LD file will be saved. If None, saves alongside
the input CSV with a |
synapse_client
|
Optional Synapse client instance for logging. If None, creates a
new client instance. Use |
Output:
The function logs validation errors and warnings to help you fix data model issues before generating JSON schemas. Errors indicate critical problems that must be fixed, while warnings suggest improvements but won't block schema generation.
| RETURNS | DESCRIPTION |
|---|---|
dict
|
The generated data model as a dictionary in JSON-LD format. The same data is
also saved to the file path specified in |
Using this function to generate JSONLD Schema files:
Basic usage with default output path:
from synapseclient import Synapse
from synapseclient.extensions.curator import generate_jsonld
syn = Synapse()
syn.login()
jsonld_model = generate_jsonld(
schema="path/to/my_data_model.csv",
data_model_labels="class_label",
output_jsonld=None, # Saves to my_data_model.jsonld
synapse_client=syn
)
Specify custom output path:
jsonld_model = generate_jsonld(
schema="models/patient_model.csv",
data_model_labels="class_label",
output_jsonld="~/output/patient_model_v1.jsonld",
synapse_client=syn
)
Use display labels:
jsonld_model = generate_jsonld(
schema="my_model.csv",
data_model_labels="display_label",
output_jsonld="my_model.jsonld",
synapse_client=syn
)
Source code in synapseclient/extensions/curator/schema_generation.py
5452 5453 5454 5455 5456 5457 5458 5459 5460 5461 5462 5463 5464 5465 5466 5467 5468 5469 5470 5471 5472 5473 5474 5475 5476 5477 5478 5479 5480 5481 5482 5483 5484 5485 5486 5487 5488 5489 5490 5491 5492 5493 5494 5495 5496 5497 5498 5499 5500 5501 5502 5503 5504 5505 5506 5507 5508 5509 5510 5511 5512 5513 5514 5515 5516 5517 5518 5519 5520 5521 5522 5523 5524 5525 5526 5527 5528 5529 5530 5531 5532 5533 5534 5535 5536 5537 5538 5539 5540 5541 5542 5543 5544 5545 5546 5547 5548 5549 5550 5551 5552 5553 5554 5555 5556 5557 5558 5559 5560 5561 5562 5563 5564 5565 5566 5567 5568 5569 5570 5571 5572 5573 5574 5575 5576 5577 5578 5579 5580 5581 5582 5583 5584 5585 5586 5587 5588 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 | |
generate_jsonschema
¶
generate_jsonschema(data_model_source: str, output_directory: str, data_type: Optional[list[str]], data_model_labels: DisplayLabelType, synapse_client: Synapse) -> tuple[list[dict[str, Any]], list[str]]
Generate JSON Schema validation files from a data model with validation rules.
This function creates JSON Schema files that enforce validation rules defined in your CSV data model. The generated schemas can validate manifests for required fields, data types, valid values (enums), ranges, regex patterns, conditional dependencies, and more.
Validation Rules Supported:
- Type validation: Enforces string, number, integer, or boolean types
- Valid values: Creates enum constraints from valid values in the data model
- Required fields: Marks attributes as required (can be component-specific)
- Range validation: Translates
inRangerules to min/max constraints - Pattern matching: Converts
regexrules to JSON Schema patterns - Format validation: Applies
date(ISO date) andurl(URI) format constraints - Array validation: Handles
listrules for array-type properties - Conditional dependencies: Creates
if/thenschemas for dependent attributes
Component-Based Rules:
Rules can be applied selectively to specific components using the #Component syntax
in your validation rules. This allows different validation behavior per manifest type.
| PARAMETER | DESCRIPTION |
|---|---|
data_model_source
|
Path to the data model file (CSV or JSONLD) or URL to the raw JSONLD. Can accept:
TYPE:
|
output_directory
|
Directory path where JSON Schema files will be saved. Each
component will generate a separate
TYPE:
|
data_type
|
List of specific component names (data types) to generate schemas for. If None, generates schemas for all components in the data model. |
data_model_labels
|
Label format for properties in the generated schema:
TYPE:
|
synapse_client
|
Synapse client instance for logging. Use
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[list[dict[str, Any]], list[str]]
|
tuple[list[dict[str, Any]], list[str]]: A tuple containing: - A list of JSON schema dictionaries, each corresponding to a component - A list of file paths where the schemas were written |
Using this function to generate JSON Schema files:
Generate schemas from a CSV data model:
from synapseclient import Synapse
from synapseclient.extensions.curator import generate_jsonschema
syn = Synapse()
syn.login()
schemas, file_paths = generate_jsonschema(
data_model_source="path/to/model.csv",
output_directory="./schemas",
data_type=None, # All components
data_model_labels="class_label",
synapse_client=syn
)
Generate schemas from a JSONLD data model:
schemas, file_paths = generate_jsonschema(
data_model_source="path/to/model.jsonld",
output_directory="./schemas",
data_type=None, # All components
data_model_labels="class_label",
synapse_client=syn
)
Generate schema for specific components:
schemas, file_paths = generate_jsonschema(
data_model_source="https://example.com/model.jsonld",
output_directory="./validation_schemas",
data_type=["Patient", "Biospecimen"],
data_model_labels="class_label",
synapse_client=syn
)
Source code in synapseclient/extensions/curator/schema_generation.py
5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 | |
query_schema_registry
¶
query_schema_registry(synapse_client: Optional[Synapse] = None, schema_registry_table_id: Optional[str] = None, column_config: Optional[SchemaRegistryColumnConfig] = None, return_latest_only: bool = True, **filters) -> Union[str, List[str], None]
Query the schema registry table to find schemas matching the provided filters.
This function searches the Synapse schema registry table for schemas that match the provided filter parameters. Results are sorted by version in descending order (newest first). The function supports any number of filter parameters as long as they are configured in the column_config.
| PARAMETER | DESCRIPTION |
|---|---|
synapse_client
|
Optional authenticated Synapse client instance |
schema_registry_table_id
|
Optional Synapse ID of the schema registry table. If None, uses the default table ID. |
column_config
|
Optional configuration for custom column names. If None, uses default configuration ('version' and 'uri' columns).
TYPE:
|
return_latest_only
|
If True (default), returns only the latest URI as a string. If False, returns all matching URIs as a list of strings.
TYPE:
|
**filters
|
Filter parameters to search for matching schemas. These work as follows:
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
Union[str, List[str], None]
|
If return_latest_only is True: Single URI string of the latest version, or None if not found |
Union[str, List[str], None]
|
If return_latest_only is False: List of URI strings sorted by version (highest version first) |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no filter parameters are provided |
Expected Table Structure
The schema registry table should contain columns for:
- Schema version for sorting (default: 'version')
- JSON schema URI (default: 'uri')
- Any filterable columns as configured in column_config
Additional columns may be present and will be included in results.
Comprehensive filter usage demonstrations
This includes several examples of how to use the filtering system.
Basic Filtering (using default filters):
from synapseclient import Synapse
from synapseclient.extensions.curator import query_schema_registry
syn = Synapse()
syn.login()
# 1. Get latest schema URI for a specific DCC and datatype
latest_uri = query_schema_registry(
synapse_client=syn,
dcc="ad", # Exact match for Alzheimer's Disease DCC
datatype="Analysis" # Exact datatype match
)
# Returns: "sage.schemas.v2571-ad.Analysis.schema-0.0.0"
# 2. Get all versions of matching schemas (not just latest)
all_versions = query_schema_registry(
synapse_client=syn,
dcc="mc2",
datatype="Biospecimen",
return_latest_only=False
)
# Returns: ["MultiConsortiaCoordinatingCenter-Biospecimen-12.0.0",
# "sage.schemas.v2571-mc2.Biospecimen.schema-9.0.0"]
# 3. Pattern matching with wildcards
# Find all "Biospecimen" schemas across all DCCs
biospecimen_schemas = query_schema_registry(
synapse_client=syn,
datatype="Biospecimen", # Exact match for Biospecimen
return_latest_only=False
)
# Returns: ["MultiConsortiaCoordinatingCenter-Biospecimen-12.0.0",
# "sage.schemas.v2571-mc2.Biospecimen.schema-9.0.0",
# "sage.schemas.v2571-veo.Biospecimen.schema-0.3.0",
# "sage.schemas.v2571-amp.Biospecimen.schema-0.0.1"]
# 4. Pattern matching for DCC variations
mc2_schemas = query_schema_registry(
synapse_client=syn,
dcc="%C2", # Matches 'mc2' and 'MC2'
return_latest_only=False
)
# Returns schemas from both 'mc2' and 'MC2' DCCs
# 5. Using additional columns for filtering (if they exist in your table)
specific_schemas = query_schema_registry(
synapse_client=syn,
dcc="amp", # Must be AMP DCC
org="sage.schemas.v2571", # Must match organization
return_latest_only=False
)
# Returns schemas that match BOTH conditions
Direct Column Filtering (simplified approach):
# Any column in the schema registry table can be used for filtering
# Just use the column name directly as a keyword argument
# Basic filters using standard columns
query_schema_registry(dcc="ad", datatype="Analysis")
query_schema_registry(version="0.0.0")
query_schema_registry(uri="sage.schemas.v2571-ad.Analysis.schema-0.0.0")
# Additional columns (if they exist in your table)
query_schema_registry(org="sage.schemas.v2571")
query_schema_registry(name="ad.Analysis.schema")
# Multiple column filters (all must match)
query_schema_registry(
dcc="mc2",
datatype="Biospecimen",
org="MultiConsortiaCoordinatingCenter"
)
Filter Value Examples with Real Data:
# Exact matching
query_schema_registry(dcc="ad") # Returns schemas with dcc="ad"
query_schema_registry(datatype="Biospecimen") # Returns schemas with datatype="Biospecimen"
query_schema_registry(dcc="MC2") # Returns schemas with dcc="MC2" (case sensitive)
# Pattern matching with wildcards
query_schema_registry(dcc="%C2") # Matches "mc2", "MC2"
query_schema_registry(datatype="%spec%") # Matches "Biospecimen"
# Examples with expected results:
query_schema_registry(dcc="ad", datatype="Analysis")
# Returns: "sage.schemas.v2571-ad.Analysis.schema-0.0.0"
query_schema_registry(datatype="Biospecimen", return_latest_only=False)
# Returns: ["MultiConsortiaCoordinatingCenter-Biospecimen-12.0.0",
# "sage.schemas.v2571-mc2.Biospecimen.schema-9.0.0", ...]
# Multiple conditions (all must be true)
query_schema_registry(
dcc="amp", # AND
datatype="Biospecimen", # AND
org="sage.schemas.v2571" # AND (if org column exists)
)
# Returns: ["sage.schemas.v2571-amp.Biospecimen.schema-0.0.1"]
Source code in synapseclient/extensions/curator/schema_registry.py
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 | |