Tables

Synapse Tables enable storage of tabular data in Synapse in a form that can be queried using a SQL-like query language.

Tables is an BETA feature

The tables feature is in the beta stage. Please report bugs via JIRA.

A table has a Schema and holds a set of rows conforming to that schema.

A Schema is defined in terms of Column objects that specify types from the following choices: STRING, DOUBLE, INTEGER, BOOLEAN, DATE, ENTITYID, FILEHANDLEID.

Example

Preliminaries:

import synapseclient
from synapseclient import Project, File, Folder
from synapseclient import Schema, Column, Table, Row, RowSet, as_table_columns

syn = synapseclient.Synapse()
syn.login()

project = syn.get('syn123')

To create a Table, you first need to create a Table Schema. This defines the columns of the table:

cols = [
    Column(name='Name', columnType='STRING', maximumSize=20),
    Column(name='Chromosome', columnType='STRING', maximumSize=20),
    Column(name='Start', columnType='INTEGER'),
    Column(name='End', columnType='INTEGER'),
    Column(name='Strand', columnType='STRING', enumValues=['+', '-'], maximumSize=1),
    Column(name='TranscriptionFactor', columnType='BOOLEAN')]

schema = Schema(name='My Favorite Genes', columns=cols, parent=project)

Next, let’s load some data. Let’s say we had a file, genes.csv:

Name,Chromosome,Start,End,Strand,TranscriptionFactor
foo,1,12345,12600,+,False
arg,2,20001,20200,+,False
zap,2,30033,30999,-,False
bah,1,40444,41444,-,False
bnk,1,51234,54567,+,True
xyz,1,61234,68686,+,False

Let’s store that in Synapse:

table = Table(schema, "/path/to/genes.csv")
table = syn.store(table)

The Table() function takes two arguments, a schema object and data in some form, which can be:

  • a path to a CSV file
  • a Pandas DataFrame
  • a RowSet object
  • a list of lists where each of the inner lists is a row

With a bit of luck, we now have a table populated with data. Let’s try to query:

results = syn.tableQuery("select * from %s where Chromosome='1' and Start < 41000 and End > 20000" % table.schema.id)
for row in results:
    print(row)

Pandas

Pandas is a popular library for working with tabular data. If you have Pandas installed, the goal is that Synapse Tables will play nice with it.

Create a Synapse Table from a DataFrame:

import pandas as pd

df = pd.read_csv("/path/to/genes.csv", index_col=False)
schema = Schema(name='My Favorite Genes', columns=as_table_columns(df), parent=project)
table = syn.store(Table(schema, df))

Get query results as a DataFrame:

results = syn.tableQuery("select * from %s where Chromosome='2'" % table.schema.id)
df = results.asDataFrame()

Changing Data

Once the schema is settled, changes come in two flavors: appending new rows and updating existing ones.

Appending new rows is fairly straightforward. To continue the previous example, we might add some new genes from another file:

table = syn.store(Table(table.schema.id, "/path/to/more_genes.csv"))

To quickly add a few rows, use a list of row data:

new_rows = [["Qux1", "4", 201001, 202001, "+", False],
            ["Qux2", "4", 203001, 204001, "+", False]]
table = syn.store(Table(schema, new_rows))

Updating rows requires an etag, which identifies the most recent change set plus row IDs and version numbers for each row to be modified. We get those by querying before updating. Minimizing changesets to contain only rows that actually change will make processing faster.

For example, let’s update the names of some of our favorite genes:

results = syn.tableQuery("select * from %s where Chromosome='1'" %table.schema.id)
df = results.asDataFrame()
df['Name'] = ['rzing', 'zing1', 'zing2', 'zing3']

Note that we’re propagating the etag from the query results. Without it, we’d get an error saying something about an “Invalid etag”:

table = syn.store(Table(schema, df, etag=results.etag))

The etag is used by the server to prevent concurrent users from making conflicting changes, a technique called optimistic concurrency. In case of a conflict, your update may be rejected. You then have to do another query an try your update again.

Changing Table Structure

Adding columns can be done using the methods Schema.addColumn() or addColumns() on the Schema object:

schema = syn.get("syn000000")
bday_column = syn.store(Column(name='birthday', columnType='DATE'))
schema.addColumn(bday_column)
schema = syn.store(schema)

Renaming or otherwise modifying a column involves removing the column and adding a new column:

cols = syn.getTableColumns(schema)
for col in cols:
    if col.name == "birthday":
        schema.removeColumn(col)
bday_column2 = syn.store(Column(name='birthday2', columnType='DATE'))
schema.addColumn(bday_column2)
schema = syn.store(schema)

Table attached files

Synapse tables support a special column type called ‘File’ which contain a file handle, an identifier of a file stored in Synapse. Here’s an example of how to upload files into Synapse, associate them with a table and read them back later:

## your synapse project
project = syn.get(...)

covers_dir = '/path/to/album/covers/'

## store the table's schema
cols = [
    Column(name='artist', columnType='STRING', maximumSize=50),
    Column(name='album', columnType='STRING', maximumSize=50),
    Column(name='year', columnType='INTEGER'),
    Column(name='catalog', columnType='STRING', maximumSize=50),
    Column(name='cover', columnType='FILEHANDLEID')]
schema = syn.store(Schema(name='Jazz Albums', columns=cols, parent=project))

## the actual data
data = [["John Coltrane",  "Blue Train",   1957, "BLP 1577", "coltraneBlueTrain.jpg"],
        ["Sonny Rollins",  "Vol. 2",       1957, "BLP 1558", "rollinsBN1558.jpg"],
        ["Sonny Rollins",  "Newk's Time",  1958, "BLP 4001", "rollinsBN4001.jpg"],
        ["Kenny Burrel",   "Kenny Burrel", 1956, "BLP 1543", "burrellWarholBN1543.jpg"]]

## upload album covers
for row in data:
    file_handle = syn._uploadToFileHandleService(os.path.join(covers_dir, row[4]))
    row[4] = file_handle['id']

## store the table data
row_reference_set = syn.store(RowSet(columns=cols, schema=schema, rows=[Row(r) for r in data]))

## Later, we'll want to query the table and download our album covers
results = syn.tableQuery("select artist, album, year, catalog, cover from %s where artist = 'Sonny Rollins'" % schema.id, resultsAs="rowset")
for row in results:
    file_info = syn.downloadTableFile(results, rowId=row.rowId, versionNumber=row.versionNumber, column='cover')
    print("%s_%s" % (row.rowId, row.versionNumber), ", ".join(str(a) for a in row.values), file_info['path'])

Deleting rows

Query for the rows you want to delete and call syn.delete on the results:

results = syn.tableQuery("select * from %s where Chromosome='2'" %table.schema.id)
a = syn.delete(results.asRowSet())

Deleting the whole table

Deleting the schema deletes the whole table and all rows:

syn.delete(schema)

Queries

The query language is quite similar to SQL select statements, except that joins are not supported. The documentation for the Synapse API has lots of query examples.

Schema

class synapseclient.table.Schema(name=None, columns=None, parent=None, properties=None, annotations=None, local_state=None, **kwargs)

A Schema is a synapse.entity.Entity that defines a set of columns in a table.

Parameters:
  • name – give the Table Schema object a name
  • description
  • columns – a list of Column objects or their IDs
  • parent – the project (file a bug if you’d like folders supported) in Synapse to which this table belongs
cols = [Column(name='Isotope', columnType='STRING'),
        Column(name='Atomic Mass', columnType='INTEGER'),
        Column(name='Halflife', columnType='DOUBLE'),
        Column(name='Discovered', columnType='DATE')]

schema = syn.store(Schema(name='MyTable', columns=cols, parent=project))
addColumn(column)
Parameters:column – a column object or its ID
addColumns(columns)
Parameters:columns – a list of column objects or their ID
has_columns()

Does this schema have columns specified?

removeColumn(column)
Parameters:column – a column object or its ID

Column

class synapseclient.table.Column(**kwargs)

Defines a column to be used in a table synapseclient.table.Schema.

Variables:

id – An immutable ID issued by the platform

Parameters:
  • columnType (string) – Can be any of: “STRING”, “DOUBLE”, “INTEGER”, “BOOLEAN”, “DATE”, “FILEHANDLEID”, “ENTITYID”
  • maximumSize (integer) – A parameter for columnTypes with a maximum size. For example, ColumnType.STRINGs have a default maximum size of 50 characters, but can be set to a maximumSize of 1 to 1000 characters.
  • name (string) – The display name of the column
  • enumValues (array of strings) – Columns type of STRING can be constrained to an enumeration values set on this list.
  • defaultValue (string) – The default value for this column. Columns of type FILEHANDLEID and ENTITYID are not allowed to have default values.

Row

class synapseclient.table.Row(values, rowId=None, versionNumber=None)

A row in a Table.

Parameters:
  • values – A list of values
  • rowId – The immutable ID issued to a new row
  • versionNumber – The version number of this row. Each row version is immutable, so when a row is updated a new version is created.

Table

class synapseclient.table.TableAbstractBaseClass(schema, headers=None, etag=None)

Abstract base class for Tables based on different data containers.

class synapseclient.table.RowSetTable(schema, rowset)

A Table object that wraps a RowSet.

class synapseclient.table.TableQueryResult(synapse, query, limit=None, offset=None, isConsistent=True)

An object to wrap rows returned as a result of a table query.

The TableQueryResult object can be used to iterate over results of a query:

results = syn.tableQuery(“select * from syn1234”) for row in results:

print(row)
asDataFrame()

Convert query result to a Pandas DataFrame.

next()

Python 2 iterator

class synapseclient.table.CsvFileTable(schema, filepath, etag=None, quoteCharacter=u'"', escapeCharacter=u'\', lineEnd='n', separator=u', ', header=True, linesToSkip=0, includeRowIdAndRowVersion=None, headers=None)

An object to wrap a CSV file that may be stored into a Synapse table or returned as a result of a table query.

classmethod from_table_query(synapse, query, quoteCharacter=u'"', escapeCharacter=u'\\', lineEnd='\n', separator=u', ', header=True, includeRowIdAndRowVersion=True)

Create a Table object wrapping a CSV file resulting from querying a Synapse table. Mostly for internal use.

setColumnHeaders(headers)

Set the list of synapseclient.table.SelectColumn objects that will be used to convert fields to the appropriate data types.

Column headers are automatically set when querying.

Module level methods

synapseclient.table.as_table_columns(df)

Return a list of Synapse table Column objects that correspond to the columns in the given Pandas DataFrame.

Params df:Pandas DataFrame
Returns:A list of Synapse table Column objects
synapseclient.table.Table(schema, values, **kwargs)

Combine a table schema and a set of values into some type of Table object depending on what type of values are given.

Parameters:
  • schema – a table py:class:Schema object
  • value

    an object that holds the content of the tables - a py:class:RowSet - a list of lists (or tuples) where each element is a row - a string holding the path to a CSV file - a Pandas DataFrame

Usually, the immediate next step after creating a Table object is to store it:

table = syn.store(Table(schema, values))

End users should not need to know the details of these Table subclasses:

See also: