Working with data sets

This topic shows some of the operations you can use with BDD data sets.

Note that Dgraph Gateway must running before these commands can succeed. If Dgraph Gateway is not running, you will see a Connection refused message, as shown in this example:

>>> dss = bc.datasets()
[Errno 111] Connection refused
>>>

Retrieving all data sets

To return all data sets:

>>> dss = bc.datasets()

Finding the count

This command shows the number of returned data sets:

>>> dss.count
2

In the example, there are two data sets in BDD.

Printing the data sets

You can use the Python print function to print the names and sources of each data set. A Python for loop will iterate over the data sets:

>>> dss = bc.datasets()
>>> for ds in dss:
...     print ds
... 

WarrantyClaims  default_edp_4f6c159c-1042-4cd5-a6b2-e567e5cd03d3   default_edp_4f6c159c-1042-4cd5-a6b2-e567e5cd03d3   Hive   default.warrantyclaims

MassTowns       default_edp_da5ff7d5-521e-4851-a9c8-2755802f3053   default_edp_da5ff7d5-521e-4851-a9c8-2755802f3053   Hive   default.masstowns
>>>

In this example, there are two data sets:

The data set with with a display name "WarrantyClaims" has a Dgraph database name "default_edp_4f6c159c-1042-4cd5-a6b2-e567e5cd03d3" and its collection name is the same. The Hive source table is named "warrantyclaims" and is in the Hive "default" database.
The data set with a display name "MassTowns" has a Dgraph database name "default_edp_da5ff7d5-521e-4851-a9c8-2755802f3053" and its collection name is the same. The Hive source table is named "masstowns" and is also in the Hive "default" database.

Similarly, you can retrieve one data set (via its collection name) and then output its name and source:

>>> ds = dss.dataset('default_edp_4f6c159c-1042-4cd5-a6b2-e567e5cd03d3')
>>> ds

WarrantyClaims  default_edp_4f6c159c-1042-4cd5-a6b2-e567e5cd03d3   default_edp_4f6c159c-1042-4cd5-a6b2-e567e5cd03d3   Hive   default.warrantyclaims
>>>

Displaying the data set metadata

You can use the BddDataset properties() function to print the data set's metadata:

>>> ds = dss.dataset('default_edp_4f6c159c-1042-4cd5-a6b2-e567e5cd03d3')
>>> ds.properties()
{'timesViewed': '0', 'sourceName': 'default.warrantyclaims', 'attributeDisplayNames': 'Vehicle_Dealer',
...
'fullDataSet': 'true', 'collectionIdToBeReplaced': None, 'authorizedGroup': None}
>>>

To display only one property, you can use this syntax:

ds.properties ['propName']

For example:

>>> ds.properties() ['displayName']
'WarrantyClaims'
>>>

To produce a more readable output, import the json module and then use the print function:

>>> dss = bc.datasets()
>>> ds = dss.dataset('default_edp_65e296e7-52b5-4e3e-b837-3386cb3ec079')
>>> import json
>>> print json.dumps(ds.properties(), indent=2, sort_keys=True, ensure_ascii=False)
{
  "accessType": "private", 
  "attributeCount": "24", 
  "attributeDisplayNames": "Vehicle_Dealer", 
  ...
  "transformed": "false", 
  "uploadUserId": "10098", 
  "uploadUserName": "Admin Admin", 
  "version": "3"
}
>>>