bgdata¶
bgdata is a simple data package manager. It allows to create, search and use data packages.
By default, it works with the data packages used by our group: Barcelona Biomedical Genomics Group.
bgdata downloads once the packages from a remote repository and keeps them in a local repository (typically a local folder) so that it is fast to access them. To get a better overview of how the repositories work check the repositories section.
However, bgdata is more than that, keep reading to find out what it can do.
The data packages¶
A data package is nothing more, and nothing less, that set of files (or even a single file).
Identifying¶
bgdata identifies each data package with a 4-level structure
- project
- dataset
- version
- build
project
and dataset
are the main identifiers of a data package.
In some cases, you might find that the package does not belong to
a particular project. For such cases, we use _
as project name.
Some of the bgdata commands will automatically set the project as
_
if you do not provide it.
The version
is intended distinguish between incompatible
versions of the package. E.g. when you are removing some data columns in your files.
The build
is an identifier that allows to distinguish between compatible versions
of the same packages. Typically, we use the date when we create the package as
the build
identifier. However, the build
can be anything (as long as it does not
start with an alpha character), so you might find other builds.
For example, we use the human genome in many of projects.
There are several version of the human genome available at http://hgdownload.cse.ucsc.edu/downloads.html#human .
We downloaded our data of interest for the hg19 version and created packages
using _
as project, genomereference
as dataset,
hg19
as version and 20150724
as build.
Then, we can request this package as
bgdata get _/genomereference/hg19?20150724
Tags¶
As remembering all the build
identifiers for all the packages might be painful
and you probably need to change all the queries in your scripts to get
newer versions, bgdata supports the concept of tags.
A tag
is a pointer to a particular build, and in several
operations with bgdata you can use a tag
instead of a build
.
bgdata will resolve which is the build
associated with that tag
and use that package.
The advantage of using a tag
rather than a build
is that
with the same query in your software, you get the most updated
version of a particular package by only keeping the tag
up to date.
E.g. following the example above, if we ask for the tag
master
we get the always the most recent version:
bgdata get _/genomereference/hg19?master
provided that we keep our tag
up to date.
In most cases, bgdata will use the tag
master
when you do not indicate the build
or tag
for a particular package.
Important
A tag
works essentially as a pointer to a build
for a particular project
, dataset
and version
.
This means that when asking for a tag
you also need to
indicate the other parameters.
Repositories¶
bgdata manages the packages through 3 layers of repositories:
- remote
- local
- caches
Remote¶
The remote represents a repository that serves as a source of data packages. Currently, it is an HTTP server that contains the compressed data packages and some tags.
When the user requests for a package that is not present in the local repository bgdata will download it from the remote into the local.
In addition, bgdata will keep in sync the tags
.
This means that if a tag
of a particular package
is updated in the remote, and the user requests that
particular tag
, he or she will get the latest version
from the remote if the local tag
was not up to date.
Note
bgdata can work in offline mode. In such case, packages will not be downloaded and tags will not be updated.
Local¶
The local repository is the one where the user can find the packages that have been requested.
While the remote is an HTTP server, the local should be a reachable path from the user’s machine.
The main difference with the remote repository, apart from being in the local machine, is that packages are uncompressed.
The download process¶
The download process from the remote is done using the Python package homura. Thanks to it, downloads can be resumed. After download, bgdata extracts all the files if they were compressed.
Once the download and extraction processes are done bgdata creates
a file named .download
with the date and time of that moment.
If this file is not present or deleted, bgdata assumes the
download has failed and reattempts it.
Caches¶
A cache is an extension of the local repository. Like the local repository, it should be reachable path from the user’s machine. Moreover, bgdata supports multiple caches.
When the user request a packages, bgdata will be search for it first in each cache and the in the local repository.
A cache can have different uses. As an example, we use the scratch space in the nodes of our cluster to as cache for the packages we use recurrently. For the others, we have a local repository reachable through the network file system.
Important
bgdata will not fail just because a cache is not present. This means that you can also use an external hard drive as a cache and if it is not connected bgdata can still be used.
Configuring bgdata¶
bgdata has a default configuration file which looks like:
version=2
local_repository = "~/.bgdata"
remote_repository = "http://bbglab.irbbarcelona.org/bgdata"
However, you can create you own configuration file and change it.
Custom configuration¶
To create you own custom configuration you need to create a
file bgdatav2.conf
and place in the corresponding
config file folder (this is done using the
appdir package
using the user_config_dir
function with bbglab
as the only parameter).
That file, should follow the same structure as the default, but you can change the sections to fit you own needs.
The local folder (where the data packages are stored)
is indicated through local_repository
.
# The default local folder where you want to store the data packages
local_repository = "~/.bgdata"
Note
You can put any reachable path.
The remote repository is a (public) URL where the data packages are stored and the bgdata uses to look for the packages that are not in the local repository.
# The remote URL from where do you want to download the data packages
remote_repository = "http://bbglab.irbbarcelona.org/bgdata"
If you need to access to the remote repo through a proxy you can also configure it as follows:
# Optional proxy configuration
# [proxy]
host = proxy.someurl.org
port = 8080
# If it's an authenticated proxy
user = myname
pass = mypasswd
Optionally, bgdata can be set to not look for newer versions of the packages in the remote repository and only use what is available on the local. To make use of this option, you need to add:
# If you want to force bgdata to work only locally
offline = True
Using the cache_repositories
option you can indicate
a list of repositories (similar to the local) in which
to look for the files.
# Cache repositories
[cache_repositories]
# Pairs name and path
my hard drive = /mnt/user/hd
Note
cache repositories have higher priority than the local, meaning that bgdata will look in them before checking the local. In addition, they are search last to first.
As an example of usage, data packages that are being used recurrently
in our cluster are saved in the scratch
directory of
each node. This way, bgdata takes the data from the scratch
which is faster than using the network file system.
Usage¶
bgdata is a Python package with a command line interface. This means that you can use bgdata as a Python library or from a terminal.
Getting packages¶
The most basic function of bgdata is to retrieve the path to a particular package. This is done through the get method.
The package is identified by a string with the format:
[<project>/]<dataset>/<version>[?<build>|<tag>]
- project is optional. Default project is
_
- dataset and version are required
- build or tag are optional. By default, bgdata requests the
tag
master.
Note
As master is the default tag
, it is present in the remote repository,
and unless you are in offline mode, bgdata will keep it synchronized.
As an example, we are going to ask for master tag
of hg19 version
the genomereference dataset
in the default project
(_).
From the command line:
$ bgdata get _/genomereference/hg19?master
2018-03-19 10:56:08 bgdata.manager INFO -- "master" resolved as 20150724
2018-03-19 10:56:08 bgdata.command INFO -- Dataset downloaded
/home/user/.bgdata/_/genomereference/hg19-20150724
and from Python:
>>> bgdata.get('_/genomereference/hg19?master')
'/home/user/.bgdata/_/genomereference/hg19-20150724'
Important
bgdata returns the path to local or cache folder where the package is present. When there is only one file in the folder, or in some special cases, bgdata returns the path to that file instead of the folder path.
Searching for packages¶
The bgdata list command can be used to check which data packages are
in the local repository.
This function (actually it is a generator) returns three elements:
a string that represents the package (like the input for the get method),
the name of the repo where you can find the package (local
represents
the local repository, and the rest will be the names of the caches),
and the tags associated with that particular build.
In the command line:
$ bgdata list
_/genomereference/hg19?20150724 local ['master']
From Python:
>>> for pkg, repo, tag in bgdata.list():
... print('Package {} in {} is associated with tags: {}'.format(pkg, repo, tag))
...
Package _/genomereference/hg19?20150724 in local is associated with tags: ['master']
To search for packages you can use the search command. This command lists all available packages in the indicated level. For example, when searching with empty string, it will list all available projects:
$ bgdata search
_
cgi
intogen
If you search for a project, you get a list of datasets:
$ bgdata search _
genomereference
genomesignature
If you search for a dataset within a project, you get all possible versions:
$ bgdata search _/genomereference
hg19
hg18
hg38
And builds can be find out by searching for the version of the dataset within a project:
$ bgdata search _/genomereference/hg19
20150724
Informartion about the packages¶
The remote repository contains metadata about the packages. This information is used internally by bgdata to know which projects are presents, which datasets are in each project and so on.
The info command can be used to retrieve that information, by a simple query.
$ bgdata info _/genomereference/hg19
{'author': 'BBGLab',
'created_on': '20150724,
'description': 'Human Genome HG19',
'license': 'Freely available for public use',
'md5': '851d41ac755f4deba7b98851084927ab'
'source': 'http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/'}
Logs¶
The logging process of bgdata is done using the logging
module
in Python.
When using bgdata as a Python library,
the logging module is not configured at all, thus it
is left to the end user how to configure the logging system.
The loggers used by bgdata are all below one named as bgdata
so you only need to configure that one.
When using bgdata from the command line interface, there are two flags that can be used to configure the logging system.
bgdata contains a set of subcommands but there are two flags that are general:
-v, --verbose | Give more information |
-q, --quiet | Suppress all log messages but the ones on the stderr |
The --quiet
flag can be useful in your bash script to store
the output of bgdata in a variable.
Advanced usage¶
Understanding the local repository¶
As we have already mentioned in the package section bgdata identifies each data package with a 4-level structure: project, dataset, version and build.
In the local repository, the 4-level structure is converted into a 3-level folder structure following this layout project/dataset/version-build.
For example, for the hg19
version of the human genome,
we set the project to _
, the dataset to genomereference
,
the version to hg19
and the build to the date
used to create the package 20150724
.
If you request this package with bgdata (bgdata get _/genomereference/hg19?20150724), after downloading you will see that you have a local repository as:
|- .bgdata/
| |
| |- genomerefernce/
| | |
| | |- hg19-20150724/
| | | |
| | | |- chr1.txt
| | | |- chr2.txt
| | | |- ...
| | | |- .downloaded
This structure makes easy to map the query you make with project, dataset and version to the folder structure.
The .downloaded¶
The .downloaded
file is a file created after downloading and extracting the package
used internally by bgdata to check whether the package is present and correct.
The .singlefile¶
In some data packages you will find that there is a .singlefile
file.
It contains the name of one of the files in the folder.
This file, if present, is used by bgdata to retrieve the path to that particular
file rather than the path to the folder.
bgdata creates this file automatically if a downloaded package contains only one file. However, some packages can use this file, even if there is more than one file, to easy the usage. For example, a tabix file is formed by a data file and and index file. However, tools using it only need to receive the path to the data file. For packages consisting on a tabix file, although they contain two files, we retrieve always the path to the data file as if that was the only file in the package.
The tag files¶
The build
that is pointed by a tag
is indicated in a file,
named as the version.
For example, a tag
file for the hg19
package
mentioned above that sets the master
tag to 20150724
build will be located in:
|- .bgdata/
| |
| |- genomerefernce/
| | |
| | |- hg19-20150724/
| | |
| | |- hg19.master
The tag
file only contains a string with the build
.
Cache management¶
bgdata includes some commands to manage your caches. However, keep in mind that caches are like partial copies of your local repository so adding or removing packages from your caches is as simple as copying them from the local repository or deleting.
The commands you can use with bgdata cache are:
add | Add a package to the cache |
clean | Clean everything |
remove | Add a package to the cache |
update | Update packages in caches |
- add
- This command will copy a local package into the cache
- clean
- Clean is a command to remove everything in the cache
- remove
- This command will remove a particular build of package from the cache
- Update
Update will remove old versions of package and copy new ones. Care must be used when using this command. The flow is as follow:
- bgdata resolves which builds are associated with the indicated tags
- for each cache, bgdata gets which packages are present. If the build of that package is not in the resolved, it is deleted. The recent(s) version(s) of the packages are added to the cache.
It is important to note that if a package is not present in the cache it will not be updated.
Tags in caches¶
Tag files can be used in cache repositories.
In fact, when you request for a particular tag
bgdata looks first in the local repository and then
in the caches for it.
Warning
Using tag
files in the caches is not
recommended and the user must manually
update the tag files.
Creating your own packages¶
Building packages¶
The build command receives the path to a folder (or even a single file)
and creates a compressed data package with it.
Then it uncompress it in the local repository and associates
that build with the build
tag.
Thus you can use that tag (e.g. _/genomereference/hg19?build
)
for your tests.
Uploading packages to the remote¶
Warning
This section is only for people within our group or people that have set up their own system using bgdata.
Once the package is build, it can be uploaded to the remote making use of the upload command.
Important
Only packages that have been previously built can be uploaded.
The upload process does not go through HTTP. To avoid external users to update packages to our remote repository, the upload process is just a copy of files in the network file system. Thus, it will only work for people with access to the NFS.
If you have access, you need to edit your configuration file to add
remote_repository_upload = /path/to/remote
The upload process includes the creation of a metadata file for the uploaded package. This file contains, among other items, a checksum used during the download process.
Fixing your builds¶
The easiest way to fix your builds is to make it directly in your code,
e.g. bgdata get project/dataset/version?build
.
However, in some cases, it is useful to fix the builds of the packages used
without modifying your code.
Two typical use cases are (there might be many others):
- fixing the builds for reproducibility
without modifying your code. Your calls to
bgdata get project/dataset/version
will return the same build even if you add new builds. - make a particular package point to a different tag. This can be useful for developing. You associate your new build to a develop tag and force bgdata to use the develop data for that package and the default for the rest.
To fix your builds without explicitly indicating that in your code, you can pass a file using the environment variable BGDATA_BUILDS that points to a file that sets the builds. Such file, can contain three different ways of fixing your builds:
Indicate a path to a file for a package in the paths section:
[paths] project/dataset/version = /my/local/path
In this case, any call to
project/dataset/version
will point to/my/local/path
. This have no effect if the request is done indicating a tag or build.Override your tags in the builds section:
[builds] [[project/datastet/version]] master = 20181105
In this case, any request to the
master
tag ofproject/datastet/version
point to20181105
build. The request can be explicit (project/dataset/version?master
) or implicit (project/dataset/version
, when the default tag ismaster
).Fix the tags in the tags section:
[tags] project/dataset/version = master project = develop
In this case, any call (that does not indicate the build or tag) of any data package under
project
will use the tagdevelop
by default except the packageproject/dataset/version
that will use the master. Note that this will not have any effect if you explicitly indicate the build or tag.