scgpm_seqresults_dnanexus¶
This project is read from GitHub
Scripts¶
accept_pending_transfers.py¶
Given all pending DNAnexus project transfers for the user, allows the user to accept transfers under a specified billing account, but not just any project. Only projects that have the specified queue (set in the queue property of the project) will be transferred to the user.
usage: accept_pending_transfers.py [-h] --access-level
{VIEW,UPLOAD,CONTRIBUTE,ADMINISTER} -q
QUEUE -o ORG
[--share-with-org {VIEW,UPLOAD,CONTRIBUTE,ADMINISTER}]
Named Arguments¶
--access-level | Possible choices: VIEW, UPLOAD, CONTRIBUTE, ADMINISTER
|
-q, --queue |
|
-o, --org |
|
--share-with-org | |
Possible choices: VIEW, UPLOAD, CONTRIBUTE, ADMINISTER
|
add_props_to_fq_files.py¶
Adds properties to FASTQ files in DNAnexus. Make sure you are logged into DNAnexus with the appropriate account before using this script in order to ensure access to the relevant projects. Accepts a tab-delimited input file containing properties to add to barcoded FASTQ files, and there must be a header line containing the two fields ‘dx_project_id’ and ‘barcode’. The remaining columns will be treated as property names. It is assumed that the directory containing the FASTQ files in each project is named /stage0_bcl2fastq/fastqs. Note that there are some standard property fields in use. For example, in the CIRM stem-cell project, the following property names are in use:
- lab_patient_id (i.e. some identifier for a give patient/sample)
- lab_patient_group (i.e. A categorization for the patient)
- lab_patient_condition (i.e. a treatment condition, perhaps a medicine and dosage)
The properties that represent FASTQ metadata for some lab protocol/setup should start with ‘lab’ followed by an underscore.
usage: add_props_to_fq_files.py [-h] -i INFILE
Named Arguments¶
-i, --infile | Tab-delimited input file. |
add_props_to_project.py¶
Given a DNAnexus project ID, or a file containing multiple DNAnexus project IDs (one per line), adds one or more properties to each project. Properties are specified as positional key=value arguments.
usage: add_props_to_project.py [-h]
(--dx-project-id DX_PROJECT_ID | --infile INFILE)
props [props ...]
Positional Arguments¶
props |
|
Named Arguments¶
--dx-project-id | |
The DNAnexus project ID to add properties too. | |
--infile |
|
add_r1-r2_fastq_paths.py¶
Retrieves the FASTQ file names in DNAnexus for the specified sequenced libraries. The tab-delimited input file may be provided in one of two formats.
- Format 1:
- DNAnexus project name
- barcode
- Format 2
- uhts run name
- lane
- barcode,
The format in use is determined by the number of header fields present in the header line, which must appear as the very first line in the input file and begin with a ‘#’.
The output file is identical to the input file, with the exception of two new columns at the start of the file being the FASTQ file name on the DNAnexus platform, and the read number. Thus, the output columns are:
- FASTQ file name
- Read number (1 for forward reads, 2 for reverse reads)
followed by the input file columns. Note that at present, one of three warnings may be output to stdout. The possible warnings are triggered whenver
- A DNAnexus project isn’t found based on the provided criteria.
- A DNAnexus project was found, but there were not any FASTQ files found within having the specified barcodes.
- A DNAnexus project was found, but only a forward reads or reverse reads FASTQ file was found, not both.
The last warning thus implies that the script assumes all reads are paired-end, which is true.
usage: add_r1-r2_fastq_paths.py [-h] -i INFILE -o OUTFILE
Named Arguments¶
-i, --infile |
|
-o, --outfile |
add_standard_props_to_projects.py¶
To fill in.
usage: add_standard_props_to_projects.py [-h] -b BILLING_ACCOUNT
Named Arguments¶
-b, --billing-account | |
|
batch_download_fastqs.py¶
Calls download_fastqs.py in batch, provided an input file specifying the FASTQs to download. This script passes a log file name to download_fastqs.py for error logging, i.e. if a DNAnexus project isn’t found then it will be logged. The log file is named after this script and contains the number of seconds since the epoch to help generate a unique name.
usage: batch_download_fastqs.py [-h] -i INFILE -d FILE_DOWNLOAD_DIR
[--not-found-error]
Named Arguments¶
-i, --infile |
|
-d, --file-download-dir | |
| |
--not-found-error | |
Default: False |
download_fastqs.py¶
To fill in.
usage: download_fastqs.py [-h] --errlog ERRLOG [-l LIBRARY_NAME]
[--uhts-run-name UHTS_RUN_NAME]
[--dx-project-name DX_PROJECT_NAME] [--lane LANE]
[-b BARCODE [BARCODE ...]] -d FILE_DOWNLOAD_DIR
[--not-found-error]
Named Arguments¶
--errlog |
|
-l, --library-name | |
| |
--uhts-run-name | |
| |
--dx-project-name | |
| |
--lane |
|
-b, --barcode |
|
-d, --file-download-dir | |
| |
--not-found-error | |
Default: False |
download_multiple_projects.py¶
Downloads all projects from DNanexus that have the given value for the ‘seq_run_name’ property set. A billing account can be specified in order to limit the project search space to only those belonging to that account.
usage: download_multiple_projects.py [-h] --seq-run-name SEQ_RUN_NAME
--download-dir DOWNLOAD_DIR
[-b BILLING_ACCOUNT_ID]
[-s SKIP_PROJECTS [SKIP_PROJECTS ...]]
Named Arguments¶
--seq-run-name |
|
--download-dir |
|
-b, --billing-account-id | |
| |
-s, --skip-projects | |
|
download_project.py¶
Downloads a SCGPM sequencing results project from DNAnexus. Currently, there is one DNAnexus project for each lane of Illumina sequencing. A folder will be created by the name of the DNAnexus project within the specified output folder. The project folder will contain a FASTQ subfolder, a FASTQC subfolder, and several files including 1) the sequencing QC report, 2) the list of barcodes in the sequencing lane in barcodes.json, run information in run_details.json, and a meta-data tarball by the name of ${sequencing_run_name}.metadata.tar. This last file contains important XML files output by the sequencer, such as the runParameters.XML and RunInfo.xml.
usage: download_project.py [-h] --download-dir DOWNLOAD_DIR
(--dx-project-id DX_PROJECT_ID | --dx-project-list DX_PROJECT_LIST)
Named Arguments¶
--download-dir |
|
--dx-project-id | |
The DNAnexus ID of the project to download. | |
--dx-project-list | |
|
list_projects.py¶
Writes a tab-delimited file containg information about all projects in DNAnexus that are billed to the specified org. The field names are:
Name, ID, library_name, seq_run_name, lab, paired_end, sequencer_type, seq_instrument, seq_lane_index, queue, experiment_type
usage: list_projects.py [-h] -o OUTFILE --org ORG
Named Arguments¶
-o, --outfile | The output file. |
--org | The DNAnexus org in which the projects belong. |
scgpm_clean_raw_data.py¶
This script calls the DNAnexus app I built called SCGPM Clean Raw Data at https://platform.dnanexus.com/app/scgpm_clean_raw_dataRemoves to unwanted files (that drive up the storage costs) from the raw_data folder of a DNAnexus project containing sequencing results from the SCGPM sequencing workflow. Most of the files in the raw_data folder are removed. Moreover, the lane tarball is removed; the XML files RunInfo.xml and runParameters.xml are extracted from Interop.tar and then the tarball is removed; finally, metadata.tar is removed. The extracted XML files are uploaded back to the raw_data folder.
Queryies DNAnexus for all projects billed to the specified org and that were created within the last -d days.
You must have the environemnt variable DX_SECURITY_CONTEXT set (described at http://autodoc.dnanexus.com/bindings/python/current/dxpy.html?highlight=token) in order to authenticate with DNAnexus.
usage: scgpm_clean_raw_data.py [-h] [-d DAYS_AGO] -o ORG
Named Arguments¶
-d, --days-ago |
Default: 30 |
-o, --org |
|
Client API Modules¶
scgpm_seqresults_dnanexus¶
Utilities for working with the SCGPM Sequencing Center application logic on DNAnexus.
-
scgpm_seqresults_dnanexus.
LOG_DIR
= 'Logs_scgpm_seqresults_dnanexus'¶ The directory that contains log files.
-
scgpm_seqresults_dnanexus.
debug_logger
= <Logger scgpm_seqresults_dnanexus (DEBUG)>¶ A
logging.Logger
instance that logs all messages sent to it to STDOUT, as well as a log file in the folder specified byLOG_DIR
. In addition, another file handler is created that accepts messages at the levellogging.ERROR
, and lives in the same folder.
dnanexus_utils¶
-
exception
scgpm_seqresults_dnanexus.dnanexus_utils.
DxMissingAlignmentSummaryMetrics
[source]¶ Bases:
Exception
Raised by DxSeqResults.get_alignment_summary_metrics() when it can’t locate a Picard alignment summary metrics file for a given barcoded sample of FASTQ sequencing results.
-
exception
scgpm_seqresults_dnanexus.dnanexus_utils.
DxMissingLibraryNameProperty
[source]¶ Bases:
Exception
Raised when creating a new DxSeqResults instance for a DNAnexus project that doesn’t have the library_name project property present.
-
exception
scgpm_seqresults_dnanexus.dnanexus_utils.
DxProjectMissingQueueProperty
[source]¶ Bases:
Exception
-
exception
scgpm_seqresults_dnanexus.dnanexus_utils.
DxMultipleProjectsWithSameLibraryName
[source]¶ Bases:
Exception
-
exception
scgpm_seqresults_dnanexus.dnanexus_utils.
DnanexusBarcodeNotFound
[source]¶ Bases:
Exception
-
scgpm_seqresults_dnanexus.dnanexus_utils.
select_newest_project
(dx_project_ids)[source]¶ Given a list of DNAnexus project IDs, returns the one that is newest as determined by creation date.
Parameters: dx_project_ids – list of DNAnexus project IDs. Returns: str.
-
scgpm_seqresults_dnanexus.dnanexus_utils.
accept_project_transfers
(access_level, queue, org, share_with_org=None)[source]¶ Parameters: - access_level – str. Permissions level the new member should have on transferred projects. Should be one of [“VIEW”,”UPLOAD”,”CONTRIBUTE”,”ADMINISTER”]. See https://wiki.dnanexus.com/API-Specification-v1.0.0/Project-Permissions-and-Sharing for more details on access levels.
- queue – str. The value of the queue property on a DNAnexus project. Only projects that are pending transfer that have this value for the queue property will be transferred to the specified org.
- org – str. The name of the DNAnexus org under which to accept the project transfers for projects that have their queue property set to the value of the ‘queue’ argument.
- share_with_org – str. Set this argument if you’d like to share the transferred projects with the org so that all users of the org will have access to the project. The value you supply should be the access level that members of the org will have.
Returns: The projects that were transferred to the specified billing account. Keys are the project IDs, and values are the project names.
Return type: dict
-
scgpm_seqresults_dnanexus.dnanexus_utils.
find_org_projects_by_name_glob
(org, glob)[source]¶ Parameters: glob – str. - Ex:
- Find the project(s) with SREQ-163 at the end of the project’s name:
- find_org_projects_by_name_glob(org=”org-someorg”, glob=”*SREQ-163”)
Shares one or more DNAnexus projects with an organization. It appears that DNAnexus requires for the user that wants to share the org to first have ADMINISTER access on the project. Only then could he share the project with the org.
Parameters: - project_ids – list. One or more DNAnexus project identifiers, where each project ID is in the form “project-FXq6B809p5jKzp2vJkjkKvg3”.
- org – str. The name of the DNAnexus org with which to share the projects.
- access_level – The permission level to give to members of the org - one of [“VIEW”,”UPLOAD”,”CONTRIBUTE”,”ADMINISTER”].
- suppress_email_notification – bool. True means to allow the DNAnexus platform to send an email notification for each shared project.
-
class
scgpm_seqresults_dnanexus.dnanexus_utils.
DxSeqResults
(dx_project_id=False, dx_project_name=False, uhts_run_name=False, sequencing_lane=False, library_name=False, billing_account_id=None, latest_project=False)[source]¶ Bases:
object
Finds the DNAnexus sequencing results project that was uploaded by GSSC. The project can be precisely retrieved if the projecd ID is specified (via the dx_project_id argument). Otherwise, you can supply the dx_project_name argument if you know the name, or use the library_name argument if you know the name of the library that was submitted to GSSC. All sequencing result projects uploaded to DNAnexus by GSSC contain a property named ‘library_name’, and projects will be searched on this property for a matching library name when the library_name argument is specified. If both the library_name and the dx_project_name arguments are specified, only the latter is used in finding a project match. The billing_account argument can optionally be specifed to restrict all project searches to only those that are billed to that particular billing account (unless dx_project_id is specified in which case the DNAnexus project is directly retrieved).
Parameters: - - str. The ID of the DNAnexus project (dx_project_id) – will be performed as it will be directly retrieved.
- - str. Name of a DNAnexus project containing sequencing results that were (dx_project_name) – uploaded by GSSC.
- - str. Name of the sequencing run in UHTS. This is added as a property to (uhts_run_name) – all projects in DNAnexus through the ‘seq_run_name’ property.
- - int. Lane number of the flowcell on which the library was sequenced. (sequencing_lane) – This is in a property named seq_lane_index on all GSSC projects in DNAnexus.
- - str. Library name of the sample that was sequenced. This is the name of (library_name) – the library that was submitted to GSSC for sequencing, and is added as a property to all GSSC DNAnexus projects via the ‘library_name’ property.
- - str. Name of the DNAnexus billing account that the project belongs to. (billing_account_id) – This will only be used to restrict the search of projects that the user can see to only those billed by the specified account.
- - bool. True indicates that if multiple projects are found given the search (latest_project) – criteria, the most recently created project will be returned.
-
FQEXT
= '.fastq.gz'¶ The extension used for FASTQ files.
-
get_run_details_json
()[source]¶ Retrieves the JSON object for the stats in the file named run_details.json in the project specified by self.dx_project_id.
Returns: JSON object of the run details.
-
get_alignment_summary_metrics
(barcode)[source]¶ Parses the metrics in a ${barcode}alignment_summary_metrics file in the DNAnexus project (usually in the qc folder). This contains metrics produced by Picard Tools’s CollectAlignmentSummaryMetrics program.
-
get_barcode_stats
(barcode)[source]¶ Loads the JSON in a ${barcode}_stats.json file in the DNAnexus project (usually in the qc folder).
-
get_sample_stats_json
(barcode=None)[source]¶ Deprecated since version 0.1.0: GSSC has removed the sample_stats.json file since the entire folder it was in has been removed. Use
get_barcode_stats()
instead.Retrieves the JSON object for the stats in the file named sample_stats.json in the project specified by self.dx_project_id. This file is located in the DNAnexus folder staged_qc_report.
Parameters: barcode – str. The barcode for the sample. Currently, the sample_stats.json file is of the following form when there isn’t a genome mapping:
[{“Sample name”: “AGTTCC”}, {“Sample name”: “CAGATC”}, {“Sample name”: “GCCAAT”}, …}].
When there is a mapping, each dictionary has many more keys in addition to the “Sample name” one.
Returns: list of dicts if barcode=None, otherwise a dict for the given barcode.
-
download_metadata_tar
(download_dir)[source]¶ Downloads the ${run_name}.metadata.tar file from the DNAnexus sequencing results project.
Parameters: download_dir – str - The local directory path to download the QC report to. Returns: The filepath to the downloaded metadata tarball. Return type: str
-
download_run_details_json
(download_dir)[source]¶ Downloads the run_details.json and the barcodes.json from the DNAnexus sequencing results project.
Parameters: download_dir – str - The local directory path to download the QC report to. Returns: str. The filepath to the downloaded run_details.json file.
-
download_barcodes_json
(download_dir)[source]¶ Downloads the run_details.json and the barcodes.json from the DNAnexus sequencing results project.
Parameters: download_dir – str - The local directory path to download the QC report to. Returns: str. The filepath to the downloaded barcodes.json file.
-
download_samplesheet
(download_dir)[source]¶ Downloads the SampleSheet used in demultiplexing from the DNAnexus sequencing results project.
Parameters: download_dir – str - The local directory path to download the QC report to. Returns: str. The filepath to the downloaded QC report.
-
download_qc_report
(download_dir)[source]¶ Downloads the QC report from the DNAnexus sequencing results project.
Parameters: download_dir – str - The local directory path to download the QC report to. Returns: str. The filepath to the downloaded QC report.
-
download_fastqc_reports
(download_dir)[source]¶ Downloads the QC report from the DNAnexus sequencing results project.
Parameters: download_dir – str - The local directory path to download the QC report to. Returns: str. The filepath to the downloaded FASTQC reports folder.
-
download_fastqs
(dest_dir, barcode, overwrite=False)[source]¶ Downloads all FASTQ files in the project that match the specified barcode, or if a barcode isn’t given, all FASTQ files as in this case it is assumed that this is not a multiplexed experiment. Files are downloaded to the directory specified by dest_dir.
Parameters: - barcode – str. The barcode sequence used.
- dest_dir – str. The local directory in which the FASTQs will be downloaded.
- overwrite – bool. If True, then if the file to download already exists in dest_dir, the file will be downloaded again, overwriting it. If False, the file will not be downloaded again from DNAnexus.
Returns: - The key is the barcode, and the value is a dict with integer keys of 1 for the
forward reads file, and 2 for the reverse reads file. If not paired-end,
Return type: dict
Raises: Exception – The barcode is specified and less than or greater than 2 FASTQ files are found.
-
get_fastq_dxfile_objects
(barcode=None)[source]¶ Retrieves all the FASTQ files in project self.dx_project_name as DXFile objects.
Parameters: barcode – str. If set, then only FASTQ file properties for FASTQ files having the specified barcode are returned. Returns: list of DXFile objects representing FASTQ files. Raises: dnanexus_utils.FastqNotFound – No FASTQ files were found.
-
revcomp_barcode_in_fastqfile_prop
(i7=False, i5=False)[source]¶ Use this method if you need to update the barcode sequence stored as the value of the barcode property of a FASTQ file on DNAnexus.
Parameters: - i7 – bool. True means to reverse complement the i7 barcode.
- i5 – bool. True means to reverse complement the i5 barcode.
-
revcomp
(seq)[source]¶ Returns The reverse complement of a DNA sequence.
Parameters: seq – str. Returns: str.
-
get_fastq_files_props
(barcode=None)[source]¶ Returns the DNAnexus file properties for all FASTQ files in the project that match the specified barcode, or all FASTQ files if not barcode is specified.
Parameters: barcode – str. If set, then only FASTQ file properties for FASTQ files having the specified barcode are returned. Returns: dict. Keys are the FASTQ file DXFile objects; values are the dict of associated properties on DNAnexus on the file. In addition to the properties on the file in DNAnexus, an additional property is added here called ‘fastq_file_name’. Raises: dnanexus_utils.FastqNotFound exception if no FASTQ files were found.