API¶
Building references¶
Indexers¶
IM-Fusion uses Indexer
classes to generate references for the
different RNA-seq aligners, with one class provided for each aligner. The main
methods provided by the Indexers are the check_dependencies
and build
methods. The former is used to check whether required external dependencies
are present in $PATH
, whilst the latter should be called with the required
reference files to build the actual reference.
-
class
imfusion.build.indexers.
Indexer
(logger=None, skip_index=False)[source]¶ Base Indexer class.
This base class defines the interface for Indexer classes and provides basic functionality for building augmented reference genomes. Functionality for building the indices for alignment is provided by sub-classes (as this is aligner specific), typically by overriding the protected
_build_indices
method.The main interface of the class is formed by the
check_dependencies
andbuild
methods. The former is used to check whether required external dependencies are present in$PATH
, whilst the latter is called with the required reference files to build the actual reference. On completion, thebuild
method returns aReference
object, which defines paths to the various reference files. The class of this object is defined by the_reference_class
method, which may be overridden to return aligner-specificReference
subclasses.Additionally, the class provides the methods
configure_args
andfrom_args
, which are used to configure the argument parser forimfusion-build
and to instantate instances from parsed arguments. Subclasses with extra options should override the protected_parse_args
method, which extracts constructor parameters from argparse arguments forfrom_args
, allowing for flexible interpretation of these arguments.Parameters: - logger (logging.Logger) – Logger to be used for logging messages.
- skip_index (bool) – Whether to skip building the index. Mostly used for debugging purposes, as this typically results in an incomplete reference.
-
build
(refseq_path, gtf_path, transposon_path, transposon_features_path, output_dir, blacklist_regions=None, blacklist_genes=None)[source]¶ Builds an augmented reference containing the transposon sequence.
The reference is built by first copying the original reference sequence into the output directory and then augmenting this reference with the transposon sequence. If needed, blacklisted genes/regions are then masked within this reference. Next, other supporting files (such as the reference GTF) are copied and indexed as needed. As a final step, aligner-specific indices are built for the augmented reference.
Parameters: - refseq_path (Path) – Path to the reference sequence (in Fasta format).
- gtf_path (Path) – Path to the reference GTF file.
- transposon_path (Path) – Path to the transposon sequence (in Fasta format).
- transposon_features_path (Path) – Path to the transposon features (tab-separated file).
- output_dir (Path) – Path to the output directory. Should not yet exist.
- blacklist_regions (List[Tuple[str, int, int]]) – List or regions that should be blacklisted, specified as a tuple of (chromosome, start, end).
- blacklist_genes (List[str]) – List of genes that should be blacklisted. Should correspond to
gene_id
entries in the passed GTF file.
Returns: Returns a
Reference
object, which describes the paths to various files within the generated reference.Return type:
-
check_dependencies
()[source]¶ Checks if all required external dependencies are in
$PATH
.Raises a ValueError if any dependencies are missing.
-
classmethod
configure_args
(parser)[source]¶ Configures an argument parser for the Indexer.
Used by
imfusion-build
to configure the sub-command for this indexer (if registered as an Indexer using theregister_indexer
function).Parameters: parser (argparse.ArgumentParser) – Argument parser to configure.
-
dependencies
¶ External dependencies required by Indexer.
-
classmethod
from_args
(args)[source]¶ Constructs an Indexer instance from given arguments.
Instantiates an Indexer instance from the given argparse arguments. Uses the
_parse_args
method internally, which performs the actual argument-to-parameter extraction. As such, the_parse_args
method should be overridden in any subclasses with extra parameters.Parameters: args (argparse.Namespace) – Arguments to parse.
-
class
imfusion.build.indexers.
StarIndexer
(logger=None, skip_index=False, overhang=100, threads=1)[source]¶ Indexer that builds references for the STAR aligner.
Performs the same steps as the base
Indexer
class, but additionally generates an index for alignment with STAR using STAR’sgenomeGenerate
command. Special attention should be paid to theoverhang
parameter, which defines the overhang used by STAR in the build reference (see thesjdbOverhang
parameter in the STAR documentation for more details). Ideally, the value for this parameter should be one less than the length of the used reads.-
classmethod
configure_args
(parser)[source]¶ Configures an argument parser for the Indexer.
Used by
imfusion-build
to configure the sub-command for this indexer (if registered as an Indexer using theregister_indexer
function).Parameters: parser (argparse.ArgumentParser) – Argument parser to configure.
-
dependencies
¶ External dependencies required by this indexer.
-
classmethod
-
class
imfusion.build.indexers.
TophatIndexer
(logger=None, skip_index=False)[source]¶ Indexer that builds references for the Tophat-Fusion aligner.
Performs the same steps as the base
Indexer
class, but additionally generates an index for alignment with Tophat-Fusion using Bowtie and Tophat2.-
dependencies
¶ External dependencies required by this indexer.
-
References¶
Indexer
classes return Reference
objects after building
references, which define paths to specific files within the reference.
This includes, for example, paths to the augmented fasta file, to the file
describing transposon features and to the reference index. The Reference
base class defines paths that should be available for all references.
Subclasses may define additional paths that are specific to the
corresponding aligner.
-
class
imfusion.build.indexers.
Reference
(reference_path)[source]¶ Base Reference class.
Reference classes define paths to specific files within the reference. This includes, for example, paths to the augmented fasta file and to the r eference index. This base class defines paths that should be available for all references. Subclasses may define additional paths that are specific to the corresponding aligner.
Parameters: reference_path (Path) – Path to the reference directory. -
base_path
¶ Path to reference base directory.
-
exon_gtf_path
¶ Path to exon gtf.
-
fasta_path
¶ Path to reference sequence.
-
features_path
¶ Path to transposon features.
-
gtf_path
¶ Path to reference gtf.
-
index_path
¶ Path to index.
-
indexed_gtf_path
¶ Path to reference gtf.
-
transposon_name
¶ Name of transposon sequence.
-
transposon_path
¶ Name of transposon sequence.
-
-
class
imfusion.build.indexers.
StarReference
(reference_path)[source]¶ Star Reference class.
Defines paths to files within the STAR reference. Currently the same as the base
Reference
class.
-
class
imfusion.build.indexers.
TophatReference
(reference_path)[source]¶ Tophat Reference class.
Defines paths to files within the Tophat-Fusion reference. Compared to the base reference, this class adds an additional path to the transcriptome index that is used by Tophat during alignment.
-
index_path
¶ Path to index.
-
transcriptome_path
¶ Path to transcriptome index.
-
Identifying insertions¶
IM-Fusion uses Aligner
classes to identify insertions using the different
RNA-seq aligners, with one class provided for each supported aligner. The main
methods provided by the Indexers are the check_dependencies
and
identify_insertions
methods. The latter should be called with a reference
instance and the sequence read files to identify insertions.
-
class
imfusion.insertions.aligners.
Aligner
(reference, logger=None)[source]¶ Base aligner class.
This base class defines the interface for
Aligner
classes. The main methods provided byAligners
are thecheck_dependencies
andidentify_insertions
methods. The former is used to check whether required external dependencies are present in$PATH
, whilst the latter should be called with a reference instance and the sequence read files to identify insertions. The implementation of theidentify_insertions
method is specific to each aligner and should be overridden in each subclass.Parameters: - reference (Reference) – Reference class describing the reference paths.
- logger (logging.Logger) – Logger to be used for logging messages.
-
check_dependencies
()[source]¶ Checks if all required external dependencies are in
$PATH
.Raises a ValueError if any dependencies are missing.
-
classmethod
configure_args
(parser)[source]¶ Configures an argument parser for the Indexer.
Used by
imfusion-build
to configure the sub-command for this indexer (if registered as an Indexer using theregister_indexer
function).Parameters: parser (argparse.ArgumentParser) – Argument parser to configure.
-
dependencies
¶ External dependencies required by aligner.
-
classmethod
from_args
(args)[source]¶ Constructs an Indexer instance from given arguments.
Instantiates an Indexer instance from the given argparse arguments. Uses the
_parse_args
method internally, which performs the actual argument-to-parameter extraction. As such, the_parse_args
method should be overridden in any subclasses with extra parameters.Parameters: args (argparse.Namespace) – Arguments to parse.
-
identify_insertions
(fastq_path, output_dir, fastq2_path=None)[source]¶ Identifies insertions from given reads.
Aligns RNA-seq reads to the reference genome and uses this alignment to identify gene-transposon fusions. These gene-transposon fusions are summarized to transposon insertions and returned.
Parameters: - fastq_path (Path) – Path to fastq file containing sequence reads. For paired-end data, this should refer to the first read of the pair.
- output_dir (Path) – Output directory, to use for output files such as the alignment.
- fastq2_path (Path) – For paired-end sequencing data, path to fastq file containing the second read of the pair.
Yields: Insertion – Yields the identified insertions.
-
class
imfusion.insertions.aligners.
StarAligner
(reference, assemble=False, assemble_args=None, min_flank=12, threads=1, extra_args=None, logger=None, filter_features=True, filter_orientation=True, filter_blacklist=None, external_sort=False, merge_junction_dist=10, max_spanning_dist=300, max_junction_dist=10000)[source]¶ STAR aligner.
Aligner that identifies insertions using STAR, by aligning reads using STAR (including chimeric alignments) and identifying gene-transposon fusions from chimeric read alignments.
As STAR provides chimeric alignment information but does not provide a an actual list of fusions, we identify fusions by summarizing these alignments ourselves. The strategy to do so is as follows:
- We filter for chimeric reads involving the transposon and genomic loci, to reduce the number of reads we need to consider.
- We select chimeric reads (or mates) that actually overlap the breakpoint of the fusion, as these alignments provide the exact position of the fusion. These reads are grouped by their positions to determine the number of reads that support each fusion. Fusions with close distance (on both the genomic and transposon loci) are merged to account for slight variations in the alignment.
- For paired-end data:
- We select mates that span the fusion site (meaning that they do not overlap the breakpoint, therefore we only have an approximate position) and group mates of which both ends are with close proximity of eachother.
- We then try to assign these groups of spanning mates to previously identified ‘junction’ fusions that are in the close vicinity. If we find a match, the number of mates in the group is registered as ‘spanning support’ for that fusion. If no match is found, we consider the group to identify a unique fusion and derive an approximate location for the fusion from the mate locations.
- We return the combined list of fusions. For single-end data, this only contains junction fusions, whilst for paired-end data this is the combined list of junction and unassigned spanning fusions.
To avoid memory issues with STAR during the sorting of bam files, sorting can be performed using Sambamba. Other external dependencies of this aligner are STAR itself and Stringtie, which is used for the assembly of novel transcripts (if requested).
Parameters: - reference (Reference) – Reference class describing the reference paths.
- assemble (bool) – Whether to perform the assembly of novel transcripts using Stringie.
- assemble_args (Dict[str, Any]) – Extra arguments to pass to Stringie during assembly.
- min_flank (int) – Minimum number of flanking bases required on both sides of the
breakpoint for a fusion to be considered valid. This means that, for
a split read overlapping the fusion boundary, at least
min_flank
bases should be on either sides of the fusion. - threads (int) – Number of threads to use.
- extra_args – Extra arguments to pass to STAR for alignment.
- logger (logging.Logger) – Logger to be used for logging messages.
- filter_features (bool) – Whether insertions should be filtered to remove non SA/SD insertions.
- filter_orientation (bool) – Whether insertions should be filtered for fusions in which the transposon feature is in the wrong orientation.
- filter_blacklist (List[str]) – List of gene ids to filter insertions for.
- external_sort (bool) – Whether sorting should be performed using STAR (False) or using Sambamba (True). Sambamba uses less memory than STAR, but is slower.
- merge_junction_dist (int) – Maximum distance within which fusions supported by split reads (overlapping the junction, so that the exact breakpoint is known) are merged. This merging avoids calling multiple fusions due to slight variations in the alignment, although this value should not be chosen too large to avoid merging distinct insertions.
- max_spanning_dist (int) – Maximum distance within which spanning mate pairs (mates that do not overlap the fusion junction) are grouped when summarizing spanning chimeric reads. Both mates from two pairs need to be within this distance of each other to be merged. The value should be chosen to reflect the expected or emprical insert size.
- max_junction_dist (int) – Maxixmum distance within which groups of spanning mates are assigned to a junction fusion (which is supported by split reads, so that its exact position is known). Groups that cannot be assigned to a junction fusion are considered to arise from a separate insertion.
-
classmethod
configure_args
(parser)[source]¶ Configures an argument parser for the Indexer.
Used by
imfusion-build
to configure the sub-command for this indexer (if registered as an Indexer using theregister_indexer
function).Parameters: parser (argparse.ArgumentParser) – Argument parser to configure.
-
dependencies
¶ External dependencies required by aligner.
-
identify_insertions
(fastq_path, output_dir, fastq2_path=None)[source]¶ Identifies insertions from given reads.
Aligns RNA-seq reads to the reference genome and uses this alignment to identify gene-transposon fusions. These gene-transposon fusions are summarized to transposon insertions and returned.
Parameters: - fastq_path (Path) – Path to fastq file containing sequence reads. For paired-end data, this should refer to the first read of the pair.
- output_dir (Path) – Output directory, to use for output files such as the alignment.
- fastq2_path (Path) – For paired-end sequencing data, path to fastq file containing the second read of the pair.
Yields: Insertion – Yields the identified insertions.
-
class
imfusion.insertions.aligners.
TophatAligner
(reference, assemble=False, assemble_args=None, min_flank=12, threads=1, extra_args=None, logger=None, filter_features=True, filter_orientation=True, filter_blacklist=None)[source]¶ Tophat2 aligner.
Aligner that identifies insertions using Tophat-Fusion (as implemented in Tophat2), which aligns RNA-seq reads and returns a summary file detailing the identified fusions. Gene-transposon fusions are identified directly from this summary file.
External dependencies include Tophat2 and Bowtie1, which are both used for alignment. Stringtie is required for the detection of novel transcripts.
Parameters: - reference (Reference) – Reference class describing the reference paths.
- assemble (bool) – Whether to perform the assembly of novel transcripts using Stringie.
- assemble_args (Dict[str, Any]) – Extra arguments to pass to Stringie during assembly.
- min_flank (int) – Minimum number of flanking bases required on both sides of the
breakpoint for a fusion to be considered valid. This means that, for
a split read overlapping the fusion boundary, at least
min_flank
bases should be on either sides of the fusion. - threads (int) – Number of threads to use.
- extra_args – Extra arguments to pass to Tophat2 for alignment.
- logger (logging.Logger) – Logger to be used for logging messages.
- filter_features (bool) – Whether insertions should be filtered to remove non SA/SD insertions.
- filter_orientation (bool) – Whether insertions should be filtered for fusions in which the transposon feature is in the wrong orientation.
- filter_blacklist (List[str]) – List of gene ids to filter insertions for.
Differential expression¶
Quantification¶
The imfusion.expression
module provides several functions for generating,
reading and normalizing exon expression counts. These functions are used by
IM-Fusion to prepare counts for the differential expression tests.
-
imfusion.expression.
generate_exon_counts
(bam_files, gtf_path, names=None, extra_kws=None, tmp_dir=None, keep_tmp=False)[source]¶ Generates exon counts for given bam files using featureCounts.
This function is used to generate a m-by-n matrix (m = number of samples, n = number of exons) of exon expression counts. This matrix is generated using featureCounts, whose results are then parsed and returned.
Parameters: - bam_files (list[pathlib.Path]) – List of paths to the bam files for which counts should be generated.
- gtf_path (pathlib.Path) – Path to the GTF file containing gene features.
- names (dict[str, str]) – Alternative names to use for the given bam files. Keys of the dict should correspond to bam file paths, values should reflect the sample names that should be used in the resulting count matrix.
- extra_kws (dict[str, tuple]:) – Dictionary of extra arguments that should be passed to feature counts. Keys should correspond to argument names (including dashes), values should be tuples containing the argument values.
- tmp_dir (pathlib.Path) – Temp directory to use for generating counts.
- keep_tmp (bool) – Whether to keep the temp directory (default = False).
- **kwargs – Any kwargs are passed to feature_counts.
Returns: DataFrame containing counts. The index of the DataFrame contains gene ids corresponding to exons in the gff file, the columns correspond to samples/bam files. Column names are either the bam file paths, or the alternative sample names if given.
Return type: pandas.DataFrame
-
imfusion.expression.
read_exon_counts
(file_path)[source]¶ Reads exon counts from an IM-Fusion expression file.
Parameters: - file_path (pathlib.Path) – Path to the exon count file.
- gene_id (Optional[str]) – Optional gene_id.
Returns: Matrix of exon counts. The rows correspond to the counted features, the columns correspond to the index values (chomosome, position etc.) and the samples.
Return type: pd.DataFrame
-
imfusion.expression.
normalize_counts
(counts)[source]¶ Normalizes counts for sequencing depth using median-of-ratios.
Normalizes gene expression counts for differences in sequencing depth between samples using the median-of-ratios approach described in DESeq2.
Parameters: counts (pd.DataFrame) – Matrix of gene counts, with genes along the rows and samples along the columns. Returns: Matrix of normalized expression counts. Return type: pd.DataFrame
-
imfusion.expression.
estimate_size_factors
(counts)[source]¶ Calculates size factors using median-of-ratios.
Calculates normalization factors for the median-of-ratios approach for normalizing expression counts for differences in sequencing depths between samples.
Parameters: counts (pd.DataFrame) – Matrix of gene counts, with genes along the rows and samples along the columns. Returns: Array of normalization factors, with one entry per sample. Return type: np.Array
DE tests¶
IM-Fusion provides three differential expression tests, which are implemented
in their own functions (test_de_exon
, test_de_exon_single
and
test_de_gene
). Each of these functions returns a corresponding test result
instance (see below for more details). The test_de
function is a
convenient wrapper that applies the group-wise exon-level test for multiple
genes, and provides an optional fallback to the gene-level test for genes that
cannot be tested at the exon-level.
-
imfusion.expression.
test_de
(insertions, exon_counts, gene_ids, fallback_to_gene=False, gene_counts=None)[source]¶ Performs and summarizes group-wise DE tests for multiple genes.
High-level function that performs the group-wise DE tests for multiple genes and summarizes the results in a single table. Provides an optional fallback to perform the gene-level test for genes that can not be tested using the exon level test (typically due to a lack of exons before/after the insertion sites in the gene).
Parameters: - insertions (List[Insertion]) – Insertions to use for test.
- counts (pd.DataFrame) – Exon expression counts to use for test. Expected to conform to the format returned by the read_exon_counts function.
- gene_ids (List[str]) – IDs of genes to test for differential expression. Expected to conform with gene_ids used for the insertions and the expression counts.
- fallback_to_gene (bool) – If true, the gene-level test is used as a fallback for genes that do not have any exons before/after their insertion sites. If false, no test is performed for these genes.
- gene_counts (pd.DataFrame) – Optional matrix of gene expression counts to use for the gene-level test. If not given, the exon expression counts are used to estimate gene-level expression by summing the expression values of the gene exons.
Returns: DataFrame summarizing the results of the DE tests. Contains three columns: ‘gene_id’, ‘p_value’ and ‘direction’. The p-value indicates the significance of the differential expression for the given gene, whilst the direction indicates whether the expression goes up (=1) in samples with an insertion or goes down (=-1).
Return type: pd.DataFrame
-
imfusion.expression.
test_de_exon
(insertions, exon_counts, gene_id, pos_samples=None, neg_samples=None)[source]¶ Performs the groupwise exon-level differential expression test.
Tests if the expression of exons after the insertion site(s) in a gene is significantly increased or decreased in samples with an insertion (pos_samples) compared to samples without an insertion (neg_samples). The test is performed by comparing normalized counts after the insertion sites between samples with and without an insertion in the gene, using the non-parametric Mann-Whitney-U test.
Note that the before/after split for the groupwise test is taken as the common set of before/after exons over all samples with an insertion. In cases where either set is empty, for example due to insertions before the first exon of the gene, we attempt to drop samples that prevent a proper split and perform the test without these samples.
Parameters: - insertions (List[Insertion] or pd.DataFrame) – List of insertions.
- exon_counts (pandas.DataFrame) – Matrix containing exon counts, with samples along the columns and exons along the rows. The DataFrame should have a multi-index containing the chromosome, start, end and strand of the exon. The samples should correspond with samples in the insertions frame.
- gene_id (str) – ID of the gene of interest. Should correspond with a gene in the count matrix.
- pos_samples (Set[str]) – Set of positive samples (with insertion) to use in the test. Defaults to all samples with an insertion in the gene of interest.
- neg_samples (Set[str]) – Set of negative samples (without insertion) to use in the test. Defaults to all samples not in the positive set.
Returns: Result of the differential expression test.
Return type:
-
imfusion.expression.
test_de_exon_single
(insertions, exon_counts, insertion_id, gene_id, neg_samples=None)[source]¶ Performs the single sample exon-level differential expression test.
Tests if the expression of exons after the insertion site(s) in a gene is significantly increased or decreased in the sample with the given insertion compared to samples without an insertion (neg_samples). The test is performed by comparing normalized counts between the given sample and the set of background samples using the negative binomial distribution.
Note that the before/after split for the groupwise test is taken as the common set of before/after exons over all samples with an insertion. In cases where either set is empty, for example due to insertions before the first exon of the gene, we attempt to drop samples that prevent a proper split and perform the test without these samples.
Parameters: - insertions (Union[List[Insertion], pd.DataFrame]) – List of insertions.
- exon_counts (pandas.DataFrame) – Matrix containing exon counts, with samples along the columns and exons along the rows. The DataFrame should have a multi-index containing the chromosome, start, end and strand of the exon. The samples should correspond with samples in the insertions frame.
- gene_id (str) – ID of the gene of interest. Should correspond with a gene in the count matrix.
- neg_samples (Set[str]) – Set of negative samples (without insertion) to use in the test. Defaults to all samples without an insertion in the gene.
Returns: Result of the differential expression test.
Return type:
-
imfusion.expression.
test_de_gene
(insertions, gene_counts, gene_id, pos_samples=None, neg_samples=None)[source]¶ Performs the groupwise gene-level differential expression test.
Tests if the expression of the given gene is signficantly increased or decreased in samples with an insertion in the gene. Significance is calculated using a mannwhitneyu test between the two groups, after normalizing for sequencing depth using median-of-ratios normalization (as implemented in DESeq2).
Parameters: - insertions (Union[List[Insertion], pd.DataFrame]) – List of insertions.
- gene_counts (pandas.DataFrame) – Matrix containing gene counts, with samples along the columns and genes along the rows.
- gene_id (str) – ID of the gene of interest. Should correspond with a gene in the count matrix.
- pos_samples (Set[str]) – Set of positive samples (with insertion) to use in the test. Defaults to all samples with an insertion in the gene of interest.
- neg_samples (Set[str]) – Set of negative samples (without insertion) to use in the test. Defaults to all samples not in the positive set.
Returns: Result of the differential expression test.
Return type:
DE results¶
Each of the three differential expression tests returns a result instance that contains the results of the test and provides several functions for visualizing the test result.
-
class
imfusion.expression.test.
DeResult
(before, after, pos_samples, neg_samples, dropped_samples, direction, p_value)[source]¶ Class embodying the results of the groupwise exon-level DE test.
-
before
¶ pandas.DataFrame – Expression counts of exons before the split.
-
after
¶ pandas.DataFrame – Expression counts of exons after the split.
-
positive_samples
¶ Set[str] – Samples with an insertion.
-
negative_samples
¶ Set[str] – Samples without an insertion.
-
dropped_samples
¶ Set[str] – Samples that were dropped from the analysis, because of violating the constraints on the minimum number of exons before/after the split.
-
direction
¶ int – Direction of the differential expression (1 = positive, -1 = negative).
-
p_value
¶ float – P-value of the differential expression test.
-
-
class
imfusion.expression.test.
DeSingleResult
(before, after, pos_sample, neg_samples, direction, p_value)[source]¶ Class embodying the results of the single-sample exon-level DE test.
-
before
¶ pandas.DataFrame – Expression counts of exons before the split.
-
after
¶ pandas.DataFrame – Expression counts of exons after the split.
-
positive_sample
¶ str – Samples with an insertion.
-
negative_samples
¶ Set[str] – Samples without an insertion.
-
direction
¶ int – Direction of the differential expression (1 = positive, -1 = negative).
-
p_value
¶ float – P-value of the differential expression test.
-
-
class
imfusion.expression.test.
DeGeneResult
(counts, pos_samples, neg_samples, direction, p_value)[source]¶ Class embodying the results of the groupwise gene-level DE test.
-
counts
¶ pandas.Series – Expression counts of gene for the different samples.
-
positive_samples
¶ Set[str] – Samples with an insertion.
-
negative_samples
¶ Set[str] – Samples without an insertion.
-
direction
¶ int – Direction of the differential expression (1 = positive, -1 = negative).
-
p_value
¶ float – P-value of the differential expression test.
-