API

Building references

Indexers

IM-Fusion uses Indexer classes to generate references for the different RNA-seq aligners, with one class provided for each aligner. The main methods provided by the Indexers are the check_dependencies and build methods. The former is used to check whether required external dependencies are present in $PATH, whilst the latter should be called with the required reference files to build the actual reference.

class imfusion.build.indexers.Indexer(logger=None, skip_index=False)[source]

Base Indexer class.

This base class defines the interface for Indexer classes and provides basic functionality for building augmented reference genomes. Functionality for building the indices for alignment is provided by sub-classes (as this is aligner specific), typically by overriding the protected _build_indices method.

The main interface of the class is formed by the check_dependencies and build methods. The former is used to check whether required external dependencies are present in $PATH, whilst the latter is called with the required reference files to build the actual reference. On completion, the build method returns a Reference object, which defines paths to the various reference files. The class of this object is defined by the _reference_class method, which may be overridden to return aligner-specific Reference subclasses.

Additionally, the class provides the methods configure_args and from_args, which are used to configure the argument parser for imfusion-build and to instantate instances from parsed arguments. Subclasses with extra options should override the protected _parse_args method, which extracts constructor parameters from argparse arguments for from_args, allowing for flexible interpretation of these arguments.

Parameters:
  • logger (logging.Logger) – Logger to be used for logging messages.
  • skip_index (bool) – Whether to skip building the index. Mostly used for debugging purposes, as this typically results in an incomplete reference.
build(refseq_path, gtf_path, transposon_path, transposon_features_path, output_dir, blacklist_regions=None, blacklist_genes=None)[source]

Builds an augmented reference containing the transposon sequence.

The reference is built by first copying the original reference sequence into the output directory and then augmenting this reference with the transposon sequence. If needed, blacklisted genes/regions are then masked within this reference. Next, other supporting files (such as the reference GTF) are copied and indexed as needed. As a final step, aligner-specific indices are built for the augmented reference.

Parameters:
  • refseq_path (Path) – Path to the reference sequence (in Fasta format).
  • gtf_path (Path) – Path to the reference GTF file.
  • transposon_path (Path) – Path to the transposon sequence (in Fasta format).
  • transposon_features_path (Path) – Path to the transposon features (tab-separated file).
  • output_dir (Path) – Path to the output directory. Should not yet exist.
  • blacklist_regions (List[Tuple[str, int, int]]) – List or regions that should be blacklisted, specified as a tuple of (chromosome, start, end).
  • blacklist_genes (List[str]) – List of genes that should be blacklisted. Should correspond to gene_id entries in the passed GTF file.
Returns:

Returns a Reference object, which describes the paths to various files within the generated reference.

Return type:

Reference

check_dependencies()[source]

Checks if all required external dependencies are in $PATH.

Raises a ValueError if any dependencies are missing.

classmethod configure_args(parser)[source]

Configures an argument parser for the Indexer.

Used by imfusion-build to configure the sub-command for this indexer (if registered as an Indexer using the register_indexer function).

Parameters:parser (argparse.ArgumentParser) – Argument parser to configure.
dependencies

External dependencies required by Indexer.

classmethod from_args(args)[source]

Constructs an Indexer instance from given arguments.

Instantiates an Indexer instance from the given argparse arguments. Uses the _parse_args method internally, which performs the actual argument-to-parameter extraction. As such, the _parse_args method should be overridden in any subclasses with extra parameters.

Parameters:args (argparse.Namespace) – Arguments to parse.
class imfusion.build.indexers.StarIndexer(logger=None, skip_index=False, overhang=100, threads=1)[source]

Indexer that builds references for the STAR aligner.

Performs the same steps as the base Indexer class, but additionally generates an index for alignment with STAR using STAR’s genomeGenerate command. Special attention should be paid to the overhang parameter, which defines the overhang used by STAR in the build reference (see the sjdbOverhang parameter in the STAR documentation for more details). Ideally, the value for this parameter should be one less than the length of the used reads.

classmethod configure_args(parser)[source]

Configures an argument parser for the Indexer.

Used by imfusion-build to configure the sub-command for this indexer (if registered as an Indexer using the register_indexer function).

Parameters:parser (argparse.ArgumentParser) – Argument parser to configure.
dependencies

External dependencies required by this indexer.

class imfusion.build.indexers.TophatIndexer(logger=None, skip_index=False)[source]

Indexer that builds references for the Tophat-Fusion aligner.

Performs the same steps as the base Indexer class, but additionally generates an index for alignment with Tophat-Fusion using Bowtie and Tophat2.

dependencies

External dependencies required by this indexer.

References

Indexer classes return Reference objects after building references, which define paths to specific files within the reference. This includes, for example, paths to the augmented fasta file, to the file describing transposon features and to the reference index. The Reference base class defines paths that should be available for all references. Subclasses may define additional paths that are specific to the corresponding aligner.

class imfusion.build.indexers.Reference(reference_path)[source]

Base Reference class.

Reference classes define paths to specific files within the reference. This includes, for example, paths to the augmented fasta file and to the r eference index. This base class defines paths that should be available for all references. Subclasses may define additional paths that are specific to the corresponding aligner.

Parameters:reference_path (Path) – Path to the reference directory.
base_path

Path to reference base directory.

exon_gtf_path

Path to exon gtf.

fasta_path

Path to reference sequence.

features_path

Path to transposon features.

gtf_path

Path to reference gtf.

index_path

Path to index.

indexed_gtf_path

Path to reference gtf.

transposon_name

Name of transposon sequence.

transposon_path

Name of transposon sequence.

class imfusion.build.indexers.StarReference(reference_path)[source]

Star Reference class.

Defines paths to files within the STAR reference. Currently the same as the base Reference class.

class imfusion.build.indexers.TophatReference(reference_path)[source]

Tophat Reference class.

Defines paths to files within the Tophat-Fusion reference. Compared to the base reference, this class adds an additional path to the transcriptome index that is used by Tophat during alignment.

index_path

Path to index.

transcriptome_path

Path to transcriptome index.

Identifying insertions

IM-Fusion uses Aligner classes to identify insertions using the different RNA-seq aligners, with one class provided for each supported aligner. The main methods provided by the Indexers are the check_dependencies and identify_insertions methods. The latter should be called with a reference instance and the sequence read files to identify insertions.

class imfusion.insertions.aligners.Aligner(reference, logger=None)[source]

Base aligner class.

This base class defines the interface for Aligner classes. The main methods provided by Aligners are the check_dependencies and identify_insertions methods. The former is used to check whether required external dependencies are present in $PATH, whilst the latter should be called with a reference instance and the sequence read files to identify insertions. The implementation of the identify_insertions method is specific to each aligner and should be overridden in each subclass.

Parameters:
  • reference (Reference) – Reference class describing the reference paths.
  • logger (logging.Logger) – Logger to be used for logging messages.
check_dependencies()[source]

Checks if all required external dependencies are in $PATH.

Raises a ValueError if any dependencies are missing.

classmethod configure_args(parser)[source]

Configures an argument parser for the Indexer.

Used by imfusion-build to configure the sub-command for this indexer (if registered as an Indexer using the register_indexer function).

Parameters:parser (argparse.ArgumentParser) – Argument parser to configure.
dependencies

External dependencies required by aligner.

classmethod from_args(args)[source]

Constructs an Indexer instance from given arguments.

Instantiates an Indexer instance from the given argparse arguments. Uses the _parse_args method internally, which performs the actual argument-to-parameter extraction. As such, the _parse_args method should be overridden in any subclasses with extra parameters.

Parameters:args (argparse.Namespace) – Arguments to parse.
identify_insertions(fastq_path, output_dir, fastq2_path=None)[source]

Identifies insertions from given reads.

Aligns RNA-seq reads to the reference genome and uses this alignment to identify gene-transposon fusions. These gene-transposon fusions are summarized to transposon insertions and returned.

Parameters:
  • fastq_path (Path) – Path to fastq file containing sequence reads. For paired-end data, this should refer to the first read of the pair.
  • output_dir (Path) – Output directory, to use for output files such as the alignment.
  • fastq2_path (Path) – For paired-end sequencing data, path to fastq file containing the second read of the pair.
Yields:

Insertion – Yields the identified insertions.

class imfusion.insertions.aligners.StarAligner(reference, assemble=False, assemble_args=None, min_flank=12, threads=1, extra_args=None, logger=None, filter_features=True, filter_orientation=True, filter_blacklist=None, external_sort=False, merge_junction_dist=10, max_spanning_dist=300, max_junction_dist=10000)[source]

STAR aligner.

Aligner that identifies insertions using STAR, by aligning reads using STAR (including chimeric alignments) and identifying gene-transposon fusions from chimeric read alignments.

As STAR provides chimeric alignment information but does not provide a an actual list of fusions, we identify fusions by summarizing these alignments ourselves. The strategy to do so is as follows:

  1. We filter for chimeric reads involving the transposon and genomic loci, to reduce the number of reads we need to consider.
  2. We select chimeric reads (or mates) that actually overlap the breakpoint of the fusion, as these alignments provide the exact position of the fusion. These reads are grouped by their positions to determine the number of reads that support each fusion. Fusions with close distance (on both the genomic and transposon loci) are merged to account for slight variations in the alignment.
  3. For paired-end data:
    • We select mates that span the fusion site (meaning that they do not overlap the breakpoint, therefore we only have an approximate position) and group mates of which both ends are with close proximity of eachother.
    • We then try to assign these groups of spanning mates to previously identified ‘junction’ fusions that are in the close vicinity. If we find a match, the number of mates in the group is registered as ‘spanning support’ for that fusion. If no match is found, we consider the group to identify a unique fusion and derive an approximate location for the fusion from the mate locations.
  4. We return the combined list of fusions. For single-end data, this only contains junction fusions, whilst for paired-end data this is the combined list of junction and unassigned spanning fusions.

To avoid memory issues with STAR during the sorting of bam files, sorting can be performed using Sambamba. Other external dependencies of this aligner are STAR itself and Stringtie, which is used for the assembly of novel transcripts (if requested).

Parameters:
  • reference (Reference) – Reference class describing the reference paths.
  • assemble (bool) – Whether to perform the assembly of novel transcripts using Stringie.
  • assemble_args (Dict[str, Any]) – Extra arguments to pass to Stringie during assembly.
  • min_flank (int) – Minimum number of flanking bases required on both sides of the breakpoint for a fusion to be considered valid. This means that, for a split read overlapping the fusion boundary, at least min_flank bases should be on either sides of the fusion.
  • threads (int) – Number of threads to use.
  • extra_args – Extra arguments to pass to STAR for alignment.
  • logger (logging.Logger) – Logger to be used for logging messages.
  • filter_features (bool) – Whether insertions should be filtered to remove non SA/SD insertions.
  • filter_orientation (bool) – Whether insertions should be filtered for fusions in which the transposon feature is in the wrong orientation.
  • filter_blacklist (List[str]) – List of gene ids to filter insertions for.
  • external_sort (bool) – Whether sorting should be performed using STAR (False) or using Sambamba (True). Sambamba uses less memory than STAR, but is slower.
  • merge_junction_dist (int) – Maximum distance within which fusions supported by split reads (overlapping the junction, so that the exact breakpoint is known) are merged. This merging avoids calling multiple fusions due to slight variations in the alignment, although this value should not be chosen too large to avoid merging distinct insertions.
  • max_spanning_dist (int) – Maximum distance within which spanning mate pairs (mates that do not overlap the fusion junction) are grouped when summarizing spanning chimeric reads. Both mates from two pairs need to be within this distance of each other to be merged. The value should be chosen to reflect the expected or emprical insert size.
  • max_junction_dist (int) – Maxixmum distance within which groups of spanning mates are assigned to a junction fusion (which is supported by split reads, so that its exact position is known). Groups that cannot be assigned to a junction fusion are considered to arise from a separate insertion.
classmethod configure_args(parser)[source]

Configures an argument parser for the Indexer.

Used by imfusion-build to configure the sub-command for this indexer (if registered as an Indexer using the register_indexer function).

Parameters:parser (argparse.ArgumentParser) – Argument parser to configure.
dependencies

External dependencies required by aligner.

identify_insertions(fastq_path, output_dir, fastq2_path=None)[source]

Identifies insertions from given reads.

Aligns RNA-seq reads to the reference genome and uses this alignment to identify gene-transposon fusions. These gene-transposon fusions are summarized to transposon insertions and returned.

Parameters:
  • fastq_path (Path) – Path to fastq file containing sequence reads. For paired-end data, this should refer to the first read of the pair.
  • output_dir (Path) – Output directory, to use for output files such as the alignment.
  • fastq2_path (Path) – For paired-end sequencing data, path to fastq file containing the second read of the pair.
Yields:

Insertion – Yields the identified insertions.

class imfusion.insertions.aligners.TophatAligner(reference, assemble=False, assemble_args=None, min_flank=12, threads=1, extra_args=None, logger=None, filter_features=True, filter_orientation=True, filter_blacklist=None)[source]

Tophat2 aligner.

Aligner that identifies insertions using Tophat-Fusion (as implemented in Tophat2), which aligns RNA-seq reads and returns a summary file detailing the identified fusions. Gene-transposon fusions are identified directly from this summary file.

External dependencies include Tophat2 and Bowtie1, which are both used for alignment. Stringtie is required for the detection of novel transcripts.

Parameters:
  • reference (Reference) – Reference class describing the reference paths.
  • assemble (bool) – Whether to perform the assembly of novel transcripts using Stringie.
  • assemble_args (Dict[str, Any]) – Extra arguments to pass to Stringie during assembly.
  • min_flank (int) – Minimum number of flanking bases required on both sides of the breakpoint for a fusion to be considered valid. This means that, for a split read overlapping the fusion boundary, at least min_flank bases should be on either sides of the fusion.
  • threads (int) – Number of threads to use.
  • extra_args – Extra arguments to pass to Tophat2 for alignment.
  • logger (logging.Logger) – Logger to be used for logging messages.
  • filter_features (bool) – Whether insertions should be filtered to remove non SA/SD insertions.
  • filter_orientation (bool) – Whether insertions should be filtered for fusions in which the transposon feature is in the wrong orientation.
  • filter_blacklist (List[str]) – List of gene ids to filter insertions for.
identify_insertions(fastq_path, output_dir, fastq2_path=None)[source]

Identifies insertions from given reads.

Differential expression

Quantification

The imfusion.expression module provides several functions for generating, reading and normalizing exon expression counts. These functions are used by IM-Fusion to prepare counts for the differential expression tests.

imfusion.expression.generate_exon_counts(bam_files, gtf_path, names=None, extra_kws=None, tmp_dir=None, keep_tmp=False)[source]

Generates exon counts for given bam files using featureCounts.

This function is used to generate a m-by-n matrix (m = number of samples, n = number of exons) of exon expression counts. This matrix is generated using featureCounts, whose results are then parsed and returned.

Parameters:
  • bam_files (list[pathlib.Path]) – List of paths to the bam files for which counts should be generated.
  • gtf_path (pathlib.Path) – Path to the GTF file containing gene features.
  • names (dict[str, str]) – Alternative names to use for the given bam files. Keys of the dict should correspond to bam file paths, values should reflect the sample names that should be used in the resulting count matrix.
  • extra_kws (dict[str, tuple]:) – Dictionary of extra arguments that should be passed to feature counts. Keys should correspond to argument names (including dashes), values should be tuples containing the argument values.
  • tmp_dir (pathlib.Path) – Temp directory to use for generating counts.
  • keep_tmp (bool) – Whether to keep the temp directory (default = False).
  • **kwargs – Any kwargs are passed to feature_counts.
Returns:

DataFrame containing counts. The index of the DataFrame contains gene ids corresponding to exons in the gff file, the columns correspond to samples/bam files. Column names are either the bam file paths, or the alternative sample names if given.

Return type:

pandas.DataFrame

imfusion.expression.read_exon_counts(file_path)[source]

Reads exon counts from an IM-Fusion expression file.

Parameters:
  • file_path (pathlib.Path) – Path to the exon count file.
  • gene_id (Optional[str]) – Optional gene_id.
Returns:

Matrix of exon counts. The rows correspond to the counted features, the columns correspond to the index values (chomosome, position etc.) and the samples.

Return type:

pd.DataFrame

imfusion.expression.normalize_counts(counts)[source]

Normalizes counts for sequencing depth using median-of-ratios.

Normalizes gene expression counts for differences in sequencing depth between samples using the median-of-ratios approach described in DESeq2.

Parameters:counts (pd.DataFrame) – Matrix of gene counts, with genes along the rows and samples along the columns.
Returns:Matrix of normalized expression counts.
Return type:pd.DataFrame
imfusion.expression.estimate_size_factors(counts)[source]

Calculates size factors using median-of-ratios.

Calculates normalization factors for the median-of-ratios approach for normalizing expression counts for differences in sequencing depths between samples.

Parameters:counts (pd.DataFrame) – Matrix of gene counts, with genes along the rows and samples along the columns.
Returns:Array of normalization factors, with one entry per sample.
Return type:np.Array

DE tests

IM-Fusion provides three differential expression tests, which are implemented in their own functions (test_de_exon, test_de_exon_single and test_de_gene). Each of these functions returns a corresponding test result instance (see below for more details). The test_de function is a convenient wrapper that applies the group-wise exon-level test for multiple genes, and provides an optional fallback to the gene-level test for genes that cannot be tested at the exon-level.

imfusion.expression.test_de(insertions, exon_counts, gene_ids, fallback_to_gene=False, gene_counts=None)[source]

Performs and summarizes group-wise DE tests for multiple genes.

High-level function that performs the group-wise DE tests for multiple genes and summarizes the results in a single table. Provides an optional fallback to perform the gene-level test for genes that can not be tested using the exon level test (typically due to a lack of exons before/after the insertion sites in the gene).

Parameters:
  • insertions (List[Insertion]) – Insertions to use for test.
  • counts (pd.DataFrame) – Exon expression counts to use for test. Expected to conform to the format returned by the read_exon_counts function.
  • gene_ids (List[str]) – IDs of genes to test for differential expression. Expected to conform with gene_ids used for the insertions and the expression counts.
  • fallback_to_gene (bool) – If true, the gene-level test is used as a fallback for genes that do not have any exons before/after their insertion sites. If false, no test is performed for these genes.
  • gene_counts (pd.DataFrame) – Optional matrix of gene expression counts to use for the gene-level test. If not given, the exon expression counts are used to estimate gene-level expression by summing the expression values of the gene exons.
Returns:

DataFrame summarizing the results of the DE tests. Contains three columns: ‘gene_id’, ‘p_value’ and ‘direction’. The p-value indicates the significance of the differential expression for the given gene, whilst the direction indicates whether the expression goes up (=1) in samples with an insertion or goes down (=-1).

Return type:

pd.DataFrame

imfusion.expression.test_de_exon(insertions, exon_counts, gene_id, pos_samples=None, neg_samples=None)[source]

Performs the groupwise exon-level differential expression test.

Tests if the expression of exons after the insertion site(s) in a gene is significantly increased or decreased in samples with an insertion (pos_samples) compared to samples without an insertion (neg_samples). The test is performed by comparing normalized counts after the insertion sites between samples with and without an insertion in the gene, using the non-parametric Mann-Whitney-U test.

Note that the before/after split for the groupwise test is taken as the common set of before/after exons over all samples with an insertion. In cases where either set is empty, for example due to insertions before the first exon of the gene, we attempt to drop samples that prevent a proper split and perform the test without these samples.

Parameters:
  • insertions (List[Insertion] or pd.DataFrame) – List of insertions.
  • exon_counts (pandas.DataFrame) – Matrix containing exon counts, with samples along the columns and exons along the rows. The DataFrame should have a multi-index containing the chromosome, start, end and strand of the exon. The samples should correspond with samples in the insertions frame.
  • gene_id (str) – ID of the gene of interest. Should correspond with a gene in the count matrix.
  • pos_samples (Set[str]) – Set of positive samples (with insertion) to use in the test. Defaults to all samples with an insertion in the gene of interest.
  • neg_samples (Set[str]) – Set of negative samples (without insertion) to use in the test. Defaults to all samples not in the positive set.
Returns:

Result of the differential expression test.

Return type:

DeResult

imfusion.expression.test_de_exon_single(insertions, exon_counts, insertion_id, gene_id, neg_samples=None)[source]

Performs the single sample exon-level differential expression test.

Tests if the expression of exons after the insertion site(s) in a gene is significantly increased or decreased in the sample with the given insertion compared to samples without an insertion (neg_samples). The test is performed by comparing normalized counts between the given sample and the set of background samples using the negative binomial distribution.

Note that the before/after split for the groupwise test is taken as the common set of before/after exons over all samples with an insertion. In cases where either set is empty, for example due to insertions before the first exon of the gene, we attempt to drop samples that prevent a proper split and perform the test without these samples.

Parameters:
  • insertions (Union[List[Insertion], pd.DataFrame]) – List of insertions.
  • exon_counts (pandas.DataFrame) – Matrix containing exon counts, with samples along the columns and exons along the rows. The DataFrame should have a multi-index containing the chromosome, start, end and strand of the exon. The samples should correspond with samples in the insertions frame.
  • gene_id (str) – ID of the gene of interest. Should correspond with a gene in the count matrix.
  • neg_samples (Set[str]) – Set of negative samples (without insertion) to use in the test. Defaults to all samples without an insertion in the gene.
Returns:

Result of the differential expression test.

Return type:

DeSingleResult

imfusion.expression.test_de_gene(insertions, gene_counts, gene_id, pos_samples=None, neg_samples=None)[source]

Performs the groupwise gene-level differential expression test.

Tests if the expression of the given gene is signficantly increased or decreased in samples with an insertion in the gene. Significance is calculated using a mannwhitneyu test between the two groups, after normalizing for sequencing depth using median-of-ratios normalization (as implemented in DESeq2).

Parameters:
  • insertions (Union[List[Insertion], pd.DataFrame]) – List of insertions.
  • gene_counts (pandas.DataFrame) – Matrix containing gene counts, with samples along the columns and genes along the rows.
  • gene_id (str) – ID of the gene of interest. Should correspond with a gene in the count matrix.
  • pos_samples (Set[str]) – Set of positive samples (with insertion) to use in the test. Defaults to all samples with an insertion in the gene of interest.
  • neg_samples (Set[str]) – Set of negative samples (without insertion) to use in the test. Defaults to all samples not in the positive set.
Returns:

Result of the differential expression test.

Return type:

DeResult

DE results

Each of the three differential expression tests returns a result instance that contains the results of the test and provides several functions for visualizing the test result.

class imfusion.expression.test.DeResult(before, after, pos_samples, neg_samples, dropped_samples, direction, p_value)[source]

Class embodying the results of the groupwise exon-level DE test.

before

pandas.DataFrame – Expression counts of exons before the split.

after

pandas.DataFrame – Expression counts of exons after the split.

positive_samples

Set[str] – Samples with an insertion.

negative_samples

Set[str] – Samples without an insertion.

dropped_samples

Set[str] – Samples that were dropped from the analysis, because of violating the constraints on the minimum number of exons before/after the split.

direction

int – Direction of the differential expression (1 = positive, -1 = negative).

p_value

float – P-value of the differential expression test.

plot_boxplot(log=False, ax=None, show_points=True, box_kws=None, strip_kws=None)[source]

Plots boxplot of ‘after’ expression for samples with/without insertions in the gene.

plot_sums(log=False, ax=None, line_kws=None)[source]

Plots the distribution of before/after counts for the samples.

class imfusion.expression.test.DeSingleResult(before, after, pos_sample, neg_samples, direction, p_value)[source]

Class embodying the results of the single-sample exon-level DE test.

before

pandas.DataFrame – Expression counts of exons before the split.

after

pandas.DataFrame – Expression counts of exons after the split.

positive_sample

str – Samples with an insertion.

negative_samples

Set[str] – Samples without an insertion.

direction

int – Direction of the differential expression (1 = positive, -1 = negative).

p_value

float – P-value of the differential expression test.

plot_sums(log=False, ax=None, line_kws=None)[source]

Plots the distribution of before/after counts for the samples.

class imfusion.expression.test.DeGeneResult(counts, pos_samples, neg_samples, direction, p_value)[source]

Class embodying the results of the groupwise gene-level DE test.

counts

pandas.Series – Expression counts of gene for the different samples.

positive_samples

Set[str] – Samples with an insertion.

negative_samples

Set[str] – Samples without an insertion.

direction

int – Direction of the differential expression (1 = positive, -1 = negative).

p_value

float – P-value of the differential expression test.

plot_boxplot(ax=None, log=False, box_kws=None)[source]

Plots boxplot comparing expression between the two groups.