Usage

Building a reference

Before detecting insertion sites, IM-Fusion first needs to build an augmented version of the host reference genome that contains the sequence of the transposon as additional sequence. This augmented reference is created using the imfusion-build command, which essentially concatenates the host reference genome and the transposon sequence (both in Fasta format) into a new Fasta file and builds the indices needed for alignment. Separate sub-commands are provided for each supported aligner (currently STAR and Tophat-Fusion).

The basic command for building a (STAR-based) reference is as follows:

imfusion-build star \
    --reference_seq Mus_musculus.GRCm38.dna.primary_assembly.fa \
    --reference_gtf Mus_musculus.GRCm38.76.gtf \
    --transposon_seq t2onc2.sequence.fa \
    --transposon_features t2onc2.features.txt \
    --output_dir references/GRCm38.76.t2onc.star \
    --blacklist_genes ENSMUSG00000039095 ENSMUSG00000038402 \
    --star_threads 4

Here, the --reference_seq argument points to the fasta file containing the sequence of the reference genome, --reference_gtf points to the gtf file containing the reference gene features, --transposon_seq points to the transposon sequence and --transposon_features points to a file describing the transposon features. The reference files are written to the path specified by --output_dir.

It is important to blacklist genes or genomic sequences that are part of the transposon sequence using the optional --blacklist_genes and --blacklist_regions arguments. The former can be used to blacklist entire genes (specified by their ID in the GTF file), whilst the latter can be used to blacklist specific regions (specified as chr:start-end). Sequences of blacklisted regions are replaced by ‘N’ nucleotides in the generated reference.

Failure to blacklist shared sequences will result in multiple alignments and prevent proper identification of transposon insertions.

The command for building a Tophat-Fusion reference is nearly identical:

imfusion-build tophat \
    --reference_seq Mus_musculus.GRCm38.dna.primary_assembly.fa \
    --reference_gtf Mus_musculus.GRCm38.76.gtf \
    --transposon_seq transposon.fa \
    --transposon_features transposon.features.txt \
    --output_dir references/GRCm38.76.t2onc.tophat \
    --blacklist_genes ENSMUSG00000039095 ENSMUSG00000038402

However, both aligners do have some aligner-specific arguments for building their references. See the help of the respective sub-commands for more details. For STAR, special attention should be paid to memory usage, as STAR requires approximately 30GB of memory for building the reference genome.

Detecting insertions (per sample)

After building the augmented reference, we can detect transposon insertions in each sample using the imfusion-insertions command. This command essentially runs a gene-fusion aware aligner (either STAR or Tophat-Fusion, with seperate sub-commands for each aligner) to align the RNA-seq reads and identify gene fusions, extracts gene-transposon fusions from the results and derives the corresponding insertion sites. The insertions are written as the tab-separated file insertions.txt in the output directory.

The basic command for imfusion-insertions using the STAR aligner is:

imfusion-insertions star \
    --fastq sample_s1.R1.fastq.gz \
    --reference references/GRCm38.76.t2onc.star \
    --output_dir output/sample_s1 \
    --star_threads 4

In this command, the --fastq argument specifies a path to the fastq file containing RNA-seq reads for the given sample. For paired-end samples, the second pair should be provided using the optional --fastq2 argument. The --reference argument should point to the previously built augmented reference, whilst the --output_dir argument specifies where the sample output should be written.

An optional --assemble argument indicates whether IM-Fusion should perform a reference-guided transcript assembly. If given, IM-Fusion runs Stringtie after the RNA-seq alignment to detect novel gene transcripts based on the RNA-seq alignment. The results of this assembly are subsequently used in the insertion detection step to annotate insertions that involve novel transcripts.

The command for using Tophat-Fusion is nearly identical:

imfusion-insertions tophat \
    --fastq sample_s1.R1.fastq.gz \
    --reference references/GRCm38.76.t2onc.tophat \
    --output_dir output/sample_s1 \
    --tophat_threads 4

However, both aligners do have some aligner-specific arguments concerning the alignment. See the help of the respective sub-commands for more details. For STAR, special attention should be paid to memory usage, as STAR requires approximately 30GB of memory (per process) for loading the reference genome.

Quantifying expression (per sample)

After detecting insertions, we use the generated RNA-seq alignment to quantify exon expression counts for the given sample. These counts are later used to test for differential expression when identifying candidate genes from a group of samples.

The expression counts are generated using the imfusion-expression command:

imfusion-expression \
    --sample_dir output/sample_s1 \
    --reference references/GRCm38.76.t2onc.star

Here, the --sample_dir argument should point to a sample directory (which was previously generated by imfusion-insertions). The --reference argument should point to the reference that was used to identify insertions. Two optional arguments --paired and --stranded can be used to indicate whether the alignment contains paired-end sequencing data and to indicate the strandedness of the RNA-seq reads.

The generated counts are written to the sample directory as the TSV file exon_counts.txt.

Merging sample results

To detect genes that are recurrently mutated across samples, we first merge the individual sample results into a combined dataset using imfusion-merge. This command effectively concatenates the individual results into combined insertions.txt and exon_counts.txt files.

The basic command is as follows:

imfusion-merge --sample_dirs ./output/sample_s1 \
                             ./output/sample_s2 \
               --output ./output/merged.insertions.txt \
               --output_expression ./output/merged.exon_counts.txt

In this command, the --sample_dirs argument points to the sample directories that should be merged and output indicates that path to which the merged insertion file should be written. The --output_expression argument indicates where merged expression counts should be written. This argument may be omitted if no expression counts were generated for the samples.

Selecting (DE) CTGs

To identify genes that are commonly targeted by insertions (commonly targeted genes, or CTGs), IM-Fusion uses the Poisson distribution to test whether a given gene has more insertions than may be expected by chance. This test is performed on the merged dataset using the imfusion-ctg command:

imfusion-ctg --insertions ./output/merged.insertions.txt  \
             --expression ./output/merged.exon_counts.txt \
             --reference references/GRCm38.76.t2onc.star \
             --chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 \
                           13 14 15 16 17 18 19 X \
             --output ./output/merged.ctgs.txt

Here, --insertions and --expression should point to the merged insertions and expression files generated by imfusion-merge. The --reference argument refers to the same reference as used for the alignment, whilst --output specifies the path where the CTG output should be written. The --chromosomes argument specifies which chromosomes should be included in the test and is mainly used to omit chromosomes containing the transposon donor loci.

The parameters for the CTG test can be changed using the --window, --pattern and --chromosomes arguments. The --window parameter specifies the size of the window around genes within which insertions should be included. The --pattern argument can be used to account for integration biases of the transposon, if the transposon is known to integrate at specific nucleotide sequences.

Finally, the significance thresholds for the CTG and DE tests can be specified using the --threshold and --de_threshold arguments:

imfusion-ctg --insertions ./output/merged.insertions.txt  \
             --expression ./output/merged.exon_counts.txt \
             --reference references/GRCm38.76.t2onc.star \
             --chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 \
                           13 14 15 16 17 18 19 X \
             --output ./output/merged.ctgs.txt \
             --threshold 0.05 \
             --de_threshold 0.05

Optionally, the differential expression test can be skipped by not providing the expression data. In this case, only the CTG test is performed:

imfusion-ctg --insertions ./merged.insertions.txt  \
             --reference references/GRCm38.76.t2onc.star \
             --chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 \
                           13 14 15 16 17 18 19 X \
             --output ./output/merged.ctgs.txt
             --threshold 0.05 \