Building a reference¶
Before detecting insertion sites, IM-Fusion first needs to build an
augmented version of the host reference genome that contains the
sequence of the transposon as additional sequence. This augmented reference
is created using the imfusion-build
command, which essentially concatenates
the host reference genome and the transposon sequence (both in Fasta
format) into a new Fasta file and builds the indices needed for alignment.
Separate sub-commands are provided for each supported aligner (currently STAR
and Tophat-Fusion).
The basic command for building a (STAR-based) reference is as follows:
imfusion-build star \
--reference_seq Mus_musculus.GRCm38.dna.primary_assembly.fa \
--reference_gtf Mus_musculus.GRCm38.76.gtf \
--transposon_seq t2onc2.sequence.fa \
--transposon_features t2onc2.features.txt \
--output_dir references/ \
--blacklist_genes ENSMUSG00000039095 ENSMUSG00000038402 \
--star_threads 4
Here, the --reference_seq
argument points to the fasta file containing
the sequence of the reference genome, --reference_gtf
points to the gtf
file containing the reference gene features, --transposon_seq
points to
the transposon sequence and --transposon_features
points to a file
describing the transposon features. The reference files are written to the
path specified by --output_dir
It is important to blacklist genes or genomic sequences that are part of the
transposon sequence using the optional --blacklist_genes
arguments. The former can be used to blacklist entire
genes (specified by their ID in the GTF file), whilst the latter can be used
to blacklist specific regions (specified as chr:start-end). Sequences of
blacklisted regions are replaced by ‘N’ nucleotides in the generated reference.
Failure to blacklist shared sequences will result in multiple alignments and prevent proper identification of transposon insertions.
The command for building a Tophat-Fusion reference is nearly identical:
imfusion-build tophat \
--reference_seq Mus_musculus.GRCm38.dna.primary_assembly.fa \
--reference_gtf Mus_musculus.GRCm38.76.gtf \
--transposon_seq transposon.fa \
--transposon_features transposon.features.txt \
--output_dir references/GRCm38.76.t2onc.tophat \
--blacklist_genes ENSMUSG00000039095 ENSMUSG00000038402
However, both aligners do have some aligner-specific arguments for building their references. See the help of the respective sub-commands for more details. For STAR, special attention should be paid to memory usage, as STAR requires approximately 30GB of memory for building the reference genome.
Detecting insertions (per sample)¶
After building the augmented reference, we can detect transposon insertions
in each sample using the imfusion-insertions
command. This command
essentially runs a gene-fusion aware aligner (either STAR or Tophat-Fusion,
with seperate sub-commands for each aligner) to align the RNA-seq reads and
identify gene fusions, extracts gene-transposon fusions from the results and
derives the corresponding insertion sites. The insertions are written as the
tab-separated file insertions.txt
in the output directory.
The basic command for imfusion-insertions
using the STAR aligner is:
imfusion-insertions star \
--fastq sample_s1.R1.fastq.gz \
--reference references/ \
--output_dir output/sample_s1 \
--star_threads 4
In this command, the --fastq
argument specifies a path to the fastq
file containing RNA-seq reads for the given sample. For paired-end samples, the
second pair should be provided using the optional --fastq2
The --reference
argument should point to the previously built augmented
reference, whilst the --output_dir
argument specifies where the
sample output should be written.
An optional --assemble
argument indicates whether IM-Fusion should perform
a reference-guided transcript assembly. If given, IM-Fusion runs Stringtie
after the RNA-seq alignment to detect novel gene transcripts based on the
RNA-seq alignment. The results of this assembly are subsequently used in the
insertion detection step to annotate insertions that involve novel transcripts.
The command for using Tophat-Fusion is nearly identical:
imfusion-insertions tophat \
--fastq sample_s1.R1.fastq.gz \
--reference references/GRCm38.76.t2onc.tophat \
--output_dir output/sample_s1 \
--tophat_threads 4
However, both aligners do have some aligner-specific arguments concerning the alignment. See the help of the respective sub-commands for more details. For STAR, special attention should be paid to memory usage, as STAR requires approximately 30GB of memory (per process) for loading the reference genome.
Quantifying expression (per sample)¶
After detecting insertions, we use the generated RNA-seq alignment to quantify exon expression counts for the given sample. These counts are later used to test for differential expression when identifying candidate genes from a group of samples.
The expression counts are generated using the imfusion-expression
imfusion-expression \
--sample_dir output/sample_s1 \
--reference references/
Here, the --sample_dir
argument should point to a sample directory (which
was previously generated by imfusion-insertions
). The --reference
argument should point to the reference that was used to identify insertions.
Two optional arguments --paired
and --stranded
can be used to indicate
whether the alignment contains paired-end sequencing data and to indicate the
strandedness of the RNA-seq reads.
The generated counts are written to the sample directory as the TSV
file exon_counts.txt
Merging sample results¶
To detect genes that are recurrently mutated across samples, we first merge
the individual sample results into a combined dataset using imfusion-merge
This command effectively concatenates the individual results into combined
and exon_counts.txt
The basic command is as follows:
imfusion-merge --sample_dirs ./output/sample_s1 \
./output/sample_s2 \
--output ./output/merged.insertions.txt \
--output_expression ./output/merged.exon_counts.txt
In this command, the --sample_dirs
argument points to the sample
directories that should be merged and output
indicates that path to
which the merged insertion file should be written. The --output_expression
argument indicates where merged expression counts should be written. This
argument may be omitted if no expression counts were generated for the samples.
Selecting (DE) CTGs¶
To identify genes that are commonly targeted by insertions (commonly targeted
genes, or CTGs), IM-Fusion uses the Poisson distribution to test whether a
given gene has more insertions than may be expected by chance. This test is
performed on the merged dataset using the imfusion-ctg
imfusion-ctg --insertions ./output/merged.insertions.txt \
--expression ./output/merged.exon_counts.txt \
--reference references/ \
--chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 \
13 14 15 16 17 18 19 X \
--output ./output/merged.ctgs.txt
Here, --insertions
and --expression
should point to the merged
insertions and expression files generated by imfusion-merge
. The
argument refers to the same reference as used for the
alignment, whilst --output
specifies the path where the CTG output should
be written. The --chromosomes
argument specifies which chromosomes
should be included in the test and is mainly used to omit chromosomes
containing the transposon donor loci.
The parameters for the CTG test can be changed using the --window
and --chromosomes
arguments. The --window
specifies the size of the window around genes within which insertions should
be included. The --pattern
argument can be used to account for integration
biases of the transposon, if the transposon is known to integrate at specific
nucleotide sequences.
Finally, the significance thresholds for the CTG and DE tests can be specified
using the --threshold
and --de_threshold
imfusion-ctg --insertions ./output/merged.insertions.txt \
--expression ./output/merged.exon_counts.txt \
--reference references/ \
--chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 \
13 14 15 16 17 18 19 X \
--output ./output/merged.ctgs.txt \
--threshold 0.05 \
--de_threshold 0.05
Optionally, the differential expression test can be skipped by not providing the expression data. In this case, only the CTG test is performed:
imfusion-ctg --insertions ./merged.insertions.txt \
--reference references/ \
--chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 \
13 14 15 16 17 18 19 X \
--output ./output/merged.ctgs.txt
--threshold 0.05 \