Usage¶
Building a reference¶
Before detecting insertion sites, IM-Fusion first needs to build an
augmented version of the host reference genome that contains the
sequence of the transposon as additional sequence. This augmented reference
is created using the imfusion-build command, which essentially concatenates
the host reference genome and the transposon sequence (both in Fasta
format) into a new Fasta file and builds the indices needed for alignment.
Separate sub-commands are provided for each supported aligner (currently STAR
and Tophat-Fusion).
The basic command for building a (STAR-based) reference is as follows:
imfusion-build star \
--reference_seq Mus_musculus.GRCm38.dna.primary_assembly.fa \
--reference_gtf Mus_musculus.GRCm38.76.gtf \
--transposon_seq t2onc2.sequence.fa \
--transposon_features t2onc2.features.txt \
--output_dir references/GRCm38.76.t2onc.star \
--blacklist_genes ENSMUSG00000039095 ENSMUSG00000038402 \
--star_threads 4
Here, the --reference_seq argument points to the fasta file containing
the sequence of the reference genome, --reference_gtf points to the gtf
file containing the reference gene features, --transposon_seq points to
the transposon sequence and --transposon_features points to a file
describing the transposon features. The reference files are written to the
path specified by --output_dir.
It is important to blacklist genes or genomic sequences that are part of the
transposon sequence using the optional --blacklist_genes and
--blacklist_regions arguments. The former can be used to blacklist entire
genes (specified by their ID in the GTF file), whilst the latter can be used
to blacklist specific regions (specified as chr:start-end). Sequences of
blacklisted regions are replaced by ‘N’ nucleotides in the generated reference.
Failure to blacklist shared sequences will result in multiple alignments and prevent proper identification of transposon insertions.
The command for building a Tophat-Fusion reference is nearly identical:
imfusion-build tophat \
--reference_seq Mus_musculus.GRCm38.dna.primary_assembly.fa \
--reference_gtf Mus_musculus.GRCm38.76.gtf \
--transposon_seq transposon.fa \
--transposon_features transposon.features.txt \
--output_dir references/GRCm38.76.t2onc.tophat \
--blacklist_genes ENSMUSG00000039095 ENSMUSG00000038402
However, both aligners do have some aligner-specific arguments for building their references. See the help of the respective sub-commands for more details. For STAR, special attention should be paid to memory usage, as STAR requires approximately 30GB of memory for building the reference genome.
Detecting insertions (per sample)¶
After building the augmented reference, we can detect transposon insertions
in each sample using the imfusion-insertions command. This command
essentially runs a gene-fusion aware aligner (either STAR or Tophat-Fusion,
with seperate sub-commands for each aligner) to align the RNA-seq reads and
identify gene fusions, extracts gene-transposon fusions from the results and
derives the corresponding insertion sites. The insertions are written as the
tab-separated file insertions.txt in the output directory.
The basic command for imfusion-insertions using the STAR aligner is:
imfusion-insertions star \
--fastq sample_s1.R1.fastq.gz \
--reference references/GRCm38.76.t2onc.star \
--output_dir output/sample_s1 \
--star_threads 4
In this command, the --fastq argument specifies a path to the fastq
file containing RNA-seq reads for the given sample. For paired-end samples, the
second pair should be provided using the optional --fastq2 argument.
The --reference argument should point to the previously built augmented
reference, whilst the --output_dir argument specifies where the
sample output should be written.
An optional --assemble argument indicates whether IM-Fusion should perform
a reference-guided transcript assembly. If given, IM-Fusion runs Stringtie
after the RNA-seq alignment to detect novel gene transcripts based on the
RNA-seq alignment. The results of this assembly are subsequently used in the
insertion detection step to annotate insertions that involve novel transcripts.
The command for using Tophat-Fusion is nearly identical:
imfusion-insertions tophat \
--fastq sample_s1.R1.fastq.gz \
--reference references/GRCm38.76.t2onc.tophat \
--output_dir output/sample_s1 \
--tophat_threads 4
However, both aligners do have some aligner-specific arguments concerning the alignment. See the help of the respective sub-commands for more details. For STAR, special attention should be paid to memory usage, as STAR requires approximately 30GB of memory (per process) for loading the reference genome.
Quantifying expression (per sample)¶
After detecting insertions, we use the generated RNA-seq alignment to quantify exon expression counts for the given sample. These counts are later used to test for differential expression when identifying candidate genes from a group of samples.
The expression counts are generated using the imfusion-expression command:
imfusion-expression \
--sample_dir output/sample_s1 \
--reference references/GRCm38.76.t2onc.star
Here, the --sample_dir argument should point to a sample directory (which
was previously generated by imfusion-insertions). The --reference
argument should point to the reference that was used to identify insertions.
Two optional arguments --paired and --stranded can be used to indicate
whether the alignment contains paired-end sequencing data and to indicate the
strandedness of the RNA-seq reads.
The generated counts are written to the sample directory as the TSV
file exon_counts.txt.
Merging sample results¶
To detect genes that are recurrently mutated across samples, we first merge
the individual sample results into a combined dataset using imfusion-merge.
This command effectively concatenates the individual results into combined
insertions.txt and exon_counts.txt files.
The basic command is as follows:
imfusion-merge --sample_dirs ./output/sample_s1 \
./output/sample_s2 \
--output ./output/merged.insertions.txt \
--output_expression ./output/merged.exon_counts.txt
In this command, the --sample_dirs argument points to the sample
directories that should be merged and output indicates that path to
which the merged insertion file should be written. The --output_expression
argument indicates where merged expression counts should be written. This
argument may be omitted if no expression counts were generated for the samples.
Selecting (DE) CTGs¶
To identify genes that are commonly targeted by insertions (commonly targeted
genes, or CTGs), IM-Fusion uses the Poisson distribution to test whether a
given gene has more insertions than may be expected by chance. This test is
performed on the merged dataset using the imfusion-ctg command:
imfusion-ctg --insertions ./output/merged.insertions.txt \
--expression ./output/merged.exon_counts.txt \
--reference references/GRCm38.76.t2onc.star \
--chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 \
13 14 15 16 17 18 19 X \
--output ./output/merged.ctgs.txt
Here, --insertions and --expression should point to the merged
insertions and expression files generated by imfusion-merge. The
--reference argument refers to the same reference as used for the
alignment, whilst --output specifies the path where the CTG output should
be written. The --chromosomes argument specifies which chromosomes
should be included in the test and is mainly used to omit chromosomes
containing the transposon donor loci.
The parameters for the CTG test can be changed using the --window,
--pattern and --chromosomes arguments. The --window parameter
specifies the size of the window around genes within which insertions should
be included. The --pattern argument can be used to account for integration
biases of the transposon, if the transposon is known to integrate at specific
nucleotide sequences.
Finally, the significance thresholds for the CTG and DE tests can be specified
using the --threshold and --de_threshold arguments:
imfusion-ctg --insertions ./output/merged.insertions.txt \
--expression ./output/merged.exon_counts.txt \
--reference references/GRCm38.76.t2onc.star \
--chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 \
13 14 15 16 17 18 19 X \
--output ./output/merged.ctgs.txt \
--threshold 0.05 \
--de_threshold 0.05
Optionally, the differential expression test can be skipped by not providing the expression data. In this case, only the CTG test is performed:
imfusion-ctg --insertions ./merged.insertions.txt \
--reference references/GRCm38.76.t2onc.star \
--chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 \
13 14 15 16 17 18 19 X \
--output ./output/merged.ctgs.txt
--threshold 0.05 \