===== Usage ===== Building a reference -------------------- Before detecting insertion sites, IM-Fusion first needs to build an augmented version of the host reference genome that contains the sequence of the transposon as additional sequence. This augmented reference is created using the ``imfusion-build`` command, which essentially concatenates the host reference genome and the transposon sequence (both in Fasta format) into a new Fasta file and builds the indices needed for alignment. Separate sub-commands are provided for each supported aligner (currently STAR and Tophat-Fusion). The basic command for building a (STAR-based) reference is as follows: .. code:: bash imfusion-build star \ --reference_seq Mus_musculus.GRCm38.dna.primary_assembly.fa \ --reference_gtf Mus_musculus.GRCm38.76.gtf \ --transposon_seq t2onc2.sequence.fa \ --transposon_features t2onc2.features.txt \ --output_dir references/GRCm38.76.t2onc.star \ --blacklist_genes ENSMUSG00000039095 ENSMUSG00000038402 \ --star_threads 4 Here, the ``--reference_seq`` argument points to the fasta file containing the sequence of the reference genome, ``--reference_gtf`` points to the gtf file containing the reference gene features, ``--transposon_seq`` points to the transposon sequence and ``--transposon_features`` points to a file describing the transposon features. The reference files are written to the path specified by ``--output_dir``. It is important to blacklist genes or genomic sequences that are part of the transposon sequence using the optional ``--blacklist_genes`` and ``--blacklist_regions`` arguments. The former can be used to blacklist entire genes (specified by their ID in the GTF file), whilst the latter can be used to blacklist specific regions (specified as chr:start-end). Sequences of blacklisted regions are replaced by 'N' nucleotides in the generated reference. **Failure to blacklist shared sequences will result in multiple alignments and prevent proper identification of transposon insertions.** The command for building a Tophat-Fusion reference is nearly identical: .. code:: bash imfusion-build tophat \ --reference_seq Mus_musculus.GRCm38.dna.primary_assembly.fa \ --reference_gtf Mus_musculus.GRCm38.76.gtf \ --transposon_seq transposon.fa \ --transposon_features transposon.features.txt \ --output_dir references/GRCm38.76.t2onc.tophat \ --blacklist_genes ENSMUSG00000039095 ENSMUSG00000038402 However, both aligners do have some aligner-specific arguments for building their references. See the help of the respective sub-commands for more details. For STAR, special attention should be paid to memory usage, as STAR requires approximately 30GB of memory for building the reference genome. Detecting insertions (per sample) --------------------------------- After building the augmented reference, we can detect transposon insertions in each sample using the ``imfusion-insertions`` command. This command essentially runs a gene-fusion aware aligner (either STAR or Tophat-Fusion, with seperate sub-commands for each aligner) to align the RNA-seq reads and identify gene fusions, extracts gene-transposon fusions from the results and derives the corresponding insertion sites. The insertions are written as the tab-separated file ``insertions.txt`` in the output directory. The basic command for ``imfusion-insertions`` using the STAR aligner is: .. code:: bash imfusion-insertions star \ --fastq sample_s1.R1.fastq.gz \ --reference references/GRCm38.76.t2onc.star \ --output_dir output/sample_s1 \ --star_threads 4 In this command, the ``--fastq`` argument specifies a path to the fastq file containing RNA-seq reads for the given sample. For paired-end samples, the second pair should be provided using the optional ``--fastq2`` argument. The ``--reference`` argument should point to the previously built augmented reference, whilst the ``--output_dir`` argument specifies where the sample output should be written. An optional ``--assemble`` argument indicates whether IM-Fusion should perform a reference-guided transcript assembly. If given, IM-Fusion runs Stringtie after the RNA-seq alignment to detect novel gene transcripts based on the RNA-seq alignment. The results of this assembly are subsequently used in the insertion detection step to annotate insertions that involve novel transcripts. The command for using Tophat-Fusion is nearly identical: .. code:: bash imfusion-insertions tophat \ --fastq sample_s1.R1.fastq.gz \ --reference references/GRCm38.76.t2onc.tophat \ --output_dir output/sample_s1 \ --tophat_threads 4 However, both aligners do have some aligner-specific arguments concerning the alignment. See the help of the respective sub-commands for more details. For STAR, special attention should be paid to memory usage, as STAR requires approximately 30GB of memory (per process) for loading the reference genome. Quantifying expression (per sample) ----------------------------------- After detecting insertions, we use the generated RNA-seq alignment to quantify exon expression counts for the given sample. These counts are later used to test for differential expression when identifying candidate genes from a group of samples. The expression counts are generated using the ``imfusion-expression`` command: .. code:: bash imfusion-expression \ --sample_dir output/sample_s1 \ --reference references/GRCm38.76.t2onc.star Here, the ``--sample_dir`` argument should point to a sample directory (which was previously generated by ``imfusion-insertions``). The ``--reference`` argument should point to the reference that was used to identify insertions. Two optional arguments ``--paired`` and ``--stranded`` can be used to indicate whether the alignment contains paired-end sequencing data and to indicate the strandedness of the RNA-seq reads. The generated counts are written to the sample directory as the TSV file ``exon_counts.txt``. Merging sample results ---------------------- To detect genes that are recurrently mutated across samples, we first merge the individual sample results into a combined dataset using ``imfusion-merge``. This command effectively concatenates the individual results into combined ``insertions.txt`` and ``exon_counts.txt`` files. The basic command is as follows: .. code:: bash imfusion-merge --sample_dirs ./output/sample_s1 \ ./output/sample_s2 \ --output ./output/merged.insertions.txt \ --output_expression ./output/merged.exon_counts.txt In this command, the ``--sample_dirs`` argument points to the sample directories that should be merged and ``output`` indicates that path to which the merged insertion file should be written. The ``--output_expression`` argument indicates where merged expression counts should be written. This argument may be omitted if no expression counts were generated for the samples. Selecting (DE) CTGs ------------------- To identify genes that are commonly targeted by insertions (commonly targeted genes, or CTGs), IM-Fusion uses the Poisson distribution to test whether a given gene has more insertions than may be expected by chance. This test is performed on the merged dataset using the ``imfusion-ctg`` command: .. code:: bash imfusion-ctg --insertions ./output/merged.insertions.txt \ --expression ./output/merged.exon_counts.txt \ --reference references/GRCm38.76.t2onc.star \ --chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 \ 13 14 15 16 17 18 19 X \ --output ./output/merged.ctgs.txt Here, ``--insertions`` and ``--expression`` should point to the merged insertions and expression files generated by ``imfusion-merge``. The ``--reference`` argument refers to the same reference as used for the alignment, whilst ``--output`` specifies the path where the CTG output should be written. The ``--chromosomes`` argument specifies which chromosomes should be included in the test and is mainly used to omit chromosomes containing the transposon donor loci. The parameters for the CTG test can be changed using the ``--window``, ``--pattern`` and ``--chromosomes`` arguments. The ``--window`` parameter specifies the size of the window around genes within which insertions should be included. The ``--pattern`` argument can be used to account for integration biases of the transposon, if the transposon is known to integrate at specific nucleotide sequences. Finally, the significance thresholds for the CTG and DE tests can be specified using the ``--threshold`` and ``--de_threshold`` arguments: .. code:: bash imfusion-ctg --insertions ./output/merged.insertions.txt \ --expression ./output/merged.exon_counts.txt \ --reference references/GRCm38.76.t2onc.star \ --chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 \ 13 14 15 16 17 18 19 X \ --output ./output/merged.ctgs.txt \ --threshold 0.05 \ --de_threshold 0.05 Optionally, the differential expression test can be skipped by not providing the expression data. In this case, only the CTG test is performed: .. code:: bash imfusion-ctg --insertions ./merged.insertions.txt \ --reference references/GRCm38.76.t2onc.star \ --chromosomes 1 2 3 4 5 6 7 8 9 10 11 12 \ 13 14 15 16 17 18 19 X \ --output ./output/merged.ctgs.txt --threshold 0.05 \