itpseq.parsing.format_sequences#

itpseq.parsing.format_sequences(filename, codons=False, aa=False, repeat_header=False, out=None, limit=None)[source]#

Formats a nucleotide inverse toeprint file in a custom human-readable format

Parameters:

filename (str or list) – Name of the nucleotide inverse toeprint file to use as input. If a list of filenames or a directory is passed, apply the function to all the nucleotide inverse toeprint files.
codons (bool) – If True, splits the coding sequence into codons
aa (bool) – If True, interleave the codons below each nucleotide sequence Adds the length of the peptide after the amino-acids.
repeat_header (bool or int:) – If given an integer, repeats the header every <repeat_sequence> reads.
out (str or Path) – If defined, write the output to this file. If None, write to stdout.
limit (int or None) – Limit the number of reads to process

Examples

Display the first 5 inverse toeprints of the “nnn15_noa1.nuc.itp.txt” file:

>>> format_sequences('nnn15_noa1.nuc.itp.txt', limit=5)
#                    [E][P][A]
                     ATGGGACGC cccgcagtatct
      ATGAGTTACAAAGGCAACTCGGAA caggtagcatatc
                     ATGGAAGAG gcccatgccattcc
                        ATGAAT cgaaacatgttt
ATGACTATGTTTCTTGGACACACATAAGGG aactagttaggg

Display the first 10 inverse toeprints of the “nnn15_noa1.nuc.itp.txt” file, group the coding sequence by codons and display the translation. Repeat the header every 5 reads.:

>>> format_sequences('nnn15_noa1.nuc.itp.txt', limit=5,
...                  codons=True, aa=True,
...                  repeat_header=5)
#                           [E] [P] [A]
                            ATG GGA CGC cccgcagtatct
                             M   G   R  3
        ATG AGT TAC AAA GGC AAC TCG GAA caggtagcatatc
         M   S   Y   K   G   N   S   E  8
                            ATG GAA GAG gcccatgccattcc
                             M   E   E  3
                                ATG AAT cgaaacatgttt
                                 M   N  2
ATG ACT ATG TTT CTT GGA CAC ACA TAA GGG aactagttaggg
 M   T   M   F   L   G   H   T   *   G  10

#                           [E] [P] [A]
                            ATG CTA TAA taggtcaagcacca
                             M   L   *  3
                    ATG ACC AAT CCG TAG gactaacgccacat
                     M   T   N   P   *  5
            ATG TAG CCG GGC AAG GAG ATC cgcacctcgcgc
             M   *   P   G   K   E   I  7
                                ATG TAA ctatacgacgtcg
                                 M   *  2
                                ATG TAA acacgccttgtcgt
                                 M   *  2

Export the output to a file

>>> format_sequences('nnn15_noa1.nuc.itp.txt', out='nnn15_noa1_formatted.txt')