itpseq.parsing.parse#

itpseq.parsing.parse(filename, *, a1='GTATAAGGAGGAAAAAAT', a2='GGTATCTCGGTGTGACTG', mm1=2, mm2=2, limit=None, min_seq_len=3, max_seq_len=30, quality=30, start='ATG', contaminants='TCCAACATGCTGAGC|^GATCCTTTTTA', untranslated_overhang=12, **kwargs)[source]#

Takes a ‘filename’ as input (fastq format), loops over the fastq using fastq_iterator and performs several checks to extract valid ITP sequences. Also computes various statistics on the dataset.

Parameters:
  • filename – fastq file to process

  • a1 (str) – Sequence of the left adaptator

  • a2 (str) – Sequence of right adaptator

  • mm1 (int) – Number of tolerated mismatches in a1

  • mm2 (int) – Number of tolerated mismatches in a2

  • limit (int or None) – Maximum number of sequences to process (useful for quick tests)

  • min_seq_len (int or None) – Minimum length of the matched sequence to keep

  • max_seq_len (int or None) – Maximum length of the matched sequence to keep

  • quality (int) – Minimum quality required in the coding sequence to keep a read

  • contaminants (str,) – Regex defining the contaminants to remove the matching reads

  • untranslated_overhang (int) – Minimum extra nucleotides after the A-site (for RNAse R w/ ribosome this is 12). The possible overhangs are x / x+2.