itpseq.parsing.parse#

itpseq.parsing.parse(filename, *, a1='GTATAAGGAGGAAAAAAT', a2='GGTATCTCGGTGTGACTG', mm1=2, mm2=2, limit=None, min_seq_len=3, max_seq_len=30, quality=30, start='ATG', contaminants='TCCAACATGCTGAGC|^GATCCTTTTTA', untranslated_overhang=12, **kwargs)[source]#

Takes a ‘filename’ as input (fastq format), loops over the fastq using fastq_iterator and performs several checks to extract valid ITP sequences. Also computes various statistics on the dataset.

Parameters:

filename – fastq file to process
a1 (str) – Sequence of the left adaptator
a2 (str) – Sequence of right adaptator
mm1 (int) – Number of tolerated mismatches in a1
mm2 (int) – Number of tolerated mismatches in a2
limit (int or None) – Maximum number of sequences to process (useful for quick tests)
min_seq_len (int or None) – Minimum length of the matched sequence to keep
max_seq_len (int or None) – Maximum length of the matched sequence to keep
quality (int) – Minimum quality required in the coding sequence to keep a read
contaminants (str,) – Regex defining the contaminants to remove the matching reads
untranslated_overhang (int) – Minimum extra nucleotides after the A-site (for RNAse R w/ ribosome this is 12). The possible overhangs are x / x+2.