Loading iTP-Seq data in Python#

Automatic loading from a directory#

The easiest approach to create a DataSet is to use a consistent format of the file names (see Naming conventions).

The parsing step creates four file for each input fast files:

the inverse-toeprint sequences as nucleotides (<file_prefix>.nuc.txt)
the inverse-toeprint sequences as amino-acids (<file_prefix>.aa.itp.txt)
metadata as JSON (<file_prefix>.itp.json)
a log file (<file_prefix>.itp.log)

All the files share the same prefix and the JSON files are used to identify the replicates.

Default behavior#

By default DataSet expects a prefix with the XXX_YYYDD format. XXX (alphanumeric) will be assigned as a “lib-type” key, YYY (letters) as a “sample” key, and DD (digits) as the “replicate”. For example nnn15_noa1.

Therefore a directory containing 3 “noa” and 3 “tcx” replicates would look like:

nnn15_noa1.aa.itp.txt       nnn15_noa3.aa.itp.txt       nnn15_tcx2.aa.itp.txt
nnn15_noa1.assembled.fastq  nnn15_noa3.assembled.fastq  nnn15_tcx2.assembled.fastq
nnn15_noa1.itp.json         nnn15_noa3.itp.json         nnn15_tcx2.itp.json
nnn15_noa1.itp.log          nnn15_noa3.itp.log          nnn15_tcx2.itp.log
nnn15_noa1.nuc.itp.txt      nnn15_noa3.nuc.itp.txt      nnn15_tcx2.nuc.itp.txt
nnn15_noa2.aa.itp.txt       nnn15_tcx1.aa.itp.txt       nnn15_tcx3.aa.itp.txt
nnn15_noa2.assembled.fastq  nnn15_tcx1.assembled.fastq  nnn15_tcx3.assembled.fastq
nnn15_noa2.itp.json         nnn15_tcx1.itp.json         nnn15_tcx3.itp.json
nnn15_noa2.itp.log          nnn15_tcx1.itp.log          nnn15_tcx3.itp.log
nnn15_noa2.nuc.itp.txt      nnn15_tcx1.nuc.itp.txt      nnn15_tcx3.nuc.itp.txt

Loading this directory will automatically assign the 3 Replicates to 2 Samples (“tcx” and “noa”). In addition, if a sample is named “noa”, it is automatically assigned as a reference to the other samples that share the same keys (other than “sample”):

In [1]: from itpseq import DataSet

In [2]: data = DataSet('.') # current directory containing the data files

In [3]: data
Out[3]: 
DataSet(data_path='.',
        file_pattern='(?P<lib_type>[^_]+)_(?P<sample>[^_\\d]+)(?P<replicate>\\d+)',
        samples=[Sample(nnn15.noa:[1, 2, 3]),
                 Sample(nnn15.tcx:[1, 2, 3], ref: nnn15.noa)],
        )

In [4]: data.samples
Out[4]: 
{'nnn15.noa': Sample(nnn15.noa:[1, 2, 3]),
 'nnn15.tcx': Sample(nnn15.tcx:[1, 2, 3], ref: nnn15.noa)}

In [5]: data.replicates
Out[5]: 
{'nnn15.noa.1': Replicate(nnn15.noa.1),
 'nnn15.noa.2': Replicate(nnn15.noa.2),
 'nnn15.noa.3': Replicate(nnn15.noa.3),
 'nnn15.tcx.1': Replicate(nnn15.tcx.1),
 'nnn15.tcx.2': Replicate(nnn15.tcx.2),
 'nnn15.tcx.3': Replicate(nnn15.tcx.3)}

This detection is due to the default regular expression file_pattern: (?P<lib_type>[^_]+)_(?P<sample>[^_\d]+)(?P<replicate>\d+).

The lib_type and sample keys are automatically used to group the Replicates into a Sample and to create the Sample name.

It is possible to specify the keys to use to group the Replicates:

In [6]: DataSet('.', keys=['sample'])  # ignoring "lib_type"
Out[6]: 
DataSet(data_path='.',
        file_pattern='(?P<lib_type>[^_]+)_(?P<sample>[^_\\d]+)(?P<replicate>\\d+)',
        samples=[Sample(noa:[1, 2, 3]),
                 Sample(tcx:[1, 2, 3], ref: noa)],
        )

Custom prefix and keys#

Let’s imagine a dataset with two drugs (drugA and drugB), one control (noa) and a few different concentrations for the drugs (10, 20, 30µM):

drugA1_10µM.itp.json  drugA3_20µM.itp.json  drugB2_30µM.itp.json
drugA1_20µM.itp.json  drugA3_30µM.itp.json  drugB3_10µM.itp.json
drugA1_30µM.itp.json  drugB1_10µM.itp.json  drugB3_20µM.itp.json
drugA2_10µM.itp.json  drugB1_20µM.itp.json  drugB3_30µM.itp.json
drugA2_20µM.itp.json  drugB1_30µM.itp.json  noa1.itp.json
drugA2_30µM.itp.json  drugB2_10µM.itp.json  noa2.itp.json
drugA3_10µM.itp.json  drugB2_20µM.itp.json  noa3.itp.json

As explained above, the different parts of the filename can be defined by passing a regular expression to file_pattern.

Each (?P<name>...) group associates the captured value to the associated name, which can later be used to group the samples. Here we want to capture the drug name, the replicate number, and the concentration (if any).

We can therefore define file_pattern as:

(?P<sample>[^_\d]+)(?P<replicate>\d+)(_(?P<concentration>\d+µM))?

(?P<sample>[^_\d]+): matches the sample name (anything but _ or digits)
(?P<replicate>\d+): matches digits defining the replicate number
(_(?P<concentration>\d+µM))?: optionally matches _ followed by a concentration

More information on the syntax of regular expressions can be found in the Python documentation.

In [7]: from itpseq import DataSet

In [8]: data = DataSet('.', file_pattern=r'(?P<sample>[^_\d]+)(?P<replicate>\d+)(_(?P<concentration>\d+µM))?')

In [9]: data
Out[9]: 
DataSet(data_path='.',
        file_pattern='(?P<sample>[^_\\d]+)(?P<replicate>\\d+)(_(?P<concentration>\\d+µM))?',
        samples=[Sample(drugA.10µM:[1, 2, 3], ref: noa),
                 Sample(drugA.20µM:[1, 2, 3], ref: noa),
                 Sample(drugA.30µM:[1, 2, 3], ref: noa),
                 Sample(drugB.10µM:[1, 2, 3], ref: noa),
                 Sample(drugB.20µM:[1, 2, 3], ref: noa),
                 Sample(drugB.30µM:[1, 2, 3], ref: noa),
                 Sample(noa:[1, 2, 3])],
        )

In [10]: data.samples
Out[10]: 
{'drugA.10µM': Sample(drugA.10µM:[1, 2, 3], ref: noa),
 'drugA.20µM': Sample(drugA.20µM:[1, 2, 3], ref: noa),
 'drugA.30µM': Sample(drugA.30µM:[1, 2, 3], ref: noa),
 'drugB.10µM': Sample(drugB.10µM:[1, 2, 3], ref: noa),
 'drugB.20µM': Sample(drugB.20µM:[1, 2, 3], ref: noa),
 'drugB.30µM': Sample(drugB.30µM:[1, 2, 3], ref: noa),
 'noa': Sample(noa:[1, 2, 3])}

Labels are automatically assigned to each sample:

In [11]: data['drugA.10µM'].labels
Out[11]: {'concentration': '10µM', 'sample': 'drugA'}

In [12]: data['noa'].labels
Out[12]: {'concentration': None, 'sample': 'noa'}

It is also possible to define the keys that will be used to assign the replicate. For instance, using ref_labels={'sample': 'drugA'} would define drugA as a reference to the samples that match the other same keys.

In [13]: data = DataSet('.',
   ....:                file_pattern=r'(?P<sample>[^_]+)(?P<replicate>\d+)(_(?P<concentration>\d+µM))?',
   ....:                ref_labels={'sample': 'drugA'},
   ....:                )
   ....: 

In [14]: data
Out[14]: 
DataSet(data_path='.',
        file_pattern='(?P<sample>[^_]+)(?P<replicate>\\d+)(_(?P<concentration>\\d+µM))?',
        samples=[Sample(drugA.10µM:[1, 2, 3]),
                 Sample(drugA.20µM:[1, 2, 3]),
                 Sample(drugA.30µM:[1, 2, 3]),
                 Sample(drugB.10µM:[1, 2, 3], ref: drugA.10µM),
                 Sample(drugB.20µM:[1, 2, 3], ref: drugA.20µM),
                 Sample(drugB.30µM:[1, 2, 3], ref: drugA.30µM),
                 Sample(noa:[1, 2, 3])],
        )

Manual loading#

From `Sample`/`Replicate` objects#

It is also possible to create Replicate, Sample, and DataSet objects manually.

In [15]: from itpseq import DataSet, Sample, Replicate

In [16]: R1 = Replicate(replicate='1', file_prefix='nnn15_tcx1') # relative to current directory

In [17]: R2 = Replicate(replicate='2', file_prefix='nnn15_tcx2')

In [18]: R3 = Replicate(replicate='3', file_prefix='nnn15_tcx3')

In [19]: N1 = Replicate(replicate='1', file_prefix='nnn15_noa1')

In [20]: N2 = Replicate(replicate='2', file_prefix='nnn15_noa2')

In [21]: N3 = Replicate(replicate='3', file_prefix='nnn15_noa3')

In [22]: S = Sample(replicates=[R1, R2, R3],
   ....:            name='tcx',
   ....:            reference=Sample(replicates=[N1, N2, N3], name='noa'),
   ....:           )
   ....: 

In [23]: S
Out[23]: Sample(tcx:[1, 2, 3], ref: noa)

From a dictionary#

From a dictionary of samples/replicates:

In [24]: data = DataSet({'tcx': [{'file_prefix': 'nnn15_tcx1'},
   ....:                         {'file_prefix': 'nnn15_tcx2'},
   ....:                         {'file_prefix': 'nnn15_tcx3'}
   ....:                        ],
   ....:                 'noa': [{'file_prefix': 'nnn15_noa1'},
   ....:                         {'file_prefix': 'nnn15_noa2', 'replicate': 'custom_name'},
   ....:                         {'file_prefix': 'nnn15_noa3'}
   ....:                        ]},
   ....:                ref_mapping={'tcx': 'noa'})
   ....: 
Creating temporary cache directory: "/tmp/tmpggon_87s"

In [25]: data
Out[25]: 
DataSet(samples=[Sample(tcx:[rep1, rep2, rep3], ref: noa),
                 Sample(noa:[rep1, custom_name, rep3])],
        )

From config files (e.g. JSON)#

From a JSON file (e.g. samples.json):

{
 "tcx":[
  {"file_prefix":"nnn15_tcx1"},
  {"file_prefix":"nnn15_tcx2"},
  {"file_prefix":"nnn15_tcx3"}
 ],
 "noa":[
  {"file_prefix":"nnn15_noa1"},
  {"file_prefix":"nnn15_noa2","replicate":"custom_name"},
  {"file_prefix":"nnn15_noa3"}
 ]
}

In [26]: import json

In [27]: from itpseq import DataSet

In [28]: with open('samples.json') as f:
   ....:     data = DataSet(json.load(f))
   ....: 
Creating temporary cache directory: "/tmp/tmp_u58thfv"

In [29]: data
Out[29]: 
DataSet(samples=[Sample(tcx:[rep1, rep2, rep3]),
                 Sample(noa:[rep1, custom_name, rep3])],
        )