Loading iTP-Seq data in Python#
Automatic loading from a directory#
The easiest approach to create a DataSet is to use a consistent format of
the file names (see Naming conventions).
The parsing step creates four file for each input fast files:
the inverse-toeprint sequences as nucleotides (
<file_prefix>.nuc.txt)the inverse-toeprint sequences as amino-acids (
<file_prefix>.aa.itp.txt)metadata as JSON (
<file_prefix>.itp.json)a log file (
<file_prefix>.itp.log)
All the files share the same prefix and the JSON files are used to identify the replicates.
Default behavior#
By default DataSet expects a prefix with the XXX_YYYDD format. XXX
(alphanumeric) will be assigned as a “lib-type” key, YYY (letters) as a “sample”
key, and DD (digits) as the “replicate”. For example nnn15_noa1.
Therefore a directory containing 3 “noa” and 3 “tcx” replicates would look like:
nnn15_noa1.aa.itp.txt nnn15_noa3.aa.itp.txt nnn15_tcx2.aa.itp.txt
nnn15_noa1.assembled.fastq nnn15_noa3.assembled.fastq nnn15_tcx2.assembled.fastq
nnn15_noa1.itp.json nnn15_noa3.itp.json nnn15_tcx2.itp.json
nnn15_noa1.itp.log nnn15_noa3.itp.log nnn15_tcx2.itp.log
nnn15_noa1.nuc.itp.txt nnn15_noa3.nuc.itp.txt nnn15_tcx2.nuc.itp.txt
nnn15_noa2.aa.itp.txt nnn15_tcx1.aa.itp.txt nnn15_tcx3.aa.itp.txt
nnn15_noa2.assembled.fastq nnn15_tcx1.assembled.fastq nnn15_tcx3.assembled.fastq
nnn15_noa2.itp.json nnn15_tcx1.itp.json nnn15_tcx3.itp.json
nnn15_noa2.itp.log nnn15_tcx1.itp.log nnn15_tcx3.itp.log
nnn15_noa2.nuc.itp.txt nnn15_tcx1.nuc.itp.txt nnn15_tcx3.nuc.itp.txt
Loading this directory will automatically assign the 3 Replicates to 2 Samples (“tcx” and “noa”). In addition, if a sample is named “noa”, it is automatically assigned as a reference to the other samples that share the same keys (other than “sample”):
In [1]: from itpseq import DataSet
In [2]: data = DataSet('.') # current directory containing the data files
In [3]: data
Out[3]:
DataSet(data_path='.',
file_pattern='(?P<lib_type>[^_]+)_(?P<sample>[^_\\d]+)(?P<replicate>\\d+)',
samples=[Sample(nnn15.noa:[1, 2, 3]),
Sample(nnn15.tcx:[1, 2, 3], ref: nnn15.noa)],
)
In [4]: data.samples
Out[4]:
{'nnn15.noa': Sample(nnn15.noa:[1, 2, 3]),
'nnn15.tcx': Sample(nnn15.tcx:[1, 2, 3], ref: nnn15.noa)}
In [5]: data.replicates
Out[5]:
{'nnn15.noa.1': Replicate(nnn15.noa.1),
'nnn15.noa.2': Replicate(nnn15.noa.2),
'nnn15.noa.3': Replicate(nnn15.noa.3),
'nnn15.tcx.1': Replicate(nnn15.tcx.1),
'nnn15.tcx.2': Replicate(nnn15.tcx.2),
'nnn15.tcx.3': Replicate(nnn15.tcx.3)}
This detection is due to the default regular expression file_pattern:
(?P<lib_type>[^_]+)_(?P<sample>[^_\d]+)(?P<replicate>\d+).
The lib_type and sample keys are automatically used to group the
Replicates into a Sample and to create the Sample name.
It is possible to specify the keys to use to group the Replicates:
In [6]: DataSet('.', keys=['sample']) # ignoring "lib_type"
Out[6]:
DataSet(data_path='.',
file_pattern='(?P<lib_type>[^_]+)_(?P<sample>[^_\\d]+)(?P<replicate>\\d+)',
samples=[Sample(noa:[1, 2, 3]),
Sample(tcx:[1, 2, 3], ref: noa)],
)
Custom prefix and keys#
Let’s imagine a dataset with two drugs (drugA and drugB), one control (noa) and a few different concentrations for the drugs (10, 20, 30µM):
drugA1_10µM.itp.json drugA3_20µM.itp.json drugB2_30µM.itp.json
drugA1_20µM.itp.json drugA3_30µM.itp.json drugB3_10µM.itp.json
drugA1_30µM.itp.json drugB1_10µM.itp.json drugB3_20µM.itp.json
drugA2_10µM.itp.json drugB1_20µM.itp.json drugB3_30µM.itp.json
drugA2_20µM.itp.json drugB1_30µM.itp.json noa1.itp.json
drugA2_30µM.itp.json drugB2_10µM.itp.json noa2.itp.json
drugA3_10µM.itp.json drugB2_20µM.itp.json noa3.itp.json
As explained above, the different parts of the filename can be defined by passing a regular expression to file_pattern.
Each (?P<name>...) group associates the captured value to the associated name, which can later be used to group the samples.
Here we want to capture the drug name, the replicate number, and the concentration (if any).
We can therefore define file_pattern as:
(?P<sample>[^_\d]+)(?P<replicate>\d+)(_(?P<concentration>\d+µM))?
(?P<sample>[^_\d]+): matches the sample name (anything but_or digits)(?P<replicate>\d+): matches digits defining the replicate number(_(?P<concentration>\d+µM))?: optionally matches_followed by a concentration
More information on the syntax of regular expressions can be found in the Python documentation.
In [7]: from itpseq import DataSet
In [8]: data = DataSet('.', file_pattern=r'(?P<sample>[^_\d]+)(?P<replicate>\d+)(_(?P<concentration>\d+µM))?')
In [9]: data
Out[9]:
DataSet(data_path='.',
file_pattern='(?P<sample>[^_\\d]+)(?P<replicate>\\d+)(_(?P<concentration>\\d+µM))?',
samples=[Sample(drugA.10µM:[1, 2, 3], ref: noa),
Sample(drugA.20µM:[1, 2, 3], ref: noa),
Sample(drugA.30µM:[1, 2, 3], ref: noa),
Sample(drugB.10µM:[1, 2, 3], ref: noa),
Sample(drugB.20µM:[1, 2, 3], ref: noa),
Sample(drugB.30µM:[1, 2, 3], ref: noa),
Sample(noa:[1, 2, 3])],
)
In [10]: data.samples
Out[10]:
{'drugA.10µM': Sample(drugA.10µM:[1, 2, 3], ref: noa),
'drugA.20µM': Sample(drugA.20µM:[1, 2, 3], ref: noa),
'drugA.30µM': Sample(drugA.30µM:[1, 2, 3], ref: noa),
'drugB.10µM': Sample(drugB.10µM:[1, 2, 3], ref: noa),
'drugB.20µM': Sample(drugB.20µM:[1, 2, 3], ref: noa),
'drugB.30µM': Sample(drugB.30µM:[1, 2, 3], ref: noa),
'noa': Sample(noa:[1, 2, 3])}
Labels are automatically assigned to each sample:
In [11]: data['drugA.10µM'].labels
Out[11]: {'concentration': '10µM', 'sample': 'drugA'}
In [12]: data['noa'].labels
Out[12]: {'concentration': None, 'sample': 'noa'}
It is also possible to define the keys that will be used to assign the
replicate. For instance, using ref_labels={'sample': 'drugA'} would define
drugA as a reference to the samples that match the other same keys.
In [13]: data = DataSet('.',
....: file_pattern=r'(?P<sample>[^_]+)(?P<replicate>\d+)(_(?P<concentration>\d+µM))?',
....: ref_labels={'sample': 'drugA'},
....: )
....:
In [14]: data
Out[14]:
DataSet(data_path='.',
file_pattern='(?P<sample>[^_]+)(?P<replicate>\\d+)(_(?P<concentration>\\d+µM))?',
samples=[Sample(drugA.10µM:[1, 2, 3]),
Sample(drugA.20µM:[1, 2, 3]),
Sample(drugA.30µM:[1, 2, 3]),
Sample(drugB.10µM:[1, 2, 3], ref: drugA.10µM),
Sample(drugB.20µM:[1, 2, 3], ref: drugA.20µM),
Sample(drugB.30µM:[1, 2, 3], ref: drugA.30µM),
Sample(noa:[1, 2, 3])],
)
Manual loading#
From Sample/Replicate objects#
It is also possible to create Replicate, Sample, and
DataSet objects manually.
In [15]: from itpseq import DataSet, Sample, Replicate
In [16]: R1 = Replicate(replicate='1', file_prefix='nnn15_tcx1') # relative to current directory
In [17]: R2 = Replicate(replicate='2', file_prefix='nnn15_tcx2')
In [18]: R3 = Replicate(replicate='3', file_prefix='nnn15_tcx3')
In [19]: N1 = Replicate(replicate='1', file_prefix='nnn15_noa1')
In [20]: N2 = Replicate(replicate='2', file_prefix='nnn15_noa2')
In [21]: N3 = Replicate(replicate='3', file_prefix='nnn15_noa3')
In [22]: S = Sample(replicates=[R1, R2, R3],
....: name='tcx',
....: reference=Sample(replicates=[N1, N2, N3], name='noa'),
....: )
....:
In [23]: S
Out[23]: Sample(tcx:[1, 2, 3], ref: noa)
From a dictionary#
From a dictionary of samples/replicates:
In [24]: data = DataSet({'tcx': [{'file_prefix': 'nnn15_tcx1'},
....: {'file_prefix': 'nnn15_tcx2'},
....: {'file_prefix': 'nnn15_tcx3'}
....: ],
....: 'noa': [{'file_prefix': 'nnn15_noa1'},
....: {'file_prefix': 'nnn15_noa2', 'replicate': 'custom_name'},
....: {'file_prefix': 'nnn15_noa3'}
....: ]},
....: ref_mapping={'tcx': 'noa'})
....:
Creating temporary cache directory: "/tmp/tmpggon_87s"
In [25]: data
Out[25]:
DataSet(samples=[Sample(tcx:[rep1, rep2, rep3], ref: noa),
Sample(noa:[rep1, custom_name, rep3])],
)
From config files (e.g. JSON)#
From a JSON file (e.g. samples.json):
{
"tcx":[
{"file_prefix":"nnn15_tcx1"},
{"file_prefix":"nnn15_tcx2"},
{"file_prefix":"nnn15_tcx3"}
],
"noa":[
{"file_prefix":"nnn15_noa1"},
{"file_prefix":"nnn15_noa2","replicate":"custom_name"},
{"file_prefix":"nnn15_noa3"}
]
}
In [26]: import json
In [27]: from itpseq import DataSet
In [28]: with open('samples.json') as f:
....: data = DataSet(json.load(f))
....:
Creating temporary cache directory: "/tmp/tmp_u58thfv"
In [29]: data
Out[29]:
DataSet(samples=[Sample(tcx:[rep1, rep2, rep3]),
Sample(noa:[rep1, custom_name, rep3])],
)