This study was initiated in an attempt to address these shortcom- ings by developing a more powerful predictor for identifying DNA recombination spots. The proposed predictor is called iRSpot-EL, where ‘i’ stands for ‘identify’, ‘RSpot’ for ‘recombination spot’ and ‘EL’ for ‘ensemble learning’.
To develop a new predictor usually consists of two purposes. One is to stimulate theoretical studies in the relevant areas, and the other is to make experimental scientists easier to get their desired in- formation. To realize these, the rest of this article is presented ac- cording to the following five guidelines (Chou, 2011): (i) benchmark dataset, (ii) sample representation, (iii) operation algorithm, (iv) val- idation, and (v) web-server.
2 Materials and methods
2.1 Benchmark dataset
A reliable and stringent benchmark is pivotal to the development of an accurate prediction method. In literature, the benchmark dataset usually consists of a training dataset and a testing dataset: the for- mer is for the purpose of training a proposed model, while the latter for the purpose of testing it. As pointed out by a comprehensive re- view (Chou and Shen, 2007b), however, there is no need to separate a benchmark dataset into a training dataset and a testing dataset for validating a prediction method if it is tested by the jackknife or sub- sampling (K-fold) cross-validation because the outcome thus ob- tained is actually from a combination of many different independent dataset tests. In this study, for facilitating the comparison of the pro- posed predictor with the existing ones, we adopted the widely used benchmark dataset (Chen et al., 2013; Jiang et al., 2007; Liu et al., 2012; Qiu et al., 2014) that can be formulated as S ¼ Sþ [ S— (1)where S is the benchmark dataset, Sþ the positive subset containing 490 DNA segments (hotspot samples) with the relative hybridiza- tion ratios (Gerton et al., 2000) higher than 1.5 (Jiang et al., 2007), S— the negative subset containing 591 DNA segments (coldspot sam- ples) with the relative hybridization ratios (Gerton et al., 2000) lower than 0.82 (Jiang et al., 2007), and the symbol [ denotes the union in the set theory. In order to reduce redundancy and hom- ology bias, the CD-HIT software (Li et al., 2001) was used to re- move sequences whose similarity is >75%. Finally, 478 hotspots (positive samples) and 572 coldspots (negative samples) were ob- tained. For readers’ convenience, the 478 hotspot samples and 572 coldspot samples as well as their detailed sequences are given in
Supplementary Materials S1.
2.2 Pseudo k-tuple nucleotide composition
With the avalanche of biological sequences emerging in the post- genomic age, one of the most challenging problems in computa- tional biology is how to formulate a biological sequence with a vec- tor, yet essentially still keep its key pattern or characteristics. This is because nearly all the existing machine-learning algorithms were de- veloped to handle vector but not sequence samples, as elaborated in a recent review (Chou, 2015). Unfortunately, a vector defined in a discrete model may completely lose all the sequence-order informa- tion or sequence pattern characteristics. To overcome such a prob- lem for protein/peptide sequences, the pseudo amino acid composition (PseAAC) (Chou, 2001) was introduced, and has be- come an important tool (Cao et al., 2013; Du et al., 2012, 2014) widely used in nearly all the areas of computational proteomics [see a long list of references cited in Chou (2011)]. Encouraged by the successes of PseAAC, the pseudo nucleotide composition (PseKNC) (Chen et al., 2014, 2015b; Liu et al., 2015a, 2016b) was introduced to formulate DNA/RNA sequences, and it has been increasingly used in computational genetics and genomics (see, e.g. a recent re- view (Chen et al., 2015a) as well as a long list of references cited therein). Recently, a web-server called ‘Pse-in-One’ was developed for generating various modes of pseudo components for DNA/RNA and protein/peptide sequences (Liu et al., 2015b). 减数分裂和基因重组英文文献和中文翻译(2):http://www.youerw.com/fanyi/lunwen_205051.html