Notably, the classification of some biological niches is especially suited for feature representation. For example, routine annotation tools fail to confidently assign function for bioactive peptides and short proteins Naamati et al. A number of previous studies focus on feature extraction from whole protein sequences Cao et al. Specialized predictors have been presented for structural tasks including secondary structure, solvent accessibility, stability, disordered regions, domains and more Cai et al.
ML approaches have proven suitable to classify protein properties beyond their 3D-structure. Naive biophysical features classification outperformed simple sequence-based methods for a number of protein families Varshavsky et al.
However, the most likely advantage of the feature and pattern-based ML approach is toward high-level functionality e. Pe'er et al. Examples for such predictions include protein—protein interactions Bock and Gough, ; Cheng and Baldi, , discriminating outer membrane proteins Gromiha and Suwa, , membrane topology Nugent and Jones, , subcellular localization Hua and Sun, and more.
The strongest features learned by the ML classifiers often expose biologically important motifs Leslie et al. In this study, we focus on the ability of elementary biophysical features together with a rich set of engineered representation of proteins to classify high-level protein functions. These features are suited for both supervised and unsupervised classification. We present a universal, modular workflow for protein function classification: i feature generation and extraction from primary sequences ProFET.
In gathering the protein sets in this study, we used datasets made available by i custom sets gathered from public databases such as UniProtKB Wu et al.
As a rule, we used only classes that contain a minimal number of samples per group typically 40, after redundancy removal. Sequences with unknown amino acid AA , errors or sequences that are shorter than 30 AA were removed. We included in the analysis the most recent SCOP classification 2. LocTree3 benchmark Yachdav et al. Mammalian subcellular localization: Protein-organelle pairs are acquired from SWP. Uncultured bacterium.
Sequences extracted from UniProtKB and mapped to keyword annotations for major cellular compartments membrane, cytoplasm, ribosome. SCOPe Release 2. The classes that were not included had small number of folds in each. DNA-binding proteins. Benchmark dataset from DNA binder Kumar et al.
RNA-binding proteins. Benchmark dataset from BindN Wang et al. Virus-host pairs: Acquired from SWP. The set include all viral proteins partitioned by the kingdom of the hosts. Capsids: Compilation of two sets of all viral capsid proteins annotated by SWP: i Classes according to host type. All features extracted by ProFET are directly derived from the protein sequence and do not require external input Saeys et al.
Properties relying on external predictors e. ProFET can also generate a pre-defined set of default features for consistency in evaluation and ease of use, callable from the command-line. The features that are described below can be restricted to a segment of a protein e. We support two versions for a subsequence analysis: i relative portions and ii fixed lengths.
The activation of global feature extraction combined with segmental consideration is advantageous. It is motivated by the atypical composition of different segments of numerous protein classes, e.
Instability index, an estimate for the stability of a protein in vitro Gasteiger et al. Aliphatic index, the relative volume occupied by aliphatic side chains Ala, Val, Ile and Leu Gasteiger et al. Most of these properties were based on the Expasy proteomics collection Gasteiger et al. The important of these elementary global features has been previously validated Varshavsky et al. For example, lysine-arginine appearance KR is grouped together with RK. Reduced AA alphabets.
Grouping of AA secures a compact representation. We include a large number of such alphabets from various sources Murphy et al.
The other AA remain in the uncompressed representation. Potential post-translational modification PTM sites. Others include N-glycosylation and Asp or Asn hydroxylation sites. We included Cysteine spacer motif that captures the tendency of Cys to appear in a minimal window Naamati et al.
Potential Disorder FoldIndex. Local regions of disorder are predicted using the naive FoldIndex Prilusky et al. FoldIndex predicts the disorder as a function of the hydrophobic potential and net charge.
These features aim to capture the non-randomly distribution of each AA in the sequence, based on the concept of information entropy. Autocorrelation with Selected letters. Lag is then computed. For details, see Ofer and Linial AA propensity scales map each AA to a quantitative value that represents physicochemical or biochemical properties, such as hydropathicity or size. These scales can then be used to represent the protein sequence as a time series, typically using sliding windows of different sizes and to extract additional features.
Maximum and minimum values for a given scale and window-size along the entire sequence. We implemented the Dubchak and ProFEAT CTD features hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, secondary structure and solvent accessibility hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, secondary structure and solvent accessibility Dubchak et al. Code from Spice van den Berg et al. An additional subdivision of disorder propensity was adapted from Composition Profiler Vacic et al.
The power of any of the predictor proposed is tested by several routinely used evaluation methods. We measure the performance for the binary and multiclass tasks with the same metrics: F1 score the weighted average of the precision and recall and Accuracy Acc.
TP represents the number of the correctly recognized proteins. FP, the number of proteins wrongly identified and FN the number of proteins missed. Performance is evaluated using cross-validation.
This pre-filtering step at the cross validation phase had a negligible impact on the overall performance not shown. A wide array of methods for supervised and unsupervised feature selection can be applied to identify the best features, implemented with the superlative Scikit learn toolkit Abraham et al. These include wrapper methods—Random Feature Elimination Ozcift, , model-based filtering [e.
In the test cases, we used the RFE method, combined with an underlying non-linear ensemble of classifiers Random forests. The underlying principle is iterative fitting of the classifier on the data, with the weakest features being pruned at each of the iterations Abraham et al.
We examined the selected features, and the model classification performance with the reduced set of features, and show novel, interpretable features, as well as excellent retained performance.
We introduce two test cases to illustrate the potential of ProFET to provide a generic platform for analyzing the basis of high-level functionality in proteins. Classifying thermophile proteins was used as a test case for a binary classification of functionality that is not explicitly derived from the sequence.
Classifying neuropeptide NP hormone precursors serves to assess the classification of poorly studied protein niche Karsenty et al. We generalize the approach to a range of from subcellular localization to viral phylogeny tasks see Section 2.
In all the illustrated cases, ProFET was used as a generic framework for feature extraction and prediction. External information that is often available e. The workflow is composed of modular sections Fig.
ProFET: Feature extraction from any protein sequences. Extracted features can be analyzed independently suitable for ML analysis or unsupervised tasks or discriminatively i. Model Selection: The features are used to train and tune different ML models. For any given performance metric e. Performance Report: Classification performance is measured for a given model and dataset, using cross-validation. Feature Selection: Informative features are selected and their importance measured using different methods.
These methods include the statistical significance, wrapper methods, model-based selection, stability selection and more. New sequences can be predicted using a trained ML model. This can be applied via the feature extraction pipeline or with a selected smaller subset of the selected features. The ProFET framework: merging machine-learning protocols, cross-validated tuning, feature selection and prediction.
Set 1: Thermophiles are proteins that function under high temperature. Given the extreme environmental conditions, we expect to detect biophysical signatures in these proteins underlying their thermostability. These are secreted proteins. Routine sequence alignment-based methods are insufficient to identify the immensely diverse NPs. In compiling a dataset, we used as a negative set a collection of proteins with Signal peptides, which lacked validated TMD and therefore, most likely to be secreted.
We keep the same atypical range of lengths to match the labeled NPPs. Both the positive and negative datasets have Signal peptides confirmed and cleaved using SignalP Petersen et al. The final dataset held negatives and NPPs. For all three sets as in Section 3. Classification was performed using a random forest classifier, implemented in Scikit learn see Section 2.
Figure 2 A shows the results of the classifications for the set of the Thermophilic proteins and the NPPs as confusion matrices. Results were derived from fold stratified cross validation. Figure 2 B shows the performance as receiver operating characteristics curves. Performance was measured using an automatically tuned SVM with a radial basis function RBF kernel, with fold stratified cross validation.
The performance was very high with a FP rate of 0. Performance results for the two datasets used. A Confusion matrix of the classifier performance. Results were derived from fold stratified cross-validation. B AUC area under receiver operating characteristics curve. Uncultured bacteria comprise a set of poorly characterized proteins Set 3. The localization performance for the multi-class task is very convincing tested via 12 rounds of stratified shuffle split cross-validation. The F1 score is 0.
The most significant E -value was used for each sequence as an approximate distance matrix. We then trained a K-nearest-neighbors classifier and recorded the performance.
Clustering performance was significantly lower than reported Fig. The best results for the Psi-Blast test were obtained from Spectral clustering model.
For the NPP set total of proteins , the F1 score is 0. The F1 score for the hold-out sets were 0. In addition to the success of the predictors, interpretability of the features that best contributed to the performance is a crucial knowledge.
Several methods for feature selection can be applied to identify a minimal set of such features. We applied a combination of Random Forests an ensemble of decision tree classifiers with the Random Feature Elimination wrapper method. In each of the iterations, the weakest features are removed and the model is then retrained with the remaining features, until the preselected desired amount of features remains. Performance of the reduced feature set is measured using new splits of the training data and cross validation.
Recall that the initial set of default generated features included features. The F -test filter reduced the number of features to and features for the Thermophiles and NPP sets, respectively. We note the importance of AA composition, particularly of charged and polar AA groups. Of further importance are features involving glutamic acid E and glutamine Q , and the organizational entropy of E and Q.
We note that merely using the AA composition would not have captured many of these features. The classification performance F1 score with just 15 features reached Classification performance: F1 score of the positive class with just 15 features was Figure 3 shows the types of the 15 strongest features for the two test cases. Selected features are ranked by relative importance to the classifier.
Feature titles are self-explanatory. Top 15 informative features that dominate the successful classification of thermophilic proteins and NPPs. The workflow applied to our test cases Section 3. Each set was measured using fold randomized stratified cross-validation.
Altogether, we present 15 additional datasets in addition to the NPPs and Thermophilic proteins. Dummy predictor is a default classifier for the largest class in the dataset rightmost, coloured pink. The classification performance for DNA and RNA binding proteins meets the state of the art results obtained by special purpose predictors Wang, etal. We used the same benchmark data to directly assess the performance.
We show Fig. We conclude that excellent performance is achieved by using the default setting of the ProFET workflow. The classification success varies according to the tasks. These sets differ only by the degree of redundancy removal. We found similar levels of accuracy for both sets. The performance accuracy, F1 score for all 17 analyzed datasets with respect to the Dummy-majority classifier is shown Supplementary Table S1. The main drawbacks in existing sequence-based methods are i some functions cannot be detected by sequence-based methods; ii current statistical models mostly capture local patterns rather than high-level function and iii rare sequences or those that have very few homologs cannot be successfully used for inference or construction of good statistical model.
In this study, we introduce ProFET as a feature extraction platform that can serve many classification tasks. ProFET was compiled as a flexible tool for any size of protein sequence. Our platform adds to previous studies that use quantitative feature representations for sequences.
The communality in these methods is the transformation step in which the protein sequences are converted to hundreds or thousands of features, many of them elementary biochemical and biophysical properties, while others are statistically derived e. ProFET includes many novel additions for the elementary representation. For example, features that are based on a reduced alphabets, entropy, high performance AA scales, binary autocorrelation, sequence segmentation, mirror k-mers and more.
Many of these features not only improved performance while allowing a compact representation but also expose statistical importance properties in proteins Fig. The advantage of using reduced alphabet has been noted for 3D-structure representation Bacardit et al. ProFET results were the input for ML approaches allowing a rigorous assessment of performance and reaches state of the art results. Recovering the classification success by a small set of top features argues for the power of a compact representation for understanding the features that dominate any specific tasks.
Several conclusions can be drawn from the results of the classification tasks Fig. Protein centric analysis: Feature engineering methods presented in this study should be considered a baseline approach for whole protein rather than protein domains. Most of our knowledge from 3D structure and evolution relies on the properties of domains within proteins. We propose the feature engineering as a complementary approach to the domain-centric one.
This is in contrast to methods that customize features for a specific task. The ProFET pipeline provides a default set of features that is suitable for many classification tasks. Therefore, ProFET eliminate the need to duplicate the effort for feature extraction.
Flexibility of use: Our presented pipeline accepts a single sequence, combined files, multiple files or a directory. It automatically labels the input into classes if desired and normalizes the features if desired. Thus, any user can use ProFET to set the desired combination of features, representations and normalization. From the point of view of the user, several considerations were taken:.
We use state of art, open source, freely available python data science tools such as Pandas, scikit-learn, biopython Cock et al.
Our framework includes details on the features as part of the data pipeline so results are interpretable. Our code is available for academic and non-commercial use, under the GNU 3 license. We provide a large collated resource for feature extraction. Thanks to the modular design of ProFET, adding and tinkering with features is trivial. Users of ProFET can decide to focus, remove or expand any subset of the features e.
ProFET allows tuning of any number of parameters in the feature generation pipeline, e. In summary, the approach presented here is suitable and powerful for application towards modern approach for ML especially in the emerging field of Deep Learning and unsupervised learning of feature representations.
These features can easily experimented with allowing additional applications of biological insight to the task of feature engineering. We thank Michael Doron for extensive collaboration, aid and programming expertise in setting up the framework. Nadav Rappoprt supported Psi-Blast comparisons.
Abraham A. Google Scholar. Atchley W. USA , , — Bacardit J. BMC Bioinformatics , 10 , 6. Bock J. Gough D. Bioinformatics , 17 , — Cai Y. BMC Bioinformatics , 2 , 3. Campen A. Protein Pept. Cao D. Bioinformatics , 29 , — The inputs are ground referenced CMOS compatible. I GND. Stresses above the ones listed here may affect device reliability or may cause permanent damage to the device. Welcome to ManualMachine. We have sent a verification link to to complete your registration.
Log In Sign Up. Forgot password? Enter your email address and check your inbox. Please check your email for further instructions. Enter a new password. BTS G. Data Sheet 2 V1. Operating voltage V bb on 4. Data Sheet 4 V1. Data Sheet 5 V1.
Data Sheet 6 V1. Data Sheet 7 V1. Data Sheet 8 V1. Supply Voltage 3. Data Sheet 9 V1. You can only view or download manuals with. Sign Up and get 5 for free. Upload your files to the site. You get 1 for each file you add. Get 1 for every time someone downloads your manual. Buy as many as you need. View and download manuals available only for.
Register and get 5 for free. Upload manuals that we do not have and get 1 for each file.
0コメント