Ph.D. Dissertation Defense - Jiefu Li
New Machine Learning Algorithms for Detecting Biological Signals in Protein Sequences
Machine learning techniques have proved to be successful in a variety of biological applications. However, almost all these successful cases relied on both the quality and quantity of the data, which is not true in general especially for ongoing biological research. One case is Plasmodesmata-located proteins (PDLPs). PDLPs are type-I transmembrane proteins, which are targeted to intercellular pores called plasmodesmata (PD). PD is membrane-lined intercellular communication channels through which essential nutrients and signaling molecules move between neighboring cells in the plant. This cell-to-cell exchange of molecules through PD is fundamental to the physiology, development and immunity of the plant. However, no universal or consensus PD-targeting signal has ever been discerned nor molecular details are known. At the start of the dissertation research, only 8 PDLPs had been verified in Arabidopsis thaliana experimentally. Once located in PD, biologists believe that PDLPs play their regulatory role via interaction at the transmembrane domain, although currently only very limited experimental data is available about the interaction details. The limited number of data prevents trantinonal machine learning techniques from learning the useful knowledge to make any predictions.
To tackle these difficulties in PDLP study, closely cooperating with biological knowledge and machine learning techniques is necessary, and special designed machine learning algorithms are preferred to leverage both the biological knowledge and the usage of limited data. Therefore, for the problems of PDLP candidate prediction and PD-targeting signal discovery, we designed (1). a hyper-classifier combining HMM and SVM to predict PDLP candidates. (2). a 3-state HMM called PDHMM to predict PD-targeting region. (3). To overcome the issue of insufficient data, algorithms of training HMM with partially labelled data, training HMM with both positive and negative examples are developed. By the work done in (1) and (2), several PD-targeting signals in known PDLPs, one newly discovered PDLP and several PD-associated proteins are verified in the wet lab experiment successfully. The computational tools have been ported to a web-server to assist biologists in PDLP study.
As for PD regulation, which is still in early stage, with only experimental confirmation of interaction between PDLP5 with PEPR2, our research has been focused on investigating inter-helix contact prediction by leveraging existing data and prediction methods that are not specifically related to PD regulation. Specifically, we developed a method to combine different machine learning inter-helix prediction results. In doing so, we utilize features extracted from patterns in 2D contact map, and devise a mechanism to mitigate the skewness between positive (contact residue pair) and negative (non-contact residue pair) inherent in inter-helix data, which presents a special challenge to many prediction methods. Moreover, our method requires only linear space and time complexity as more methods to be combined. The cross-validation results show that the prediction from our method outperforms each individual method significantly.
Zoom link: https://udel.zoom.us/j/9469780726
Monday, July 19, 2021 at 8:00amVirtual Event