DOCTORAL DEFENSE - JIEFU LI
New Machine Learning Algorithms for Detecting Biological Signals in Protein Sequences
Machine learning techniques have proved to be successful in a variety of biological applications. However, almost all these successful cases relied on both the quality and quantity of the data, which is not true in general especially for ongoing biological research. One case is Plasmodesmata-located proteins (PDLPs). PDLPs are type-I transmembrane proteins, which are targeted to intercellular pores called plasmodesmata (PD). PD is membrane-lined intercellular communication channels through which essential nutrients and signaling molecules move between neighboring cells in the plant. This cell-to-cell exchange of molecules through PD is fundamental to the physiology, development and immunity of the plant. However, no universal or consensus PD-targeting signal has ever been discerned nor molecular details are known. Furthermore, only 8 PDLPs have been verified in Arabidopsis thaliana experimentally. The limited number of data prevents trantinonal machine learning techniques from learning the useful knowledge to make any predictions.
To tackle the difficulties in PDLP study, closely cooperating biological knowledge and machine learning techniques is necessary, and special designed machine learning algorithms are preferred to leverage both the biological knowledge and the usage of limited data. Therefore, to the problems of PDLP candidate prediction, PD-targeting discovering, we designed (1). a hyper-classifier combining HMM and SVM to predict PDLP candidates. (2). a 3-state HMM called PDHMM to predict PD-targeting region. (3). To overcome the issue of insufficient data, algorithms of training HMM with partially labelled data, training HMM with both positive and negative examples are developed. By the work done in (1) and (2), several PD-targeting signals in known PDLPs, one new discovered PDLP, several PD-associated proteins are verified in the wet lab experiment successfully. Moreover, a web-server has been developed to assist biologists in PDLP study.
Our immediate future research is: (1). Develop machine learning based tools to detect a special interested region in PDLP called JMe region automatically to strengthen the usage of the server. (2). Develop a new HMM training algorithm to utilize both negative examples and partial label information. (3). From protein-protein interaction perspective, developing new algorithms/ modifying on successful ones to combine the knowledge of both PDLPs and other well studied transmembrane proteins to study PD regulation to assist biologists understanding PD-targeting mechanism future.
Tuesday, December 8, 2020 at 2:30pmVirtual Event