Joint Multi-family Evolutionary Coupling Analysis for Protein Contact
Prediction
Protein contacts contain important information for protein structure and functional study, but contact prediction is very challenging especially for protein families without many sequence homologs. Recently evolutionary coupling (EC) analysis, which predicts contacts by analyzing residue co-evolution in a single target family, has made good progress due to better statistical and optimization techniques. Different from these single-family EC methods that focus on only a single protein family, this paper presents a joint multi-family EC analysis method that predicts contacts of one target family by jointly modeling residue co-evolution in itself and also (distantly) related families with divergent sequences but similar folds, and enforcing their co-evolution pattern consistency based upon their evolutionary distance. To implement this multi-family EC analysis strategy, this paper presents a novel joint graphical lasso method to model a set of related protein families. In particular, we model a set of related families using a set of correlated multivariate Gaussian distributions, the inverse covariance matrix (or precision matrix) of each distribution encoding the contact pattern of one family. Then we co-estimate the precision matrices by maximizing the occurring probability of all the involved sequences, subject to the constraint that the matrices shall share similar patterns. Finally we solve this optimization problem using Alternating Direction Method of Multipliers (ADMM). Experiments show that joint multi-family EC analysis can reveal many more native contacts than single-family analysis even for a target family with 4000-5000 non-redundant sequence homologs, which makes many more protein families amenable to co-evolution-based structure and function prediction.
View on arXiv