Identification of protein repeats _ computational biology and data mining group

We found the first homology of the Huntington's disease protein to other protein sequence [1]. This protein contains a repeat of around 40 amino acids which, at the time, was already described for the alpha subunit of the protein phosphatase 2A. We found and characterised this repeat in a number of eukaryotic cytoplasmic proteins mainly involved in cytoplasmic transport processes and most of them known to be part of protein complexes.

We developed a method [2] for identification of short protein repeats (between 20-40 amino acids long). These repeats are usually very divergent and their recognition is difficult even if having a good profile of the repeat. We observed that the scores of optimal and sub-optimal non-overlapping alignments of a repeat profile against a large database of randomized sequences follow Extreme Value Distributions (EVDs).

From the analysis of those EVDs we can associate E-values to multiple non-overlapping hits of a profile repeat against a query sequence. We tested the method for eleven repeat families in the whole SwissProt database, Saccharomyces cerevisiae, Caenorhabditis elegans and Homo Sapiens proteins. We could detect new unrecognised repeats and unify some repeat families. The method was available as a web served hosted at EMBL (REP, 2000-2014).

The previous work showed the difficulty of classifying ARM and HEAT repeats (which occur in at least 1 in 500 eukaryotic protein sequences). They are similar in sequence and structure but we could not account for both of them with a single profile. We have reviewed these repeats [3] correlating sequence similarity between repeats to functional and structural properties. Several profiles were built that improved their detection. They can be used for scanning protein sequences through the REP server.

We developed a neural network based method [ARD] to detect repeats like HEAT, Armadillo, and PBS, that form similar structures composed of alpha-helices (which we termed alpha-rods) [4]. Using this method allowed detecting novel instances of this structure, for example in human proteins STAG1-3, SERAC1, and PSMD1-2 & 5. Application of the method to human huntingtin and comparison to orthologs allowed us to delimit three alpha-rods in huntingtin whose intra-molecular interactions we characterized experimentally using yeast two hybrid and co-immunoprecipitation of protein fragments encoding the domains. We updated the method by allowing the detection of repeats with an internal linker of variable length ( [ARD2], [5]). Using ARD2 we evaluated novel structures and the phylogenetic distribution of these repeats, pointing to multiple likely events of independent emergence of these repeats in distant taxa and to their increased frequency in organisms of high cellular complexity such as eukarya in general, and cyanobacteria and planctomycetes within prokarya. Homorepeats

Homorepeats (or polyX) in protein sequences are stretches of consecutive repetitions of a single amino acid residue. There is no consensus to define the minimum number of repeats that is relevant. Examination of conservation in multiple sequence alignments pinpoints conserved homorepeats. To facilitate the study and detection of homorepeats we created a web server that annotates homorepeats in multiple sequence alignments ( [dAPE]; [6]).

We studied the differential use of homorepeats across taxa to evaluate their evolution and function [7]. Our results suggest that homorepeats have biological function in the creation or modulation of protein-protein interactions in a context dependent manner, with a tendency to occur outside protein domains and at the protein termini.

Many homorepeats evolve relatively quickly by being inserted at a location. This can be appreciated since some orthologs have homorepeats of variable length or totally lacking. We used this property for the analysis of the structural context of sites where polyQ was inserted during evolution [8]. We could appreciate that polyQ has a bias to be inserted in disordered regions, with some tendency to occur C-terminal of regions with alpha-helical content. This supports its role as a C-terminal modulator of coiled coil interactions, which also have alpa-helical structure. Other protein repeats

We have analysed a large protein family of the Arabidopsis thaliana plant genome [9]. This family contains at least 48 proteins of yet unknown function. We identified kelch repeats (implied in protein-protein interactions) and an F-box domain (which targets proteins for degradation). The demonstration of the in vivo interaction of one of the members of the family with ASK1 (homolog of yeast Skp1p, a subunit of the SCF complex which is involved in the ubiquitination of proteins prior degradation by the 26S proteasome) via the F-box domain, gave some insights into the functionality of this family.

Protein repeats that form structural repeating units that assemble together are quite common in many protein families and organisms. In an invited review we discuss the analysis of such repeats (including computational characterization) and how we think that repetition in protein sequence relates to evolution and function [10].

We identified a protein domain that appears with variable copy number in genes that are usually in the vicinity of a putative Fe3+ siderophore transporter [11]. We denoted this new domain NEAT for NEAr Transporter. Given that this domain seems to be specific of pathogenic bacteria, we suggest that it is a potential target for therapy against disease.

We participated in the characterization of microtubule associated AIR9, a protein that in plants associates to the microtubules of the cortical cells during preprophase and when the plant cortex is contacted by the cell plate (a plant-specific cellular structure that forms during cell division) [12]. This protein contains homologs in trypanosomatid parasites featuring a region with leucin reach repeats and a number of protein tandem repeats. We termed these repeats A9, characterized them in plant, trypanosomes, and bacterial sequences, and predicted them to adopt an immunoglobulin fold. We discussed the phylogeny of the AIR9 proteins with novel sequence evidence and discuss the especial amino acid bias in the plants members of this family [13].

Periostin is a protein of the extracellular matrix. Despite its proven association to bone and heart development and to cancer, its function currently remains elusive. By sequence and database analyses we characterized the variability of Periostin's C-terminal in terms of exon count, length, and alternative splicing, and the existence of a 13-amino acid repeat that we predict to form consecutive beta strands [14]. These findings are put in the context of functional and structural predictions.

In some situations, even after resolution of a protein's 3D structure, the definition of protein repeats may be under debate. For example, we clarified the presence of armadillo repeats in p115, a structural component of the Golgi apparatus that facilitates the tethering of transport vesicles inbound from the endoplasmic reticulum to the cis-Golgi membrane, following conflicting interpretations of its structure [15].

We characterized a region of 15 repeats of around 10 amino acids in the human mineralocorticoid receptor (MR) [16]. The MR is part of the renin angiotensin aldosterone system (RAAS). This protein has an inhibitory domain of unknown structure. We predict that the repeats region adopts a beta-solenoid structure and propose how this could be involved in phosphorylation dependent inter- and intra-molecular interactions.

Using sequence similarity analyses, we identified a region of tandem repeats covering the C-terminal 2/3 of the TPX2 protein [17]. TPX2, conserved in plants and chordata, is essential for spindle pole formation and controls the nucleation of microtubules on chromosomes during mitosis. There was so far no structural information about this protein. Using structure predictions we support that the region of the repeats forms an alpha helical solenoid, which we support with CD spectra that indicates high alpha-helical content in Xenopus (frog) and Arabidopsis (plant) TPX2. RepeatsDB

RepeatsDB is a database of protein tandem repeats of known structure derived from protein 3D structures. The Repeats DB 2.0 update includes annotations from more than 5400 structures, 60% of them manually curated [18]. Repeats are classified in fve categories according to their length and general arrangement, with subclasses that depend on secondary structure content.

[2] Andrade, M.A., C.P. Ponting, T.J. Gibson and P. Bork. 2000. Homology-based method for identification of protein repeats using statistical significance estimates. J. Mol. Biol. 298, 521-537.

[3] Andrade, M.A., C. Petosa, S.I. O'Donoghue, C.W. Müller and P. Bork. 2001. Comparison of ARM and HEAT repeat proteins. J. Mol. Biol. 309, 1-18.

[4] Palidwor, G.A., S. Shcherbinin, M.R. Huska, T. Rasko, U. Stelzl, A. Arumughan, R. Foulle, P. Porras, L. Sanchez-Pulido, E.E. Wanker, M.A. Andrade-Navarro. 2009. Detection of alpha-rod repeats using a neural network and application to huntingtin. PLoS Comp. Biol. 5, e1000304. [ARD].

[5] Fournier, D., G.A. Palidwor, S. Shcherbinin, A. Szengel, M.H. Schaefer, C. Perez-Iratxeta and M.A. Andrade-Navarro. 2013. Functional and genomic analyses of alpha-solenoid proteins. PLoS One. 8, e79894. [ARD2].

[6] Mier, P. and M.A. Andrade-Navarro. dAPE: a web server to detect homorepeats and follow their evolution. Bioinformatics. In press. [ dAPE]

[7] Mier, P., G. Alanis-Lobato and M.A. Andrade-Navarro. 2017. Context characterization of amino acid homorepeats using evolution, position and order. Proteins. In press.

[9] Andrade, M.A., M. González-Guzmán, R. Serrano and P.L. Rodríguez. 2001. A combination of the F-box motif and kelch repeats defines a large Arabidopsis family of F-box proteins Plant Mol. Biol. 46, 603-614.

[10] Andrade, M.A., C. Perez-Iratxeta, and C.P. Ponting. 2001. Protein repeats: structures, functions and evolution. Journal of Structural Biology. 84, 445-451.

[11] Andrade, M.A., F.D. Ciccarelli, C. Perez-Iratxeta and P. Bork. 2002. NEAT: A domain duplicated in genes near the components of a putative Fe3+ siderophore transporter from Gram-positive pathogenic bacteria. Genome Biology. 3, research0047.1-0047.5.

[12] Buschmann, H., J. Chan, L. Sanchez-Pulido, M.A. Andrade-Navarro, J.H. Doonan and C.W. Lloyd. 2006. Microtubule associated AIR9 recognizes the cortical division site at preprophase and again when the cell plate inserts. Current Biology. 2, 296-299.

[13] Buschmann, H., L. Sanchez-Pulido, M.A. Andrade-Navarro and C.W. Lloyd. 2007. Homologues of Arabidopsis microtubule-associated AIR9 in trypanosomatid parasites: hints on evolution and function. Plant Signaling & Behavior. 16, 1938-1943.

[14] Hoersch, S. and M.A. Andrade-Navarro. 2010. Periostin shows increased evolutionary plasticity in its alternatively spliced region. BMC Evolutionary Biology. 10, 30.

[16] Vlassi, M., K. Brauns and M.A. Andrade-Navarro. 2013. Short tandem repeats in the inhibitory domain of the mineralocorticoid receptor: prediction of a ß-solenoid structure. BMC Structural Biology. 13, 17.

[17] Sanchez-Pulido, L., L.H. Perez, S. Kuhn, I. Vernos and M.A. Andrade-Navarro. 2016. The C-terminal domain of TPX2 is made of alpha-helical tandem repeats. BMC Structural Biology. In press.

[18] Paladin, L., L. Hirsch, D. Piovesan, M.A. Andrade-Navarro, A.V. Kajava and Silvio C.E. Tosatto. 2016. RepeatsDB 2.0: improved annotation, classification, search and visualization of repeat protein structures. Nucleic Acids Research. In press. [ RepeatsDB]