ProLego Protein as lego blocks ...

Protein structure space is enormous 1,2. Most of the linear polymer of amino acids has intrinsic nature of forming a dynamic and flexible 3D structure; which poses one of the conundrum of present days 3,5. Mapping the variation in the structure space will provide an insight into the organisation and evolvability in protein fold space 1,4,5.

The local arrangements (or secondary structure elements (SSEs)) provide a regular pattern in the protein chain, which is well defined 9,3. The arrangements of these patterns have been addressed in architecture level (in -CATH) 7 and fold level (in -SCOP)8 classifications. However, these definitions are influenced by evolutionary and sequence information, which is bias by current limitation of knowledge space10.

Here, with ProLego, we provide a simple and intuitive way to study protein structure space, using core definition of "topology", as used in case of protein structures 1,3. With focus on secondary structures (alpha helix and strands), we have cataloged the protein structure topology variation in current structure space.

Definition of Topology 1,3

Here topology is defined as the arrangement of secondary structure, their spatial contacts and relative orientation. This definition help to address the crucial aspects of local and non-local contacts and their relative position in the context of 3D structure.

Browsers [Search in ProLego DB]

  • Topology Browser

    As defined in the above section, topology of a protein chains have been extracted [ref. alpha paper]. Each topology depends on the composition of secondary structures (sequential arrangement of SSEs from N to C-terminal) and their total count. Topology browser, provides the tabular representation of different SSE combination and their distinct topology.

    Entries can be searched by either class of structure (topClass: Alpha (A), Beta (B) or Alpha-Beta (AB)) or, Number of helix or, number of strands or, composition of Secondary structure Elements (SSE).

    [Total topologies : 1292]

  • Protein Browser

    A pre-calculated set of non-redundant curated datasets from PDB [different release dates and different sequence homology cutoffs] can be access in the protein browser. The protein browser shows a tabulated view (paginated at 30 entries), of the complete protein data in ProLego DB.

    Entries can be searched with a PDB Id or chain Id. The resultant table shows result similar to the queries Id. The table has information on molecule which links to the RCSB-PDB database for cross reference and biological source of the protein.

    [Total protein (non-redundant data @ 60% sequence homology cluster) : 31034]

  • Domain Browser

    Domain set from CATH (v. 4.1 : CATH_4.1) and SCOP (v.1.75 ASTRAL_30) has been pre-calculated for exploring topologies. The browser provides interface for searching by domain Id, structure class (i.e CATH or SCOP structure classification string), molecule of in the domain and its biological source.

    [Total Domains (CATH + SCOP) @ 30% sequence homology cluster : 15255]

Generation of Topology [Runs proLego codes on the fly]

  • By PDB Id

    Given Protein Database 4 letter protein id (or PDB Id), the search engine extracts PDB molecule. The same page will give a brief description of molecule in the PDB file and prompts for selection of a chain. Satisfying the required conditions, the topology generation process starts. Once the topology has been built, result page with all information will appear.

  • By Uploading a File

    The user can upload a PDB formatted protein file with a keyword description. Upon successful submission and upload of requested file, topology building starts and redirected to the result page.

Generic Topology Result

From the Topology browser, once a topology has been selected (by clicking on the "see topology"), user will be redirected to the queried topology page. This page has details like statistical significance and prevalence status (True/False) of queried topology along with available significance score (P-Values [see below]). The panel on the right shows a generic graph representation of topologies in this group.

Following the top menu, the bottom panel has the list of protein and domains in that topology group. Relevant information of the protein chians has been provided that can be redirected to a particular protein page.

Prolego server protein (or Domain) results:

  • Protein Information: Information of queried protein/domain name, molecule, residues, sequence and SSE sequence (from N to C terminal) has been provided in the first panel.
  • Protein Topology: Secondary structure contact map represented in an adjacency matrix graph for SSEs in N to C terminal (left to right on X-axis and bottom to top in Y-axis). Contact types are colour coded, where as non-contacting SSEs are shown in grey rectangles. Hover over the rectangles to know the contact and orientation of two SSEs.
  • The right panel shows a brief introduction of topology and its prevalence status as defined by the ProLego DB.

  • Modules: For topologies with SSE >=4; topology modules have been extracted and shown in the ascending order of number component SSEs [sse =3 to N-1; N= Number of SSE]. A list of protein:chain and domains from ProLego DB has been provided that has this modules as topologies.
  • Linear Topology Graph: This SVG plot represents the connectivity and arrangement of SSEs in protein chain from N to C -terminal (left (blue) to right (red) ). The constituting SSEs (Helix as rectangles and Strands as triangles), has been shown along with their relative orientation in case of strands. The connecting arcs, represents the contact between corresponding SSEs with colour indicating the type of orientation (green: parallel, red: anti-parallel and blue: orthogonal). The zoom button on the right bottom corner can be used to scale up or scale down the visualisation.
  • Topology Graph: A 2D SVG graph representing the topological arrangement of SSEs in a plane. The SSEs are represented as circle (helix) and arrow (strands) from blue to red (N to C terminal). The sequential connection shown in the solid black line whereas, the spatial connection in the dashed red lines. Hovering over the block, one can see the structure detail (starting residue and end residue in PDB), of corresponding SSEs.
  • Similar in ProLegoDB: If the ProLego DB as any entries of protein or domains with similar topology , this section will show them as table.

Statistical tests

In ProLego server, there are two level of statistical tests. As the grouping of proteins is primarily based on the composition of secondary structures, it is crucial to verify the representation of the group is statistically significant or not. The next level of test is on the prevalence of a group of topologies.

  • Topology group Significance: The occurrence of topology representation for a composition of secondary structures is suppose to be equally distributed, if chosen at random. This is our null hypothesis. We test the null hypothesis with different dataset analysis of a particular topology. The test used is Chi-Square test, using python (scipy)implementation. (As per requirements, we have not considered the topological sets which representation is less than 5.) . If the null hypothesis is rejected with a confidence of 99.99% (P-Value <= 0.001), then the topology representation of that SSE composition group is considered to be statistically significant.
  • Prevalent group Significance: Among a group of observed topology, all topology are suppose to be equally probable (if occurring at random). However, from observations (across datasets), a subset of topologies are observed to be consistently over represented. We have grouped these topologies and refer them as prevalent and the rest as less-prevalent. The difference among these two groups, are tested using Wilcoxcon Rank-Sum Test . If the difference is found to be statistically significant (P-value < 0.01), we consider the subset of topologies as "prevalent" (or most-frequent).

PDB structure to Topology String (or Contact String)

As shown in the figure on left, contact between secondary structures pairs are projected in an adjacency matrix. The contact type encodes the SSE pair (H:H as H; E:E as E and H:E as C) in connection with the contact orientation (a, p and r). The 1D contact string is a segmental representation (separated by "-") of the adjacency matrix, where the sequentially adjacent SSEs are the first segment and sequentially most distant SSE (N and C) is last segment.

Please refer to : Khan, T., & Ghosh, I. (2015). Modularity in protein structures: study on all-alpha proteins. Journal of Biomolecular Structure and Dynamics, 1102(May), 1–15.

Dataset Selection

Protein Data Base has been filtered for X-ray structures with good resolution structures and sequence identity clusters of 80%, 60% and 30%. The non-redundant subsets are generated from CD-Hit and PISCES server.The mail goal of data variation is to check the consistency of the resulted topology groups and robustness of the prevalence classes.

    Other Dataset tested:
  • CATH (v. 4.1): Domains from Sequence cluster of 40.
  • SCOP (v 1.75): Domains from curated ASTRAL set of sequence identity 30 cluster.

Application Design

The application is developed in Python (v2.7.9-12). The web server is implemented in django (> v. 1.9). The database support is by MySQL-lite.

ProLegoDB's, statistically significant topologies can be used, by running the local version of proLego. The implementation details can be found with the git hub repository (as mentioned below). This can also used if you have ensemble of structures. The output result will be in a JSON format with topology information as well as module information.

Code can be downloaded from GitHub repository.

The figure shows the schema of prolego database. The 4 tables are connected with appropriate keys.

Taushif Khan* @
PhD Student, MANF-SRF
email: taushi14_sit AT jnu.ac.in, taushifkhan AT gmail.com

Shailesh K. Panday* @
PhD Student, BINC-JRF
email: shaile27_sit AT jnu.ac.in

Prof. Indira Ghosh*
email: indira0654 AT gmail.com

* School of Computational & Integrative Sciences (SCIS), Jawaharlal Nehru University (JNU), New Delhi, India-110067
@ Contact for bugs and queries.

Developers would like to acknowledge the help from followings:

Acknowledgement for funding

  • UGC-MANF grant for PhD and DBT-CCPM project (TK)
  • DBT-BINC fellowship (SKP)
  1. Khan, T., & Ghosh, I. (2015). Modularity in protein structures: study on all-alpha proteins. Journal of Biomolecular Structure and Dynamics, 1102(May), 1–15. http://doi.org/10.1080/07391102.2014.1003969
  2. Taylor, W. R., Chelliah, V., Hollup, S. M., MacDonald, J. T., & Jonassen, I. (2009). Probing the “dark matter” of protein fold space. Structure (London, England : 1993), 17(9), 1244–52. http://doi.org/10.1016/j.str.2009.07.012
  3. Wei, G., Xi, W., Nussinov, R., & Ma, B. (2016). Protein Ensembles: How Does Nature Harness Thermodynamic Fluctuations for Life? The Diverse Functional Roles of Conformational Ensembles in the Cell. Chemical Reviews, 116(11), 6516–6551. http://doi.org/10.1021/acs.chemrev.5b00562
  4. Taylor, W. R., Jones, D. T., & Sadowski, M. I. (2012). Protein topology from predicted residue contacts. Protein Science : A Publication of the Protein Society, 21(2), 299–305. http://doi.org/10.1002/pro.2002
  5. Banerji, A., & Ghosh, I. (2009). Revisiting the myths of protein interior: studying proteins with mass-fractal hydrophobicity-fractal and polarizability-fractal dimensions. PloS One, 4(10), e7361. http://doi.org/10.1371/journal.pone.0007361
  6. Rorick, M. (2012). Quantifying protein modularity and evolvability: a comparison of different techniques. Bio Systems, 110(1), 22–33. http://doi.org/10.1016/j.biosystems.2012.06.006
  7. Cuff, A., Redfern, O. C., Greene, L., Sillitoe, I., Lewis, T., Dibley, M., … Orengo, C. (2009). The CATH Hierarchy Revisited — Structural Divergence in Domain Superfamilies and the Continuity of Fold Space. Structure, 17(8), 1051–1062. http://doi.org/10.1016/j.str.2009.06.015
  8. Andreeva, A., Howorth, D., Brenner, S. E., Hubbard, T. J. P., Chothia, C., & Murzin, A. G. (2004). SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Research, 32(Database issue), D226-9. http://doi.org/10.1093/nar/gkh039
  9. Kabsch, W., & Sander, C. (1983). Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features. Biopolymers, 22, 2577–2637.
  10. Skolnick, J., Arakaki, A. K., Lee, S. Y., & Brylinski, M. (2009). The continuity of protein structure space is an intrinsic property of proteins. Proceedings of the National Academy of Sciences of the United States of America, 106(37), 15690–5. http://doi.org/10.1073/pnas.0907683106