Journal of Natural Science, Biology and Medicine

: 2012  |  Volume : 3  |  Issue : 2  |  Page : 139--146

Identification and analysis of biomarkers for mismatch repair proteins: A bioinformatic approach

Manika Sehgal, Tiratha Raj Singh 
 Department of Biotechnology and Bioinformatics, Jaypee University of Information and Technology, Waknaghat, Solan, H.P., India

Correspondence Address:
Tiratha Raj Singh
Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Waknaghat, Solan-173234, H.P.


Introduction: Mismatch repair is a highly conserved process from prokaryotes to eukaryotes. Defects in mismatch repair can lead to mutations in human homologues of the Mut proteins and affect genomic stability which can result in microsatellite instability (MI). MI is implicated in most human cancers and majority of hereditary nonpolyposis colorectal cancers (HNPCCs) are attributed to defects in MLH1. Materials and Methods: In our study we analyzed MLH1 protein and the associated nucleotide and other protein sequences. The protein sequences involved in mismatch repair in different organisms have been found to be evolutionary related. Several other related proteins to MLH1 have also been identified through protein-protein interactions. All associated proteins are either mismatch repair proteins or associated with MLH1 in various pathways. Pathways information was also confirmed through MMR and other pathways in KEGG. QSite Finder showed that the active site of MLH1 protein involves residues from the conserved pattern and is involved in ligand-protein interactions and could be a useful site. To analyze linkage disequilibrium (LD) and common haplotype patterns in disease association, we performed statistical haplotype analysis on HapMap genotype data of SNPs genotyped in population CEU on chromosome 3 for MLH1. Results: Various markers have been found and LD plot was also generated. Two distinct blocks have been identified in LD plot which can be independent region of action, and there is involvement of 7 and 17 markers in first and second blocks, respectively. Conclusion: Overall correlation of 0.95 has been found among all interactions of genotyped SNPs which is significant.

How to cite this article:
Sehgal M, Singh TR. Identification and analysis of biomarkers for mismatch repair proteins: A bioinformatic approach.J Nat Sc Biol Med 2012;3:139-146

How to cite this URL:
Sehgal M, Singh TR. Identification and analysis of biomarkers for mismatch repair proteins: A bioinformatic approach. J Nat Sc Biol Med [serial online] 2012 [cited 2020 Dec 2 ];3:139-146
Available from:

Full Text


DNA mismatch repair is a process that takes place in the cells of almost every living organism, both prokaryotic and eukaryotic because of its evolutionary importance. The first evidence for mismatch repair was obtained from Streptococcus pneumonia and then work on Escherichia coli had identified a number of genes that, when mutationally inactivated, cause hypermutable strains. [1],[2] Three of these proteins are essential in detecting the mismatch and directing repair machinery to it - MutS, MutH and MutL (MutS is a homologue of HexA and MutL of HexB). MLH1 heterodimerizes with PMS2 to form MutL alpha, a component of the postreplicative DNA mismatch repair system (MMR). Defects in MLH1 are a cause of mismatch repair cancer syndrome (MMRCS) also known as Turcot syndrome or brain tumor-polyposis syndrome1 (BTPS1), [3] Muir-Torre syndrome (MuToS) also abbreviated MTS and susceptibility to endometrial cancer (ENDMC). [4] Poor efficacy of DNA polymerase enzyme or the DNA being exposed to ionizing radiations (gamma rays, X-rays, ultraviolet rays), highly reactive oxygen radicals and various chemicals in the environment also produces aberrations in the DNA. If the genetic information encoded in the DNA is to remain uncorrupted, these chemical changes must be corrected to avoid various mutations. The DNA repair ability of a cell is vital to the integrity of its genome and thus to its normal functioning and that of the organism.

Mismatch repair enzymes function to recognize these errors and correct them. After replication, these enzymes travel down the new DNA molecules and are able to identify mistakes by the "bulge" that results from a mismatched pair. When an error is discovered, the mismatch repair enzymes then activate other enzymes that complete the DNA repair. There are various disorders that occur due to the mutations in this mismatch repair proteins and affect genomic stability, which can result in microsatellite instability (MI). [5] MI is implicated in most human cancers and majority of hereditary nonpolyposis colorectal cancers (HNPCC) are attributed to defects in MLH1. [6],[7] It is also evident that DNA damage and repair are essential processes to understand the mechanisms of cancer, ageing and various human genetic diseases. [8] Therefore there is a need to analyze these proteins and their roles in various disorders. Our approach involves diversified analysis of the structural, functional and evolutionary aspects of these proteins.

In our study we analyzed the MLH1 protein and other associated proteins. DNA repair is initiated by MutS alpha (MSH2-MSH6) or MutS beta (MSH2-MSH6) binding to a dsDNA mismatch, then MutL alpha is recruited to the heteroduplex. [9] Assembly of the MutL-MutS-heteroduplex ternary complex in presence of RFC and PCNA is sufficient to activate endonuclease activity of PMS2. [10],[11] It introduces single-strand breaks near the mismatch and thus generates new entry points for the exonuclease EXO1 to degrade the strand containing the mismatch. DNA methylation would prevent cleavage and therefore assure that only the newly mutated DNA strand is going to be corrected. MutL alpha (MLH1-PMS2) interacts physically with the clamp loader subunits of DNA polymerase III suggesting that it may play a role to recruit the DNA polymerase III to the site of the MMR. Also implicated in DNA damage signaling, a process which induces cell cycle arrest and can lead to apoptosis in case of major DNA damages. The MLH1 protein which is a mismatch repair protein present in many species had a common signature motif - GFRGE[AG]L.

The ability to recognize and repair damaged DNA is common to all forms of life, and numerous DNA repair pathways have evolved to repair almost all possible DNA lesions. The comparative and functional genome study of the organisms helps us to identify conserved regions and various related disorders. [12] There is a strong relationship between DNA repair pathways and human genetic disorders as these disorders represents defects in several associated genes e.g. in case of cancer and multi-system defects specifically in the immune and neurological systems. [13] The various protein-protein interactions which are involved in many complex networks and pathways are essential for understanding the metabolic and cellular processes and can further serve as novel targets for therapeutic interventions. There is a growing interest in understanding haplotype structures in the human genome using identified genetic markers as haplotype structures may provide critical information on human evolutionary history and the identification of genetic variants underlying various human traits. [14] Therefore, a DNA mismatch repair protein i.e. MLH1 which is involved in various disorders has been extensively analyzed in this study.

 Materials and Methods

Various in silico approaches and computational tools have been applied for the biological analysis of MLH1 protein. First, the protein sequence of MLH1 protein in humans was retrieved from NCBI which was cross referenced from Uniprot and Swissprot databases. The MLH1 protein sequences from various other organisms like Saccharomyces cerevisiae, Rattus norvegicus, Bos taurus, Mus musculus, etc. were also retrieved from NCBI and then these sequences were aligned together using Multiple Sequence Alignment tools like MAFFT [15] and MUSCLE. [16] Conserved motifs in these sequences were compared and confirmed through PROSITE database. [17]

A phylogenetic tree providing evolutionary relatedness of sequences was also obtained through Treefinder with GTR-GI model and 10,000 replicates [18] and the Phylogenetic Web Repeater (POWER). [19] Various protein-protein interactions with MLH1 were obtained from STRING, [20] BIND, [21] IntAct, [22] and MINT [23] PPI databases. The MLH1 sub-cellular localization was obtained from various tools like PSORT, [24] LOCATE, [25] BaCelLo [26] and MultiLoc, [27] which was found related to various disease pathways in KEGG database [28] [Table 1]. The protein structure of MLH1 was also found and downloaded from Protein Data Bank (PDB) and the active site residues were obtained from QSITE Finder. [29] {Table 1}

Linkage Disequilibrium (LD) is used in the study of population genetics for the non-random association of alleles at two or more loci. [30] Various measures have been proposed for characterizing the statistical association between alleles at different loci. Most common measures are D' and r 2 and both range between 0 and 1. D' is a measure of LD between two genetic markers. D' = 1 (complete LD) indicates that two SNPs have not been separated by recombination, while D' <1 (incomplete LD) indicates that the ancestral LD was disrupted during the history of the population. Only D' value near one is a reliable measure of LD extent. r 2 is also a measure of LD between two genetic markers. r 2 = 1 (Perfect LD) for SNPs that have been separated by recombination or have the same allele frequencies. We have here applied haplotype block and haplotype tagger analysis to reveal the information regarding LD. [31] The haplotype analysis was performed using Haploview.


In multiple sequence alignment (MSA) performed by MAFFT and MUSCLE [Figure 1], a conserved signature motif for mismatch repair proteins GFRGE[AG]L, is shown within the rectangle. This proves that the protein sequence involved in mismatch repair in different organisms have been found to be evolutionary related as there is a common conserved motif in MLH1 protein of these species, which is a DNA mismatch repair protein's MutL/HexB/PMS1 signature motif. From the PSORT subcellular localization tool, the MLH1 protein was found to be nuclear which was also confirmed by other available servers. A phylogenetic tree was reconstructed using Treefinder with consensus analysis for 10000 replicates on GTR-GI model with optimum values which shows the evolutionary relationship of sequences in this study [Figure 2]. Tree is in harmony with available phylogenies of involved studies but with different marker data. Arabidopsis with Solanum is an interesting aspect of the tree as this pair is most distant and justifies its presence with these two species as separate and far clade from rest of the species evolutionarily. Further longer branches of Drosophila (Insecta) and Schmidtea (Platyhelminthes) confirms their position between plants and higher organisms. Positions of fungus (Ascomycetes), zebra fish (Cyprinidae), and Nasonia (Insecta) with longer branches than rodents and mammals gave perfect shape to this phylogenetic tree. Tree is in agreement with the available standard phylogenies but distinction of two rooted separated blocks is a unique feature among species in this study.{Figure 1}{Figure 2}

The pattern of the tree generated by POWER and other phylogenetic tree generating programs was almost similar with respect to phylogenetic trends of all the species in this study. All the groups and nodes are in agreement with the repetition of particular species with a score of more than 80 except one group where 3 species- Saccharomyces cerevisiae, Sordaria macrospora, Sordaria macrospora k-hell are there while the score of this group is also significant (60-80) as shown in [Figure 3].{Figure 3}

The MLH1 protein is known to interact with a number of proteins which are involved in DNA repair pathways. From the STRING database, the MLH1 protein is found to be interacting with a number of proteins like msh2, pms2, msh3, exo1, msh6, etc., which has experimental, text-mining, gene fusion, neighborhood, co-expression and other evidences. Some of the important interactions in STRING database have been found similar to the interactions in the BIND database as shown in [Figure 4] and [Table 2]. When these interactions were observed in other databases like IntAct and MINT, certain new interacting proteins were found which are represented in [Table 3] and [Table 4], respectively. Genecards gave the information regarding 102 proteins interacting with MLH1 [Table 5]. Therefore, on comparing all these databases, certain interactions were found common and will be of interest to researchers.{Figure 4}{Table 2}{Table 3}{Table 4}{Table 5}

When MLH1 protein was searched in KEGG (Kyoto Encyclopedia for Genes and Genomes) Pathway database, various biochemical pathways had shown the vital role of MLH1 protein in their processes. Various diseases are closely associated with MLH1 protein as this protein is found in the pathways causing many cancers like colorectal cancer, endometrial cancer, etc., and in mismatch repair pathway. [32] Various active site residues were discovered from the MLH1 protein structure so that the putative site should be known in advance where the ligand could probably bind the protein. As we have already seen that this protein is involved in a number of diseases therefore there is a need to analyze this protein in detail and the pockets identified [Figure 5] where the drug could bind would help in designing new inhibitors for the protein. This kind of analysis can provide an insight for the therapeutic applications. All of the protein atoms close to a probe-cluster defining various sites are shown in [Table 6].{Figure 5}{Table 6}

According to some recent studies it has been found that chromosomes are structured in a way that each chromosome can be divided into many blocks named haplotypes. [33] Knowledge of local linkage disequilibrium (LD) and common haplotype patterns in disease association has potential to make them comprehensive and efficient. [34] Haplotype tagging refers to the methods of selecting minimal number of SNPs that uniquely identify common haplotypes (>5% in frequency). Principal use of tagging is to select a 'good' subset of SNPs to be typed in all the studied individuals. We performed haplotype analysis on HapMap genotype data of SNPs genotyped in population CEU on chromosome 3. LD plot was generated and in haploblock diagram [Figure 6], two distinct blocks have been identified that are the alternative blocks within same loci on LD plot, and a strong correlation between blocks indicates independent site of action which is being proposed by this analysis and there is involvement of 7 markers in first block while 17 markers in second block with significant statistical support. Overall correlation of 0.95 has been found among all interactions of genotyped SNPs which is significant [Figure 7].{Figure 6}{Figure 7}


From our analysis it can be concluded that a system's biology approach is essential for the interaction of genes/proteins/networks for understanding of the cellular processes, and there is a need to perform detailed analysis on repair pathways and associated human genetic disorders. The protein sequences involved in mismatch repair in different organisms have been found to be evolutionary related as there is a common motif GFRGE[AG]L found in MLH1 protein of these species. Followed by the multiple sequence analysis using MAFFT and MUSCLE servers, the same pattern was found conserved among all species in this study. Phylogenetic tree generated based on MSA is also in agreement with standard phylogeny available for various biomarkers. Several other related proteins have also been identified through protein-protein interactions. All associated proteins are either mismatch repair proteins or associated with MLH1 in various pathways. Pathways information was also confirmed through MMR and other pathways in KEGG. Further studies from QSite Finder showed that the active site of MLH1 protein also involves these residues and this conserved pattern is involved in ligand-protein interactions as confirmed through a complex structure of MLH1. Information generated will definitely be an aid for further research and based on conserved residues of active sites and various ligand interaction cavities, new inhibitors can be designed.

Marker information is generated from sequence to structure level with conserved signature motif and active site residue within structural pockets, respectively. Besides that, evolutionary information has also been generated which suggests the selection of a specific and suitable molecular evolutionary model of substitution for MLH1 protein sequences among various organisms. Haplotype analysis revealed 24 (17+7) new alleles with significant statistical scores and confirmed the association of these alleles with various disorders. Two independent sites of action (two distinct but related blocks) have been identified for the same allele, which might be helpful in mapping various markers on genomic data. Overall this study provides a new direction towards repair proteins and their myriad analysis.


1Priebe SD, Hadi SM, Greenberg B, Lacks SA. Nucleotide sequence of the hexA4 gene for DNA mismatch repair in Streptococcuspneumoniae and homology of hexA to mutS of Escherichia coli and Salmonella typhimurium. J Bacteriol 1988;170:190-6.
2Prudhomme M, Martin B, Mejean V, Claverys J. Nucleotide sequence of the Streptococcus pneumoniae hexB mismatch repair gene: Homology of HexB to MutL of Salmonella typhimurium and to PMS1 of Saccharomyces cerevisiae. J Bacteriol 1989;171:5332-8.
3Hamilton SR, Liu B, Parsons RE, Papadopoulos N, Jen J, Powell SM, et al. The molecular basis of Turcot's syndrome. N Engl J Med 1995;332:839-47.
4Shin BY, Chen H, Rozek LS, Paxton L, Peel DJ, Anton-Culver H, et al. Low allele frequency of MLH1 D132H in American colorectal and endometrial cancer patients. Dis Colon Rectum 2005;48:1723-7.
5Kobayashi K, Matsushima M, Koi S, Saito H, Sagae S, Kudo R, et al. Mutational analysis of mismatch repair genes, hMLH1 and hMSH2, in sporadic endometrial carcinomas with microsatellite instability. Jpn J Cancer Res 1996;87:141-5.
6Kruger S, Plaschke J, Jeske B, Gorgens H, Pistorius SR, Bier A, et al. Identification of six novel MSH2 and MLH1 germline mutations in HNPCC. Hum Mutat 2003;21:445-46.
7Vasen HF, Moslein G, Alonso A, Bernstein I, Bertario L, Blanco I, et al. Guidelines for the clinical management of Lynch syndrome (hereditary non-polyposis cancer). J Med Genet 2007;44:353-62.
8Li GM. Mechanisms and functions of DNA mismatch repair. Cell Res 2008;18:85-98.
9 Jiricny J. MutLalpha: At the cutting edge of mismatch repair. Cell 2006;126:239-41.
10 Flores-Rozas H, Clark D, Kolodner RD. Proliferating cell nuclear antigen and Msh2p-Msh6p interact to form an active mispair recognition complex. Nat Genet 2000;26:375-8.
11Iyer RR, Pohlhaus TJ, Chen S, Hura GL, Dzantiev L, Beese LS, et al. The MutSalpha-proliferating cell nuclear antigen interaction in human DNA mismatch repair. J Biol Chem 2008;283:13310-9.
12Manolio TA. Genomewide Association Studies and Assessment of the Risk of Disease. N Engl J Med 2010;363:166-76.
13Suter CM, Martin IK, Ward RL. Germline epimutation of MLH1 in individuals with multiple cancers. Nat Genet 2004;36:497-501.
14Crawford DC, Nickerson DA. Definition and clinical importance of haplotypes. Annu Rev Med 2005;56:303-20.
15Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acid Res 2002;30:3059-66.
16Edgar RC. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004;32:1792-7.
17Sigrist CJA, Cerutti L, de Castro E, Langendijk-Genevaux PS, Bulliard V, Bairoch A, et al. PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res 2010;38:161-6.
18Jobb G, von Haeseler A, Strimmer K. TREEFINDER: A powerful graphical analysis environment for molecular phylogenetics. BMC Evol Biol 2004;4:18.
19Lin CY, Lin FK, Lin CH, Lai LW, Hsu HJ, Chin CH, et al. POWER: PhylOgenetic WEb Repeater-an integrated and user-optimized framework for biomolecular phylogenetic analysis. Nucleic Acids Res 2005;33:W553-6.
20Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, et al. The STRING database in 2011: Functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011;39:D561-8.
21Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T, Hogue CW. BIND-The Biomolecular Interaction Network Database. Nucleic Acids Res 2001;29:242-5.
22Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, et al. The IntAct molecular interaction database in 2010. Nucleic Acids Res 2010;38:D525-31.
23Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L. MINT: The Molecular INTeraction database. Nucleic Acids Res. 2007;35:D572-4.
24Horton P, Park KJ, Obayashi T, Fujita N, Harada H, Adams-Collier CJ. WoLF PSORT: Protein localization predictor. Nucleic Acids Res 2007;35:W585-7.
25Fink JL, Aturaliya RN, Davis MJ, Zhang F, Hanson K, Teasdale MS, et al. LOCATE: A mouse protein subcellular localization database. Nucleic Acids Res 2006;34:D213-7.
26Pierleoni A, Martelli PL, Fariselli P, Casadio R. BaCelLo: A Balanced subCellular Localization predictor. Bioinformatics 2007;22:e408-16.
27Hoeglund A, Doennes P, Blum T, Adolph HW, Kohlbacher O. MultiLoc: Prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs, and amino acid composition. Bioinformatics 2006;22:1158-65.
28Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucl Acids Res 2000;28:27-30.
29Laurie AT, Jackson RM. Q-SiteFinder: An energy-based method for the prediction of protein-ligand binding sites. Bioinformatics 2005;21:1908-16.
30Barrett JC, Fry B, Maller J, Daly MJ. Haploview: Analysis and visualization of LD and haplotype maps. Bioinformatics 2005;21:263-5.
31Purcell S, Daly MJ, Sham PC. WHAP: Haplotype-based association analysis. Bioinformatics 2007;23:255-6.
32Vilkki S, Tsao JL, Loukola A, Poyhonen M, Vierimaa O, Herva R, et al. Extensive somatic microsatellite mutations in normal human tissue. Cancer Res 2001;61:4541-4.
33Sub Y, Vijg J. SNP discovering in associating genetic variation with human disease phenotypes. Mutat Res 2005;573:41-53.
34The International HapMap Consortium. The International HapMap Project. Nature 2003;426:789-96.