ECCB12 Accepted Posters
Number | Title | Authors | Topic | |
---|---|---|---|---|
A 01
A1 |
This work presents a novel approach to predict functional relation between genes using gene expression data. Genes may have various types of relations between them e.g. regulatory relations or they may be concerning to the same protein complex or metabolic/signaling pathways and obviously gene expression data should contain some clues to such relations. The present approach first digitizes the log-ratio type gene expression data of S. cerevisiae to a matrix consisting of 1, 0 and -1 indicating highly expressed, no major change and highly suppressed conditions for genes respectively. For each pair of gene, a probability density mass function table is constructed indicating nine joint probabilities. Those pairs of genes were then chosen for which the sum of probability density masses in selected points are statistically significant. It has been shown that such gene pairs share many Gene Ontology (GO) terms. Furthermore the clustering of the network consisting of these gene pairs generate many modules rich with similar function genes assessed by hypergeometric p-values determined using the R package GOStats. Also it was verified using PRIMA (PRomoter Integration in Microarray Analysis) software that for many modules many of the genes contain similar binding sites in their promoters corresponding to known transcription factors of yeast indicating the effectiveness of the proposed approach in predicting regulatory relations together with other functional relations. |
Altaf-Ul-Amin M, Kanaya S*
*Nara Institute of Science and Technology (NAIST) Japan |
A - Applied and Translational Bioinformatics |
|
A 03
A3 |
The understanding of complex biological systems implies the experimentation and simulation of biochemical kinetics. This phenomenon plays a key role in all biological processes, since it is responsible for the evolution of chemical concentrations by governing reaction rates. Biochemical kinetics is usually modeled using partial or stochastic differential equations. However, these methods can neither handle the geometrical constraints, such as membrane binding, nor the three-dimensional aspect, which are inherent to these processes. Therefore, we propose an individual-based system for the three-dimensional simulation of biochemical kinetics at the microscopic scale. Our system is populated by reactive autonomous entities, which means that they follow the stimulus-response scheme and the usual "perception-decision making-action" life cycle. During the simulation, the entities evolve in a volume by diffusing according to three-dimensional Brownian motion. These entities can undergo bimolecular reactions upon collisions or unimolecular ones individually. Our algorithms for these biochemical reactions use the probability theory to determine whether reactions should happen or not. We achieve the validation of our individual-based system by comparing our results with references, obtained from differential equations modeling of simple reactions. We have applied our work to blood coagulation enzyme kinetics that implies membrane binding events. Our approach provides accurate results, but this application helped us to highlight the fact that these algorithms are computationally expensive. Parallel computing seems to be a promising solution to obtain great performance improvements and first results are genuinely encouraging. |
Crépin L*, Harrouet F, Kerdélo S, Redou P
*ENIB France |
A - Applied and Translational Bioinformatics |
|
A 04
A4 |
Natural killer T (NKT) and mucosal-associated invariant T (MAIT) cells are specialized, highly effective subsets of T cells with various roles in immunity. T cells express αβ heterodimeric receptors on their surface that enable antigen recognition. Both NKT and MAIT cells typically express semi-invariant TCRs comprised of invariant TCRα-chains that are dominant in a majority of individuals and highly similar across species. The prevalence and evolutionary conservation of the NKT and MAIT TCRs suggest that they play an important role in the immune system; it is therefore surprising that their production is left to chance by a largely random gene recombination process. In this study, we used a computational biology approach, involving bioinformatics analysis of sequence data and computer simulations of the gene recombination process, to investigate NKT and MAIT TCRα sequences for a variety of species. For all reported species, the NKT and MAIT invariant TCRα amino acid sequences exhibited features that suggest efficient production by gene recombination. Moreover, in computer simulations of a random gene recombination process, the NKT and MAIT invariant TCRα amino acid sequences were the most generated of all sequences conforming to the length restrictions associated with NKT and MAIT cells. These results suggest that the highly efficient production of NKT and MAIT invariant TCRα sequences is an important determinant of their prevalence within individuals, across individuals, and across species. |
Greenaway HY, Ng B, Price DA, Douek DC, Davenport MP, Venturi V*
*Centre for Vascular Research, University of New South Wales Australia |
A - Applied and Translational Bioinformatics |
|
A 05
A5 |
Amyloids are proteins that form β-fibrils. Majority of these proteins natively have a completely different and functional structure. Usually amyloids lead to serious diseases, e.g. Alzheimer disease (amyloid-β, tau), Parkinson disease (α-synuclein), etc. Number of diseases that turn out amyloid-associated is constantly increasing. Recently it was discovered that amyloidogenic properties can only be due to short segments of aminoacids in a protein sequence, which can transform the structure when non-burried. A few hundreds of such peptides have been experimentally found, however testing all possible aminoacid combinations is not possible. Instead, they can be predicted by physico-chemical or statistical methods. We introduce our original machine learning method, which is based on site specific correlations. The method is capable of finding the most relevant window in the positive learning set, based on differences in site specific correlations between positive and negative training sets, and classify peptides of the test set with regard to their potential amylogenicity. Additionally, the relevant fragment of the peptide, responsible for its amyloidogenicity, is indicated. The model also shows which residue combinations can increase the risk of amylogenicity at each position of a peptide. The method was trained and tested on the experimental databases, reaching sensitivity of 0.75, specificity of 0.90 and AUC ROC above 0.8. Tests on computational dataset from ZipperDB, obtained by 3D profile method, resulted in AUC ROC close to 0.9. |
Gasior P, Kotulska M*
*Wroclaw University of Technology, Institute of Biomedical Engineering and Instrumentation Poland |
A - Applied and Translational Bioinformatics |
|
A 06
A6 |
Massively parallel sequencing allows for rapid sequencing of large numbers of sequences in just a single run. Thus, 16S rRNA amplicon sequencing of complex microbial communities has become possible. The sequenced 16S rRNA fragments (reads) are clustered into operational taxonomic units and taxonomic categories are assigned. Recent reports suggest that data pre-processing should be performed before clustering. We assessed combinations of data pre-processing steps (no pre-processing, denoising, chimera checking) and clustering algorithms (BLAST, CD-HIT, ESPRIT-Tree, mothur’s cluster, UCLUST) on cluster accuracy for oral microbial sequence data.
The number of clusters varied up to two orders of magnitude depending on the pre-processing. Pre-processing using both denoising and chimera checking resulted in a number of clusters that was closest to the number of species in the mock dataset (25 versus 16). Based on run time, purity, and normalized mutual information (NMI), we could not identify a single best clustering algorithm. The differences in clustering accuracy among the algorithms after the same pre-processing were minor compared to the differences in accuracy among different pre-processing steps. The pre-processing method that resulted in the highest NMI, the highest purity and the number of clusters closest to the expected number of clusters was the combination of denoising with chimera checking. |
Bonder MJ, Abeln S, Zaura E, Brandt BW*
*Academic Centre for Dentistry Amsterdam (ACTA) Netherlands |
A - Applied and Translational Bioinformatics |
|
A 07
A7 |
The Gene Ontology (GO) provides core biological knowledge representation for modern biologists, whether computation or experimentally based. GO resources include biomedical ontologies that cover molecular domains of cellular life forms as well as extensive compiliations of gene product annotations to these ontologies that provide comprehensive statements about what gene products do. Although extensively used in data analysis work flows, and widely incorporated into numerous data analysis platforms and applications, the general user of GO resources often misses fundamental distinctions about GO structures, GO annotations,and what can and can not be extrapolated from GO resources. I will present 10 Rules for Using the GO. |
Blake J*
*The Jackson Laboratory United States of America |
A - Applied and Translational Bioinformatics |
|
A 08
A8 |
Prostate cancer is the most diagnosed internal malignancy in the western world. While the majority of prostate cancers are non-lethal, there is currently no reliable approach to distinguish lethal from non-lethal prostate cancer at an early, curable stage. To better understand the underlying molecular mechanisms governing lethality in prostate cancer, we have performed whole genome sequencing of matched whole-blood, primary tumour and distant metastases in two individuals, as well as RNA-SEQ and DNA methylation profiling. Integration of this data provides significant computational challenges as current approaches are designed for matched normal and primary tumour only. We present an overview of the challenges faced, and computational approaches subsequently designed to handle integration of genomics data in the normal-tumour-metastasis whole-genome sequencing setting. Building on the commonly used Genome Analysis Toolkit (GATK), we have identified a set of enhancements which utilise the relationship between matched normal, primary tumour and metastases to provide accurate and relevant variant discovery. |
Macintyre G*, Hong MKH, Pederson J, Ryan A, Costello A, Shi F, Kowalczyk A, Phal P, Hovens CM, Corcoran NM
*NICTA and The University of Melbourne Australia |
A - Applied and Translational Bioinformatics |
|
A 09
A9 |
Continuous technological improvements facilitate the availability of huge amounts of data resulting from the simultaneous characterisation of different aspects belonging to the same biological process. It is possible to measure the activity of thousands of transcripts, hundreds of proteins and hundreds of metabolites. Only the integrative analysis of all data types yields a deeper understanding of the biological process of interest.
Most of the current analysis techniques are based on the assumption of a direct correlation between genes and proteins. This assumption does not seem to hold due to post-transcriptional and translational expression regulation processes. Here we discuss two alternatives to conservative analysis techniques. Co-inertia analysis (CIA) is an integrative analysis method used to visualize and explore transcriptome and proteome data [Culhane et al. 2003, Fagan et al. 2004]. The generalised singular value decomposition (GSVD) has shown its potential in the analysis of two transcriptome data sets [Alter et al. 2003]. We compare CIA and GSVD by applying them to transcriptome and proteome data of three P. pastoris strains [Dragosits et al. 2010] cultivated at different osmotic conditions. Using CIA we visualise the three strains in a 2D plane and interpret the spatial configuration. With GSVD we decompose the data sets in matrices with biologically meaningful interpretations. Through projection in different subspaces of the GSVD we explore the processes captured by the data sets. We propose the GSVD for the analysis of transcriptome and proteome data (metabolome data integration easily possible) showing that it is a suitable integrative analysis technique. |
Tomescu OA*, Dragosits M, Graf A, Gasser B, Mattanovich D, Thallinger GG
*Austrian Centre for Industrial Biotechnology / Graz University of Technology Austria |
A - Applied and Translational Bioinformatics |
|
A 10
A10 |
While the human intestinal microbiota naturally alters with age it is known to be further perturbed post antibiotic therapy. This impact of antibiotic therapy on the composition of the intestinal microbiota of a cross-section of the elderly Irish population (n=185, >65 yr) was investigated, taking into consideration their residence location. Forty-two of the 185 elderly subjects were treated with at least one antibiotic within one month of faecal microbiota profiling. The residence location of the elderly subjects varied from long-term nursing care and rehabilitation wards, to day hospitals and the community. Microbiota profiling revealed significant changes across the Firmicutes (p=0.026), Proteobacteria (p=0.003) and Bacteroidetes (p=0.01) phyla post antibiotic therapy. Time since cessation of antibiotic therapy had no impact on the composition of the microbiota. This impact of antibiotic therapy in the elderly should be considered for the development of products for healthy aging. |
O'Sullivan O*, Coakley M, Lakshminarayanan B, Conde S, Claesson MJ, Cusack S, Fitzgerald AP, O'Toole PW, Stanton C, Ross RP
*Teagasc Ireland |
A - Applied and Translational Bioinformatics |
|
A 11
A11 |
Background
Proteins are directed to cellular compartments by peptide sequences that act as targeting signals. Mislocalization due to disrupted signaling caused by sequence mutations is likely to have a major impact on protein function, as well as on physiological processes that such a function brokers. Localization models achieve good predictive performance for most individual cell compartments, but fail to scale to many transport signals simultaneously or to integrate existing annotations. Methods We describe HumLoc, a protein subcellular annotation pipeline aimed at Homo sapiens and closely related mammalian organisms. This tool enriches expert-curated localization information from UniProt with machine learning predictions to maximize coverage and provide a one-stop shop. Integration of both types of information is achieved by mapping compartments to a 3-element ontology (extracellular, cell membrane or intracellular) which eliminates granularity differences in annotations and makes prediction more tractable. Results The prediction pipeline of HumLoc achieves an estimated 83% correct classification rate assigning proteins to the 3-element ontology. It is of special interest for the interpretation of GWAS results. Given a set of SNPs, HumLoc can be used to filter those which are predicted to alter localization, potentially leading to a disease. We present such a set of germline and somatic mutations, in addition to some general findings about mislocalization SNPs. Availability A preliminary web interface as well as a web service can be accessed at http://cbs.dtu.dk/cgi-bin/humloc-2.0.cgi. These allow browsing existing subcellular annotations and pre-computed predictions. Furthermore, the submission of novel sequences is also possible. |
Rubio García A*, de Bono B, Nielsen H, Gupta R
*Technical University of Denmark Denmark |
A - Applied and Translational Bioinformatics |
|
A 12
A12 |
The forkhead box (FOX) L2 protein belongs to an evolutionarily conserved family of transcription factors that regulate transcription of genes involved in cell growth, proliferation and differentiation and play a central role during development. FOXL2 is involved in female sex determination; ovarian development and follicle formation. In human, mutations in FOXL2 are known to cause blepharophimosis, ptosis, and epicanthus inversus syndrome (BPES), associated with premature ovarian failure (POF) in type I.
Recently, a somatic FOXL2 mutation (p.Cys134Trp) has been found associated with the adult ovarian granulosa cell tumor (OGCT). Here, we present preliminary results from an ongoing study of Foxl2/DNA binding sites in mouse ovarian tissue and mouse pituitary gonadotropic cell line, alpha T3-1. We used ChIP-seq, which combines chromatin immunoprecipitation (ChIP) with high-throughput massively parallel sequencing, to identify the genomic locations bound by Foxl2. Interestingly, we found a high relative enrichment level of Foxl2 bounded regions in the promotersof known genes (Refseq) with respect to the genome background (5% ovary; 7% alpha T3-1). Using MEME-chip (http://meme.sdsc.edu/meme/) and CisFinder (http://lgsun.grc.nia.nih.gov/CisFinder/), specifically designed for finding over-representing short DNA motifs in ChIP-seq data, we were able to clearly identify a core consensus binding sequence for FOXL2 that is over-represented in the enriched ChIP regions from both experiments.The results of this study will help in finding new targets for Foxl2 as well as in better understanding the mechanisms underlying the role of Foxl2 in ovarian and pituitary development. |
Sbardellati A*, Marongiu M, Meloni A, Marcia L, Cusano R, Angius A, Fotia G, Crisponi L
*CRS4 - Center for Advanced Studies, Research and Development in Sardinia Italy |
A - Applied and Translational Bioinformatics |
|
A 13
A13 |
Motivation:
Short tandem repeats (STRs) are common motifs in the human genome that are difficult to genotype on a genome-wide scale. Here, we describe a method to use short read sequencing data to infer the genotype (repeat length) of the two copies of an STR locus in a diploid individual, which we have tested using a data set of 15-mers. Results: Our method, STRIE, first aligns read pairs from the individual to the reference genome and calculates the distances between the mapped positions of the two reads of each pair (the ‘mapped paired end read separations’, or MPERS). To infer the genotype of an individual at a given STR locus, it compares the distribution of MPERS for read pairs that map on either side of that locus to the distribution of MPERS for all read pairs. We demonstrate STRIE using simulated Illumina sequence data based on individual NA12878 who was sequenced in the 1000 Genomes project. For 15-mer STR loci, we identify the correct length for at least 66% of loci within 2 repeats (within 30 bp) relative to the length in the reference genome. Availability and implementation: The source code written in C++ is obtainable upon request and a release under open source software license is planned in the near future. |
Whitener W, Lyberg D*, Coghlan A, Durbin R
*University College Cork Ireland |
A - Applied and Translational Bioinformatics |
|
A 14
A14 |
Over the past few years thousands of microarray results have become available for exploratory and metaanalysis studies. The quality control step that allows elimination of poor quality arrays is essential for developing valuable databases dedicated to these tasks. The classical quality assesment methods that are intended for identification of outlier arrays within a single experiment may be insufficient to eliminate experiments where the majority of samples are of low quality, thus the development of new methods is necessary.
In the current study we tested known methods for quality assessment of Affymetrix microarrays along with two new methods proposed by us – average rank Inter Quantile Range (arIQR) and PM/MM t-test. As an independent measure of quality we specified how well the gene expression profile from each array correlates with reference expression profile of homologous genes in the same organ from different species. We discovered that samples, which show lower correlation with the reference, could be identified as of poor quality on the basis of some of the quality metrics. The newly proposed method arIQR outperformes all the other methods for this task and will be implemented in Bgee1, the database built for evolutionary comparison of gene expression patterns between animal species 1 Bastian F., Parmentier G., Roux J., Moretti S., Laudet V., Robinson-Rechavi M. (2008) Bgee: Integrating and Comparing Heterogeneous Transcriptome Data Among Species. in DILS: Data Integration in Life Sciences. Lecture Notes in Computer Science. 5109:124-131 |
Rosikiewicz M*, Robinson-Rechavi M
*University of Lausanne Switzerland |
A - Applied and Translational Bioinformatics |
|
A 15
A15 |
Mesenchymal stem cells (MSCs) are a group of multipotent cells present in the stromal fraction of human tissues. MSCs are capable of multilineage differentiation into osteoblasts, adipocytes, chondrocytes and other cells. There is particular interest in understanding the biology of MSCs due to their potential as therapeutic agents in numerous diseases. Whilst the later stages of oseoblast differentiation are understood, little is known about the early, commitment stages of MSC into osteoblastic cells. Here, we have used multiple technology platforms to understand the regulatory processes in this early stage of osteoblastic differentiation. Illumina next-generation RNA sequencing (RNASeq) and Affymetrix Whole Transcript (WT) microarray technology has been used to investigate the mRNA changes associated with the human MSC cell line (hMSC-TERT) differentiation into osteoblasts. Gene expression was sampled over a time course from day 0 to day 12 after differentiation induction. In order to focus on transcriptional regulatory processes that are taking place during differentiation, a previously manually curated list of transcription factors was used to filter the normalised microarray dataset. The transcription factor gene expression profiles were then clustered using self-organising maps (SOMS) to identify inherent patterns in the data. A number of known and novel transcription factors were shown to exhibit strong up-regulation across the time-course. These results are being further investigated using RNASeq which will allow the expression of genes to be verified, along with the investigation of alternative splicing and promoter usage. |
Twine N*, Wilkins M, Kassem M
*University of New South Wales Australia |
A - Applied and Translational Bioinformatics |
|
A 16
A16 |
Sequence based methods for prediction of signal peptides share a common problem: the difficulty in distinguishing between signal peptides and transmembrane helices. We present here a new version of SignalP, the most widely used tool for signal peptide prediction, which has been constructed specifically to address this problem. By extensive benchmarking using realistic data sets where transmembrane proteins are abundant, we show that SignalP 4.0 outperforms ten other current methods, including transmembrane helix topology predictors with built-in signal peptide models. The new version is available at http://www.cbs.dtu.dk/services/SignalP/
|
Nielsen H*, Petersen TN, Brunak S, von Heijne G
*Technical University of Denmark Denmark |
A - Applied and Translational Bioinformatics |
|
A 17
A17 |
Natural products as synthesised by microorganisms are a rich resource for the pharmaceutical and cosmetic industry. These compounds are used e.g. as antibiotics, anti-cancer drugs, pigments, and flavours. Many microorganisms have the ability to produce a great number of different compounds, but only a few of these are detectable under laboratory conditions. Thus, it seems promising to search the genomes of these microorganisms to identify silent gene clusters encoding the biosynthesis of these compounds. In order to facilitate genome mining for natural products, we have created antiSMASH, a web-based detection and annotation pipeline focused on elucidating natural product biosynthesis pathways, annotating the involved proteins, predicting the chemical properties and structures of the respective products, and providing a comparison to already published gene clusters or biosynthesis proteins. As antiSMASH is gaining popularity in the natural product research community, the architecture of the pipeline is being improved to scale up with the increased workload. After the original paper was released in July 2011, the pipeline was handling around 60 jobs per week. At the time of writing, the load has risen by about a factor of ten, with the pipeline handling around 600 jobs per week on average, with load peaks of over 300 jobs a day. In the current development version antiSMASH 2.0, the pipeline has been redesigned to better scale across multiple servers. Applying the Active-Object-Pattern using an Advanced Message Queuing Protocol, multiple servants can concurrently process requests. This poster will present the computational details of the implementation of antiSMASH 2.0. |
Blin K*, Medema MH, Cimermancic P, de Jaeger V, Zakrzewski P, Fischbach MA, Wohlleben W, Breitling R, Takano E, Weber T
*University of Tübingen Germany |
A - Applied and Translational Bioinformatics |
|
A 18
A18 |
Metabolite annotation is one of the major challenges of mass spectrometry (MS) based untargeted metabolomic studies and is crucially depending on exact mass measurements. The quality and efficiency of metabolite annotations can be strongly improved by development of innovative strategies, including a precise evaluation of the instrument specific mass measurement. The quadrupole time of flight mass spectrometer (Q-TOF-MS), which is the instrument of choice for many metabolomics studies, was evaluated several times in the past in order to find the important factors affecting mass measurements. However, the applicability of these studies for large scale untargeted metabolomics experiments is largely limited. In this work, we demonstrate that the quality of peak annotations can be improved by using an adaptive mass accuracy window, estimated by a continuous surface function constructed by analyzing an extensive data set of authentic chemical standards. A method for creating a similar mass accuracy model based on laboratory specific database of standards is outlined and a practical application of the model is evaluated and discussed. |
Shahaf N*, Wehrens R
*Iasma Italy |
A - Applied and Translational Bioinformatics |
|
A 19
A19 |
MicroRNAs are ~22-nucleotide (nt) long molecules that regulate the expression of genes post-transcriptionally. About 1% of the predicted genes in humans, C. elegans, D. melanogaster and A. thaliana code for miRNAs, which makes them one of the most abundant non-coding RNAs in humans and plants. However, there is still much to know about the factors that promote a successful miRNA-mRNA interaction. The discovery of miRNAs in C. elegans by Lee and coworkers in 1993 was the beginning of the race to understand the mechanisms by which miRNAs exert their functions. Even the most established rules, the “seed rule” and the “G:U wobble rule”, have been proven not to apply in many cases. The former states that 3'UTRs that contain perfect 6- to 8-bp match to the 5' end of a miRNA (the “seed”, usually starting at position 2 of the miRNA) are generally expected to be regulated by this miRNA, and the latter states that G:U wobble pairs in the seed are detrimental for miRNA-target interactions.
We propose a straightforward way to combine the tens of currently available prediction algorithms, and assign them a credibility measure based on their performance compared with experimental validations. By combining several algorithms based on different criteria, we improve the prediction of miRNA targets. Most of the recently proven rules for valid miRNA-mRNA interactions are included in our strategy (local AU enrichment, 3' complementary pairing, conservation and others) but we also keep those containing non-conserved miRNAs. |
Martinez-Herrera DJ*, Sanchez Caballero I, Tabas-Madrid D, Muniategui A, Nogales-Cadenas R, Sorzano COS, Rubio A, Pascual-Montano A
*Spanish National Center for Biotechnology-CSIC Spain |
A - Applied and Translational Bioinformatics |
|
B 01
B1 |
Due to its accessibility and ease of manipulation the limb has a long and fruitful tradition as model system for experimental, theoretical and computational studies on organogenesis. Important pathways and drivers of morphogenesis have been identified, but an understanding of the dynamics among them is just emerging. We want to integrate this knowledge step by step into a computational model in parallel to studies by our wet lab collaborators to improve experimental design and the understanding of patterning, evolution and morphogenesis in the context of limb bud development.
Together with our collaborators we are collecting and analyzing 3D data. Our reaction-diffusion model for patterning reproduces wild type and mutant data well on a static domain. We are assessing the importance of directed cell behavior employing cellular Potts models. We now focus on growth and down-stream effects, e.g. digit condensation. Finally we want to integrate these models to understand phenotypes in shape. |
Germann P*, Iber D
*D-BSSE, ETHZ Switzerland |
B - Bioimaging, Spatial-temporal Modeling and Data Visualization |
|
B 02
B2 |
Many organs of higher organisms, such as lung, kidney, and glands, are heavily branched structures. The branched trees of lungs and kidneys are generated by the sequential, non-random use of simple modes of branching. Genetic studies have lead to the identification of key molecular players. However an integrative, mechanistic understanding of the branching process has remained elusive.
We propose model for lung branching morphgenesis based on the interactions between FGF10, SHH and its receptor Patched and the one for kidney based on GDNF signaling through its receptor RET and co-receptor Gfra1. The models are formulated as sets of PDEs on a deforming domain. The key assumptions in the model are that ligand diffuses faster than receptor and that ligand-receptor signaling upregulates receptor production. The proposed models generate bifurcations and trifurcations, lateral modes of branching which are observed during lung and/or kidney morphogenesis. Modeling also shows that switching between branching modes may be a result of extra-regulatory interactions, growth speed, or local FGF10 concentration. An extended model shows that a simple network is sufficient to control branch point selection, smooth muscles and vasculature formation during lung morphogenesis. The proposed models are consistent with all published data on mutant and all variables and parameters have clear biological interpretation and lie well within physiologically accessible range. The key assumptions for the proposed models have been shown experimentally to hold for several developmental systems. Therefore receptor-ligand interactions may form the core mechanism that controls domain patterning and branching during the development of branched organs. |
Menshykau D*, Iber D
*ETH Switzerland |
B - Bioimaging, Spatial-temporal Modeling and Data Visualization |
|
B 03
B3 |
Understanding how genomic information is translated into cellular functions constitutes a main challenge in Biology. The eukaryotic genome exists as chromatin, a nucleoprotein complex composed by DNA, regulatory RNAs and a variety of histone and non-histone proteins that are often modified and regulate expression of the genetic information contained in DNA.
In the recent years, after sequencing of the genomes of several model organisms was completed, large amounts of data have been gathered regarding different aspects of genome functioning, from gene expression and non-coding RNAs to the genomic distribution of epigenetic factors, namely histone modifications and chromatin associated proteins. Also, there are a large number of databases describing gene functions and interactions. Tools to analyze, visualize and integrate gene expression data at a functional level are readily available. However, integrating large amounts of experimental results and databases on epigenetic factors, genetic elements, levels of transcription and other functional categories, remains a challenge. In this context, we have developed chroGPS, a chromatin-based genome positioning system to integrate, visualize and study the associations between epigenetic factors, their relations to functional genetic elements, gene expression and biological functions. |
Reina O*, Font-Burgada J, Rossell D, Graffelman J, Azorín F
*IRB Barcelona. Bioinformatics and Biostatistics unit Spain |
B - Bioimaging, Spatial-temporal Modeling and Data Visualization |
|
B 04
B4 |
We propose a method for the modelling of planar surfaces of microscopic objects using simple light microscopy. By continuous variation of the microscope focus, a stack of images is generated. We apply the Sobel operator to each single image of the stack [1]. This operator is widely used for edge detection, with higher values for sharper edges. At each (x,y) position, we obtain a series of Sobel values, one for each image. The maximum of this series then defines the z-value of the respective (x,y) point of the object under investigation. We then fit an elastic map to this cloud of (x,y,z) tuples, which provides a smoothed, realistic fit of the object surface. Our method avoids the long procession time of confocal imaging. This is often a prerequisite for higher throughput, and for fragile objects.
We apply our method to the surface reconstruction of the leafs of Arabidopsis thaliana. A reliable quantification of the distribution of the trichomes on the leaf surface can provide valuable information on the molecular process that generates these trichomes[2]. As these leafs are strongly bent, the euclidean distance between trichomes would bias the results. The 3D reconstruction of the leaf surfaces allows us to calculate the geodesic distance between the trichomes, resulting in more realistic estimates of these distances. We demonstrate the utility of our method and scan the leaf surfaces of 30 leafs (7 days old, 3th leaf) of Arabidopsis thaliana, which are then used to generate appropriate summary statistics of their trichome distribution. References: 1. Image and Vision Computing, Vol. 1, No. 1. (February 1983), pp. 37-42 2. Mol Syst Biol, Vol. 4 (2008), doi:10.1038/msb.2008.54 |
Failmezger H*, Jaegle B, Schrader A, Hülskamp M, Tresch A
*Max Planck Institute for Plant Breeding Research Germany |
B - Bioimaging, Spatial-temporal Modeling and Data Visualization |
|
B 05
B5 |
NF-κB is a key transcription factor that is activated by inflammatory cytokines such as TNFα and regulates proliferation, survival, apoptosis and the cellular stress response. Dysregulation of NF-κB is involved in number of pathologies, including cancer and cardiovascular disease. However, the outcome of NF-κB activation is highly context dependent. In this study we sought to investigate the relationship between NFkB and cell morphology in tumour and non-tumour cell lines using quantitative high-throughput image analysis. Specifically, we simultaneously determined the morphology, local density and subcellular localization of NF-κB in single cells of 22 cell lines, including 17 cell lines from primary human breast tumours, across 11 treatment conditions. Our novel dataset describes a total of 270 features for approximately 800,000cells. Using Bayesian inference, we modeled the dependency between NF-κB and cellular morphology in basal and treated conditions to generate 242 network models. These analyses revealed that NF-κB localization is highly dependent on aspects of cell shape (e.g. nuclear roundness), as well as population context (e.g. colony area). We have validated these models experimentally to demonstrate that perturbing cell shape affects NFkB signaling in different cell lines and in different treatment conditions. These finding suggests that cell shape and population context play a role in tuning cell signaling and contribute to the regulation of NF-κB. The loss of shape-linkages in some cell lines could indicate genetic or microenvironmental factors that contribute to the progression of cancer. Further studies are underway to validate our models in live cells expressing GFP-NFκB. |
Sailem H*, Sero J, Bakal C
*Institute of Cancer Research United Kingdom |
B - Bioimaging, Spatial-temporal Modeling and Data Visualization |
|
B 06
B6 |
The development of long bones requires a sophisticated spatial organization of cellular signaling, proliferation, and differentiation programs [1]. How such spatial organization emerges on the growing long bone domain is still unresolved. It has been proposed that a Turing mechanism of Schnakenberg type enables the emergence of the characteristic protein expression and cellular distribution patterns [2]. Protein expression rates depend on the distribution of cells while cell proliferation and differentiation depend on the the concentration of signaling proteins. As a result a Turing mechanism for patterning in the developing bone must be tightly linked to a model for tissue growth and cell dynamics. The Turing space is typically small and we wondered whether realistic patterns could still be obtained in such a coupled model where the classical parameters in the Turing model change as a result of tissue growth. The tissue is modeled as an incompressible Newtonian fluid and growth results from local mass sources. In a first step, we solved a simple Turing system on a constantly growing domain, which corresponds to a one-way coupling. The pattern is stable for more than fivefold longitudinal growth of the tissue. When coupling hypertrophic differentiation (increase in cell volume) to the Turing system using an algebraic expression, the pattern is highly volatile and sensitive to the geometry. In a last step, we introduce a factor that blocks differentiation and acts as an integrator between the Turing system and the hypertrophic differentiation process. As a result, the pattern as well as the cell dynamics, are stabilized. Such a mechanism may thus serve as an underlying core mechanism to explain the initialization of long bone development.
[1] H. M. Kronenberg, “Developmental regulation of the growth plate,” Nature, vol. 423, no. 6937, pp. 332–336, May 2003. [2] D. A. Garzon-Alvarado, J. M. Garcia-Aznar, and M. Doblare, “A reaction-diffusion model for long bones growth,” Biomechanics And Modeling In Mechanobiology, vol. 8, no. 5, pp. 381–395, 2009. |
Tanaka S*, Iber D
*ETH Zurich Switzerland |
B - Bioimaging, Spatial-temporal Modeling and Data Visualization |
|
B 07
B7 |
Viruses spread between cells, tissues and organisms by cell-free and cell-cell transmission mechanisms. Both mechanisms enhance disease caused by different viruses, but it is difficult to distinguish between them. We have previously characterized the transmission mode of human adenovirus in monolayers of epithelial cells using experimental data from live-cell fluorescence microcopy. Employing these data as experimental parameters we have developed an in silico model using multi-scale hybrid dynamics, cellular automata and particle strength exchange (CA-PSE). Here we present a generalized open source simulation framework based on the model we have developed. Based on the inherent flexibility of the CA-PSE model this framework aims at enabling predictions of spatial spread of from different families, taking into account their different modes of spreading, in particular cell-to-cell and cell-free spreading. The frame work promises to be useful for better understanding of various parameters which modulate the viral spreading dynamics. By this, we hope to provide mechanistic insights into commonly used endpoint measurements in virology, such as plaque assays or fluorescent focus forming assays. |
Yakimovich A*, Yakimovich Y, Schmid M, Pelkmans L, Sbalzarini I, Greber U
*University of Zurich, Institute of Molecular Life Science Switzerland |
B - Bioimaging, Spatial-temporal Modeling and Data Visualization |
|
B 08
B8 |
Vasculogenesis is the process of de novo formation of blood and lymphatic vasculature for instance in a developing embryo, while angiogenesis is the process of vessel formation from pre-existing vasculature. These processes are orchestrated by a plethora of growth and differentiation factors such as vascular endothelial growth factors (VEGFs), angiopoietins, and platelet derived growth factors. The VEGF family of angiogenic growth factors consists of six polypeptides,VEGF-A, -B, -C, -D, -E, and PlGF. VEGFs also exist in different isoforms which lead to distinct signal output. VEGFs specifically bind to three type V receptor tyrosine kinases, VEGFR-1, -2, and -3, and to co-receptors, such as neuropilins. A complex interplay among these receptors regulates blood and lymphatic vessel development. Deregulation of vessel development and homeostasis is responsible for many diseases such as retinopathies or atherosclerosis, but also plays a critical role in tumor cell growth and metastasis in cancer. The technologies for in vivo monitoring the distinct signal output by angiogenic growth factor receptors are limited. Here we present a methodology to study VEGF-induced angiogenesis of endothelial cells on micropatterned substrates that mimic the in vivo environment encountered by endothelial cells in hypoxic tissues undergoing angiogenesis. Immunostaining demonstrates specific VEGF binding to micropatterned glass coverslips engineered by a microfluidic methodology. Immobilized VEGF was biologically active and endothelial cells expressing VEGF receptors migrated towards and adhered to these patterns. The patterned substrates present a robust and reproducible system that can be used to in vitro characterize specific cellular behavior in response to growth on VEGF patterned substrates.
In addition, we develop analytical tools to acquire, process and quantify microscopic data. Cell biological data obtained from endothelial cell cultures will be compared with in silico models describing the migration of cells on patterned substrates. The combination of cell biological experiments with in silico modeling of cell migration and pattern formation is expected to disclose essential biological principles underlying organogenesis in vivo, for instance in tissues undergoing angiogenesis. |
Zinca S*, Padeste C, Ballmer-Hofer K, Milde F, Koumoutsakos P
*ETH Zurich, Chair of Computational Science, Paul Scherrer Institute, Laboratory of Biomolecular Research Switzerland |
B - Bioimaging, Spatial-temporal Modeling and Data Visualization |
|
B 09
B9 |
Hi-C is a sequencing based assay which can provide an insight into the 3-dimensional structure of the genome through the creation of large libraries of paired sequence tags deriving from genomic regions found in close physical proximity.
Although several groups have reported methods to extract useful information from Hi-C datasets there are, as yet, no user friendly software packages which provide a simple way to visualise and analyse Hi-C data on modest hardware. Here we present a set of extensions to the existing SeqMonk package for NGS analysis which allows the program to analyse Hi-C data. These extensions allow the data to be visualised either by extraction of arbitrary 1D subsets from the whole dataset, or by doing a full genome wide 2D analysis, backed by a statistical model appropriate to Hi-C data. The tools shown here should make Hi-C analysis accessible to a much wider audience than is currently possible. |
Andrews S*
*The Babraham Institute United Kingdom |
B - Bioimaging, Spatial-temporal Modeling and Data Visualization |
|
C 01
C1 |
Background: Age plays an important role in medicine and medical research; it is an important factor when considering phenotypic changes in health and disease. We recently developed the Age-Phenome Knowledgebase (APK) which formaly represents knowledge about clinically-relevant traits such as disease, that occur at different ages. The APK holds over 35,000 entries which describe the relationships between age and phenotypes, mined from over 1.5 million PubMed abstracts. In this work we demonstrate the integrative analysis of raw clinical measurments with knowldge mined from the literature.
Methods and Results: Data from the NHANES III survey was used to calculate the fraction of abnormal blood test results at each age. Furthermore, the ages of diagnosis of diseases in NHANES were also captured. The pattern of change in abnormal blood values and disease diagnosis was compared to a normalised measure of age-to-disease reports from APK. Correlatation analysis of abnormal blood results with APK-derived disease patterns revealed several interesting positive and negative correlations, especially when allowing a 1-3 year shift between the patterns. Furthermore, comparison of the pattern of age of diagnosis of disease from NHANES subjects to that described in the medical literature and captured in APK reveals that these patterns are in good agreement for the majority of diseases tested. Conclusions: In this work we demonstrate the usefulness of analysis of data obtained from different types of bio-medical resources. Furthermore, we show that knowledge stored in the APK captures current medical knowledge and is comparable to that observed in clinical data. |
Geifman N*, Rubin E
*Ben Gurion University Israel |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 02
C2 |
Plasma amino acids changes in response to metabolic alternations during the course of various diseases. However the causes of progression from healthy control to manifestation of lifestyle disease remain obscure. Here we extend plasma amino acids profile to those relationships with regulatory properties and find remarkable differences in one of the lifestyle diseases, diabetes mellitus. In this study, we performed dynamic analysis using time-course data of the plasma samples of AKITA mice which develop diabetes mellitus as a result of insulin deficiency. First, we decided to analyze the dynamic property on the subnetwork structure including five amino acids located in the upstream of the network, namely, alanine, glycine, leucine, isoleucine, and valine. These amino acids constituted an interrelated loop whose detailed network structure could not be determined by the static analysis. And then, we inferred the dynamic network structure which reproduces the actual time-courses within the error allowance of 10% using S-system. The S-system is the conceptual mathematical model represented by power-law formalism and one of the best formalism to estimate interaction mechanisms among system components, and enables us to reconstruct the network architectures with the experimentally observed time-courses of the quantity of the network components. By performing sensitivity analysis, we selected dominant relations in this network. The emphasis of this study is altered interactions of plasma amino acids that show stabilizing and destabilizing features in a variety of clinical settings. This result of branched-chain amino acids (leucine, isoleucine, and valine) is especially in good agreement with the biological knowledge. |
Tanaka T*, Mochida T, Maki Y, Shiraki Y, Mori H, Matsumoto S, Shimbo K, Ando T, Nakamura K, Endo F, Okamoto M
*Innovative Science and Technology for Bio-industry, Graduate School of Bioresource and Bioenvironmental Sciences, Kyushu University Japan |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 03
C3 |
MOTIVATION: BV6 is an antagonist of inhibitor of apoptosis (IAP) proteins. Since these are often over-expressed in cancer, BV6 may be used as a regulator of cancer cell survival in therapy. We predict BV6 sensitivity for patients by inference from experimental data to personalize therapy.
MATERIAL: From 3 BV6-sensitive and 3 BV6-resistant patients of primary acute myeloid leukemia (AML), we take cells in culture to profile gene expression with Affymetrix microarrays (experimental dataset). As the clinical dataset, we use 93 AML cDNA-based patient gene expression profiles, published previously. METHODS: We limit both datasets to common genes and standardize expression values. Using Guided Clustering, we find gene sets which are differentially expressed between BV6-sensitive and BV6-resistant samples in the experimental dataset, and show similar correlation patterns in both datasets. To predict BV6-sensitivity of patients from the clinical dataset, we learn Support Vector Machines (SVM) from the experimental dataset limited to guiding genes. RESULTS: We obtaine a gene set of 103 genes by Guided Clustering. A SVM based on this gene set predicts 55 patients as sensitive and 38 as resistant to BV6. According to Fisher tests, sensitive patients are strongly related to inv(16) (p-value 1.0e-14) and the FAB subtype 4 (p-value 2.8e-12). Noisy classifiers predict BV6-sensitivity of 78 of 93 patients similarly to the noiseless SVM. CONCLUSION: Our integrative approach has the potential to predict drug sensitivity of patients based on in vitro experiments, and thus to improve the personalized use of drugs. |
Moll A*, Bullinger L, Lottaz C
*Institute for Functional Genomics, University of Regensburg Germany |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 04
C4 |
Alzheimer’s disease is mainly characterized by the progressive decline in cognitive ability and pathologically linked to formation of senile plaques. Although the etiology of the disease is not well understood, the most widely discussed hypothesis is that amyloid-beta accumulation is the causative component of the AD pathogenesis. Many published studies indicate that deposition of amyloid-beta perturbs a range of signaling pathways; however, this knowledge remains scattered in the form of free text and pathway cartoons in the literature and various pathway databases.
Converting descriptive knowledge of published studies into a computer-readable form will improve disease-modeling efforts dramatically. Motivated by the capacities of the Biological Expression Language (openBEL) - a language representing scientific findings in the life sciences in a computable form by capturing causal and correlative relationships - we have constructed first draft of a computer readable model for APP interacting pathways using BEL. After collecting all the pathway information related to APP biology from the literature, the knowledge was converted into triples using BEL codes. The model in its current form comprises 1808 BEL code lines encoding the information from 240 abstracts and 45 full-text publications. Main pathways represented in the model include mitochondrial function, calcium homeostasis, cholesterol metabolism, inflammatory responses, neurogenesis and apoptosis. We present the integrated model as a graphical network and discuss its application to causal reasoning of processes and pathways involved in the disease mechanism. |
Tom Kodamullil A*, Younesi E, Hofmann-Apitius M
*Department of Bioinformatics, Fraunhofer Institute SCAI Germany |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 06
C6 |
The main goal of the KD4v server based on Inductive Logic Programming (ILP) is to uncover, exploit and display the links between the structural impact of a mutation and a human disease phenotype. KD4v provides 2 complementary services: (i) a knowledgebase consisting of ILP rules based on sequence/structure/evolution predicates that characterize deleterious mutations and that can be interpreted by biologists, (ii) a tool for mutation prediction based on the deduced ILP rules with performances similar to the most widely used methods PolyPhen-2 and SIFT. More importantly, the rules associated with deleterious prediction facilitate the biological interpretation. Hopefully, this development will contribute to a more complete elucidation of the chain of events leading from a molecular defect to the associated pathology.The web server is available at http://decrypthon.igbmc.fr/kd4v.
Reference: Luu, Dao; Rusu, Alin; Walter, Vincent; Linard, Benjamin; Poidevin, Laetitia; Ripp, Raymond; Moulinier, Luc; Muller, Jean; Raffelsberger, Wolfgang; Wicker, Nicolas; Lecompte, Odile; Thompson, Julie; Poch, Olivier; Nguyen, Hoan (2012). KD4v: Comprehensible Knowledge Discovery System For Missense Variant. Nucleic Acids Res, Article in Press for July 2012. |
Luu T, Nguyen H*
*IGBMC France |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 07
C7 |
The problem of knowledge discovery from a large number of genes and small samples has been addressed in systems biology recently. The goal of such discovery is to achieve and reveal an interpretable model that can be read by a biologist .In this study we are interested in exploring more hidden knowledge from KEGG using sparse gene expression profiles . Using statistical machine learning to discover hidden knowledge in gene expression datasets, such as discovering the causal relationships between large numbers of genes, will potentially enhance our understanding of the higher order of molecular systems that regulate cellular growth. In this study, Colorectal cancer cell lines were treated with 5-Fluorouracil (5-FU) based treatments and RNA was extracted at 24hr time period to carry out whole genome gene expression arrays on Illumina platform. The resultant data was extracted using genomestudio software after background subtraction and quantile normalization. After that, each treated and control gene expression profile was mapped to KEGG database. We restricted the search in KEGG to colorectal cancer pathways and found that a large proportion of the genes in each treatment are annotated in well known colorectal cancer pathways, namely MAPK and Cell Cycle pathways. A known discrepancy about KEGG is that it shows for example in the Cell Cycle signaling pathway that TGF_ (3 proteins) has direct affect on SMAD2/SMAD3 (2 proteins) but does not show the specific interaction between the two sets, which protein in TGF_ interacts with which protein in SMAD2/SMAD3.In this work, machine learning of graphical models is used to extend the graphical representation of MAPK and Cell Cycle pathways to show a more detailed picture of how each protein interacts with those around it. To the best of our knowledge, there is no existing mechanism for accessing the specific connections between gene families that underlie the generic connections represented in the KEGG signaling diagrams in the context of colorectal cancer. We used lasso estimate when searching the best optimal parents (causal genes) for each affected gene. The resultant extended graphs extended causal networks as the learning is based on causal prior knowledge from KEGG. The resultant graphs also show how the treatments used in each cell line affect the connectivity between genes in comparing to the control cell line (untreated cell line).
The current study also suggests how the prior knowledge from KEGG is helpful to reduce the search space when searching the optimal model from a small gene expression profiles that has p >> n property. |
Al Oraini A, Aziz MA*
*King Abdullah International Medical Research Center, KSAU-HS Saudi Arabia |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 09
C9 |
Promoter methylation is an important factor for gene expression regulation. Especially, it can control target gene regulation semipermanently. Thus, for example, it is used for tissue specific gene expression regulation. Recently, brain tissue promoter methylation patterns turn out to be more personalized than those in other tissues. In this paper, we try finding which genes are specifically methylated among individuals.
We downloaded promoter methylation profiles from Gene expression Omnibus (GEO : GSE15014). Downloaded CEL files were treated with AffyTiling package in Bioconductor.The array includes too many probes (4,275,079), we could not apply principal component analysis (PCA) to all of them at once. Then we randomly sampled 750,000 probes and applied PCA to them. We computed P-values between different kind of samples: neuron vs non-neuron, methylation vs genomic. P-values were computed using t-test function in base package of R. The first PC primarily represents the difference between methylation and non-methylation samples. Thus, primary difference in these samples are the difference between methylated and non-methylated samples. The second PC also represents the difference between methylated samples and non-methylated samples but not divergence between individuals. These findings are compatible with the findings that methylation pattern over individuals is substantially different from those of non-methylated samples. We have successfully shown that PCA can depict useful components which represent divergence between individuals. |
Owada M*, Taguchi Y
*Chuo University Japan |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 10
C10 |
The Critical Assessment of Genome Interpretation (CAGI) is a community experiment to objectively assess computational methods for predicting the phenotypic impacts of genomic variation. In this assessment, participants are provided genetic variants and make predictions of resulting phenotype. These predictions are evaluated against experimental characterizations by independent assessors. The CAGI experiment culminates with a community workshop and publications to disseminate results, and to assess our collective ability to make accurate and meaningful phenotypic predictions. A long-term goal for CAGI is to improve the accuracy of phenotype and disease predictions in clinical settings.
The CAGI 2011 experiment consisted of 11 diverse challenges exploring the phenotypic consequences of genomic variation. CAGI predictors applied the state-of-the-art methods to identify the effects of variants in an enzyme and oncogenes, which revealed the relative strengths of each prediction approach and the necessity of customizing such methods to the genes in question. CAGI further demonstrates unexpected successes in predicting Crohn’s disease from exomes, as well as disappointing failures in using genome and transcriptome data to distinguish discordant monozygotic twins with asthma. Complementary approaches from two groups showed promising results in predicting distinct response of breast cancer cell lines to a panel of drugs. Predictors also made measurable progress in predicting a diversity of phenotypes present in the Personal Genome Project participants, as compared to the CAGI predictions from 2010. CAGI is planned again for 2012 and we welcome participation from the community. Current information will be available at the CAGI website at http://genomeinterpretation.org. |
Repo S*, Moult J, Brenner SE
*Department of Plant and Microbial Biology United States of America |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 11
C11 |
Fusion genes are products of the combination of two separate genes generated by genetic rearrangements, read-through events or trans-splicing. Fusions that disrupt the reading frames of the two original genes often lead to loss-of-function effect, while in-frame fusions can create gain-of-function chimeric protein products. These functional changes may cause neoplastic transformation and thus play an important role in cancer development. Fusion genes have been shown to be important tumor markers. We have identified gene fusions from chimeric transcripts in RNA sequencing data from 381 breast cancer tumor samples from The Cancer Genome Atlas (TCGA). RNAseq data was mapped to the reference genome using the bwa aligner followed by extraction of the unmapped sequence reads. These were subsequently used to detect fusion events by a split-read approach implemented in the FusionMap software. Each read was split in parts and retained for analysis if two separate parts of the same read could be mapped to separate chromosomal locations. We have aimed at detecting new recurrent chimeric transcripts, including those produced by non-canonical splicing mechanisms. Using this approach, we identified 581 chimeric transcripts out of which 74 were recurrent across samples. Five fusions were also found in RNA sequencing data from breast cancer cell lines and were selected for further experimental validation by RT-PCR and Sanger sequencing. |
Greger L*, Brazma A, Rung J
*EMBL-EBI United Kingdom |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 12
C12 |
Hyperlipidemia affects the progression of atherosclerosis and myocardial infarction and is characterized by high levels of serum cholesterol, low-density lipoprotein (LDL) and triglyceride (TG) clinically. Copy number variation (CNV) is responsible for some of the rare variation in the chromosome and probably contributes to susceptibility to metabolic disease. Previous studies confirmed several gain and loss regions in the chromosome which may associate with hyperlipidemia; however, further study is needed due to the limited sample size and lack of the comparison in the genetic diversity of populations. In this study, the human genome-wide SNP array was performed with 38 Taiwanese hyperlipidemia patients and 121 Chinese controls, and two different copy number baselines were used in CNV detection and discovery of the significant CNV regions associated with the hyperlipidemia in Chinese.
In this study, CNV regions in 10q11.21, 12p12.1 and 21q22.11 were significant in Taiwanese hyperlipidemia patients (P < 1x10-4). Difference copy number baseline of the HapMap and Chinese population were used to exclude the bias of population diversity. The biological function of these CNV and involved genes especially in RET and KRAS indicates several possibly relationships with hyperlipidemia or metabolic syndrome. These candidates finding may be conducive to the future CNV association studies. |
Shia W*, Chen D, Hsu F
*Changhua Christian Hospital Taiwan |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 13
C13 |
High-throughput profiling of gene expressions has opened new avenues for the understanding of biological processes at the molecular level. However, with experience it has become evident that the large lists of differentially expressed genes resulting from these approaches are insufficient to adequately describe the mechanistic biology. Reducing the complexity of such data by evaluating it in a relevant biological context is required in order to gain meaningful insight.
We propose that network-based approaches to pharmacology/toxicology are valuable to quantify biological network perturbations caused by active substances and to identify mechanisms and biomarkers modulated in response to exposure. Recently published biological network models [1,2], which are a knowledge representation of key biological mechanisms (e.g., oxidative stress) will serve as the substrate to integrate transcriptomic data. They consist of literature-derived cause-and-effect relationships between biological entities, which can be different from gene transcription and are encoded in the BEL language (www.belportal.org). The underlying concept is that transcriptional changes are the consequences of the biological processes described in the network and are not the processes per se. We present a novel approach for the computation of a quantitative expression of the Amplitude of Network Perturbations [3] to enable comparisons between different exposures and systems. Our approach enables a quantification of each biological entity in the network, which constitutes a reduced feature space capturing more appropriately the investigated biology than the original high-dimensional gene space. The potential of this knowledge driven approach is demonstrated using several publicly available human gene expression datasets. Quantifying individual patients’ responses through a relevant network appears to be a biologically meaningful, interpretable, robust and efficient way of deriving network signatures. This indicates that the network-derived features extract the biology relevant for the considered experiment with remarkable accuracy. The presented approach efficiently integrates high-throughput gene expression data and “cause and effect” network models to enable broad applications ranging from data interpretation and mechanistic hypothesis generation to patient stratification for diagnostic purposes. [1] Schlage et al, PMID: 22011616 [2] Westra et al, PMID: 21722388 [3] Martin et al, accepted in BMC SysBio |
Martin F, Sewer A, Talikka M, Xiang Y, Hoeng J*, Peitsch MC
*Philip Morris International R&D Switzerland |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 14
C14 |
For the purpose of looking for the genetic variants associated with an increased risk of early-onset breast cancer, we studied 81 patients affected with breast cancer at onset ages younger than 40 years. We collect blood specimen as well as tumour tissue from breast cancer patients of Eastern Asian origin. Their breast cancers were classified into five subtypes (luminal A, luminal B, HER2+ and triple negative), 9 of them had a family history of breast cancer. The whole exonic regions were sequenced at an average depth of 200X. Only the variants having been detected in both paired samples plus at a minimum depth of 30X are included for the further analysis. In the candidate variants detected, we focus on mutation and eliminated those common variants (i.e. polymorphism) on the basis of the records in public domains of single nucleotide variant, including dbSNP, 1000 Genome Project and ESP5400. Furthermore, the sequences obtained from 47 unaffected Eastern Asian samples were used to remove population-specific polymorphisms, resulting in both somatic and germ-line mutations. Multiple efforts had been made to predict possible functional impact of any single base substitution, based on sequence homology, the physical characteristic of amino acids and protein domain information using open sources, such as VarioWatch, SIFT, MutationTaster and PolyPhen2. Consequently, 6454 variants have been detected, which are mapped on 4603 genes, and the top 15 variants being detected in at least 4 patients included C3AR1 detected on ER(+) patients only; RNASEL detected on HER2(-) patients; HLA-DQA1 detected on HER2(+) patients; and MMP13 detected on ER(+) and HER2(-) patients. This information shed light on our understanding about breast tumourigenesis. |
Lo K*, Liu C, Chen C, Yao W, Cheng A, Shen C, Chang K, Kuo W, Lin C, Lu Y, Lee C, Chen C
*Genomics Research Center, Academia Sinica Taiwan |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 15
C15 |
Purpose
Pharmacovigilance methods have advanced greatly during the last decades, making post-market drug assessment an essential drug evaluation component. This strategy uses spontaneous reporting systems and health information databases to collect expertise from huge amounts of real-world reports. The EU-ADR Web Platform was built to further facilitate accessing, monitoring and exploring these data, enabling an in-depth analysis of adverse drug reactions risks. Methods The EU-ADR Web Platform exploits the wealth of data collected within the EU-ADR project. Millions of electronic health records are mined for specific drug events, which are correlated with literature, protein and pathway data, resulting in a rich drug-event dataset. Next, service composition strategies are tailored to coordinate the execution of distributed web services performing data-mining and statistical analysis tasks. This permits obtaining a ranked drug-event list, removing spurious entries and highlighting relationships with high risk potential. Results The EU-ADR Web Platform is an open workspace for the integrated analysis of pharmacovigilance datasets. Using this software, researchers can access a variety of tools provided by distinct partners in a single centralized environment. Besides performing standalone drug-event assessments, they can also control the pipeline for an improved batch analysis of custom datasets. Drug-event pairs can be filtered, substantiated and statistically analyzed within the platform’s innovative working environment. Conclusions A pioneering workspace for delivering advanced drug studies has been developed within the EU-ADR project consortium. This tool, targeted at the pharmacovigilance community, is available online at https://bioinformatics.ua.pt/euadr/. |
Lopes P*, Campos D, Nunes T, Oliveira JL
*DETI/IEETA, University of Aveiro Portugal |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 16
C16 |
Exhaled air carries information on human health status. Ion mobility spectrometers combined with a multi-capillary column (MCC/IMS) is a well known technology for detecting volatile organic compounds (VOCs) within human breath. This technique is relatively inexpensive, robust and easy to use in every day practice. However, the potential of this methodology depends on successful application of computational approaches for finding relevant VOCs and classification of patients into disease-specific profile groups based on the detected VOCs. We developed an integrated state-of-the-art system using sophisticated statistical learning techniques for VOC-based feature selection and supervised classification into patient groups. We analyzed breath data from 84 volunteers, each of them either suffering from chronic obstructive pulmonary disease (COPD), or both COPD and bronchial carcinoma (COPD+BC), as well as from 35 healthy volunteers, comprising a control group (CG). We standardized and integrated several statistical learning methods to provide a broad overview of their potential for distinguishing the patient groups. We found that there is strong potential for separating MCC/IMS chromatograms of healthy controls and COPD patients (best accuracy COPD vs. CG: 94%). However, further examination of the impact of bronchial carcinoma on COPD/no-COPD classification performance is necessary (best accuracy CG vs. COPD vs. COPD+BC: 79%). We also extracted 20 high-scoring VOCs that allowed differentiating COPD patients from healthy controls. We conclude that these statistical learning methods have a generally high accuracy when applied to well-structured, medical MCC/IMS data. |
Hauschild A*, Baumbach JI, Baumbach J
*Computational Systems Biology, Max Planck Institute for Informatics Germany |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 17
C17 |
Acute lymphoblastic leukemia (ALL) is the most common childhood cancer. Even though the cure rates are high, still ~20% of treated children die from resistant disease, relapse or experience treatment toxicities. Several single nucleotide polymorphisms (SNPs) are known to be key determinants for inter-individual differences in treatment resistance and toxic side effects.
Using multiplexed targeted sequencing, we have genotyped ~900 Danish and German patients for ca. 25,000 functional SNPs with potential clinical relevance for ALL. Such a multiple-SNP assay allows for investigation of effects mediated via pathways where one of several SNPs may lead to similar clinical phenotypes due to related molecular mechanisms. The selection of genes involved previously known determinants of treatment response, as well as genes from domains of potential importance for ALL treatment. The contribution of inherited genetic variation to treatment response was investigated by associating germline SNPs with risk of high minimal residual disease levels after remission induction chemotherapy and risk of relapse. This poster will show the main results, current status and challenges of the project, describe the integration of data to elucidate common pathway mechanisms and ways to better classify patients based on combining the effects of several genotypes. |
Wesolowska A*, Borst L, Dalgaard M, Helt L, Madsen H, Marquart H, Wehner P, Rasmussen M, Willerslev E, Gilbert T, Brunak S, Schmiegelow K, Gupta R
*Technical University of Denmark Denmark |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 18
C18 |
Colorectal cancer (CRC) is one of the most common cancers. Over the past 20 years, numerous genetic events that contribute to the disease have been described. However, this disease is highly heterogeneous and many of the mechanisms leading to tumorigenesis are unknown. Moreover, gene function can be modified by various mechanisms, such as copy number variation (CNV) or single nucleotide variation (SNV). In this work, we used a custom capture set to focus on alterations that impact kinase genes, because they are generally considered to be good therapeutic targets. To obtain a complete picture of genetic aberrations, we defined the SNVs and CNVs that occur in 73 primary CRC samples and studied the alteration patterns of all kinases.
We identified 3711 protein-coding SNVs that we classified as putatively activating or inactivating based on the type of alteration (frame shift, stop gained, essential splice site or non-synonymous coding). A joint analysis of SNVs and CNVs enabled the identification of both established and novel players and the study of their specific alteration patterns. For instance, KRAS and PIK3CA are both specifically activated by SNV, whereas APC, ACVR2A and TP53 are mostly inactivated by SNV. Other genes are inactivated mostly by CNV loss. We went on to analyse a panel of 42 CRC cell lines, which allowed us to identify alterations that confer sensitivity and resistance to specific drugs. If these results can be validated they will open up new avenues for targeted treatment of CRC patients. |
Michaut M*, Majewski I, Chresta C, Beran G, Orphanides G, Roepman P, Simon I, Bernards R, Wessels L
*Netherlands Cancer Institute Netherlands |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 19
C19 |
Asthma is one of the most common chronic diseases in childhood. Various studies have identified several genetic and environmental factors to be implicated in asthma pathogenesis. To identify genetic factors associated with Asthma, we performed a Genome-wide association analysis on a birth cohort of 411 children of asthmatic mothers. In order to augment the GWAS, the clinical risk factors associated with different troublesome lung symptoms child experienced were captured by clinical visits for 7 years.Exposure of the child to different bacteria in earlier life has also been found to affect the asthma outcome. Thus, the factors regarding immunological environment of the child were also tracked.
This study attempts to associate combinations of SNPs and the clinical risk factors to the phenotypic outcome that are closer to improving preventive measures. The non-linear method of genetic algorithms coupled with artificial neural networks is being used to assess for phenotypic association and class prediction. Best performing combinations of SNPs and clinical risk factors would be assessed for predictive power for the prognosis of childhood asthma. This setup also provides an opportunity to assess the impact of exposome by combining clinical parameters and environmental information. The poster will present our efforts in this direction. |
Yadav R*, Bønnelykke K, Nordahl Petersen T, Madura Larsen J, Mølgaard A, Brix Pedersen S, Bisgaard H, Gupta R
*Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark Denmark |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 20
C20 |
Plant metabolites have for decades attracted the interest of researchers because of their involvement in growth, reproduction and protection of plants from pathogens and predators. In addition, secondary metabolites in edible plants are expected to act as modifiers of biological functions in humans and, hence, being potentially involved in health maintenance and disease prevention.
However, the level of complexity is increased by the simultaneous presence of a variety of components, with diverse chemical structures and numerous biological targets. Systems chemical biology approaches have the potential to help us elucidate the effect of vegetarian diet on humans and explore the field of nutritional research in novel ways. In the present work, we used text mining to construct a unique database with information concerning the vegetarian diet, its molecular components and the associated disease phenotypes. The emerged chemical space of plant metabolites support the so far intuitive assumption that synthesis of plant metabolites is a consequence of short-term adaptations to environmental constraints and thus, does not follow the established plant taxonomy rigorously. Our data show that from a metabolic point of view, the plant taxonomy seems turbulent with little agreement on circumscription. In addition, we demonstrate the highly bioactive nature of the plant metabolic space. The general appearance of plants such as size, shape, growth and orientation (plant habit), as well as the annual/biennial/perennial trait provide interesting perspectives to look at human nutrition and search for metabolic signatures linked to specific therapeutic effects. |
Jensen K*
*Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark Denmark |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 21
C21 |
The protein kinases are responsible for regulating various biological mechanisms. The lack of control in the phosphorylation process of this protein family is associated with various diseases such as cancer, diabetes and rheumatoid arthritis. The enzyme p38 mitogen-activated protein kinase (p38 MAPK), mainly p38α, is also considered as a potential target for the development of anti-inflammatory drugs. In this work, given the wide availability of three-dimensional structures available in the PDB, we analyzed the pattern of structural variation in the activation loop region (also called T-loop) of this enzyme. The objective was to obtain the best representative set of structures to be used in ensemble docking studies. Twenty three p38α-inhibitor complexes were analyzed, through a RMSD criterion, and classified of into six representative groups according to the conformation adopted by the T-loop. Fourteen different test ensembles were constructed using a representative structure for each group and based on crossdocking experiments (performed using the Glide vs5.7 docking program) for all the 23 inhibitors. Our analyses showed that docking experiments using the best single one representative conformation obtained only a 52% success rate (best energy pose) in reproducing the correct inhibitors binding modes. Using sets of three or five representative structures it was possible to increase the success rate to 82%. We concluded that the ensemble docking methodology proposed in this work is capable to increase significantly the predictability of p38α MAPK docking experiments. |
Guedes IA*, Fraga CAM, Dardenne LE
*Laboratório Nacional de Computação Científica Brazil |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 22
C22 |
Gene Set Enrichment Analysis is a versatile bioinformatics approach and has been frequently used in modern research. Using global WNT gene sets, the WNT pathway in general is highly active in the molecular basal-like subgroup of breast cancer and in all subgroups which later metastasize to the brain. However, previous studies of breast cancer primaries couldn't identify a WNT ligand or sub-pathway mediating these signals. Our own results indicate that β-Catenin independent WNT signaling is of importance in breast cancer and its metastases. However, currently there is no WNT model available which differentiates between the distinct WNT sub-pathways. Therefore, we aim to develop a new graph-based WNT model and to use it for a more refined pathway analysis of breast cancer expression data.
As a first step we collect information about the human WNT pathway from several public databases (PID, Biocarta, Reactome, KEGG). These databases often use the Biopax format as a standard XML format. To utilize this knowledge within the statistical computing environment of R we have developed a Biopax-Parser package, which allows to retrieve pathway nodes corresponding to signaling components and pathway edges representing molecular interactions. The Biopax pathways are parsed into the R, transformed into adjacency matrix and can be merged, shrunk or extended. This package allows us to generate a consensus WNT model and use it in the further analyses. Different algorithms integrating network knowledge and allowing a more refined enrichment analysis are currently tested and will be used to discriminate activation of different WNT sub-pathways. |
Bayerlova M*, Kramer F, Pukrop T, Klemm F, Bleckmann A, Beissbarth T
*Department of Medical Statistics, University of Göttingen Germany |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 23
C23 |
Abnormal hypothalamic-pituitary-adrenal axis regulation is a key neurobiological characteristic of depression. Glucocorticoid receptor (GR) function has been shown to be disturbed in depression, hence polymorphisms altering the transcriptional effects of GR-activation might be interesting candidates for this disorder.
The aim of this study was to identify SNPs associated with glucocorticoid (GC)-induced gene expression changes (cis-eQTLs) in peripheral blood. 160 male Caucasians (69 cases, 91 controls) were genotyped using Illumina Human660W-Quad BeadChips. Imputation was performed using IMPUTE-v2 with HapMap III and 1,000 Genomes Project as reference panels. Baseline and stimulated (1.5 mg dexamethasone) gene expression was analyzed using Illumina Human HT12v3 array. Quality control checks, filtering, batch corrections and linear regression analysis was performed in PLINK and R. Of a total of 4,395 significant cis-eQTLs, 2,364 significant response-eQTLs, namely loci associated with GC-stimulated gene expression variation were identified after multiple testing corrections. Over 67% of response-eSNPs were located >200kb from the probe, indicating long-range regulation of gene expression by GCs. This was accompanied by significant enrichment of GR response elements (GREs) within the response-eQTLs. We also observed differences in the affinity of GREs between the opposite SNP alleles. Further, response-eQTLs were significantly more likely to be associated with unipolar depression susceptibility loci from a recent meta-analysis than baseline eQTLs. Interestingly, the majority of these enriched eSNPs alter the gene expression of more distant genes. In conclusion our data suggest that GC-stimulated eQTLs could expand our understanding of the genetic basis of stress-related disorders, in which GR-function plays an important pathophysiologic role. |
Arloth J*
*Max Planck Institute of Psychiatry Germany |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 24
C24 |
The continued reduction in the cost of sequencing has made it more feasible to sequence both DNA and RNA components of tumour samples. Until recently these data have been analysed in isolation, however by combining the datasets we can add value to each and gain valuable biological insight. Here we use whole genome DNA sequencing of tumour and matched normal samples, along with RNA from the tumour sample. Looking initially at SNVs and indels we show that our power to detect somatic variants with high confidence is increased by combining data sets. We report the evidence for RNA editing and details of the causes of false positives calls in both the DNA and RNA. In addition to SNVs and indels we show how using both DNA and RNA gene-fusion callers can increase our ability to determine real gene-fusion events using validated events. We present our methods and results and suggest ways in which other datasets could be combined further to provide additional insight.
|
MacArthur S*, Chen X, Becq J, Shaw R, Chuang H, Khrebtukova I, Mann T, Murray L
*Illumina, Inc. United Kingdom |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 25
C25 |
Gene variations are increasingly being recognized as important diagnostic and prognostic molecular markers in myeloid neoplasms. Several frequently mutated genes in this group of disorders are recently discovered and new markers are expected to emerge and to contribute to better characterization of patients. An important candidate marker due to numerous gene variations affecting protein sequence is TET2, the epigenetic regulator of the hematopoietic process. We have developed the web application for predictions of functional effects of TET2 gene variations in coding regions based on differences of long-range properties of altered sequences. These properties were analyzed on set of 185 amino acid substitutions by informational spectrum method (ISM), the Fourier transform based method for pattern analyses of electron-ion interaction potential (EIIP) parameter. The observed significant differences between SNPs and mutations as well as between mutations occurring in different myeloid neoplasms are applied in classification criteria on which predictions are based. Long-range recognition properties represent global characteristics of the sequence and therefore, predictions made by this application are equally efficient in conserved and in non-conserved protein regions, which is a distinguishable advantage over tools based on multiple sequence alignment and on the degree of conservation of residues. Further developments will include other frequently mutated genes as RUNX1, NRAS, DNMT3A and KIT. |
Veljkovic N*, Gemovic B, Glisic S, Perovic V, Veljkovic V
*Institute of Nuclear Sciences Vinca Serbia and Montenegro |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 26
C26 |
The explosion of biological data has dramatically reformed today’s biology research. The biggest challenge to biologists and bioinformaticians is the integration and analysis of large quantity of data to provide meaningful insights. One major problem is the combined analysis of data from different types. Bi-cluster editing, as a special case of clustering, which partitions two different types of data simultaneously, might be used for several biomedical senarios. However, the underlying algorithmic problem is NP-hard. Here we contribute with BiCluE, a software package designed to solve the weighted bi-cluster editing problem. It implements (1) an exact algorithm based on fixed-parameter tractability and (2) a polynomial-time greedy heuristics based on solving the hardest part, edge deletions, first. We evaluated its performance on artificial graphs. Afterwards we exemplarily applied our implementation on real world biomedical data, GWAS data in this case. BiCluE generally works on any kind of data types that can be modeled as (weighted or unweighted) bipartite graphs. To our knowledge, this is the first software package solving the weighted bi-cluster editing problem.
|
Sun P*, Baumbach J, Guo J
*Max Planck Institute for Informatics Germany |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 27
C27 |
Recently, there has been much interest in gene-disease networks and polypharmacology as a basis for drug repositioning. Here, we integrate data from structural and chemical databases to create a drug-target-disease network for 147 promiscuous drugs, their 553 protein targets, and 44 disease indications. Visualizing and analyzing such complex networks is still an open problem. We approach it by mining the network for network motifs of bi-cliques. In our case, a bi-clique is a subnetwork in which every drug is linked to every target and disease. Since the data is incomplete, we identify incomplete bi-cliques, whose completion introduces novel, predicted links from drugs to targets and diseases. We demonstrate the power of this approach by repositioning cardiovascular drugs to parasitic diseases, by predicting the cancer-related kinase PIK3CG as novel target of resveratrol, and by identifying for five drugs a shared binding site in four serine proteases and novel links to cancer, cardiovascular, and parasitic diseases. |
Daminelli S*, Haupt J, Reimann M, Schroeder M
*Technische Universität Dresden Germany |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 28
C28 |
The Single Amino Acid Polymorphism data analysis pipeline (SAAPdap) allows the analysis and visualization of the structural effects of mutations coupled with a predictor of whether mutations will be damagine. SAAPdap has been built along the same principles as SAAPdb, http://www.bioinf.org.uk/saap/db/ our database of effects of mutations identified by mapping single nucleotide polymorphisms (SNPs, from dbSNP) and pathogenic deviations (PDs, from OMIM and several locus-specific mutation databases), to protein structural data from the Protein Data Bank. Both SAAPdap and SAAPdb perform an automated analysis of likely local structural effects that may disrupt protein folding, function or stability and therefore may be related to a harmful phenotype.
SAAPdb used fixed thresholds for defining potentially damaging effects for all analyses. However assigning a mutation as (not) having a structural effect has been shown to be very sensitive to precise structural details. In SAAPdap, we provide continuous (rather than Boolean) values for each of the analyses. Analysis of the data in SAAPdb shows clear differences in the sequence and structural characteristics of SNPs and PDs: as might be expected, PDs have additional, and more severe, structural effects. This indicates that there is a clear signal in the data that can be used to predict the pathogenicity of a novel mutation. This has been exploited using a Random Forest predictor and initial results out-perform any other available method. The work presented here includes an update of the data, initial results from the improved analysis of clashes, from-glycine and to-proline mutations and performance of the initial pathogenicity-predicting model. |
Alnumair NSN*, Martin ACR
*Institute of Structural and Molecular Biology, Division of Biosciences, University College London United Kingdom |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 29
C29 |
As a member of the International Cancer Genome Consortium, the Ontario Institute for Cancer Research is investigating genomic abnormalities in pancreatic and prostate cancer. In our lab, we are interested in studying tumour samples at the protein interaction level, to provide insights into how cancer somatic variations trigger the cancer evolution, and with an objective of discovering putative prognostic biomarkers. We have opted to use the transcriptome as a surrogate for identifying which proteins are present, and how these proteins are modified in tumour samples. We performed a comparison of several tools and strategies developed to reconstruct transcripts and estimate gene expression from RNA-Seq data (reference-guided and reference-independent approaches). The goal of the comparison was to determine the advantages and biases of known methods in the elucidation of transcripts. We have exercised these different methods on two data sets, a normal mouse liver sample, and a human cell line extracted from a pancreatic cancer tumour sample. We compared the genes, transcripts and the resulting translated proteins obtained by each tool, focusing on the elements discovered by only one type of method, and searching for common features. Our final analysis pipeline is a combination of methods to maximize the useful output from an RNA-Seq sample. The resulting proteins identified were then used to construct sample-specific protein interaction networks. The potential protein variants identified by the different methods in the pancreatic cancer dataset were mapped on the corresponding tumour-specific protein interaction network to evaluate their essentiality in the network. |
Chautard E*, Ouellette BFF
*Ontario Institute for Cancer Research (OICR) Canada |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 30
C30 |
Analysis of phenotypic information is emerging as a powerful tool to uncover hidden molecular relationships. For instance, genes showing similar phenotypic features in mouse models often belong to the same biological pathway and similarity of drug side effects has been used to reveal novel drug targets. Here we exploited both drug and gene phenotypic information to search for novel molecular associations of drugs. For that purpose, we annotated the side effects of 1346 drugs marketed around the world and 5384 genes from the MGI repository with the MedDRA ontology and measured the phenotypic linkage between drugs and genes by an extended semantic similarity approach.
We benchmarked the predicted relationships with known drug targets from the STITCH database and observed an enrichment of direct gene-drug associations, indicating that our approach is able to detect molecular effects of drugs. Interestingly, we also predicted unknown associations of drugs and genes, suggesting that our method might reveal novel drug modes of action. This approach promises to give new insights into the molecular mechanisms that translate chemical perturbations into phenotypic effects and may thus help to improve the rational use of medicines. |
Prinz J*, Vogt I, Campillos M
*Helmholtz Zentrum München Germany |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 31
C31 |
In order to facilitate the development of targeted drugs for the treatment of colorectal cancer (CRC), it is important to acquire a better understanding of CRC subtypes and their molecular differences. Furthermore, it is necessary to assess the degree to which available cell lines resemble primary tumor subtypes. We developed a new unsupervised approach, iterative non-negative matrix factorization (iNMF), for stratifying tumor samples using genome-wide mRNA expression data. Starting from a gene expression dataset consisting of 63 CRC tumors, we identified two dominant subtypes, which were highly concordant with the classes induced by an epithelial-mesenchymal-transition (EMT) gene expression signature. Further stratification revealed five subtypes that show differential activation of specific signaling pathways. Importantly, the derived subtype gene signatures allowed stratifying independent datasets suggesting that the signatures capture disease-relevant intrinsic features of CRC. Application of the gene signatures to expression data obtained from cell lines revealed that the tumor subtypes were covered in all panels analyzed. By integrating pharmacological response data, we identified several targeted compounds showing differential response across the subtypes. The CRC stratification obtained with our new method, iNMF, offers valuable insight into the differences between CRC subtypes at a functional level. Most importantly, it captures features of the disease that are highly relevant for the development of new targeted drugs in defined CRC patient sub-populations. |
Schlicker A*, Beran G, Chresta CM, McWalter G, Pritchard A, Weston S, Runswick S, Davenport S, Heathcote K, Alferez Castro D, Orphanides G, French T, Wessels LF
*Division of Molecular Carcinogenesis, Netherlands Cancer Institute Netherlands |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 32
C32 |
MicroRNAs regulates hundreds of target genes post-transcriptionally. To pursue the inter-relationships between miRs and alternative splicing events in Parkinson's disease (PD) leukocytes we performed next generation RNA sequencing of short RNAs and utilized junction-sensitive splicing arrays and network analyses to find putative inter-relationships.
Over 70 million short reads were aligned to miRBase. 482 mature miRs and 79 mature 3'/5' miR forms were detected, of them 11 increased and 6 decreased. Reciprocal junction pairs linear regression analyses of splice junction array data from the same samples revealed 478 disease-associated AS events (in 332 distinct genes) that separated patients from controls. These genes carry putative binding sites to 13 of the changed miRs in the 3' and 5' UTRs and in the coding regions. A complex network discovered 560 connections between 13 PD-modified miRs predicted to bind 217 AS target transcripts which commonly presented multiple binding site predictions. This network included the PD-related transcription factors FOXP1 and PITX3, predictably targeted by several dys-regulated miRs. Our findings indicate miR-mediated dys-regulation of balanced AS events that may control disease-associated variations in leukocyte transcripts. Thus, miRNAs may mediate splicing changes in hundreds of PD leukocyte genes through a complex network of miR-AS connections including the 3' UTR and beyond. This integrative analysis approach addresses an unmet need to identify new targets for future therapeutic intervention. |
Soreq L*, Bergman H, Israel Z, Soreq H
*The Hebrew University of Jerusalem Israel |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 34
C34 |
Currently increasingly more hospitals and other healthcare providers use Electronic Medical Records (EMRs) to store and access patients' medical records. It is becoming more obvious that in addition to improving the quality and efficiency of the health care, EMRs are also a largely unexplored research asset waiting to be studied. The patient data stored is already collected and with close to unprecedented level of detail about patient phenotypical traits.
In this vast medical data, drug treatment information is one of the most common entries. An unfortunate effect of all drug treatments is the risk of patients experiencing an Adverse Drug Reaction (ADR). To gain more knowledge about these undesired effects, we have developed a method for identification and extraction of ADRs from an EMR system. We are able to identify known ADRs occurring at similar frequency as the manufacturer and literature states. In addition to this, we are able to identify previously unknown ADRs. Currently most of the post authorization drug safety monitoring is done by spontaneous reports and signals being detected from these reports. Since spontaneous reporting is a time consuming task and underreporting is widespread issue, ADR extraction from EMRs could possibly in the future complement and improve this reporting. This would in turn significantly improve the knowledge about undesired effects caused by marketed drugs. |
Eriksson R*, Jensen LJ, Brunak S
*NNF Center for Protein Research, University of Copenhagen Denmark |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 35
C35 |
Viral hemorrhagic fever (VHF) can be caused by several unrelated viral families. Clinical symptoms are not a good metric for identification of the infectious agent, because symptoms can be similar for most VHFs. It is imperative to identify unique markers of individual viral diseases. Microarray analysis can provide a more accurate assessment of immune response in the infected host, and can identify transcriptional patterns that serve as biomarkers of different viral diseases.
We assessed the genome-wide transcriptional response of circulating immune cells of non-human primates (NHPs) infected with EBOV, and subsequently treated with anticoagulant drugs. We identified ~200 statistically significant genes that can distinguish disease outcome between surviving and non-surviving NHPs. These genes are associated with inflammatory response, cell growth and proliferation, T cell death, and inhibition of viral replication. Within this subset of genes, we have identified two transcription factor (TF) hubs that correlate with survival: CCAAT enhancer-binding protein alpha and tumor protein 53. Both TFs are able to distinguish between surviving and non-surviving animals. We have also assessed the transcriptional response of NHPs to Lassavirus (LASV) infection. Sequential sampling over the disease course identified specific genomic signatures of the immune response to LASV infection, including the up-regulation of TLR signaling pathways and innate antiviral transcription factors. Our results suggest that unique gene expression profiles can be used to develop predictive signatures of successful immune response to VHF challenge, and to identify gene subsets which can be biomarkers for clinical discrimination, diagnosis or prognosis. |
Garamszegi S*, Yen J, Connor J, Honko A, Hensley L, Xia Y, Geisbert J, Geisbert T, Rubins K
*Boston University - Graduate Program in Bioinformatics United States of America |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 36
C36 |
The ability to sequence nucleic acids at an unprecedented pace and decreasing costs using massively parallel sequencing (MPS) strongly affects biomedical research. We developed a new approach for detection of rare mutations based on deep coverage MPS data typical for capturing/targeting sequencing. Our approach considers both mutation frequency in the two strands separately, quality scores of the different variants as well as the distribution of the reads demonstrating the variations. The approach is effective in reduction of amplification and sequence-context-dependent biases, while still maintaining deep coverage required to detect low frequency variants. We applied this method for detection of clinically relevant mutations in a chronic myeloid leukemia (CML) patient. Resistant mutations in BCR/ABL transcripts were monitored at different time points while the patient was treated by tyrosine kinase inhibitors (TKI). The large volume of sequencing data increases sensitivity compared to Sanger direct sequencing and allows detection of marginally represented and previously uncharacterized mutations. We detected changes in frequency of mutated clones including the emergence of the T315I mutation and its disappearance with no therapy change. We also observed correlation in appearance of adjacent mutations. Our approach is implemented within a program DCMD (deep coverage mutation detector) intended for detection of low frequency SNV/mutations based on deep coverage MPS data. The program is available at: http://sheba-cancer.org.il/software/dcmd |
Eyal E*, Tohami T, Cesarkas K, Jacob-Hirsch J, Volchek Y, Amir A, Nagler A, Rechavi G, Amariglio N
*Sheba Medical Center Israel |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 37
C37 |
Motivation: Several analyses of gene expression profiles have been implemented to find markers for disease classification at the level of individual genes and gene sets. Network-based markers have been shown to be more accurate and reproducible than traditional multivariate models that combine individual gene markers. Despite this progress, most network-based approaches do not fully exploit topology features, such as subnetwork density. This factor is important because network topology together with expression data encode biologically meaningful information. To address this limitation, here we introduce Pandino, a new network-based method for disease discrimination.
Results: In this method, we overlay the gene expression data on the PPI network and look for the most discriminative subnetworks. This is done by searching for candidate subnetworks that are both highly-connected and -isolated, while showing cohesive expression patterns. We tested our method on two independent large-scale breast cancer datasets. Discriminatory subnetworks were highly enriched in cancer-related GO terms and KEGG pathways. On the basis of classification accuracy and reproducibility, they were compared with models based on individual genes and pathways, as well as with other network-based classification approaches. Pandino can efficiently find highly modular subnetworks with compact gene expression profiles. Also the functionally-enriched subnetworks overperformed other approaches when tested on independent datasets. Conclusion: We propose Pandino, an alternative method to identify highly modular and transcriptionally compact subnetworks for disease classification. Although here we report tests on cancer datasets, Pandino can also be applied to other diseases. |
Zhang L*, Ye Z, Wang B, Azuaje F
*CRP-Santé Luxembourg |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 38
C38 |
It has been hypothesised that genetic interactions (i.e. epistasis) may resolve the problem of ”missing heritability” currently observed in genome-wide association studies (GWAS). However, the systematic analysis of interactions in non-synthetic data contains significant computational challenges, which are often addressed by initial reduction of the set of analysed markers. Such a reduction risks omission of epistatic loci which show a very weak or even no effect on their own, existence of which has been shown theoretically.
In order to facilitate practical analysis of interactions for the full set of SNP markers, we have developed an integrated platform designed for use on a desktop computer. This system performs fast, exhaustive evaluation of SNP pairs for a typical dataset of ≈ 500, 000 SNPs and ≈ 5, 000 samples in less than 3 hours for a CPU, and below 30 minutes for a graphics card based (GPU) implementation. This is significantly faster than comparable algorithms, especially as it allows simultaneous use of multiple statistical filters with low overhead. Application of our technique to seven Wellcome Trust Case Control Consortium datasets identifies SNP pairs that have considerably stronger association with disease than their individual component SNPs, which often show negligible effect on their own. For some diseases, there are thousands of SNP pairs identified which pass formal multiple test correction (Bonferroni correction). This set of SNP-pairs can be further refined with secondary analysis including more computationally expensive techniques such as permutation testing and sparse regression, in line with recent univariate analysis. |
Goudey B, Wang Q, Fan S*, Kowalczyk A, Gor L, Stern L, Macintyre G, Ong CS, Rawlinson D, Kowalczyk A
*NICTA Australia |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 39
C39 |
Classification analysis of gene expression profiling of tumour tissues has been widely used to aid the identification of primary site of metastatic tumours. In this work, we demonstrate that accurate identification of tumour origin could be achieved by using only Formalin-Fixed, Paraffin-Embedded (FFPE) metastatic tumour samples.
We have developed a classifier trained exclusively on a set of FFPE metastatic tumour tissues (FFPE-Met, 374 samples from 14 tumour types, all profiled on standard Illumina WG-DASL chips). For independent validation, we used the matching subset of profiles of 517 fresh frozen tumour samples (Frozen-Mixed), which is publically available and includes 289 primary tumours. This dataset was profiled on either Affymetrix-UG133 or Pathwork’s Pathchip. In main experiment, the predictor trained on FFPE-Met achieved 80.0% in Leave-One-Out (LOO) cross validation and, independently, 81.4% in test on Frozen-Mixed (here correct predictions are defined as agreement of predicted tumour type with available pathological reference). This shows that metastatic tissue can preserve robust signatures of its origin, and FFPE has no major impairment to the gene expression profiling using variety modern microarray technologies. In the second experiment, reversely, we trained on Frozen-Mixed samples, and observed the highest accuracy (89%) in LOO test, but the accuracy was only 58% when testing on FFPE-Met. That is because predicting metastatic samples from primary tissues included in Frozen-Mixed is challenging, and the high accuracy for LOO on Frozen-Mixed reflects only the internal consistency within this set, which may lead to bias when dealing with more heterogeneous FFPE-Met samples. |
Shi F*, Kowalczyk A, Byron K, Tothill R, Mileshkin L, Klupacs R, Bowtell D
*National ICT Australia Australia |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 41
C41 |
The microbial community in the human intestine constitutes one of the most densely populated ecosystems on earth. It carries 150-fold more unique genes than our own genome, is highly dynamic in space and time, and a key contributor to immune system, digestion of food, and various health complications. Characterizing the overall structure, variability, and health associations of this virtual metabolic organ is a major challenge for contemporary human biology.
The Human Intestinal Tract Chip (HITChip) microarray provides a standardized analytic platform to probe the relative abundancies of over 1000 resident bacterial phylotypes of the human intestine. The HITChip microarray platform has been to date applied in a hundred individual studies, accumulating a database of 10 000 microarray experiments from thousands of individuals including versatile metadata of the subjects. The versatility and considerable sample size make this a unique resource for studying the associations between intestinal microbiota, environment, and phenotype. We investigate selected subsets of the HITChip atlas to characterize the diversity, stability, and resilience properties of the resident microbial communities in the human intestine. The analysis highlights alternative stable states in the phylotype space and provides means to quantify the resilience, or the overall capacity to recover from perturbations, based on the estimated strength of each stable attractor. We couple these variables to environmental variables and health indicators and study their evolution across human life span from birth to retirement. |
Lahti L*, Salojärvi J, Salonen A, van Nes E, Scheffer M, de Vos WM
*Wageningen University Netherlands |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 42
C42 |
Integration of diverse genomic data sets and visualization of microRNA (miRNA) related genomic overlaps would enable prioritization of positional candidates for polygenic diseases. MicroRNAs have been found to be coordinately expressed with their host genes, implying that they might share a common transcriptional mechanism. Therefore, miRNA Genomic Viewer, a tool for visualization of genomic locations of miRNA genes, their host genes (protein-coding, lincRNA, and snoRNA), overlapping QTL, and polymorphisms in human and mouse, was developed. MiRNA Genomic Viewer is freely available at http://www.integratomics-time.com/miRNA-genomic-viewer/. It integrates data from: miRBase, Ensembl, RGD- Rat Genome Database, and OMIM and is linked to the online tool for detection of miRNA gene polymorphisms (miRNA SNiPer; http://www.integratomics-time.com/miRNA-SNiPer).
As a case study, we are presenting visualization of genomic distribution of obesity related miRNA genes and their overlapping regions. Obesity related miRNAs were collected from five mammalian species (human, cattle, pig, rat, and mouse; n=167) and presented on genomic view as human orthologs (n=87). Forty-seven miRNA genes overlapped 53 sense-oriented host genes and 92 obesity related QTL. Sixty-four polymorphisms were found within pre-miRNAs; five of them located within the miRNA seed region that is responsible for target mRNA binding. Using miRNA Genomic Viewer, we identified multi-layer genomic overlaps consisting of QTL, obesity associated miRNAs, their host genes and genetic variability. A tool miRNA Genomic Viewer allows users to identify genomic positional candidates for further experimental analysis and can serve researchers as a starting point in testing more targeted hypotheses and designing experiments related to polygenic diseases. |
Jevšinek Skok D*, Zorc M, Kovač M, Kunej T
*University of Ljubljana, Biotechnical faculty, Department of Animal Science Slovenia |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 43
C43 |
Carcinogenesis is the result of gradual alteration events at the genetic and epigenetic level, some of which induce waves of clonal expansion. Although highly relevant for clinical diagnostics and prognostics, the relative timing of these alterations is still poorly understood. Here, we analyze the gene expression profiles of colorectal adenoma and carcinoma samples, focusing on the genes showing statistically significant differential expression as compared to normal colorectal mucosa. To cope with the well–know high diversity of cancer transcriptomes, we map the identified genes to functional pathways that have been shown to be involved in carcinogenesis. A candidate gene is considered altered whenever its disregulation exceeds a certain threshold, which is estimated from the data using maximum entropy. A pathway is considered altered if at least one of its gene members is altered. In order to integrate into the analysis phenotypic markers of tumor progression, such as the TNM staging system, we learn a probabilistic genotype–phenotype map using Isotonic Conjunctive Bayesian Networks (I-CBNs). This graphical model combines estimation of partial order constraints on the level of pathway alterations with isotonic regression for estimating a non–decreasing phenotypic gradient from the observed progression markers. This novel application of our probabilistic graphical modeling framework to gene expression data sheds different light on the complex process of cancer progression. |
Constantinescu S*, Marra G, Beerenwinkel N
*ETH Zurich Switzerland |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 44
C44 |
Coexpression network analysis is useful method for exploring clues about genetic regulatory mechanism. Usually, the coexpression relationships between two genes are determined with gene expression profiles of the genes. Here, we propose coexpression network conditional to macroscopic phenotypes. The phenotypes may include experimental conditions, drug or chemical exposure or clinical phenotypes. The construction of the phenotype-conditioned coexpression network is as follows. First, coexpression network is built only with gene expression data. Then, the residuals of regression between each gene expression profile and the macroscopic phenotypes are extracted. Consequently, each gene has the residual vector. The correlation coefficients between the residuals are partial correlation coefficients of the two genes conditioning the phenotypes. Finally, statistical test of differences between the original coexpression and the corresponding partial correlations are computed. The significant results indicate that the gene pairs have the phenotype-specific coexpression relationships. We applied this method to breast cancer microarray data. The analysis results revealed phenotype-specific coexpression relationships that had not been discovered with the conventional coexpression network analysis. In conclusion, the construction of phenotype-specific coexpression network identified functional interaction between genes, which are related to expression of phenotypes. |
Cho SB*, Seok JW
*National Institute of Health, KCDC Korea, South |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 45
C45 |
Endocrine therapy is an effective treatment of estrogen receptor-positive (ER+) breast tumors, significantly reducing mortality. However, approximately 30% of patients receiving adjuvant endocrine therapy will experience recurrence within a 15-year period. The mechanisms of endocrine resistance are poorly understood. Understanding the underlying genetic diversity of breast cancers responding differently to endocrine therapy is important for the development of more optimal and individualized treatments strategies.
In the current study, a panel of isogenic MCF-7-derived human breast cancer cell lines [1-3] that are resistant to Tamoxifen only or both to Tamoxifen and Fulvestrant, respectively, were analyzed for mutations through exome sequencing and compared with the exome of the parental cell line. In addition, global gene expression levels for the same panel of cell lines were generated. Detected variation were integrated with gene expression profiles and analyzed in the context of prior knowledge on drug action and genes associated with resistance to endocrine therapies as identified by extensive literature curation. A small panel of somatic point mutations potentially associated with acquired endocrine resistance were identified. Future experimental validation will reveal which of the detected mutations that are causatively involved in resistance to endocrine therapy. References • Brünner N et al.: Acquisition of Hormone-independent Growth in MCF-/ cells is accompanied by increased expression of estrogen-regulated genes but without detectable DNA amplifications. Cancer Res 1993, 53:283-290. • Brünner N et al.: MCF7/LCC2: A 4-Hydroxytamoxifen resistant human breast cancer variant that retains sensitivity to the steroidal antiestrogen ICI 182,780. Cancer Res 1993, 53:3229-3232. • Brünner N et al.: MCF7/LCC9: An antiestrogen-resistant MCF-7 variant in which acquired resistance to the steroidal ICI 182,780 confers an early cross-resistance to the nonsteroidal antiestrogen Tamoxifen. Cancer Res 1997, 57:3486-3493. |
Spring Ehlers N*, Shi Da Z, Jun W, Li J, Bjerre CA, Stenvang J, Brünner N, Bolund L, Elias D, Ditzel H, Gupta R
*Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark Denmark |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 46
C46 |
Autism spectrum disorders (ASD) are neuropsychiatric disorders characterized by restricted repetitive behavior and abnormalities in communication and social interaction. ASD has very complex model of inheritance and probably involves multiple interacting genes. Genome wide screens published to date have identified several regions of low/modest predictive value, and only a few studies have been able to replicate the findings. Here, we combined information from previous studies to our genome-wide scan to predict interaction networks affecting ASD.
We have performed a genome-wide scan in a novel set of autism families. Total of 83 ASD families contained from 1 to 3 affected members. The statistical analysis was carried out with TDT and additionally, population based association approaches using unrelated controls. We also examined association enrichment in gene ontology classifications with SNP Ratio Test SRT, which is a pathway analysis of genome-wide association datasets. For network analysis, candidate genes were selected based on 1) the results of TDT and association analyses in this study, 2) previous studies and 3) functions of the genes related to synaptic formation or regulation. Totally, 210 genes were selected. The known interactions between these genes were first evaluated from STRING database. Unknown interactions were then examined with interaction and promoter analysis. In promoter analysis, we predicted transcription factor binding sites using TRANSFAC in gene areas and in 20,000bp upstream of the genes. Accumulated interaction information was used to create networks that are likely affecting ASD. The results are presented in the meeting. |
Oikkonen J*, Kantojärvi K, Ukkola-Vuoti L, Kallela J, Vanhala R, Järvelä I, Onkamo P
*Department of Medical Genetics, University of Helsinki Finland |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 47
C47 |
Human leukocyte antigens (HLA) are proteins involved in the human immunological response. The understanding of the HLA-peptide binding interaction is a crucial step for peptide-based vaccine design. However, the high rate of polymorphisms in HLA makes this task difficult. The functional supertype classification of HLA proteins can provide a solution to mitigate the complexity of the HLA-peptide binding interaction. Therefore, the in silico approach represents a useful, less time consuming and inexpensive way to investigate the peptide binding activity of proteins belonging to these supertypes. Hence, we combine different machine learning methods by using meta-learning approach. Here, the proposed MetaPredictor exploits the capability of various well-known supervised classifiers to yield better solution. For this purpose, at the initial stage, Support Vector Machine (SVM), Random Forest (RF), Naive Bayes (NB), Artificial Neural Network (ANN) and K-Nearest Neighbor (K-NN) are used to train. Thereafter, the test solutions are predicted by different classifiers, and from all the predicted solutions, the final prediction is done by using a consensus method, called cluster-based similarity partition (CSPA) algorithm. The performance of the MetaPredictor is described using accuracy, precision, recall and F-measure, together with confusion matrices. The error estimates are calculated using the leave-one-out procedure. Results show that the MetaPredictor produces maximum of 16% gain over the single machine learning method. Finally, statistical significance tests have been performed to establish the superiority of the proposed predictor. |
Saha I, Mazzocco G, Plewczynski D*
*Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw Poland |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 48
C48 |
The analysis of heart rate variability has been extensively used in clinical studies to determine alterations of the heart function. We apply non-linear time series methods, which can be classified into fractal, entropy and chaos measures, on time series of inter beat intervals. By using multivariable regression models and cluster analysis on the resulting parameters, we explore the association between the characteristics of the heart beat and air pollution and tobacco smoke exposure. The dataset under scrutiny is part of the SAPALDIA cohort study (Swiss Cohort Study on Air Pollution and Lung and Heart Diseases in Adults), which contains 24h ECG data for 1586 participants and information about tobacco smoke and air pollution exposure, and further covariates such as age, heart disease, etc. |
Häcki C*, Adam M, Probst-Hensch N, Künzli N, Frey U, Delgado-Eckert E
*UKBB Switzerland |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 49
C49 |
During standard drug development drug targets are first identified and validated, then compounds are developed and tested through many steps. Before any human studies the potential safety and efficacy should have been worked out to the extreme. Yet, most drugs, when tested on humans, fail, causing huge financial loss. Mostly they fail due to lack of expected efficacy. Hence, in the process there must be flaws or inadequate test systems used.
In our research, we are working with expression data about different cancer model systems. Our goal is to use genetic pathways and detect main differences between the models. A pathway that is behaving differently in a simpler model is a possible subject for further studies on how such models could be improved in the future. After annotating many datasets and finding similarity between samples, we looked at similarity on pathway level and found the pathways which behave differently in tumor and normal samples. We also compared tumor samples with samples from respective tumor model system and found pathways that are most different in this comparison. |
Metsalu T*, Vilo J
*University of Tartu Estonia |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 50
C50 |
In clinical virology, focus shifts from well-established identification of single-nucleotide variants to genome-wide probing of intra-patient RNA virus population diversity, as low-frequency haplotypes have been shown to affect virulence, immune escape, and drug resistance. A cloud of closely related haplotypes is often referred to as a quasispecies, which is assumed to emerge from a few generating sequences, called generators, subjected to mutation and recombination.
We have elaborated a jumping hidden Markov model that infers the underlying quasispecies from error-prone next-generation sequencing data and predicts the haplotype distribution. In this model, probability tables explain the observed diversity at each site and jump probabilities allow a single observed read to originate from a combination of generators. We have implemented a Variational Bayes modified EM algorithm to compute maximum a posteriori estimates of the model parameters, model selection, and prediction of the haplotype distribution in a Java program called QuasiRecomb. QuasiRecomb is validated by simulation studies, to assess the advantage of explicitly taking the recombination process into account, and applied to clinical studies of HIV-infected samples. |
Töpfer A*, Beerenwinkel N
*ETH Zürich, Department of Biosystems Science and Engineering Switzerland |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 51
C51 |
The presence of underlying disease, including infectious disease and cancer, has been observed to give rise to distinct patterns of whole-blood gene expression. However, using these patterns to derive diagnostic biomarkers capable of distinguishing phenotypically similar diseases remains difficult. The particular challenges in an African setting include deriving a signature robust to geographical heterogeneity, to the presence of endemic underlying co-infections and to the measurement variability from different expression assays.
We propose a bioinformatics pipeline that uses gene expression data from heterogeneous disease groups to derive signatures for distinguishing infectious disease. Our pipeline performs quality control, normalisation and differential expression analyses accounting for age, gender, HIV status and endemic infections. Variable selection is used for identification of minimal biomarker signatures. Finally we propose a novel disease risk-score which is highly generalisable to measurements from different (e.g. non-array) technologies, as well as to independent cohorts. To benchmark our pipeline we used an adult and a paediatric whole-blood RNA cohort (537 & 334 samples respectively) from three sites in Africa. These cohorts include HIV infected and uninfected individuals presenting with tuberculosis (TB), latent tuberculosis infection (LTBI) and other diseases that mimic TB (OD). Data analysis elucidated the host transcriptional response to TB and suggested minimal biomarker signatures that discriminate TB from other phenotypes. Signatures were validated in independent cohorts achieving a sensitivity of 91% in adults (90% in children) for TB vs LTBI and 83% in adults (80% in children) for TB vs OD, thus considerably improving the currently available diagnostic capabilities. |
Kaforou M*, Wright V, Levin M, Coin L
*Imperial College London United Kingdom |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 52
C52 |
Crohn’s disease and ulcerative colitis are inflammatory bowel diseases (IBD) characterized by chronic and relapsing inflammation of the gastro-intestinal tract. They cause lifelong suffering, as well as considerable drainage of national and personal health care resources. Although their etiology is still unclear there is a growing body of evidence for a significant microbial factor. In this study we focus on the global gene expression of these communities through mRNA sequencing at unprecedented depth. We collected colonic biopsies from inflamed and non-inflamed colonic mucosa in 20 IBD patients. Using RNA-Seq with unprecedented depth we compared microbial metatranscriptomes in inflamed and non-inflamed colonic mucosa. This was done using 600Gb of Illumina HiSeq RNA-Seq technology (15Gb/sample). We mapped these reads onto a reference of de novo assembled transcriptomes and used DE-Seq to analyze the count data. Preliminary analysis using Real-Time-PCR revealed encouraging transcriptional differences for two pathogenic Bacteroides species, where homologs of genes involved in tissue-destruction were up-regulated in inflamed mucosa of UC patients. We also observed higher expression rates of E.coli in inflamed mucosa, as has previously been observed in CD patients. Meta-transcriptome analysis confirmed these results and added a multitude of other gene candidates that were significantly up/down regulated in inflamed tissue. Thus, our analysis has revealed transcriptional differences for known microbial pathogens. High-throughput RNA-Seq analysis added extra value to these findings and is a source of continuing analysis with great potential for further interesting findings. |
Claesson M*, O'Callaghan J, Zomer A, Shanahan F
*University College Cork Ireland |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 53
C53 |
High throughput sequencing data generation demands methods for interpreting the effects of genomic variants. Numerous computational methods have been developed to assess the impact of variations because experimental methods are unable to cope with both the speed and volume of data generation. To harness the strength of currently available predictors, the Pathogenic-or-Not-Pipeline (PON-P) integrates 5 predictors to predict the probability that non-synonymous variations affect protein function and may consequently be disease-related. Random forest methodology-based PON-P shows consistently improved performance in cross-validation tests and on independent test sets, providing ternary classification and statistical reliability estimate of results. PON-P may be used as a first step in screening and prioritizing variants in order to determine deleterious ones for further experimentation.
PON-P is avaiable at http://bioinf.uta.fi/PON-P |
Vihinen M*, Olatubosun A, Väliaho J, Härkönen J, Thusberg J
*Lund University Sweden |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 54
C54 |
High-throughput gene-expression profiling technologies yeild transcriptomic signatures to predict clinical condition or patient outcome. However, such signatures have limitations, such as dependency on training set, and lack of generalization. We propose an algorithm, ITI (Interactome-Transcriptome Integration, Garcia et al. 2012) to extract a generalizable signature predicting breast cancer relapse by superimposition of a large-scale protein-protein interaction data over several gene-expression datasets. This method re-implements the Chuang et al. algorithm (2007), with the added capability to extract a transcriptomic signature from several gene-expression datasets simultaneously. Two studies were made with five breast cancer DNA microarray datasets to assess the training set impact on ITI. For each studies, two separate signatures were established on patients over-expressing Estrogen Receptor (ER+) and patients non-expressing Estrogen Receptor (ER-). We found two sets of 6 and 139 subnetworks signatures linked respectively to 5 years relapse free survival in breast cancer ER+ and ER-. These were confronted to Wang et al. (2005) and Van’t Veer et al. (2002) signatures against independent datasets. Against Desmedt et al. (2008) Dataset, in ER+/ER- the accuracy is respectively of 0.74/0.54, 0.41/0.44 and 0.60/0.38 for ITI, Van’t Veer and Wang signatures. Against van de Vijver et al. (2002) Dataset, in ER+/ER- the accuracy is respectively of 0.52/0.53, 0.62/0.53 and 0.63/0.56 for ITI, Van’t Veer and Wang signatures. We found that subnetworks formed complexes functionally linked to biological functions related to metastasis and breast cancer. Several drivers genes were detected, including CDK1, NCK1 and PDGFB, some not previously linked to breast cancer relapse. |
Garcia M*, Finetti P, Bertucci F, Brinbaum D, Bidaut G
*Bioinformatics Integrative Platform, Centre de Recherche en Cancérologie de Marseille, Inserm U1068, CNRS UMR7258, Institut Paoli-Calmettes, Aix-Marseille Univ France |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 55
C55 |
A genome wide association study on Danish men (405 controls with normal sperm counts and 465 cases with low sperm counts ) exploring male infertility and the testicular dysgenesis syndrome was recently published (J Med Genet. 2012 49:58-65). Using data from this study, as well as NordicDB and HapMap, we explore frequencies of SNPs associated with smoking in order to understand if there is a defendable genetic predisposition to smoking in Danish men, or the ability to predict this phenotype. Amongst various environmental factors, smoking habits were obtained from the hospital patient records. Amongst the infertile cases, ca. 37% were found to be smoking, and amongst the normal men (GWAS controls), ca 22% were found to be smokers. Since the testicular dysgenesis syndrome aggregates several subphenotypes into an overall infertile phenotype, the poster will show current progress and preliminary results with teasing out the association into subphenotypes within the infertile group.
|
Rigina O*, Dalgaard MD, Skakkebaek NE, Juul A, Brunak S, Gupta R
*Center for Biological Sequence Analysis. Technical University of Denmark, Denmark |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 56
C56 |
Detecting pathways that are deregulated in cancer, and using these to determine genes that are likely to play a role in the disease, is a major challenge in cancer research. In this study, we aim to identify pathways of frequently mutated genes by exploiting their network neighborhood encoded in the protein-protein interaction network. To this end, we introduce a multi-scale kernel diffusion framework and apply it to a large collection of murine retroviral insertional mutagenesis data. The diffusion strength plays the role of scale parameter, determining the size of the network neighborhood that is taken into account. As a result, in addition to detecting genes with frequent mutations in their genomic vicinity we can also find genes that harbor frequent mutations in their interaction network context.
We identify densely connected components of known and putatively novel cancer genes and demonstrate that they are strongly enriched for cancer related pathways across the diffusion scales. Moreover, the mutations in the clusters exhibit a significant pattern of mutual exclusion, supporting the conjecture that such genes are functionally linked. Using the multi-scale kernel diffusion approach, various infrequently mutated genes are found to harbor significant numbers of mutations in their interaction network neighborhood. Many of them are well-known cancer genes. Importantly, the putative cancer genes and networks detected in this study are found to be significant at different diffusion scales, confirming the necessity of a multi-scale analysis. Taken together, the results demonstrate the importance of defining recurrent mutations while taking into account the interaction network context at multiple scales. |
Babaei S*, Hulsman M, Reinders M, de Ridder J
*Delft University of Technology Netherlands |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 57
C57 |
Comparing network structures that characterize normal and tumor state provides insights to the underlying mechanisms of cancer. However, it is not feasible to compare all aspects of large networks as it requires the solving of subgraph isomorphism problem, which is NP-complete. Thus, heuristics for network comparison have arisen. There are two network comparison classes, global heuristics and local heuristics. Global network properties do not contain the detail needed to capture the structural characteristics of biological networks. Thus, more sensitive local structure measurements have been proposed. Graphlets are all non-isomorphic connected induced graphs on a certain number of nodes. Several measures for comparing network similarities based on graphlets have been proposed; in general, they return a scalar for the difference between two graphs. We introduce the notion of differential graphlet communities to obtain network structure difference between normal and cancer states. Edges in the graphs correspond to gene co-expression from gene expression data.
The differential graphlet community approach on three gene expression datasets with normal lung and lung cancer samples enables us to identify and characterize corresponding network similarities and differences. We identified differential graphlet communities that modified underlying wiring related to immune system, an emerging hallmark of cancer. Furthermore, across all 3 datasets and all 3 identified differential graphlet communities, shortest path lengths between all pairs of genes in the differential graphlet communities in their normal graphs tend to be longer than those in the tumor graphs suggesting that tumor conditions may potentially create shortcuts that bypass normal mechanisms. |
Wong S*, Cercone N, Jurisica I
*York University and Ontario Cancer Institute Canada |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 59
C59 |
We present a novel unsupervised Bayesian integrative method for discovering driver genes in cancer. The algorithm incorporates signals from different data types such as copy number, gene expression, point mutations and functional assays into a single probabilistic score. The method was applied to breast cancer, correctly identifying known drivers and uncovering novel oncogenes that were validated both in vitro and in vivo. |
Sanchez-Garcia F*, Chen B, Kotlier D, Silva J, Pe’er D
*Columbia University United States of America |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 60
C60 |
Understanding how cell fate decisions are regulated may allow controlling and directing in the future the cell commitment process. Bone mass disorders such as osteoporosis affect an important part of the population, and being capable to modify the bone formation in a controlled way is a key element to treat these disorders.
In this poster is described an approach to simulate the bone and fat formation dynamics, starting from osteo-adipo progenitor cells which are stimulated by external signals such as the Wnt pathway activation. The process depends on multiple factors, each one described with a different level of precision. Our approach is based on reusing and composition of models by using the Hybrid Systems theory, allowing randomness and non-determinism. We took a first step to test and analyze treatments of bone mass disorders in silico, and currently we are improving our model to obtain more accurate predictions. |
Assar R*, Sherman DJ, Montecino MA
*Center for Mathematical Modeling. FONDAP 15090007 Center for Genome Regulation. INRIA Chile Chile |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 61
C61 |
Current guidelines for reproductive toxicity testing require large numbers of experimental animals. The embryonic stem cell test is an animal-free in vitro method for predicting developmental toxicity based on compound-induced inhibition of murine stem cell differentiation. Several studies from our lab used gene expression to study normal stem cell differentiation as well as compound developmental toxicity. We developed a Principal Component Analysis-based algorithm that compares transcriptomics data from compound exposed cells to normal differentiating cells at different time points in a “differentiation track” to predict compound developmental toxicity. To determine the best overall biomarkers for developmental toxicity prediction, we combined data from 3 studies and 23 exposures in a novel analysis. By evaluating predictions of 100,000 randomly selected gene sets, we identified which genes contribute significantly to gene set prediction reliability. Using cross-validation, we identified a set of 52 genes that allows for 83% accuracy in developmental toxicity prediction. This gene set consists mainly of genes involved in stem cell differentiation or other developmental processes. Further work is ongoing to apply this set and the corresponding algorithm to help reduce reproductive toxicity animal testing.
|
Pennings J*, van Dartel D, Robinson J, Pronk T, Piersma A
*RIVM Netherlands |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 63
C63 |
Aggressive non-Hodgkin’s lymphomas are cancers of B cell origin, which account for a significant disease burden in Western Europe and Worldwide. They have been divided into classes, including Burkitt’s lymphoma, and diffuse large B-cell lymphoma (which also has a number of sub-classes). Traditionally class distinctions have been based on cellular morphology, immuno-phenotype and the presence of defined chromosomal translocations (Burkitt’s lymphoma is associated with a translocation of the myc oncogene into the immunoglobulin locus). Recently however these classifications have been improved by the use of gene expression information, and statistical or machine learning based classifiers have been developed which use expression information from microarrays. These now contribute significantly to our ability to classify these cancers, revealing for instance a substantial number of cases called ‘molecular Burkitt’s lymphoma’ which on the basis of previous criteria might have been classified as DLBCL. This distinction is particularly important since successful treatment of BL requires much more aggressive chemotherapy than DLBCL. Motivated by the need to use these new methods in a clinical setting, particularly with expression data from formalin-fixed paraffin embedded samples, we have built on previous work to build new classifiers that match existing classifiers as closely as possible but are optimally transferable between expression measures from different array platforms. We report on these transferable classification algorithms and gene feature sets. Notably we show that transferable performance can be optimized using only 24 genes as features and a functional trees algorithm. This 24-gene classifier can precisely differentiate BL from DLBCL and has a stable ability when working on different datasets. |
Sha C*, Westhead D, Tooze R, Jack A
*Institute of Molecular and Cellular Biology,University of Leeds United Kingdom |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 64
C64 |
Networks models have become the premier approach to modeling spreading processes such as epidemics because they capture the heterogeneity in the interaction structure. Traditional measures of a node's influence in a spreading process such as degree or centrality provide only a general ranking of nodes, do not account for the dynamics of the spreading process, and underestimate the spreading power of peripheral nodes which are the most likely point of disease introduction to the network. Path counting has emerged as one remedy to these weaknesses. Yet disease spread is more blob like than path like. We propose a novel measure, the Expected Reach (ER), which quantifies a node's spreading power as the expected number of susceptible nodes reached by infected nodes after a given number of infection events seeded from a given individual in an otherwise susceptible population. This measure corresponds directly to the dynamics of the disease transmission rate. We show that ER-3, the expectation after 3 infection events, predicts the time until an initial zombie infects half the network more accurately than path counting metrics. ER also predicts the outcome of a zombie epidemic met by concurrent spread of education in zombie hunting, both in simulated and real-world networks including Slashdot, Facebook, scientific collaboration, and Enron emails. Accurately incorporating the dynamics of the spreading process provides an effective measure of node influence. |
Lawyer G*
*MPII Computational Biology Germany |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 65
C65 |
Parkinson's disease (PD) is the second most widespread neurodegenerative disease in the world that causes a great loss of neurons before the first symptoms appear. Therefore, it is urgent to find reliable biomarkers able to identify the first damages caused by PD. High-throughput data analysis allow to characterize PD at the functional level and to identify possible candidate biomarkers and relevant functions affected by the disease.
To this aim, we compare the results of two different pipelines in the analysis of the same early PD dataset. The most relevant difference between the two pipelines is the phase of prior knowledge injection: in the Standard pipeline prior knowledge is used after the gene signature identification, while in the alternative schema, Knowledge Driven Variable Selection (KDVS), prior knowledge is used to structure the data matrix before variable selection. In both pipelines, variable selection is performed by l1l2_FS, an embedded variable selection method based on regularization that incorporates feature selection within the classification step. The source of prior knowledge used is the Gene Ontology (GO). Each pipeline identified a gene signature and a list of GO terms. The results were compared via two different procedures, namely Literature characterization and Benchmark analysis. Both pipelines were able to identify relevant gene signatures and GO terms concerning PD, nonetheless the results obtained by KDVS cover much more certified knowledge about the disease with respect to the results from the Standard procedure. |
Squillario M*, Zycinski G, Masecchia S, Verri A, Barla A
*DIBRIS-University of Genoa Italy |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 66
C66 |
We describe an alternative analysis of a High-Throughput Screening (HTS) data set using functionality from the siRna package previously presented at the ISMB 2011 conference. The siRna package provides an automated analysis work-flow that takes in raw HTS experiment data files from the plate reader along with bar code-linked, plate-specific annotation files and generates comprehensive, annotated textual and graphical output. The package has been complemented with a helper function to read arbitrary data files and an alternative aggregation method for replicate plates. A linear model approach is used to estimate the coefficients, that is, the values describing the effects of the different screens. The model is fit to normalized values generated by a novel, statistically more robust, screening data normalization method implemented in R that down-weights outliers on the plate before calculating the loess fit, termed loess-log normalization. After loess correction, data is divided by the median of negative controls and log2-transformed. Intra-plate replicates are summarized by calculating the average. Preliminary results suggest that this approach yields more reliable hits than observed when using a step-wise approach where replicates are treated independently and final results are extracted only after hit finding procedures by filtering. Here, we have applied this approach to a cell-based proliferation screen for selected compounds with known biological targets. The method shows its advantage in detecting good-quality hits especially from the screens where an unusually large number of hits were expected. |
Fey V*, Hongisto V, Nyberg S, Perälä M
*VTT Technical Research Centre of Finland, Medical Biotechnology Finland |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 67
C67 |
Neuroblastoma (NB) is the most frequent pediatric extracranial solid tumor that develops from nervous tissue. NB presents itself as a disseminated disease with a heterogeneous clinical behavior ranging from spontaneous disease regression (stage 4S) to a rapid progression of disease and fatal exitus in patients with stage 4.
Our preliminary analysis aims at investigating the tumorigenesis of NB using array-CGH (aCGH) technology and focusing on the differences between stages 4 and 4S. Modern high-resolution aCGH allows for the identification of numerical and structural aberrations or rearrangements leading to the possibility of organizing such extracted information in an hierarchical structure. The analyzed dataset consisted in 133 samples measured on 4 different Agilent platforms (GPL2873, GPL2878, GPL5477 and GPL4093). For the analysis, we delevoped a complete pipeline implemented in Python composed of a mixture of original and off-the-shelf steps: - Normalization is perfomed using CGHnormaliter, an iterative two-steps algorithm; - Calling is peformed using CGHCall, a mixture model based method for identifying chromosomal aberrations (also used internally by CGHnormaliter); - Alterations are then extracted from data specifying resolution (chromosome, arm, band, sub-band); - Tumorigenesis trees are finally produced using two different well-known oncogenesis models (Desper R. et al. (1999, 2000) - Journal of Computational Biology). The pipeline was first tested on synthetic data and then applied to the NB dataset leading to promising results that are currently undergoing a further postprocessing step for optimal visualization. |
Masecchia S*, Barla A, Squillario M, Coco S, Verri A, Tonini GP
*DIBRIS - Università degli Studi di Genova Italy |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 68
C68 |
Currently, there is a gap between the rapid development of Next-Generation Sequencing (NGS) and the establishment of standards to apply these new massive parallel sequencing technologies into Genetic Diagnostics. The arrival of targeted enrichment systems to select groups of genes or even the whole exome as well as the launching of NGS-based bench top machines has quickly extended the use of NGS platforms into small laboratories. Different “open source” and commercial analysis tools are now available to perform each of the individual steps that made up the whole analysis pipeline in target resequencing studies. Nevertheless, the selection of the right tools and parameters is complex and needs to be carefully studied in order to assess the uncertainty level regarding both the technique and the analysis itself.
Here we report the results of the accuracy and reproducibility quality controls that were carried out to verify and validate the use of gene panels in the diagnosis of heterogeneous diseases. Results using HapMap cell lines as controlled samples reached 99% sensitivity against the HapMap project data (array genotyping data) and 95% sensitivity against 1000Genomes project data (NGS data). |
Rosa-Rosa JM, Trivino JC, Rodriguez-Cruz O, Cantalapiedra D, Santillan S, Carrero R, Rodriguez-dePablos R, Collado C, Fernandez-Pedrosa V, Zuniga S*
*Sistemas Genomicos Spain |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 69
C69 |
The efforts to prevent cardiovascular (CV) complications in patients with renal disease may benefit from noninvasive vascular monitoring and risk models evaluations. We have studied common carotid intima media thickness (CIMT), plaque/ stenosis and endothelial brachial flow mediated dilatation (FMD) in renal disease subjects. After one year we evaluated the prediction power of CV risk factors for end point events (EPE): stroke, infarct and death and after five years we aim to find the impact of noninvasive markers on the mortality prediction and CV model performance. Arterial Doppler exams were performed on 67 patients with renal disease and 26 healthy matched controls. We observed EPE on subjects with carotid stenosis, plaque, or IMT over 75 percentile. EPE prediction was evaluated with an original neural networks model (NN). This used all 93 subjects' data as input: 24 risk factors including traditional and noninvasive markers (P1) and repeated without carotid plaque/stenosis features (P2) or without carotid markers and diabetes (P3) by retraining only the elements contributing more than 0.0001. Success rate prediction was significantly greater utilizing carotid structural markers: P1=0.81, versus P2=0.5 and P3=0.62. Mortality after five years, so far representing about 49% from the end stage renal disease patients, can be continuously computed into the NN model. In conclusion, carotid markers highly enhance a new NN model performance to CV risk in renal disease. The model can be improved on a larger scale and may find practical use for risk stratification and selecting personalized treatment aimed at reducing unfavorable CV outcome. |
Sandu O*, Nastac I, Uribarri J
*AECOM United States of America |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
D 01
D1 |
The Protein Structure Initiative Structural Biology Knowledgebase (SBKB, http://sbkb.org) is the scientific web portal that integrates biological, experimental, and structural data about proteins. SBKB delivers comprehensive information, including 3D structures from the Protein Data Bank, annotations from 150+ open biological resources, target history and protocols from TargetTrack, theoretical models in Protein Model Portal, technology reports from the PSI Technology Portal, PSI articles from the PSI Publications Portal, and research and technical highlights from Nature Publishing Group. We will present several examples on how data found in the SBKB and its portals can enable structural and biological research.
SBKB is supported by the National Institute of General Medical Science (U01 GM093324). |
Gabanyi M, Westbrook J, Tao Y*, Shah R, Chen L, Micallef D, Schwede T, Haas J, Bordoli L, McLaughlin W, Julfayev E, Adams P, Gifford L, Minor W, Zimmerman M, Fratczak Z, Berman H
*Center for Integrative Proteomics Research, Rutgers University United States of America |
D - Databases, Ontologies, and Text Mining |
|
D 02
D2 |
Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation; the largest repository of these annotations is the UniProt Gene Ontology Annotation database (UniProt-GOA). Despite its size and widespread use, there have been very few evaluations of its quality: to our knowledge, the most relevant study to date assessed the annotation quality of only 286 human proteins.
We introduced a methodology to systematically and quantitatively evaluate annotations available in UniProt-GOA. We used experimental annotations added in newer releases to confirm or reject electronic or non-experimental curated annotations from older releases. We defined 3 measures of annotation quality for a GO term: 1) reliability measures the proportion of electronic annotations later confirmed by new experimental annotations, 2) coverage measures the power of electronic annotations to predict experimental annotations, and 3) specificity measures how informative the predicted GO terms are. Overall, we found that electronic annotations are more reliable than generally believed, to an extent that they are competitive with annotations inferred by curators when they use evidence other than experiments from primary literature. But we also report significant variations among inference methods, types of annotations, and organisms. This work provides guidance for Gene Ontology users and lays the foundations for improving computational approaches to GO function inference. |
Skunca N*, Altenhoff A, Dessimoz C
*ETH Switzerland |
D - Databases, Ontologies, and Text Mining |
|
D 03
D3 |
Non-coding RNA genes are increasingly acknowledged for their importance in the human genome. However, there is no comprehensive non-redundant database for all such human genes.
We leveraged the effective platform of GeneCards, the human gene compendium, together with the power of fRNAdb, to judiciously unify all non-coding RNA gene entries obtainable from 15 different primary sources. Overlapping entries were clustered to unified locations based on an algorithm employing genomic coordi-nates. This allowed GeneCards’ gamut of relevant entries to rise ~3 fold, reaching more than 50,000 human non-redundant ncRNAs. Such “grand unification” within a regularly updated data structure will assist future non-coding RNA research. All these non-coding RNAs are included among the ~95,000 entries in GeneCards V3.08, along with pertinent annotation, automatically mined by its built-in pipeline from 100 data sources. This information is available at www.genecards.org. |
Belinky F*, Bahir I, Stelzer G, Rosen N, Nativ N, Dalah I, Iny Stein T, Mituyama T, Safran M, Lancet D
*Weizmann Institute of Science Israel |
D - Databases, Ontologies, and Text Mining |
|
D 04
D4 |
Systems microscopy is a new emerging field that applies multiparametric statistical analyses and mathematical modeling to imaging-derived data in order to interrogate biological processes. As with previous 'omics' technologies, systems microscopy aims at integrating data and knowledge from independent studies into a comprehensive understanding of cellular systems. Key to this goal is the development of an infrastructure that facilitates efficient generation, processing and storage of systems microscopy data. This is the scope of the Systems Microscopy Network of Excellence (http://systemsmicroscopy.eu/), a life science project spearheading a key enabling methodology based on live cell imaging for the development of next-generation systems biology.
Within this project, we are developing a prototype database and a web interface for the storage and visualization of data generated from high-throughput discrete perturbation screens (mainly RNAi), aimed at identifying key molecular actors and their function impact within complex cellular processes. The current prototype is a human, gene-centered, non-relational database, cross-referencing various annotation databases (i.e. Ensembl). The current interface prototype supports four basic type of queries: (i) for a gene, by name or attribute, across studies; (ii) for a reagent or siRNA, by manufacturer or screen internal identifier; (iii) for a phenotype, or multiple phenotypes; (iiii) for gene attribute (ontology term). Ultimately, this new repository will provide easy access to systems microscopy data, facilitate the development of analytical methods for this field and allow integration of independent studies, adding significant value to the hard-earned primary data. |
Kirsanova C*, Rustici G, Brazma A
*EMBL-EBI United Kingdom |
D - Databases, Ontologies, and Text Mining |
|
D 05
D5 |
The elucidation of the complex relationships linking genotypic and phenotypic variations to protein structure is a major challenge in the post-genomic era. We present MSV3d (Database of human MisSense Variants mapped to 3D protein structure), a new database that contains detailed annotation of missense variants of all human proteins (20199 proteins). The multi-level characterization includes details of the physico-chemical changes induced by amino acid modification, as well as information related to the conservation of the mutated residue and its position relative to functional features in the available or predicted 3D model. Major releases of the database are automatically generated and updated regularly in line with the dbSNP (database of Single Nucleotide Polymorphism) and SwissVar releases, by exploiting the extensive Décrypthon computational grid resources. The database (http://decrypthon.igbmc.fr/msv3d) is easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in XML or flat file formats. |
Luu T, Nguyen H*
*IGBMC France |
D - Databases, Ontologies, and Text Mining |
|
D 06
D6 |
A common problem to organize integrated biological data is the complexity of relationship among different data levels. When abundant omics data become more available, the complexity of the data increase until the relational database management system cannot manage this data in a efficient way. To overcome this problem, a systematic database management system, specific to manage the integrated biological data that are commonly used in systems biology work; such as reaction, genomics, transcriptomics proteiomics and metabolomics data, was developed. The conceptual data structure was designed based on object-oriented approach, whereas a document-oriented model was used for the physical data structure by using MongoDB as the base management system. To maintain the advantage of object-oriented data model and archive the ACID requirement of the database system, a C++ API will be developed on top of a MongoDB API to support data collecting with general biological data exchange formats as well as data transaction management between database and users. In addition, to support the graph analysis of metabolic network, the graph database system was used as an extension to organize queried metabolic data in the graph structure with less memory consuming for the large network. This extension also provides some build-in basic graph-related algorithms. Moreover, web applications and web services with graphic user interface will be provided for easy use. This database system will hereby become a functional database management system that can manage complex biological data to support high quality biological network reconstructions. |
Pornputtapong N*, Wanichthanarak K, Nookaew I, Nielsen J
*Department of Chemical and Biological Engineering, Chalmers University of Technology Sweden |
D - Databases, Ontologies, and Text Mining |
|
D 07
D7 |
Gene expression patterns (where and when genes are expressed) are a key feature in understanding gene function and evolution. To apply compare results between different model organisms and human, or to study gene expression evolution, a comparative approach must be used, but no tools allow to easily compare gene expression across species.
We have thus developed Bgee (Base for Gene Expression Evolution), a database designed to automatically compare expression patterns between animals. This is achieved by i) the aggregation and curation of expression data from different types and sources, to map them to formal representations of anatomies and developments of different species; Bgee release 10 contains curated data for 13,560 Affymetrix chips and 3,364 EST libraries annotated by our curators, as well as 231,992 in situ hybridizations. ii) the analysis of these data by dedicated statistical tests to define high confidence gene expression patterns. iii) the definition of comparison criteria between anatomies of different species; to date, Bgee curators have designed relationships between 5,192 species-specific terms, which map to 1,175 homologous organ groups; the latter are organized in multi-species ontologies (the HOG and vHOG ontologies). Bgee is available at: http://bgee.unil.ch/ |
Bastian FB*, Roux J, Niknejad A, Comte A, Moretti S, Parmentier G, Robinson-Rechavi M
*University of Lausanne and SIB Switzerland |
D - Databases, Ontologies, and Text Mining |
|
D 08
D8 |
FungProSec database gives a systematic and curated information on the presence of protein homorepeats in fungal proteomes and secretomes. We have used a sliding window algorithm to scan the fungal proteomes and secretomes for the presence of protein homorepeats. The present version of the FungProsec contains 58 sequenced fungal proteomes from JGI and NCBI and 51 secretomes. The threshold in the present analysis was kept as 7 consensus amino acid repeats, for any protein repeat to be considered as a true repeat pattern and to avoid false positive. The web-interface allows the user to select the proteome and secretome respectively. The repeat statistics interface gives the user, the information on several repeat statistics including the repeat unit and length, number of repetitions of the repeat unit and position of the repeat in the protein. The user can also select the amino acid on the basis of the physio-chemcial properties (non-polar, aliphatic residues, aromatic residues, polar, non-charged residues, positively charged residues, negatively charged residues) and can view its repeat statistics in the proteomes and secretomes. The repeat containing proteins (RCP) were functionally annotated using the BLAST2GO |
Sablok G*, Zottele F, La Porta N, Hietala AM, Fossdal CG, Kajava A
*FEM IASMA Italy |
D - Databases, Ontologies, and Text Mining |
|
D 09
D9 |
The ArrayExpress Archive (www.ebi.ac.uk/arrayexpress) is a manually curated public database of functional genomics experiments. This includes gene expression, chromatin immunoprecipitation, comparative genomic hybridization and high throughput sequencing experiments. A subset of ArrayExpress forms the Gene Expression Atlas (www.ebi.ac.uk/gxa), a semantically-enriched database of meta-analysis based summary statistics. Data is collected to MIAME and MINISEQE standards through web-based (MIAMExpress) or spreadsheet (MAGE-TAB) submission tools, or imported from other databases, primarily the Gene Expression Omnibus (GEO). Annotations are mapped to the Experimental Factor Ontology (www.ebi.ac.uk/efo), an application-focused ontology modelling experimental variables in ArrayExpress. Use of EFO has improved annotation consistency and enabled formation of richer queries, using relationships and synonyms.
With the development of sequencing technology and its impact on functional genomics data, ArrayExpress now accepts submission of high throughput sequencing data via our MAGE-TAB submission tool. Non-human and human non-identifiable data can be submitted. Raw data files are transferred to the European Nucleotide Archive (ENA). Experiment metadata and processed data are stored in ArrayExpress. The ENA uses an XML submission format, which is generated from the MAGE-TAB spreadsheet and submitted directly to the ENA. Curators develop online documentation and deliver tutorials at road shows and training events, encouraging communication between curators and submitters. As technologies for next generation sequencing constantly evolve, it is increasingly important to understand how data is produced and analysed. By establishing a line of contact with our submitters, we are able to develop ontologies and submission tools that better serve our users. |
Keays M*, Hastings E, Williams E, Farne A, Tang A, Ternent T, Kurbatova N, Burdett T, Malone J, Faulconbridge A, Welter D, Jupp S, Ison J, Dylag M, Emam I, Kapushesky M, Petryszak R, Kolesnikov N, Mani R, Pilicheva E, Brandizi M, Brazma A, Sarkans U, Parkinson H
*European Bioinformatics Institute United Kingdom |
D - Databases, Ontologies, and Text Mining |
|
D 10
D10 |
Knowledge of the location of a protein in a cell is an important step in discovering it's function. Prediction of the subceullular location of a protein offers a quick, economic and automatic method for identifying the subcellular location of proteins.
The inclusion of predicted protein localization data in the 1.4 release of LeishCyc, a metabolic database based on the BioCyc framework, allows a number of questions regarding the accuracy of predictors on Leishmania major and related organisms to be considered. The research investigates the accuracy of the predictions included in the database, as well as predictions from other contemporary programs, and make recommendations for the use of predicted data in the context of the existing LeishCyc ontologies for cell component and evidence, in order to enhance the curation of this important metabolomic resource on Leishmania. |
Lonsdale A*
*University of Melbourne Australia |
D - Databases, Ontologies, and Text Mining |
|
D 11
D11 |
Microarrays are the main technology for large-scale transcriptional gene expression profiling, but the large bodies of data available in public databases are not useful as is due to the large heterogeneity. We have created a methodology to build expression compendia that are unique in directly combining data from different technological platforms. We have constructed comprehensive organism-specific cross-platform expression compendia for three bacterial model organisms (Escherichia coli, Bacillus subtilis, and Salmonella enterica serovar Typhimurium) and made them publically available through an access portal, dubbed COLOMBOS, which provides a suite of tools for exploring, analyzing, and visualizing the data within these compendia. It is freely available at http://bioi.biw.kuleuven.be/colombos. The compendia also incorporate extensive annotations for both genes and experimental conditions; these heterogeneous data are functionally integrated in the COLOMBOS analysis tools to interactively browse and query the compendia not only for specific genes or experiments, but also metabolic pathways, transcriptional regulation mechanisms, experimental conditions, biological processes, etc. Additionally we have invested in the development of a compendia creation and management system: automated retrieval and parsing of experiments from GEO and ArrayExpress, guided sample annotation, and data homogenization consisting of various normalization pipelines. This management system enables us to add compendia for new organisms in future releases, as well keep the existing ones up to date with more recent published expression data. |
Meysman P*, Engelen K, Fu Q, Sanchez-Rodriguez A, Marchal K
*KU Leuven Belgium |
D - Databases, Ontologies, and Text Mining |
|
D 12
D12 |
A fundamental step in processing biomedical documents at the semantic level is the annotation of documents with ontology concepts, which can also be seen as a classification task. However, the ambiguity of terms is inherent in almost every natural language and frequently affects precision and recall in domain-agnostic approaches for retrieval or classification. In this paper we analyze the role of terms’ ambiguity in biomedical documents annotation and we address it efficiently by presenting a new automated and robust method, based on a Maximum Entropy approach, for annotating biomedical literature documents with terms from the Medical Subject Headings (MeSH).
Initial experimental evaluation shows that the suggested approach for annotating biomedical documents with MeSH terms is highly accurate, robust to the ambiguity of terms, and can provide very good performance even when a very small number of training documents is used. More precisely, we show that the proposed algorithm obtained an average F-measure of 92.4% (precision 99.41%, recall 86.77%) for the full range of the explored terms (4,078 MeSH terms), and that the algorithm's performance is resilient to terms' ambiguity, achieving an average F-measure of 92.42% (precision 99.32%, recall 86.87% in the explored MeSH terms which were found to be ambiguous according to the Unified Medical Language System} (UMLS) thesaurus. Finally, we compared the results of the suggested methodology with a Naive Bayes and a Decision Trees classification approach, and we show that the Maximum Entropy based approach performed with higher F-Measure in both ambiguous and monosemous MeSH terms. |
Tsatsaronis G*, Kissa M, Mönnich J, Schroeder M
*BIOTEC, TU Dresden Germany |
D - Databases, Ontologies, and Text Mining |
|
D 13
D13 |
RNA interference (RNAi) allows the systematic investigation of loss-of-function phenotypes on a genome-wide scale, providing a valuable source of functional gene annotation. Assays applied in RNAi screening experiments range from visible phenotypes to transcriptional readouts, in cell-based or in vivo studies.
The GenomeRNAi database collects and makes available published RNAi phenotype data. Structured annotation guidelines facilitate comparability of the data. Currently (release 7.0) the database provides phenotype data from 124 cell-based experiments in Homo sapiens, as well as 158 screens in Drosophila melanogaster, 50 of which were performed in vivo. The database also contains detailed information on RNAi reagents, including calculations on specificity and efficiency. The GenomeRNAi web interface (www.genomernai.org) allows browsing through RNAi screens or searching by gene, reagent or phenotype. Download options for individual RNAi experiments or the entire set of screens are available. Links to and from external resources are enabled via common identifiers such as Ensembl, UniProt, Flybase or CG. A GenomeRNAi DAS server has been implemented, enabling the visualization of phenotype and reagent data in their genomic context. Based on DAS technology and the Dalliance genome browser, we are implementing a graphical overview over phenotypes and reagents, along with RNASeq expression data from human and Drosophila cell lines. We encourage data submission by authors to facilitate the curation process. To this end we have implemented an author login space, allowing download of the submission template and upload of RNAi screening data. An update on curation progress and new functionalities of the website will be presented. |
Schmidt E*, Pelz O, Buhlmann S, Dhamodaran A, Yserentant K, Kerr G, Koch M, Kling A, Wiegand T, Boutros M
*German Cancer Research Center (DKFZ) Germany |
D - Databases, Ontologies, and Text Mining |
|
D 14
D14 |
Previous work has shown that the statistical analysis of biomedical text can produce novel models for biomarkers discovery in several cancer types. The increasing volume of biomedical literature. e.g., PubMed indexed articles, among others constitutes a huge source for applying text mining and predicting trends and new biomedical terminology. In this work we explore, for the first time to the best of our knowledge, the application of temporal language models in the PubMed indexed literature, in order to identify trends in terms. The suggested methodology comprises three steps: (i) training of temporal language models using a parametric window of time, (ii) application of the generated temporal language models to unseen scientific literature in order to mine the properties of the underlying terms and classify the literature in time windows, and, (iii) extraction of new biomedical terms that are suggested by the temporal models as terms following an ascending trend, and which may play a very important role in the future, e.g., biomarkers. Initial experimental evaluation for steps (i) and (ii), considering all of the biomedical literature indexed by PubMed since 1970, shows that the suggested temporal language models can train efficiently when time windows of a three-year time span are used, and ten-fold cross validation analysis shows that we can accurately predict the time window of a scientific paper with an F-Measure of almost 77%, and by considering only title, abstract, and MeSH headings. |
Gkorgkas O, Tsatsaronis G*, Varlamis I, Nørvåg K
*BIOTEC, TU Dresden Germany |
D - Databases, Ontologies, and Text Mining |
|
D 15
D15 |
Protein structure and function are mutually dependent. There are studies showing that functional constraints limit structure divergence, however even small changes in structure may significantly influence protein function. The hierarchical structure of Gene Ontology allows quantifying similarity of protein functions by applying algorithms for calculating semantic similarity. Those similarities can be used to discover evolutionary relationships between proteins or aid proper protein classification. However their usefulness is limited since semantic similarity lacks objective ground truth values and the significance thresholds of available algorithms have not been investigated. Hence, the results cannot be properly interpreted and applied.
The goal of the project is to explore the dependence of protein structure and function in various protein families using information which results from semantics of protein GO annotations. Here we present a method for assessing the significance of information held by semantic similarity, based on its reference distribution. The procedure was tested on four large GO annotation datasets and their subsets (with controlled redundancy and GO terms evidence codes). Using this procedure, for Wang and Resnik semantic similarity algorithms, we estimated the limiting values above which the similarity of protein functions is non-random. We showed that the procedure may be used as a benchmark for comparing different approaches to measuring semantic similarity. We investigated the relation between structural and functional similarity for a representative set of protein PDB structures, confirming estimated thresholds. |
Konopka B*, Golda T, Kotulska M
*Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology Poland |
D - Databases, Ontologies, and Text Mining |
|
D 17
D17 |
Motivation: Transcription factor (TF) binding sites models, often called binding profiles or binding motifs, are crucial for studying transcriptional regulatory networks by bioinformatics methods. Existing collections of transcription factor binding site (TFBS) models, such as JASPAR and TRANSFAC, often contain several redundant models for a single transcription factor. Such redundant models often arise from different types or subsets of experimental data. A model bias from one type of experimental data may be partially corrected by integration of binding sequences from various experiments. Simultaneously this would allow reducing the number of redundant models for a particular TF that is more convenient for practical use.
Results: We present the HOmo sapiens COmprehensive MOdel COllection (HOCOMOCO) of manually-curated TFBS models constructed by the integration of binding sequences obtained by pregenomic and modern high-throughput methods. To construct position weight matrices we used ChIPMunk software. Motif discovery was performed in different computational modes including those accounting for periodic positional prior associated with DNA helix pitch. We selected only one TFBS model per TF unless there was clear existing evidence of two distinct TFBS models, such as experimentally verified dimerization. Each selected TFBS model was then rated according to its manually checked quality. HOCOMOCO contains 426 systematically checked TFBS models for 401 human TFs. Availability: http://autosome.ru/HOCOMOCO http://cbrc.kaust.edu.sa/hocomoco/ http://autosome.ru/ChIPMunk |
Kulakovskiy I*, Medvedeva Y, Kasianov A, Vorontsov I, Schaefer U, Bajic V, Makeev V
*Vavilov Institute of General Genetics Russian Federation |
D - Databases, Ontologies, and Text Mining |
|
D 18
D18 |
The GENCODE consortium aims to identify all gene features in the human genome, using a combination of computational analysis, manual annotation and experimental validation of selected transcripts. Despite the number of protein-coding genes being relatively steady since the first release, the number of transcripts per protein-coding locus in GENCODE annotation has gradually increased from an average of 4.8 to 6.9, and so has the number of distinct translations (29% increase). GENCODE also contains the most comprehensive annotation of long non-coding RNA (lncRNA) loci publicly available currently totalling 11,790. Unlike protein-coding loci, the number of lncRNA loci is likely to continue to increase as new RNAseq-derived tissue-specific datasets are incorporated into the annotation.
The number of splice variants per locus is significantly higher in GENCODE in comparison with other public gene sets in both protein-coding and lncRNA genes. GENCODE covers 81% of RefSeq and 67% of UCSC transcripts, whereas 109,000 GENCODE transcripts (68%) are not present in RefSeq or UCSC. Almost 40% of those are protein-coding and give rise to 39,000 unique translations. The GENCODE data release cycle is coupled to the tri-monthly Ensembl releases. Each release contains updated gene sets where new data from the HAVANA manual annotation has been integrated with the refined Ensembl automated gene set. GENCODE is publicly available from the gencodegenes.org website where the main annotation files can be downloaded in GTF format. GENCODE data can also be visualized via the Ensembl and UCSC genome browsers or accessed through the Ensembl databases, Perl API and BioMart. |
Gonzalez JM*, Tapanari E, Harrow J, The GENCODE Consortium
*Wellcome Trust Sanger Institute United Kingdom |
D - Databases, Ontologies, and Text Mining |
|
D 19
D19 |
Biology-focussed e-resources such as databases and software are central to computational biology and in many areas of biological research. It is therefore essential to explore this "resourceome" to understand what resources are available and how they are used. Attempts have been made to support lists of resources used in bioinformatics, but they suffer from a need for manual maintenance and are never complete.
We have developed a named entity recogniser for the recovery of bioinformatics databases and software from the literature enabling the observation of usage patterns, and facilitating cataloguing, within biological data analysis. In an evaluation set of 30 full-text articles, we have successfully recognised bioinformatics databases and software with an F-measure between 65% and 85%. High ambiguity in resource naming, in combination with the on-going introduction of new resources, prevented a higher F-measure. We analysed articles from Genome Biology and BMC Bioinformatics for database and software usage. General patterns reflect the remit of these journals, with BMC Bioinformatics' emphasis on new tools (high temporal fluctuations of resource mentions) and Genome Biology's greater emphasis on data analysis (low fluctuation). More specifically, Genome Biology papers contain a higher proportion of data analysis tool mentions (e.g. Galaxy). Interestingly, R and Gene Ontology have recently joined BLAST and GenBank as the main players in bioinformatics resources. We highlight more of these resource usage trends. Our results demonstrate resource retrieval feasibility on a large-scale from the scientific literature, ultimately enabling more efficient identification, sharing and reuse of scientific "best practice". |
Duck GJ*, Brass A, Nenadic G, Robertson D, Stevens R
*The University of Manchester United Kingdom |
D - Databases, Ontologies, and Text Mining |
|
D 20
D20 |
Thyroid cancer (TC) is the most common endocrine tumour, which has seen a steady increase in incidence in the last decade. However, we are still lacking an understanding of the underlying molecular mechanisms of TC which can influence "targeted therapeutics". A number of biological pathways have been mentioned in the TC-related literature, referring to information about molecular interactions. To systematically identify biological pathways that are involved in different types of TC, we constructed a corpus of 12,050 MEDLINE abstracts through a carefully designed search. We created a generic list of biological keywords related to pathways and extracted 1,402 sentences that contained them. However, some of them can be relatively generic mentions that do not contain information about TC-specific molecular mechanisms. We hypothesized that sentences containing both pathway-related keywords and gene name occurrences were more likely to be useful. We have therefore run two named entity recognition (NER) systems for genes, BANNER and GeneTUKit, and divided the 1,402 pathway-related sentences into two sets: with or without detected gene mentions (1,102 vs. 300). Pathway mentions were then extracted by a rule-based method from the sentences that contained genes. On a randomly selected set of 200 sentences, both precision and recall were around 85%. The main issues that prevented higher performance were gene mentions not recognized by the NER tools and ambiguous references. Still, our overall results indicate that combining gene names with pathway keywords is a promising approach for the systematic detection of molecular interaction mechanisms in biomedical abstracts. |
Wu C*, Nenadic G, Schwartz J
*University of Manchester United Kingdom |
D - Databases, Ontologies, and Text Mining |
|
D 21
D21 |
The present work has been carried out towards creation of enzyme specific database highlighting their industrial and pharmaceutical applications. We intend to present the development of such specialized database for lipases (Enzyme Classification: 3.1.1.3; true lipases), as our research interest pertains to the same. Our true lipase database is a relational model database, comprises of 1045 sequences, 115 structures, experimental and bibliographic related information of lipases. Classification and structuring of data is the main hit point as the classification of an enzyme into family and further into sub-family helps understand/predict the biochemical activity of newly sequenced or uncharacterized protein. Usage of non-alignment based sub-family classification implemented in CLUSS aids to the uniqueness and simplicity of our approach as alignment based classification still remains a challenge to yield biologically plausible results for hard-to-align sequences. Sub-family classification reflects clustering of sequences based on phylogeny, physio-chemical and enzymatic activity. For instance, sequences from staphylococcal source were clustered as 2 clusters and grouped together in which a phospholipase was isolated. And lipases involved in flagellar biosynthesis were grouped as a single cluster. Motif analysis based on the sub-family classification may provide insight into the contribution of amino acids to make the protein, a lipase at sequence level. The online access of database is on implementation and has been upgraded to provide homolog-based navigation of the annotation for multiple species concurrently with comprehensive and integrated information. |
Patra S*, Saravanan P, Chakravorty D
*Indian Institute of Technology, Guwahati India |
D - Databases, Ontologies, and Text Mining |
|
D 22
D22 |
The Protein Circular Dichroism Data Bank (PCDDB) (http://pcddb.cryst.bbk.ac.uk) is a resource for circular dichroism (CD) spectroscopic data and accompanying metadata, with links to sequence and structure data bases and citation references. The PCDDB provides a repository for spectroscopic data to parallel that of the long-established Protein Data Bank (PDB) for crystallographic and NMR data. The PCDDB is a searchable data bank of CD spectra, with associated tools and protocols for spectral matching, analyses and back-calculations available as part of the overall resource. It also includes validation software (ValiDichro) in order to verify and maintain the high quality of data incorporated into the resource. The PCDDB accommodates both conventional (lab-based) CD data and synchrotron radiation circular dichroism (SRCD) data. As a facility for data-sharing, it provides a simple means of fulfilling granting body and journal publication requirements. The data bank has potential for use in a wide range of bioinformatics and molecular biology applications, including methodology development, comparisons of modelled and
experimental structures, identification of new protein folds, and comparisons of wild type and mutant proteins. The first PCDDB release (December 2009) consisted of the 71 proteins that comprise the SP175 reference dataset (Lees et al. (2006) Bioinformatics 22:1955-1962) that is widely used for CD analyses; since then new entries have included individual soluble and membrane proteins, as well as thermal denaturation series. More than 90000 files have been downloaded thus far by researchers across a wide range of disciplines. |
Whitmore L*, Woollett B, Miles A, Klose D, Janes R, Wallace B
*Birkbeck College, University of London United Kingdom |
D - Databases, Ontologies, and Text Mining |
|
D 23
D23 |
Cross-contamination of human cell lines is a frequent cause of scientific misrepresentation. A short tandem repeat profile reference standard was proposed in 2001 by the leading cell banks and five large research institutes, who tested 253 human cell lines. The editorial “Obligation for cell line authentication: Appeal for concerted action” published in 2010 in the Intl Journal of Cancer encouraged journals and funding agencies to require proper authentication of all cell lines. STR authentication is now required by many leading oncological journals.
An integrated reference system, able to link real cell lines maintained by different collections with authoritative molecular characterizations was needed. The Cell Line Integrated Molecular Authentication database (CLIMA) links available authentication data to actual cell lines distributed by cell banks. It was designed in the frame of the Cell Line Data Base, a well-known reference information source on human and animal cell lines (see http://bioinformatics.istge.it/cldb/). CLIMA includes STR profiling obtained by using different platforms; the end users can retrieve data on cell line authentications performed by different cell banks. CLIMA currently includes information on 1,294 cell lines names and for 1,737 distinct authentication assays. Access to CLIMA is provided at the following URL: http://bioinformatics.istge.it/clima/. Search is possible either by cell line name or by locus values. Outputs are linked to the cell banks where authentication was performed, to the STRdb, where information on loci is available, and to literature and web sites from where data was retrieved. |
Romano P*, Aresu O, Parodi B
*IRCCS AOU San Martino – IST Istituto Nazionale per la Ricerca sul Cancro Italy |
D - Databases, Ontologies, and Text Mining |
|
D 24
D24 |
The KB-Rank tool is used to identify the functions of protein structures via text query. It is available on the web at the URL <http://protein.tcmedc.org/KB-Rank/> and is part of the set of utilities provided for research with the information provided through the Structural Biology Knowledgebase, <http://sbkb.org>. Queries are enabled in which a user can perform searches for structures associated with specified diseases. In the current study, further information from biomedical databases that house annotations on proteins from humans and model organisms was assembled. The associations between proteins with human health and disease phenotypes were thereby expanded. As a result, disease associations of protein structures can be identified in a more effective and comprehensive manner. Diverse functional and structural annotations of the proteins provide a means to rank the structures retrieved according to their relevance to the queried disease. Protein structures that are highly associated with the disease are displayed relatively high in the results pages. An application of such searches is to identify protein structures that are highly associated with a specified disease. These structures can serve as putative targets in structure-based drug design strategies that are aimed to treat the disease in question. The KB-Rank tool has a user-friendly and interactive interface. For the list of structures retrieved, a color stripe is plotted to gauge their relative relevance to the query. The color can be used to gauge the relevance of functional annotation categories as well. Numerical values of the rank scores of protein structures and annotation categories associated with the query are also provided.
This work was supported in part by a general funds provided by The Commonwealth Medical College and by a sub-contract of the PSI Structural Biology Knowledgebase from the National Institute of General Medical Sciences (U01 GM093324). |
McLaughlin W*, Julfayev E, Tao Y, McLaughlin R
*The Commonwealth Medical College United States of America |
D - Databases, Ontologies, and Text Mining |
|
D 25
D25 |
ABSTRACT
Ontology based annotation of sequence related information has tremendous potential for intelligent querying of this information, certainly when interfacing with other bio-ontologies. The Sequence Ontology [1] and its “lite” version SOFA, were developed for this purpose, and the inclusion of the SOFA field in GFF3 data sets has made ontology based sequence annotation a reality. Successfully querying bio-ontologies has been shown feasible [2] by using SPARQL, a W3C standard language for querying RDF based ontologies. A novel browser has been developed to expose GFF3 data as a SPARQL endpoint, query the data, and display the results in a web based browser. The performance is kept acceptable for interactively exploring the data by using the principle of semantic zooming: only the subset of the features that are visible at a certain zoom level are queried. This (subset of) feature data is internally exposed as RDF data by transforming the data into triples, and referring to a RDF translation of SOFA . The browser also shows “static” context material, such as ENSEMBL tracks for genes and transcripts. The proof of concept browser has restricted functionality, but serves the following purpose in the context of a more elaborate project: • Demonstrate the feasibility of querying large GFF datasets using SPARQL • Identify needs for a sequence feature query language that cannot be fulfilled by standard SPARQL • Identify possibilities for interfacing with other bio-ontologies The poster will be accompanied by a publicly available companion website (not yet available) where the browser can be tested. REFERENCES 1. Eilbeck K, Lewis S, Mungall C, Yandell M, Stein L, Durbin R, Ashburner M: The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol 2005, 6:–. 2. Antezana E, Blondé W, Egaña M, Rutherford A, Stevens R, De Baets B, Mironov V, Kuiper M: BioGateway: a semantic systems biology tool for the life sciences. BMC Bioinformatics 2009, 10:S11. |
Devisscher M*, Boulahia C, Trooskens G, De Meyer T, Van Criekinge W, Dawyndt P
*Genohm Belgium |
D - Databases, Ontologies, and Text Mining |
|
D 26
D26 |
Comprehensive and accurate annotation of protein post-translational modifications (PTMs) in UniProtKB is an essential requirement in the field of proteomics. To facilitate PTM annotation, we have developed a text mining tool which retrieves PubMed abstracts for the most frequently annotated PTMs in UniProtKB and extracts information on PTM type and site using a pattern-matching and rule-based approach. The method performance was tested on an annotated corpus of abstracts. Depending on the PTM type, the recall varies between 75 and 95% and the precision between 44 and 94%.
The procedure is used to track new publications with PTM information. It provides Swiss-Prot curators with a set of abstracts with highlighted terms such as the type of PTM the modified amino acid and its position in the protein sequence. It also tags gene/protein mentions in the text and provide a ranked list of candidate proteins carrying the modification. It has also been applied to recover additional supporting evidence for PTMs detected by high-throughput methods. |
Veuthey A*, Bridge A, Bougueleret L, Xenarios I
*Swiss Institute of Bioinformatics Switzerland |
D - Databases, Ontologies, and Text Mining |
|
D 27
D27 |
Numerous protein-protein interaction (PPI) data are provided by using new high-throughput experimental and computational techniques; they are being collected in different databases. The data generally do not contain phenotypic or even functional or structural information about the interactors, which in many cases are available in other databases. Thus, to have widespread coverage, it is necessary to combine the data from different databases.
For this purpose, we are developing a framework to create and maintain a data warehouse on the basis of a conceptual data model. Then, we applied an automatic association inference method, based on the transitive closure concept. In particular, by leveraging IntAct and Mint PPI data, Entrez protein encoding gene data and OMIM genetic disorder data, we inferred associations between proteins and genetic disorders and their phenotypes. In our data warehouse, 46,154 human PPIs regarding 12,178 distinct human proteins were integrated. These human proteins are encoded by 11,232 different human genes. By applying transitive closure concept, we identified 1,130 gene networks and found 1,136 human PPIs associated with 628 genetic disorders. The interactions between the proteins, that are associated to the specific disease with transitive closure method, will help researchers to focus on protein interactions of the disease. This will helps to reveal the disease because of malfunctioning protein interactions. Then possibly the disease treatment strategy such as synthetic protein engineering could be applied. This hypothesis shows the importance of the integration of the PPI data with the genetic disorder data. |
Canakoglu A*, Masseroli M
*Dipartimento di Elettronica e Informazione, Politecnico di Milano Italy |
D - Databases, Ontologies, and Text Mining |
|
D 28
D28 |
Nanopublications have been proposed to solve the inherent problems of text mining and manual curation to gather, organize and exploit the data from the ever-growing number of biomedical publications. Nanopublications use RDF to formally represent scientific observations and claims. As a complementary element to traditional narrative articles, they ensure the quality and proper attribution of published data and help to keep track of provenance data. Using available RDF tools, nanopublications can be automatically interpreted, combined, checked and queried.
At the moment, nanopublications are restricted to a relatively narrow set of scientific statements, namely those that can be expressed with the existing established schemas and vocabularies. Many scientific claims (especially those involving uncertainty, intended vagueness, modal and deontic concepts, temporal aspects, and novel ideas) cannot be fully represented. RDF is extensible, but the development of accurate, useful and accepted schemas is a costly and slow process. We propose to extend nanopublications with underspecified statements that would cover any possible statement, no matter how difficult to formalize. The main part of such an underspecified statement is a sentence in plain English, which is given a URI and can be accompanied by a partial RDF representation. Though not fully formal, such statements can be uniquely identified, attributed, related to each other, and possibly be later given a full RDF representation. This extension drastically increases the application range for nanopublications and could boost their impact. We are currently performing a study on a sample of biomedical publications to evaluate the approach. |
Kuhn T*
*Yale University United States of America |
D - Databases, Ontologies, and Text Mining |
|
D 29
D29 |
The UniProt Knowledgebase (UniProtKB) is a freely accessible resource for functional information on proteins with accurate and comprehensive annotation. The number of organisms with completely sequenced and annotated genomes is increasing. UniProt endeavours to provide comprehensive protein-centric views of such genomes in the form of complete proteome sets. UniProt also defines reference proteomes, a sub-set of UniProtKB complete proteomes , which cover well-studied model organisms and other proteomes of interest for biomedical and biotechnological research and provide broad coverage of the tree of life.
A complete proteome set is defined as the entire set of proteins expressed by a specific organism. Complete proteome sets were first made available in UniProtKB release 2011_05 for human and mouse. This number has increased rapidly and 3097 will made available in release 2012_07. Currently, the NCBI taxonomy identifier (taxID) of the organism is an integral part of the production process for making complete proteome sets available in UniProtKB. However, there are now an increasing number of cases where the same strain or isolate, having the same taxID, has been completely sequenced by more than one sequencing project. These duplicate sequencing projects pose new challenges for making complete proteome sets available through UniProtKB. This poster discusses the production process for complete and reference proteomes, ways of accessing them and the introduction of a new ‘proteome set’ identifier to distinguish between sequencing projects for the same organism and to group together nuclear and organellar proteomes obtained via separate sequencing projects, in organisms where this is appropriate. |
Jones R*, Bursteinas B, Pichler K, O’Donovan C, Martin M
*European Bioinformatics Institute United Kingdom |
D - Databases, Ontologies, and Text Mining |
|
D 30
D30 |
Background
Primary immunodeficiency diseases (PIDs) are a genetically heterogeneous group of immune disorders characterized by rare, recurrent or persistent infections to certain tumors and autoimmunity. The main challenge for in silico genotype-phenotype correlation for any genetic diseases is to standardize phenotype ontology terms and the genotype. Objective Our main objective is to present a heterogeneous primary immunodeficiency disease (PID) phenotypic terms into systematic ontology structures that integrate gene, PID and mutation data in a semantically well-defined ontology and standardized formats such as Web Ontology Language (OWL) and Resource Description Framework (RDF) using semantic web technology to share and exchange information freely among other users' communities. Results The phenotypes are collected from the Resource of Asian Primary Immunodeficiency Diseases - RAPID (http://rapid.rcai.riken.jp), a web-based compendium of molecular alterations in primary immunodeficiency diseases. As expected, phenotype data are full of synonymy, polysemy (diversity of meaning), ambiguity and complexity. To overcome these issues, systematic terminology standardization processes, include data collection, retrieval and mapping, are being implemented using an internal logic-based semi-automatic method to fetch selected cross references to defined PID term specific concepts from multiple ontologies such as NCI-Metathesaurus, SNOMEDCT, Human phenotype ontology (HPO), symptoms, human diseases and so forth. PhenomeR semantically integrates these standardized phenotype terms along with PID, genes and disease-causing variants into a relational ontology for inference of genotype-phenotype correlation. To our knowledge, PID PhenomeR (http://rapid.rcai.riken.jp/ontology/v1/phenomer.php), is the first initiative of this kind, to integrate and interpret PID data into a web-based user-friendly interface towards a community-driven semantic web technology. |
Mohan S*, Thankaswamy Kosalai S
*The Institute of Physical and Chemical Research (RIKEN) Japan |
D - Databases, Ontologies, and Text Mining |
|
D 31
D31 |
Manual annotation cannot keep up with enzyme sequence discovery. In this work, we modelled the use of active and guided learning to support enzyme function curation. We evaluated, on 5,750 E. coli proteins, nine strategies to sort instances for curation. We found that selecting sets of InterPro features in order of frequency of occurrence can cut the curation effort by almost two thirds, while maintaining very high accuracy and recall. The method can be applied to real-life datasets of millions of proteins thanks to its limited computational requirements, parallelisation, good coverage of rare classes and flexibility in selecting instances for annotation. |
De Ferrari L*, Mitchell J, Aitken S
*University of St Andrews United Kingdom |
D - Databases, Ontologies, and Text Mining |
|
D 32
D32 |
A substantial part of the AGRON-OMICS consortium is devoted to profiling the growing Arabidopsis leaf in a number of environmental conditions. The TiMet consortium studies the link between circadian clock and metabolism, focused on both primary- and isoprenoid-metabolism. These international multi-institute projects generate a diverse range of quantitative molecular and phenotypic data. Vital to our analytical pipelines are adaptable database integrations that exploit standard and advanced features of the MySQL database engine and tools. These implementations are utilized for the processes of data and meta-data capture, validation, the tracking of provenance, for certain statistical-, mathematical-, and structural data transformations, for integration with R and for generating visualizations. Our systems provide access controlled user workspaces and the ability to run high performance queries across multiple and some high volume data sets. Interpreting novel datasets also requires the integration of pre-existing knowledge and consequently a range of annotations and classifications are included. Where detailed annotations were lacking, the Knowtator tool was used for curating phenotype-genotype-environment relations using ontologies. A number of scientific use-cases are presented that demonstrate the pivotal role that coherent integration can play in data quality control, project management and data analysis. Since the database engine and tools are freely available, the data, code and documentation can be simply and rapidly replicated for community dissemination and/or extension. These developments provide a useful template for a computational platform that has analytical value during a project and beyond. |
Walsh S*, Baerenfaller K, Graf A, Coman D, Hirsch-Hoffmann M, Kartal Ö, Sulpice R, Szakonyi D, Zielinski T, Granier C, Stitt M, Millar A, Hilson P, Gruissem W
*Eidgenössische Technische Hochschule Zürich Switzerland |
D - Databases, Ontologies, and Text Mining |
|
D 33
D33 |
The considerable growth in DNA sequencing efforts has resulted in numerous sequence datasets deposited in public databases describing microbial communities in many natural or human-made environments of industrial importance. However, for understanding biology and interpreting sequence data the accompanying contextual metadata on sample environmental characteristics deposited along with the sequence data is crucial. Public databases are the source of frequently uncomplete datasets that are retrieved by various search strategies while the metadata are most often compiled from original publications. Following MIMARKS specifications (Yilmaz et al. 2011a, b) a novel repository MC-LIRE is being introduced. Quality checked sequence files with accompanying metadata are organized for querrying across organisms, genes, environment types or metadata according to user selection. As a proof of principle, the following compilations were assembled (n > 400): (i) bacterial and archaeal anaerobic microbial communities involved in methane production industry (biogas as renewable resource); (ii) rumen (bovine, sheep, goat, wild ruminants); (iii) rice fields; (iv) oligotrophic bacterial communities in deep-subsurface caves; (v) extreme cold-soil and temperate soil bacterial communities. The project fills the gap in development of an integrated open community resource enabling sharing organized, complete sequence datasets and accompanying metadata ready for big scale comparative and systems biology analyses by the global scientific community and industry as well.
Yilmaz 2011a) Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nature Biotechnology 29:415-420 Yilmaz(2011b)The genomic standards consortium: bringing standards to life for microbial ecology ISMEJ 5:1565-1567 |
Murovec B*, Stres B
*University of Ljubljana / Faculty of Electrical Engineering Slovenia |
D - Databases, Ontologies, and Text Mining |
|
D 34
D34 |
MicroRNAs (miRNAs) are short RNA molecules which are involved in the regulation of gene expression by binding to mRNAs, usually resulting in translational repression or mRNA degradation. There are very few experimentally validated miRNA-mRNA out of the expected total number of interactions, which have been stored in specific databases. However, in the last few years there has been an intense proliferation of predictive algorithms to determine the targets of these non-coding RNAs. These algorithms take into account the complementarity of these two molecules, their structure and the thermodynamics of the binding process, and assign a score to every possible predicted interaction. All the details have been grouped and published in public databases.
Using the existing predictive algorithms, we have measured the confidence for each of the interactions by estimating the precision of the prediction when compared to the experimental validated information. We have also created a new predictive combined database that contains all the predictions calculated by the existing algorithms, giving every interaction a new combined score and its statistical confidence. This global score allows us to combine several databases without the effect of low-performing databases dragging down good-performing ones. The combined database uses miRNA targets from four sources containing experimental validated interactions: Tarbase, miRTarBase, miRWalk and miRecords. Predicted miRNA-mRNA interactions were retrieved from nine different algorithms: EIMMo, DIANA-microT, Microcosm, Microrna.org, TargetScan, Mirtarget, PITA, miRWalk-predictive and TargetSpy. The database is available at: http://mmmRNA.cnb.csic.es |
Tabas-Madrid D*, Martinez-Herrera DJ, Sanchez Caballero I, Pascual-Montano A
*Spanish National Center for Biotechnology - CSIC Spain |
D - Databases, Ontologies, and Text Mining |
|
D 35
D35 |
BRENDA (BRaunschweig ENzyme DAtabase, http://www.brenda-enzymes.org/) is the major database for enzyme functional information [1]. The manual collection of data from the primary literature and the curation of the database was started 25 years ago. Since then it has been continuously updated and further developed to meet the requirement of scientists. The database covers many aspects of enzymology such as nomenclature, enzyme-catalyzed reactions, kinetic data for catalysis and enzyme inhibition, enzyme stability, purification, crystallization or mutations. Each single data entry is connected to an organism name, to the literature reference and to the protein sequence identifier (if available).
Since the huge amount of publications on enzyme properties does not allow the manual annotation of the complete literature of all enzymes, additional information is retrieved by textmining procedures. The procedures are based on the text interpretation of sentences with occurring enzyme and organism names, localization, kinetic expressions, and sources and tissues in abstracts and titles of the PubMed database [2]. FRENDA (Full Reference ENzyme DAta) aims at providing an exhaustive collection of indexed literature references containing organism-specific enzyme information. AMENDA (Automatic Mining of ENzyme DAta ) comprises enzyme-specific information on the enzyme source based on the vocabulary of the BTO and the subcellular localization based on the GO terms. DRENDA (Disease RElated ENzyme DAta) provides broad information on the connection of diseases and enzymes. It is based on the classified enzymes of BRENDA and the MeshTerms for diseases [3]. KENDA is a dictionary-based textmining approach which extracts kinetic parameters of 13 categories. |
Placzek S*, Schomburg I, Chang A, Schomburg D
*Institute of Bioinformatics and Biochemistry, Technische Universität Braunschweig Germany |
D - Databases, Ontologies, and Text Mining |
|
D 36
D36 |
The UniProt Knowledgebase (UniProtKB) is a comprehensive resource for protein sequences and annotation data. UniProtKB consists of two sections: UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot holds protein entries that are manually annotated with information extracted from literature and curator-evaluated computational analyses.
UniProtKB/TrEMBL contains computationally analysed protein entries. Entries in UniProtKB are connected to various external data collections divided in 14 categories such as the underlying DNA sequence entries, protein structure databases, protein domain and protein family databases, species-specific and function/feature-specific data collections As a result, UniProtKB acts as a central hub connecting biomolecular information archived in 147 cross-referenced databases (release 2012_07). All linked databases information is displayed in files available on UniProt FTP site. Statistics about cross-references can be found in release statistics summary web page of each section. A third of databases are updated each release, external databases provide to UniProtKB a mapping file linking their stable identifiers to an UniProtKB accession number, as soon as a new mapping file is provided, it is plugged in UniProtKB. Cross-references are very important because they provide specialized information from external resources about proteins in UniProt. Specific cross-references can be retrieved directly by the UniProt web site. For example, http://www.uniprot.org/uniprot/?query=organism%3A9606+AND+database%3A%28type%3Apdb%29&sort=score will retrieve all proteins with cross-references to the protein structures in PDB for the human proteome. The ID-mapping web page (http://www.uniprot.org/help/mapping) allows users to map a list of identifiers from a external database to UniProtKB and vice versa. Programmatic access to the database mapping service is also available using Perl, Python, Ruby or Java. |
Fazzini F*, BELY B
*EMBL - European Bioinformatics Institute United Kingdom |
D - Databases, Ontologies, and Text Mining |
|
D 37
D37 |
Due to improved sequencing techniques the amount of identified protein sequences has increased exponentially for the last years. The UniProt Knowledgebase (UniProtKB) is a comprehensive resource to collect these data and their annotation and provide them to scientists from all around the world. The database consists of the two sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/TrEMBL contains automatically analysed protein entries, whereas UniProtKB/Swiss-Prot holds protein entries that are manually annotated using information extracted from literature.
Every four weeks a new version of UniProtKB is released, which implies updated and new data. We update UniProtKB/TrEMBL by taking new sequence information from INSDC (DDBJ, ENA, GeneBank), Ensembl/EnsemblGenomes and PDB and integrate it into UniProtKB/TrEMBL. Furthermore proteins belonging to a complete genome reference are flagged (see: UniProtKB complete/reference proteomes poster). To simplify access to additional information from other sources, we link 147 databases as Cross-references to the protein sequences (see UniProtKB Cross-references poster). Additionally protein annotation is added automatically to UniProtKB/TrEMBL by means of the computational tools SAAS and Unirule (see UniProtKB : Automatic annotation poster). After merging redundant entries, the number of different proteins decreases about 11%. Independently UniProtKB/SwissProt is updated and extended from literature by human curators. Next the data are written into flat files. To enable easy access and evaluation of the data, we process them into different formats (e.g. XML, RDF). Additionally all services we provide for a large variety of scientific applications (e.g. Website, Blast, Biomart, FTP) are updated and released for public usage. |
Zellner H*, Sousa da Silva A, Bely B
*EMBL - European Bioinformatics Institute United Kingdom |
D - Databases, Ontologies, and Text Mining |
|
D 38
D38 |
The UniProt Knowledgebase (UniProtKB) is a comprehensive resource for protein sequences and annotation data. UniProtKB consists of two sections: UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.
UniProtKB/Swiss-Prot holds protein entries that are manually annotated with information extracted from the experimental literature and curator-evaluated computational analyses, providing accurate sequence and functional annotation along with cross-references to other databases. UniProtKB/TrEMBL contains computationally analysed protein entries sourced from the INSDC databases, Ensembl and RefSeq, enriched with automatic annotation and classification. Although manual annotation of proteins is invaluable for the scientific community, it is labour intensive and, with UniProtKB/TrEMBL now containing more than 23 million entries, it is crucial to have a strategy for annotating these proteins in an efficient and scalable manner. Automatic annotation in UniProt is done through: the Statistical Automatic Annotation System (SAAS) and UniRule. SAAS uses a C4.5 decision tree algorithm to generate rules while UniRule consists of manually curated annotation rules. These rules are used in the UniProt Automatic Annotation System to propagate predicted annotation to UniProtKB/TrEMBL records.UniRule provides a variety of annotation types including protein names, functions, subcellular locations, interactions with other proteins and controlled vocabularies, sequence annotation including domains and residues of functional importance, and inferred family relationships.The rule curation process is done by identifying all the common sequence properties and annotation in UniProtKB/Swiss-Prot entries. In addition, these characteristics are mapped with existing InterPro member signatures; if signatures are not available, new ones are suggested to the consortium or, in some cases, we create them ourselves. To aid biocurators, we have developed an intuative web-based application that facilitates the creation, storage and maintenance of the UniRule rules. The tool has the built-in capability to access statistics on-the-fly, which indicate whether the modifications have positively or negatively influenced the validity and coverage of a rule. The tool allows biocurators to monitor the conditions and annotations to ensure that the annotation stays accurate for each release. At the time of writing (UniProtKB release 2012_08), around 40% of UniProtKB/TrEMBL entries are annotated automatically. |
Poggioli D*, UniProt consortium
*EBI United Kingdom |
D - Databases, Ontologies, and Text Mining |
|
D 39
D39 |
Biological experiments that deal with gene expression or with protein interactions produce large amount of data. These data have to be brought in biological context in order to help a biologist to interpret results and make relevant conclusions. We are developing a method which can characterize data set with the help of GO-ontology. We are searching for categories most representative for the given data set, how subsets/clusters can be revealed on the basis of ontology terms and how on the basis of differences between most representative GO-categories different sets can be distinguished.
Proteins are annotated on the basis of currently available knowledge, thus all levels of detail are possible. In order to make sense of annotations it is important to represent these in a proper way to the biologist. A methodology is needed which helps biologists to decide which level of detail is important. Analyzing a given dataset we can calculate how many proteins are annotated with each GO concept. On the basis of these genuine annotations we generate three ontology sub-graphs, one for each global GO hierarchy. These sub-graphs are build on the basis of traversal which starts from the original concepts with which gene products have been annotated, each parental concept (superclass) is visited and enriched with the number of proteins presented in children. In order to find the parent an ontology reasoner is used. It helps to discover ontology superclasses which are not explicitly declared. When these sub graphs are visualized, the biologist can choose the most important GO categories. |
Dmitrieva J*, Florea BI, Li N, Vandenbol M, Lins L
*Centre de Biophysique Moléculaire Numérique Université de Liège-Gembloux Agro Bio Tech Belgium |
D - Databases, Ontologies, and Text Mining |
|
D 40
D40 |
Protein native state is better represented by an ensemble of conformers in equilibrium describing the conformational diversity or dynamism of a protein. Conformational diversity is a key feature to understand essential properties of proteins like function, enzyme and antibody promiscuity, enzyme catalytic power, signal transduction, protein-protein recognition and the origin of new functions. Crystallographic structures of the same protein obtained in different conditions can be considered as representative conformers of protein native state. This view is supported by the correlation found between the observed structural diversity determined by NMR experiments and those coming from different crystallographic structures.
In order to study how biological properties of proteins are associated with the extension of their conformational diversity, we developed a protein conformational database called CoDNaS. For this purpose we recruited the redundant collection of crystallized structures from PDB database and obtained 9474 monomeric and homo-oligomeric proteins (accounting a total of 40565 structures) representing putative conformers for each corresponding protein. Using an all vs. all structural alignment between the corresponding conformers of each protein we defined the extension of conformational diversity as the maximum RMSD registered. By cross linking our proteins with several databases we recruited a broad spectrum of biological and physical-chemical information. Then, using our practical definition of conformational diversity it is easy to relate its extension with different parameters. For example proteins crystallized in different conditions such as bound/unbound states, or mutant/wild-type state or with variations in pH and temperature. |
Monzon A, Parisi G*, Juritz E
*UNQ Argentina |
D - Databases, Ontologies, and Text Mining |
|
D 41
D41 |
In recent times the volume and diversity of experimental data available to cancer researchers has increased at such a rate as to become overwhelming. canSAR (https://cansar.icr.ac.uk) is an integrated, multidisciplinary resource developed to help researchers make better use of global biological and chemical data, in a quicker and more efficient way. It integrates genomic and protein data, 3D structure data, pharmacologic and chemical data including drugs, clinical candidates and chemical tools. In one place researchers can view summaries of data relating to a gene including gene expression, copy number, 3d structures, druggability, protein interactions, chemical tools and more. Similar summaries exist for drugs and other bioactive compounds covering protein bioactivity profiles, cell line sensitivity profiles and 3D structure binding interactions. The platform is useful to cancer researchers from multiple disciplines. canSAR also includes tools for batch gene and compound analyses to provide summary annotations for gene or compound sets that can be used alongside a user's own experimental data. |
Tym J*, Al-lazikani B, Bulusu K
*Institute of Cancer Research United Kingdom |
D - Databases, Ontologies, and Text Mining |
|
E 03
E3 |
One goal of sequencing based metagenomic community analysis is the quantitative taxonomic assessment of microbial community compositions. In particular, relative quantification of taxons is of high relevance for metagenomic analyses or microbial community comparison. However, the majority of existing approaches quantify at low resolution (e.g. at phylum level), rely on the existence of special genes (e.g. 16S), or have severe problems discerning species with highly similar genome sequences. We developed Genome Abundance Similarity Correction (GASiC), a versatile method to estimate true genomic abundances in metagenomic datasets on the species level.
Metagenomic sequence reads are first mapped against a set of reference genomes of species potentially present in the dataset. Then, GASiC estimates the pairwise similarities of the reference genomes using a simulation approach. The similarities are then used to correct the mapping results from the first step and to obtain estimates of the true genomic abundances in the dataset. To this end, we formulate the problem as a non-negative LASSO and solve it using a constrained optimization routine. We developed GASiC, a versatile and accurate abundance correction method. By design, it is independent from the underlying mapping tool and data acquisition protocol. We applied GASiC to the metagenomic FAMeS benchmark dataset and compared its performance to existing methods, showing that GASiC is able to reduce the quantitative error by up to 60%. The GASiC source code is freely available at: http://sourceforge.net/projects/gasic |
Lindner MS*, Renard BY
*Robert Koch-Institut Germany |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 04
E4 |
MicroRNAs (miRNAs) are small RNA molecules involved in the regulation of mammalian gene expression. We have quantified expression of miRNAs in five tissues from multiple humans, chimpanzees, and rhesus macaques using high-throughput sequencing. We computed expression differences between miRNAs in species and tissue comparisons and found that transcription factors are significantly more often targeted by differentially expressed miRNAs. Through their regulatory effect on transcription factors, miRNAs may therefore exert an indirect influence on a larger proportion of genes than previously thought.
Further, while human has more than 1921 annotated miRNAs, there is still a lack of annotation in other primate species. Using sequence data from two tissues in two gorilla and two orangutan individuals we annotated with high-confidence a large number of additional miRNAs in gorilla, chimpanzee, orangutan and rhesus macaque. These sequence data supporting the expression of miRNAs in multiple primates provide new insights into the complex regulatory networks for genes and provide a resource for future analyses of miRNA gene regulation in primates. |
Dannemann M*, Prüfer K, Lizano E, Burbano HA, Kelso J, Nickel B
*Max Planck Institute for Evolutionary Anthropology Germany |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 05
E5 |
Smut fungi are major pests of grasses. Their host plants include some of the most important crop plants like maize, sorghum, barley, wheat and sugarcane. We are interested to uncover the molecular events leading to host shifts and specialization to a particular host. The availability of molecular tools together with five genome sequences of closely related species parasitizing on distinct hosts make smut fungi a particularly interesting model to unravel the basis of host specificity. Here, we present a computational approach to identify genes showing an interesting evolutionary history, some of them putatively contributing to host specificity. For our predictions, we hypothesize that three major events play a role in the determination of host range: single substitutions, frame shifts or indels and gene acquisition or loss. We reconstructed families of homologous genes using clustering techniques. From this data set, we called frame shifts using dedicated codon alignment tools. Single substitutions were searched for positive selection with non-homogenous models of sequence evolution. Gene gains or losses were inferred by detecting orfans and verified through scanning non-syntenic regions. Since secreted proteins play a critical role in establishing a biotrophic interaction, we scrutinized candidate sets for predicted secreted proteins. Overall, we detected 253 positively selected genes, 62 frame shifts and 237 orfans. Among those, 65 proteins (positively selected) and 8 proteins (both frame shifts and orfans) are predicted to be secreted. For interesting genes we are currently determining transcript profiles during discrete stages of development. In addition we are generating deletion mutants to determine whether these genes have a virulence function. |
Schweizer G*, Dutheil JY, Mannhaupt G, Kahmann R
*Max-Planck-Institute for terrestrial Microbiology, Department of Organismic Interactions Germany |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 07
E7 |
Despite numerous studies conducted to understand genomes conservation and divergence, the one shown by noncoding regions are still unclear. To explore this question, we adopted a genome-wide approach that integrates several genomics features such as, gene organization, histone modification and gene expression.
We conducted a comparison between noncoding regions of the same genomic features specifically, promoter, introns and intergenic regions, that vary in conservation focusing on the differences between highly and less conserved regions. Although the frequency of different gene organizations in fruit-fly genome is approximately the same, our results suggest that intergenic regions of different gene organization tend to have significantly different conservation levels. Moreover, within each genome organization different intergenic conservation levels reflect enrichment of genes with different biological processes. For example, genes that belong to Head-to-Head organization with highly conserved intergenic region are enriched in gene ontology terms linked to epigenetic regulations such as chromatin organization, chromatic assembly and related protein domains like histone-core, and histone-fold. Whereas, genes flanked by less conserved intergenic regions show enrichment in general terms such as nucleic acid metabolic process and cell cycle and lack of protein domain enrichment. Additionally, we found that highly conserved intergenic regions of Head-to-Head organization are more depleted in H2K27ac, H3K9ac and H3K4me3 and more enriched in H3K4me3 compared to low conservation regions. We also observed similar tendency in other types gene organizations and noncoding DNA regions. These are evidences that noncoding regions of different genomic architectures are under different levels of selective pressures, implying roles of noncoding regions in the genome regulation. |
Seridi L*, Ryu T, Ravasi T
*Integrative Systems Biology Lab, Chemical & Life Sciences and Engineering Division, Mathematical and Computer Sciences and Engineering Division, Computational Biosciences Research Center, King Abdullah University of Science and Technology, Thuwal 23955-69 Saudi Arabia |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 08
E8 |
One of the tasks of bioinformatics is whole-genome sequencing results annotation and identification of novel features in genomic organization by comparative analysis between different strains for given microorganism. Due to widespread expansion of next generation sequencing technologies, an immense number of genomic sequences is being actively accumulated in the world repositories. Existing software for genomic data analyzing often requires expert level of computer skills (i.e., databasing, scripting) which bench biologists often lack.
A Web-service has been developed that provides an effective software solution for comparative genome analysis. Query sequences (draft or complete genomes) are aligned to subject sequence using suffix tree algorithm. The result is alignment coordinates and positions of single nucleotide polymorphisms between query and reference. On the next step, user can construct phylogenetic trees using various algorithms (UPGMA, Neighbor-joining etc), examine occurrence of SNPs in genes for across all queries, check their synonymy and calculate sequence-typing using housekeeping genes (MLST). Nucleotide polymorphisms are checked for synonymy using reference annotation file (.ptt). Internet portal enables users to lead their own genome project, keeping all information privately in DB. Special features of SRAGEN include capacity to compare genomic data from >1000 Mycobacterium tuberculosis genomic sequences sequenced at Broad Institute (USA) and Sanger Institute (UK), with this bacteria being a significant target for biological analysis due to its high pathogenicity in humans. SRAGEN allows to examine genomic features which lead to drug resistance, an interesting and relevant topic. |
Altukhov I*, Alexeev D, Ischenko D, Tyakht A, Kulemin N, Kogan V, Bazaleev N, Shitikov E
*Moscow Institute of Physics and Technology Russian Federation |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 09
E9 |
Bacterial replicative DNA polymerases are multicomponent protein machines, in which the actual DNA synthesis is performed by the catalytic alpha-subunit. The catalytic subunit belongs to the distinct C-family of DNA polymerases. Polymerases of this family fall into two major groups, DnaE and PolC, typified respectively by Escherichia coli DnaE and Bacillus subtilis PolC. Despite the functional importance of C-family polymerases, the knowledge regarding their variability in sequence and domain organization as well as the distribution of different groups in bacterial genomes is still poor.
In order to better characterize this important protein family, we collected and classified all C-family DNA polymerases from 1389 fully sequenced bacterial genomes. Two distinct groups were easily identifiable: PolC, the replicative polymerase of Gram-positive bacteria, and DnaE2, a mutagenic DNA polymerase. Remaining polymerases clustered into two fairly similar subgroups: DnaE1 and DnaE3. Analysis of domain composition revealed that there is a significant variability even within the same group. However, the polymerase core domain, the duplex DNA binding domain and the PHP domain were found to be universally conserved. We also found that all mutagenic DnaE2 polymerases lack the C-terminal domain, which is known to mediate the retention of the polymerase within the replisome. Analysis of the distribution of polymerases in genomes revealed that all bacteria have at least one polymerase belonging to either DnaE1 or DnaE3 groups. Also, PolC is always present with at least one other DnaE3/DnaE1 polymerase, in agreement with the current knowledge of replication in PolC-containing Gram-positive bacteria. |
Timinskas K*, Balvočiūtė M, Venclovas Č
*Institute of Biotechnology, Vilnius University Lithuania |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 10
E10 |
Adaptation is one of the driving forces that brings novelties in genomes. Computational tools and models are available to predict adaptation - positive selection - at the molecular level, e.g. for amino acids in each lineage of a gene phylogenetic tree. These methods are powerful but assume that the input multiple sequence alignment (MSA) is biologically correct. Nevertheless, MSA building methods are far from perfect and sequences are often misaligned near border, gap, repeat or fast evolving regions. Moreover, the input data often contain errors due to gene prediction or choice of alternative transcripts. These MSA artifacts may look like positive selection events. In order to avoid false positive predictions, we need to address the automatic cleaning of MSA by correcting and/or filtering regions and residues in MSAs. The Selectome database purpose is to compute positive selection predictions for each Vertebrate gene family, in an automatic way. We combine several methods to filter sequences and improve the alignment, in a biological way, by re-aligning out non-orthologous parts (e.g. non-orthologous exons, fake coding regions). We apply scoring methods at the residue level to mask fast evolving and border regions. We benchmark here the final part of this pipeline on real data, by randomly inserting regions of variable lengths simulating a random (non-orthologous exon) or a frameshift (fake CDS) region. We also checked the capacity of the filtering process to keep simulated positively selected sites. The pipeline is successful in the filtering process when such regions are not too small. http://selectome.unil.ch/ |
Moretti S*, Laurenczy B, Salamin N, Robinson-Rechavi M
*Department of Ecology and Evolution, University of Lausanne Switzerland |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 11
E11 |
DNA transposons make up three percent of the human genome, roughly the same percentage as genes. However, due to their inactivity, they are often ignored in favour of the much more abundant, still active, retroelements. Despite this relative ignominy, there are a number of interesting questions to be asked of these transposon families. We are interested in the proliferation of elements throughout host genomes. Does an ongoing process of turnover occur, or is the process more akin to a life cycle, with elements proliferating rapidly before independent deactivation at a later date?
We answer this question by tracing back to the most recent common ancestor of each modern transposon family, using two different methods. The first method identifies the most recent common ancestor of the species in which a family of transposon fossils can still be found. The second method uses BEAST, a monte carlo markov chain molecular dating method, to predict the age of the most recent common ancestor element from which all elements found in a modern genome are descended. Independent data from five pairs of species are used in the analysis: Human-Chimpanzee, Human-Orangutan, Dog-Panda, Dog-Cat and Cow-Pig. We discover that, in general, the times to element common ancestry, for a given family, are the same for the different species pairs, suggesting that there has been no order-specific process of turnover. For the majority of families, the age to the common ancestor of the host species and that of the elements are similar, suggesting a life cycle model. |
Hellen E*, Brookfield J
*University of Nottingham United Kingdom |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 12
E12 |
The availability of affordable next-generation sequencing technologies now allows the routine determination of genomic sequences for hundreds of microbial isolates. These sequences are the basis of a variety of downstream analyses. Most of these analyses rely on a robust phylogeny, the determination of which can be a time-consuming process. Reducing the required time for extracting phylogenetically relevant information from next-generation sequencing data is of particular importance when the calculation of a phylogeny is not the primary goal of the study. Here we present a simple tool that can efficiently extract phylogenetically relevant information from any type of sequence data, without the need of sequence assembly or any other data pre-processing. The output of the program is a multiple sequence alignment which can be analyzed further by any tree reconstruction program. |
Bertels F*, van Nimwegen E, Rainey P, Silander O
*Biozentrum der Universitat Basel Switzerland |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 13
E13 |
Membrane proteins are found in all kingdoms of life and have a diverse set of function and key roles in many biological systems. Herein, we determine all alpha helical transmembrane proteins from 24 complete eukaryotic proteomes, which spans the four eukaryotic super groups; chromalveolates, plants, excavates and unikonts. Hence, we are able to investigate the evolutionary history of the membrane proteome in eukaryotes. In total we identify 100955 membrane proteins among the more than 400000 investigated proteins. Using Markov clustering, based on sequence similarity and Pfam protein family affiliation, 91% of the membrane proteins were placed into candidate families. We extend our previous classification from the human membrane proteome into the other investigated organisms and track its evolutionary history. We find that receptors are the largest functional group in animals, dominated by G protein-coupled receptors, receptor kinases and immunoglobulin receptors, but that most of the human receptor families are small or absent in other eukaryotes. Human transporter and membrane protein enzyme families on the other hand have a much longer evolutionary history and several families are present in all investigated species. Moreover, we investigate the sequence features of the transmembrane helices across lineages and discuss the mechanisms that have shaped eukaryotic membrane protein evolution. We provide a comprehensive analysis of the evolution of the eukaryotic membrane protein, its function, families and features. |
Sällman Almén M*, Fredriksson R, Schiöth HB
*Uppsala University Sweden |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 14
E14 |
When comparing whole genomes one often needs to calculate an evolutionary distances based on the pairwise number of mutations. Traditionally, this is
done by multiple sequence alignment methods, which often do not scale for large genomes. Alignment-free methods are an efficient alternative, though until recently alignment-free distance measures could not be interpreted as mutation distances. A recently published distance measure, K_r, now combines the efficiency of alignment-free methods with the biological relevance of mutation-based distances. It can be interpreted as the well-known Jukes-Cantor estimator of the number of substitutions per site and is based on the lengths of shortest unique substrings between genomes. K_r is implemented in the software Kr2 (Domazet-Lošo 2009), which looks up the required shortest unique substring lengths using enhanced suffix arrays (Abouelhoda 2004). When Kr2 is applied to large genomes, such as those of mammals, these data structures can become very large. For example, Kr2 requires 72 GB of RAM to compare 12 complete Drosophila genomes. We modified the algorithm used in Kr2 such that random access to the enhanced suffix array is eliminated. The resulting algorithm streams the tables of the enhanced suffix array in sequential order, leading to a very small memory peak during the calculation. We implemented the modified algorithm in the genometools-toolkit (http://genometools.org). This provides a scalable solution to computing the K_r-distance measure. For example, the 12 Drosophila genomes can be compared in about 3 hours using only 3 GB of RAM, which means a 20-fold space improvement over Kr2 and a comparable running time. |
Willrodt D*, Kurtz S, Haubold B
*Zentrum für Bioinformatik, Universität Hamburg Germany |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 15
E15 |
Previous investigations have indicated that the expression level of a gene correlates with the rate of evolutionary change in the corresponding protein: highly expressed proteins tend to evolve slower. It is, however, unknown whether this correlation is strong enough to be predictive of protein expression levels in practical terms.
Reconstructing phylogenetic trees for all proteins across 19 diverse bacterial genomes (phylomes) allowed us to test this hypothesis by comparing the terminal branch lengths in the trees to experimentally measured mRNA levels. In addition, we have systematically examined a number of other features extracted from the phylogenetic trees, evaluating the predictive power of each set of features alone and in combination using Random Forest regression. As a baseline, we compared against the prediction accuracy of codon biases - a standard approach to link the genomic sequence to expression levels. We found that, surprisingly, some of the features describing phylogenetic trees predict gene expression levels equally well as the codon biases, the correlation coefficient being ~0.5, when averaged across 19 species. In particular, a combination of only two sets of phylogenetic features was highly informative: the terminal branch lengths, and the number and age of duplications. The first finding is consistent with the slower evolution rate of highly expressed proteins, while the second might possibly reflect the divergence in expression levels after duplications. The major part of this correlation remains in place after controlling for the effects of gene functional category, and of the genes’ phylogenetic scope on expression levels. |
Supek F*, Marcet-Houben M, Gabaldon T
*Centre for Genomic Regulation Spain |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 16
E16 |
The study of metabolic reconstruction in different organisms exposes the existence of crucial compounds for its survival. Examples of these compounds are the enzymes that are responsible for the catalysis of biochemical reactions in metabolic pathways. Unlike the homologous enzymes, the analogous are able to catalyze the same reactions, but without significant sequence similarity at the primary level and possibly with different three-dimensional structures. A detailed study of these enzymes may exhibit new catalytic mechanisms, add information about the origin and evolution of biochemical pathways and reveal potential targets for drug development. For many diseases caused by parasites, therapeutic options remain inefficient or nonexistent, requiring the search for new drug targets. These targets may be specific proteins of the parasite (absent in the host) or compounds present in the both organisms but with different three-dimensional structure, like analogous enzymes. The tool AnEnPi approach was able to identify, annotate and compare homologous and analogous enzymes. It was developed and used to reconstruct computationally the metabolic pathways of some model organisms such as trypanosomes. Since the three-dimensional structure is important in the study of analogy, the tool MHOLline was used to obtain 3D models for homologous, analogous and specific proteins of T. cruzi versus Homo sapiens. The strategies used in this study support the concept that structural analysis together with protein functional analysis could be an interesting computational methodology to detect potential targets for structure-based rational drug design. |
Guimarães ACR*, Degrave W, Miranda A
*Fundação Oswaldo Cruz Brazil |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 17
E17 |
Insects are among the most diverse groups of animals, including more than a million described species. They represent over 90% of the metazoan species and contribute to many ecosystems. The insects genomes that were sequenced includs Diptera genomes (16, e.g., fruitfly and mosquito) and Hymenoptera (9, e.g., ants). Proteomes were downloaded from the Hymenoptera Genome Database and from UniProtKB. Functional assignment was performed by PFAM, Phobius and Clantox classifiers.
ProtoNet (www.protonet.cs.huji.ac.il) platform provides an unsupervised hierarchical clustering of all proteins. A total of 300K proteins from 17 complete insects proteomes and Daphnia pluex as an outgroup, were clustered using ProtoNet. All analyzed proteins were clustered into 10K stable families. We developed a systematic methodology to evaluate the imbalance among species, as represented by their protein families. Such an imbalance is a reflection of the evolutionary history of massive gene loss and duplications. We identified protein families which had went through an extensive expansion in only some of the species. We noted that the level of imbalance is remarkably high for membranous (with transmembrane domain) and secreted proteins (with signal peptides). We focose on proteins families that were specialized in the two main groups Diptera and Hymenoptera. Our methodology can be used for identifying functional families that are specialized only for specific species. ProtoBug (www.protobug.cs.huji.ac.il) is a resource and a querying system that supports a specie view for proteins families and functions. |
Rappoport N*, Linial M
*The Hebrew University of Jerusalem, Israel Israel |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 18
E18 |
The WHO reports no decline in the incidence of HIV-1 infections. We're interested in tracking the dispersal patterns of HIV-1 which is transmitted via a contact network. This study tests the pan-European mixing hypothesis, investigates transmission network structure and estimates epidemic growth rates across HIV-1 subtypes and modes of transmission.
The guiding principle is that evolutionary change occurring on the same time scale as disease spread allows the estimation of transmission linkage between patients. Study data consists of 46,000 HIV-1 pol gene sequences sampled from 30,000 patients collected by the EuResist consortium. Sequences are subtyped using COMET and REGA and aligned with ClustalW. A transmission graph is inferred from the pairwise distance matrix via thresholding sequence similarity as measured using logdet. An optimal threshold is identified based on edge density and a novel graph theoretic metric, graph coagulation. The pan-mixing hypothesis is tested using a modified form of the assortativity coefficient. Social network measures are used to investigate the structure of transmission clusters. Growth rates within connected components of this graph are estimated from divergence times in phylogenetic trees reconstructed using Bayesian MCMC. The optimal threshold is significantly different for each HIV subtype possibly due to differences in transmission dynamics. Transmission clusters exhibit high country wise assortativity suggesting endemic transmission as opposed to pan-mixing. The IVDA show high assortativity with other transmission types along with a high node centrality, indicating that they are important bridging elements of the epidemic. |
Kalaghatgi P*
*Max Planck Institute for Informatics Germany |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 19
E19 |
The study of evolutionary rates is a central issue to understand the mechanisms underlying protein molecular evolution. Several factors have been associated to the modulation of the evolutionary rate such as: genomic location, functional importance of the protein, expression level, structural constraints, protein stability and developmental time. However, it was recently established that the gene expression level is the property showing one of the strongest and consistent correlation between genomic data and evolutionary rate. More recently, different studies indicate that structure-functional features and translation rates could have comparable contributions to explain evolutionary rates. In this work we study how the presence of conformational diversity in proteins could influence the rate of evolution. To study this relationship we used a subset of proteins taken from PCDB database (Protein Conformational Data Base). Each of these proteins was linked to OMA database to obtain the corresponding set of orthologs and then we estimated the evolutionary rates using PAML 4. The final set contains 58 proteins corresponding to 67 domains in PCDB with an average RMSD of 1.23Å and a maximum of 5.07Å. Using this set we have determined that the evolutionary rate negative correlates with the degree of conformational diversity measured by the RMSDmax between conformers (Spearman correlation = -0.37). Our results indicate that proteins with higher conformational diversity impose additional structural constraints and then these proteins evolve with lower rates. We think that our findings could have important implications in the understanding of protein evolution process. |
Zea D, Fornasari MS, Marino C, Parisi G*
*Universidad Nacional de Quilmes Argentina |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 20
E20 |
Post-translational modification of the lysine residues of specific proteins by ubiquitin is involved in the modulation of degradation, localization, and activity of the target proteins. Here, we identified gains of ubiquitylation sites in highly conserved regions of human proteins that might have occurred during human evolution. We analyzed human ubiquitylation site data and multiple alignments of orthologous mammalian proteins including those from humans, primates, other placental mammals, opossum, and platypus. In our analysis, we identified 214 ubiquitylation sites in 190 proteins that first appeared along the human lineage during primate evolution: 4 proteins with three novel sites; 16 proteins with two sites; and the remaining 170 proteins with one site each. PML, which functions in neurodevelopment and neurodegeneration, acquired three sites, two of which are involved in the degradation of PML. Nine human proteins, including ERCC2 (also known as XPD) and NBR1, gained human-specific ubiquitylated lysines after the human-chimpanzee divergence. ERCC2 shows a Lys/Gln polymorphism, the derived (major) allele of which confers enhanced DNA repair capacity and reduced cancer risk compared with the ancestral (minor) allele. NBR1 and seven other proteins that are involved in the human autophagy protein interaction network gained a novel ubiquitylation site. The gain of novel ubiquitylation sites could be implicated in the evolution of the protein degradation and other regulatory networks. Although gains of ubiquitylation sites do not equate to adaptive evolution, they are useful candidates for molecular functional analyses to identify novel advantageous genetic modifications and innovative phenotypes acquired during human evolution. |
Kim DS, Hahn Y*
*Chung-Ang University Korea, South |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 22
E22 |
Two main hypothesis of the evolution of embryonic development have been put forward so far. First, an early conservation model predicts that the highest conservation occurs at the beginning of embryogenesis. It dates back to Karl von Baer who postulated that embryos of different species progressively diverge from one another during ontogeny. Second, an hourglass model predicts that the highest conservation can be found during mid-embryogenesis. Nowadays, the hourglass model is commonly accepted, although a formal characterization has been elusive. Recent studies have reported several molecular characteristics supporting the hourglass model. To this aim they compared descriptive statistics of expression values of all genes between developmental time-points. Such a methodology introduces dependencies between the sets of expressed genes which are compared, and consequently can produce results biased by genes expressed in many time-points. To overcome this problem, we used an alternative "modularization" approach to study the evolution of zebrafish development. We first decomposed the genes into different modules, which contained genes that were expressed only in one of the six different developmental stages (cleavage, gastrula, pharyngula, larva, juvenile and adult). Next, for every module we obtained five characteristics: gene sequence conservation, genes' age, gene expression conservation, one-to-one orthologs, and non-coding sequence conservation. The first three characteristics suggest that all developmental stages are conserved at the same level. The number of one-to-one orthologs supports the early conservation model, whereas the regulatory region conservation supports the hourglass model. Thus different levels of molecular evolution seem to follow different patterns of developmental constraints. |
Piasecka B*, Lichocki P, Robinson-Rechavi M
*University of Lausanne Switzerland |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 23
E23 |
Phylogenetic footprinting approaches to motif discovery usually rely on whole genome alignments (WGAs) for the detection of conserved regulatory sites. Due to complex genome rearrangements WGAs are not possible in plants. We therefore analyze the promoter sequences of a large set of orthologous gene families. We do not rely on the alignment of the promoters since it has been shown that regulatory sites are often not aligned correctly.
In this work we study 4 species of the Monocotyledon family. Our dataset contains 17724 gene families each consisting of 4 orthologous promoter sequences and on average one paralogous sequence. We use the Branch Length Score (BLS) to assess the degree of conservation of a motif inside a gene family. We developed an exhaustive alignment-free algorithm based on generalized suffix trees to discover the conserved motifs in a gene family. We use a 5-character alphabet, including the Any character (N) to represent the motifs. After combining the results from the different gene families we are able to calculate the confidence of each motif-BLS pair. A parallel version of the algorithm has been developed and implemented using the Message Passing Interface (MPI) which significantly reduces runtimes and makes it possible to keep the vast amount of candidate motifs in memory. We investigated the algorithm’s ability to recover known rice motifs from Transfac.. The results are compared with the results obtained by aligning the promoter sequences with Dialign-TX. Since the alignment quality might depend on the length of the sequences, we compare the results for both the 500b promoters and the 2kb upstream promoter regions. |
De Witte D*, Van Bel M, Demeester P, Dhoedt B, Vandepoele K, Fostier J
*Department of Information Technology (INTEC), Ghent University Belgium |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 25
E25 |
Evolution of protein is promoted by natural selection, defining the fixation rate of amino-acid. While this rate is assumed to be constant under neutral evolution, it is decelerated by negative selection, or accelerated by positive selection when promoting an adaptation to environmental changes. A well-known case of such adaptation is the Ribulose bisphosphate carboxylase (RubisCO), the enzyme responsible for fixation of CO2 during photosynthesis. In flowering plants (angiosperms), two forms exist, the C3 and the C4, the latest being faster in term of catalytic activity. The C4 forms result from convergent evolution in multiple clades and was the result of substitution under positive selection in only a small subset of sites. However, very few is known about the physicochemical properties of these sites.
Using a phylogenomic framework and homology modeling, we reconstructed in-silico the ancestral sequences and their associated 3D structures. With these 3D models, we are able to follow precisely the evolutionary path, by identifying each mutation on each branch, and to describe the mechanistic changes that lead to the differentiation between C3 and C4 forms. Using FoldX (Serrano's lab), we were able to estimate the stability effect of these mutations along the phylogenetic tree. While we found some mutations that are stabilising the structure to preserve the global stability, we found other mutations that are slightly destabilising. These mutations are buried inside the core structure and are close to the loop that gives access to the enzymatic cavity. These results are consistent with a “stability-activity trade-off” model. |
Studer R*, Christin P, Orengo C
*University College London United Kingdom |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 26
E26 |
The high amount of data generated by today’s powerful sequencing technologies has a great impact on many fields of bioinformatics. One of these fields is phylogenetics, with increasing numbers of sequences allowing a more detailed look at phylogenetic relationships than ever before.
However, the closer one looks, the more distracting even small distortions in the image are. One of the factors that can cause such distortions is the choice of outgroups, which serve as an evolutionary anchor for the identification of ancestry. Here we look at how differences in the chosen outgroup can influence the accurate representation of the underlying evolutionary processes in the generated phylogenetic tree. We find that especially when analyzing sequences of closely related organisms this choice can have a critical effect on the accuracy of the resulting tree. Based on these results we present a method for the evaluation of an outgroup for the reconstruction of the “true” phylogenetic relationship - even in situations when this ground truth is not known. We implemented our algorithm in a Java-based tool that selects the optimal outgroup for an alignment prior to the calculation of the phylogenetic tree. The software is available as a plugin for the bioinformatics suite Geneious. |
Brünnhäußer J, Zickmann F, Renard BY, Nitsche A, Dabrowski PW*
*Robert Koch Institute Germany |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 27
E27 |
Validation and benchmarking are challenging tasks in computational evolutionary biology because the evolutionary history of biological entities studied is usually unknown. Using computer programs for simulating sequence evolution in silico is therefore widely used to characterize newly developed models and methods under controlled conditions. However, current simulation packages tend to focus on gene-level aspects of genome evolution such as character substitutions and indels, population-level events such as recombination and gene conversion, or on genome-level aspects such as genome rearrangement and speciation events.
We introduce a new simulation program ALF (for artificial life framework), developed with the long-term goal of simulating the entire range of evolutionary forces that act on genomes. In the first release, we primarily focussed on species-level evolution where an ancestral genome, represented by an ordered set of sequences, is evolved along a tree into a number of descendant synthetic genomes. At the gene-level, ALF can simulate evolution at the nucleotide, codon or amino acid level with indels and among-site heterogeneity, supporting most established models of character substitution. Different types of sequences can be mimicked by defining several sequence classes with separate models of substitution, insertion-deletion and among-site rate variation. At the genomic level, ALF simulates GC content amelioration, gene duplication and loss, genome rearrangements and lateral gene transfer. A user-friendly web interface facilitates the setup of new simulations. We illustrate the utility of ALF with an example study demonstrating that lateral gene transfer can dramatically decrease the accuracy of two well established methods for orthology inference. |
Dalquen D*, Anisimova M, Gonnet G, Dessimoz C
*ETH Zurich Switzerland |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 28
E28 |
The conservation of intron positions comprises information useful for de novo gene prediction as well as for analysing the origin of introns. Multiple sequence alignments can be improved by incorporating information about gene structures. Here, we present GenePainter, a standalone tool for mapping gene structures onto protein multiple sequence alignments (MSA). Gene structures, as provided by WebScipio (http://www.webscipio.org/), are aligned with respect to the exact positions of the introns (down to nucleotide level) and intron phase. Resulting alignments can be displayed in plain text as well as graphically. In detail, in text-based output the exons and introns are denoted by “-“ and “|”, respectively. Alternatively, intron phases can be displayed by replacing the general intron indicator “|” by a number representing the actual phase. This information can also be included in the MSA itself, allowing for improvements in the MSA with respect to gene structure information. Besides the pure representation of aligned gene structures, structural elements can be visualized together with intron positions and phases. To this extend, a protein structure needs to be provided. Based on an alignment between the protein sequence as given in the PDB and a reference sequence from the MSA, respective intron positions and phases are mapped onto the structure. GenePainter supplies python scripts to visualize this information in PyMol. |
Hammesfahr B, Odronitz F, Mühlhausen S*, Waack S, Kollmar M
*Max Plank Institute for Biophysical Chemistry Germany |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 29
E29 |
Through alternative splicing, a single gene can produce many different isoforms, thus multiplying its potential functional roles. Alternative splicing is known to affect the great majority of protein-coding genes in many species, but it is unclear what proportions of alternative isoforms are functional. Here, we address these questions by studying the evolution of alternative splicing patterns in conjunction with the evolution of splicing regulatory mechanisms. To study the evolution of transcriptomes, we generated an extensive RNA-Seq dataset, comprising 11 species and 8 tissues. Using these data, we defined for each species a complete catalogue of cis-acting regulatory elements of splicing, i.e. splicing enhancers or silencers that are located in intronic or exonic sequences flanking the splice site. We defined these regulatory elements using a previously proposed in silico method, which relies on a principle of compensation between splice site strength and presence of additional splicing regulators (e.g., constitutive exons with weak splice sites are expected to have more enhancers for compensation). We found that the presence of in silico detected splicing enhancers is correlated with high levels of exon inclusion frequency, thus confirming the validity of our approach. Our analyses revealed that the sets of splicing regulators are highly conserved between species. However, we also detected many species-specific regulatory motifs, which may explain the rapid evolution of alternative splicing patterns. Finally, we searched for substitutions that disrupt or create splicing regulator motifs in splice site flanking regions, and we found that the presence of such substitutions correlates with important changes in the pattern of alternative splicing. |
bilican a*, Necsulea A, Kaessmann H
*Center for Integrative Genomics Switzerland |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 30
E30 |
The availability of many annotated proteomes enables the systematic study of the relationships between protein conservation and functionality. In this work, we explore this question based solely on the presence or absence of protein homologues, namely the conservation profile. We study the proteomes of 18 metazoans, and examine them from two distinct points of view: the human’s and the fly’s.
We explore functional enrichment of the “universal proteins”, having homologues in all 17 other species, and the non-universal proteins, using Gorilla gene ontology tool. Many gene ontology terms are strongly enriched in both human and fly universal proteins. These include protein metabolic process and transport, DNA-dependent regulation of transcription and regulation of gene expression. Processes such as immune response, defense response, and response to biotic stimulus are enriched in both fly and human non-universal proteins. Keratinization is enriched in human (but not fly's) non-universal proteins, while sensory perception and body morphogenesis are enriched in fly (but not human's) non-universal proteins. We also study more complex patterns of conservation profiles, e.g. proteins having homologues in all vertebrates, but none in invertebrates. Finally, we apply Quantum Clustering to the conservation profiles of non-universal proteins. The resulting proteins clusters exhibit interesting, significant functional enrichment. Using simple binary conservation profiles, based on two points of views - the human's and the fly's, we show interesting features of metazoa proteins conservation. Some of these findings concur with known ones, while others are novel, and shed more light on the relations among protein conservation and functionality. |
Pasmanik-Chor M*, Witztum J, Persi E, Horn D, Chor B
*Tel Aviv University Israel |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 31
E31 |
Gram-negative pathogens are equipped with specialized hair-like appendages (or pili) for attachment to their hosts. Pili play a crucial role in the onset and persistence of bacterial infections and often determine tissue and host tropism of the pathogen. A majority of these attachment pili are assembled by the chaperone-usher (CU) pathway. Though for a number of these pilus types, the specific role in virulence is well documented, overall, the association of pilus genotypes with strain ecology is poorly understood. The growing number of fully sequenced E. coli genomes in public databases provides an opportunity to probe this association, which can be potentially useful for pathogen surveillance, targeted anti- virulence therapies, as well as in understanding the evolution of these systems. Through HMM modelling of the usher, model-based search of CU gene clusters from around 100 E. coli genomes, phylogenetic analysis, and cluster analysis, we were able identify the types of CU pili present in E. coli. Our estimate yielded about 60 types of CU pili. Only less than half of these types are reported in literature. Some types can be found in all strains while others are more group-specific. Many of the types we identified that are associated with pathogens have not been reported earlier in literature. We found that pilus genotypes do, in general, largely correlate with E. coli lineage, implying that CU pilus gene clusters comprise a significant part of core genes sets of E. coli strains across a whole range of ecotypes. |
Taganna J*, de Greve H, Callewaert N, Remaut H
*VIB Laboratory of Structural and Molecular Microbiology Belgium |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 32
E32 |
Phylogenetic trees of prokaryotes based on the sequence data of single genes are often inconsistent to one another. To overcome this problem, supertree (concatenated tree) was developed, and it showed plausible species phylogeny. However, it tend to yield inaccurate relationships particularly for distantly related species due to disturbing factors such as horizontal gene transfer and gene loss in out-paralogs. We developed new method named “Ortho-Gen” to construct ortholog dataset for phylogenetic analysis. We introduced the following four ideas to the method to decrease the influences of horizontal gene transfer and out-paralogs.
HGT filter: Deletion of genes that are derived from horizontal gene transfer from initial sequence dataset. Out-paralog filter: Deletion of out-paralogs from result data of BLAST using the similarity score. Remaining out-paralogs are deleted in the following steps. Classification of the tree data: The trees are classified into monophyletic tree or polyphyletic tree based on the position of outgroup. In the latter case, candidate of out-paralog will be detected. Tree split: This program split a tree into two to remove the out-paralogs from the candidates of ortholog dataset with the information of phylogenetic tree's topology, such as monophyletic or polyphyletic states in the group and species level. Changing threshold: This program cut the candidate of ortholog group (including out-paralog) into two ortholog groups by the difference of evolutionary distance among true ortholog members and that among out-paralogs. |
Horiike T*, Minai R, Miyata D
*Shizuoka University Japan |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 33
E33 |
The destination or localization of a protein in a cell is one aspect of protein’s function. Advances in large-scale sequencing techniques led to a variety of completely sequenced genomes and millions of protein sequences. Unfortunately, many of these sequences are still lacking detailed functional annotation. To tackle this deficit of data, several computational methods have been developed to predict sub-cellular localization. In this study, we applied our novel method LocTree2 to predict localization in 79 completely sequenced genomes comprising archaeal, bacterial and eukaryotic domains. We found species as different as Homo sapiens, Drosophila melanogaster and Arabidopsis thaliana have surprisingly similar cellular compositions (e.g. over 50% of all eukaryotic proteins were predicted to localize in the nucleus or the cytosol). We further compared the predicted composition of the compartments to that predicted by three different state-of-the-art methods - Cello v.2.5, YLoc and WoLF PSORT. The methods predominantly disagreed in the predictions of proteins that are difficult to handle experimentally (e.g. plasma membrane and ER localized proteins). Comparing predictions to the experimental annotations in SWISS-PROT, we found experimental annotations to be biased towards localization classes traditionally studied in great detail (e.g. chloroplast for plants and nucleus for other eukaryotes). Additionally, we computed distances from the LocTree2 predicted composition of cellular compartments to reconstruct the phylogenetic relationships between the genomes, confirming accepted relationships and suggesting novel ones. Thus, a high-throughput annotation of sub-cellular localization adds a considerable value in answering important biological questions. |
Goldberg T*, Hamp T, Bromberg Y, Vicedo E, Rost B
*Technical University Munich Germany |
E - Evolution, Phylogeny, and Comparative Genomics |
|
F 01
F1 |
Tetramerisation of the oligomeric enzyme dihydrodipicolinate synthase (DHDPS) in mesophilic bacteria has been shown to be essential for stability and function, yet it is not clearly understood as to why this is the case. Shewanella benthica (Sb) is a psychrophilic bacterium that inhabits cold ocean environments. Studying oligomeric enzymes from this bacterium and comparing the properties to its mesophilic counterparts therefore allows the opportunity to explore the relationship between quaternary structure, stability and protein dynamics.
Through biophysical and X-ray structural analyses, we have shown that recombinant Sb-DHDPS exists as an active dimer in solution and in the crystal form at biologically-relevant temperatures (4-12 degrees Celsius). However, at mesophilic temperatures, the protein aggregates and possesses significantly attenuated enzymatic activity. The aggregation of Sb-DHDPS hence presents problems in using conventional experimental techniques that measure conformational dynamics. Molecular Dynamic simulations have thus been initiated to measure conformational dynamics of the Sb-DHDPS dimer as a function of temperature. Preliminary analysis indicates that the interface between the monomeric units of Sb-DHDPS is more dynamic at temperatures greater than 20 degrees Celsius in comparison to the equivalent interface of DHDPS from the mesophilic bacterium, methicilin-resistant Staphylococcus aureus. Combined, these studies offer insight into the molecular evolution of the quaternary structure, stability and dynamics of bacterial DHDPS, and consequently the importance of quaternary structure to proteins in general. |
Wubben J*, Dogovski C, Paxman J, Wagner J, Downton M, Perugini M
*Department of Biochemisty and Molecular Biology, The University of Melbourne Australia |
F - Macromolecular Structure, Dynamics and Function |
|
F 02
F2 |
Cytochrome P450 1A1 (CYP1A1) is an important isoform of CYP450 family which is widely studied because it metabolizes a large number of xenobiotics to cytotoxic and/or carcinogenic derivatives. The anticancer role of CYP1A1 is also mentioned in literature. The CYP1A1 is involved in generation of genotoxic metabolites which is responsible for cancerous cell apoptosis [1]. The expression of CYP1A1 is specifically observed in extra hepatic tissues as well as in various cancereous tissues like breast cancer, etc. The macromolecular structural studies of CYP1A1 would be helpful in identifying the features responsible for proper binding orientation of substrates which will leads to formation of genotoxic metabolite of anticancer compounds.
In present work, the homology models of the CYP1A1 were built to study the importance of loop flexibility in ligand binding using in silico methodologies. Homology models of CYP1A1 were generated using Modeller9v8 software as 3D crystal structure is not available in PDB. Models were validated by Ramachandran plot, ERRAT plot, Verify3D and ProsaII energy plot. Active site of CYP1A1 model was validated using molecular docking of selected CYP1A1 ligands. Furthermore, the molecular dynamics simulation studies were performed using Amber10 software package. Molecular dynamics studies reveal that, the flexibility of B’-C loop contributes in orientation of substrates in CYP1A1 active site. The results were examined and validated for known CYP1A1 substrates 5F-203, 5-Aminoflavone, 7-Ethoxyresorufin, etc. (as antitumor compounds). The study would be helpful to design novel substrates which will metabolized by CYP1A1, generate their genotoxic metabolites and act as antitumor compounds. |
Nandekar P*, Sangamwar A
*National Institute of Pharmaceutical Education and Research (NIPER) India |
F - Macromolecular Structure, Dynamics and Function |
|
F 03
F3 |
Protein allosteric sites are increasingly attracting the interest of medicinal chemists in the search for new types of targets and strategies to drug development. Given that allostery represents one of the most common and powerful means to regulate protein function, the traditional drug discovery approach of targeting active sites can be extended by targeting allosteric or regulatory protein pockets that may allow the discovery of not only novel drug-like inhibitors, but activators as well.
Moreover, allosteric sites present additional characteristics, such as modulable activity and less evolutionary pressure, which may facilitate the development of highly-specific allosteric drugs that can prevent side effects and readily complement traditional therapeutics. Continuing with our previous work on allosteric sites from a structural and evolutionary perspective, we have now developed a method to predict the location of allosteric sites on protein structures. The methodology, which is a coarse-grained approach based on protein flexibility, achieves up to 65% accuracy on a set of 58 different protein families. To our knowledge, this work represents the first attempt to predict allosteric sites on multiple protein families. |
Panjkovich A*, Daura X
*Institute of Biotechnology and Biomedicine (IBB-UAB) Spain |
F - Macromolecular Structure, Dynamics and Function |
|
F 04
F4 |
The design of novel protein functions, by the transfer of a functional motif on an existing scaffold, is a major goal of protein engineering. Computational protein design approaches have led to many successful compounds but only a few methods propose to search suitable sites to graft a functional motif: RosettaMatch, Scaffold Selection and AutoMatch. The two main limitations faced by these methods are : the huge number of potential grafting sites to explore and the evaluation of the functionality of the grafted motif. To reduce the complexity of the problem, these methods reduce the number of potential sites to explore. In consequence, motifs are limited to 5 or 6 residues in pockets or on the protein surface.
We propose STAMPS, an approach based on Cα-Cβ distances compatibility and an optimized clique search algorithm which can screen the entire PDB in a reasonable amount of time to find suitable grafting sites. The quality of the identified sites is then calculated using a RMSD optimization and a steric hindrance measure with the protein target. Compared to previous methods, STAMPS retrieves functional motifs faster and with a better accuracy on the Zanghellini, ActApo and ActFree benchmarks. Moreover, longer motifs (more than 10 residues) can be searched with STAMPS without explosion of computing time. Finally, STAMPS has been successfully applied for the design of Kv1.2 Potassium Channel Blockers, the analysis of facial two-histidine one-carboxylate binding motif through the entire PDB, and the detection of potential scaffolds to design artificial inhibitors of metalloenzymes. |
Collet G*, Cuniasse P
*CEA - DSV France |
F - Macromolecular Structure, Dynamics and Function |
|
F 05
F5 |
Protein structural data plays an important role in the drug discovery process, significantly by reflecting the intricacies of drug-binding to the protein. In order to understand the structural basis of function, pathology and drug-binding in proteins, this information has to be viewed in context of wider SAR and biological data. There is also a need to be able to perform an in-depth analysis of conformational mobilities and their determinants on a large-scale to obtain a global picture. The process involved is technically challenging, hence no such resource exists in the public domain that integrates structural data with chemogenomic information. This project aims to address this issue. A structural knowledge-base, called canSAR 3D, is built and the information utilised along with tools constructed, to perform large-scale structural analyses on Cancer-specific protein families.
Using integrative in-silico techniques, large-scale structural analysis is performed on ~2800 structures of the Protein Kinase family. This is achieved by performing structural comparisons using superpositions and family-based clustering to analyse the effect of conformational changes on function. Ligand-binding footprints will be derived on studying the binding sites and mapped to the chemical structures of small molecule ligands. The knowledge derived will be utilized to build a library of curated and validated 3D models of Cancer-specific proteins and applied to Structure-based Drug Design projects. |
Bulusu KC*, Halling-Brown M, Al-Lazikani B
*Institute of Cancer Research United Kingdom |
F - Macromolecular Structure, Dynamics and Function |
|
F 06
F6 |
Direct inhA inhibitors (DIIs) bind to the substrate binding pocket (SBP) which is closed by substrate binding loop (SBL). Strongly interacting DIIs are hypothesized to have smaller dissociation constant (Kd) by keeping SBL in the close conformation. In present work, we have performed conformational analysis of SBL in inhA ternary structures to evaluate its flexibility and binding characteristics. Conformational changes among closed and open form of SBL were analyzed using Chimera and Pymol molecular visualization programs. Subsequently, MD simulation (50 ns) was performed on crystal structure with open SBL conformation and docked ligand, Genz10850 in the SBP using CUDA configured AMBER11. Multiple conformations for SBL are observed in the 33 superimposed inhA structures. A difference of ~30º is estimated between most open to close conformation of SBL and of ~16º between native to close conformation of SBL respectively. MD simulation also showed a transition of ~29º from open to close conformation of SBL, leading to decrease in SBP volume. The amino acid residues, Ala198 and Leu207 present in SBL are observed to form H-bond with Thr196 and Ile105 respectively, causing stabilization of the close conformation of SBL. Moreover, ligand-protein binding energy distribution showed that electrostatic binding energy gradually increased during this transition, while van der Waals binding energy remained stable throughout the simulations. This study shows that SBL directs the size of SBP to accommodate the ligand. Furthermore, it improvises to design the ligands having enhanced electrostatic binding energy with SBL, and thereby, speculated to prolong the close state of SBL. |
Kumar V*
*National Institute of Pharmaceutical Education and Research, S.A.S. Nagar India |
F - Macromolecular Structure, Dynamics and Function |
|
F 07
F7 |
Computational modeling and docking of high resolution G-protein coupled receptors (GPCRs) are playing an increasingly important role in lead drug discovery and design. Due to high flexibility and propensity for induced fit in interface of receptor, the refinement for GPCR flexibility during docking remains a major challenge in prediction of the three-dimensional structure of complexes.
Since the receptor’s flexibility is not only observed on side chains, but also caused by backbone movements. And both the reasonable starting receptor ensemble and the interface refinement in docking process could benefit the docking models. We propose a combined refinement (coREF) approach focusing on backbone flexibility of the docking interface by integrating statical and dynamic refinements. The statical refinement algorithm, which parallelizes the backrup movement and two other movements in order to mimic the observed simultaneous fluctuations of various backbone near interface structure, starts with receptor ensemble for docking with interface flexibility. The dynamical refinement algorithm, RLEX(RosetttaLigand EXtension algorithm), adds a backbone refinement step for interface residues which extends the full side chains flexibility of RosetttaLigand docking algorithm. To validate the performance of coREF, we refine twenty GPCR Docking 2008/2010 models submitted by participants. Eight out of the ten CXCR4 targets and half of D3 targets obtain lower LRMSD than the imitating submitted models. Nine out of the twelve results are slightly better than those refined by RosettaLigand docking algorithm. |
Lv Q*
*School of Computer Science and Technology, Soochow University, China |
F - Macromolecular Structure, Dynamics and Function |
|
F 08
F8 |
The identification of biologically relevant interfaces in the lattice of a macromolecular crystal structure has become an important issue in structural biology. Tackling this problem experimentally typically requires time-consuming site directed mutagenesis or cross-linking experiments, coupled with biophysical characterization. Computationally, the most straightforward approach to classify interfaces seems to be a method based on evolutionary considerations, since biological interfaces bear the fingerprint of evolution while crystal contacts do not. However, the current method of choice in the field, PISA (Krissinel & Henrick, J Mol Biol, 2007), does not rely on evolutionary information and is based on thermodynamic estimations of interface stability. We developed a new method, called EPPIC (Evolutionary Protein Protein Interface Classifier) using three criteria to classify protein interfaces, two of which are evolutionary indicators based on the sequence entropy of homolog sequences. One of the two indicators detects differential selection pressure between interface core and rim, defined as in Schärer et al. (Proteins, 2010). The other indicator compares interface core and rest of the surface with a Z score-like approach.
EPPIC compares favorably to PISA in terms of performance and demonstrates the potential of evolutionary information for solving the interface classification problem. Its performance will further improve over time thanks to the growth of sequence databases. It can be employed in a range of applications, like validation of structures and homology models and divide-and-conquer approaches to the structure determination of supramolecular complexes. EPPIC has been implemented both as a command-line tool and as user-friendly web service. |
Duarte J, Srebniak A, Capitani G*
*Paul Scherrer Institut Switzerland |
F - Macromolecular Structure, Dynamics and Function |
|
F 09
F9 |
The relation connecting sequence and structural propensities in peptides was investigated at the individual residue level. Our work was based on previous experimentally measured residual dipolar couplings (RDCs) and chemical shifts of peptides of sequence EGAAXAASS [1]. We performed molecular dynamics (MD) simulations of these short peptides in explicit solvent with the aim of reproducing and interpreting the experimental data. We focused on the peptides with X = Tyr and Trp, because their experimental RDCs showed a particular contrasted pattern along the chain, which suggests a propensity for an alpha-helical conformation. For comparison, we also simulated the peptides with X = Gly and Ile, because of their rather flat RDC pattern, suggesting extended peptides. The simulations show the formation of a higher number of hydrogen bonds within the peptide chain with X=Tyr or Trp stabilizing a helical turn. The driving force leading to such conformation could arise from the lack of hydration of the peptide chain on either side of bulky aromatic residues.
1. Dames SA, Aregger R, Vajpai N, Bernado P, Blackledge M, et al. (2006) Residual Dipolar Couplings in Short Peptides Reveal Systematic Conformational Preferences of Individual Amino Acids. Journal of the American Chemical Society 128: 13508-13514. |
Bignucolo O*, Grzesiek S, Berneche S
*University of Basel Switzerland |
F - Macromolecular Structure, Dynamics and Function |
|
F 10
F10 |
The synthetic homotetrameric β β α (BBAT1) protein possesses a stable quaternary structure with a β β α fold. Because of its small size (a total of 84 residues), the homotetramer is an excellent model system with which to study the self-assembly and protein-protein interactions.
We find that association to a tetramer precedes and facilitates folding of the four chains. At room temperature the tetramer exists in an ensemble of diverse structures. The crystal structure becomes energetically favored only when the molecule is put in a dense and crystal-like environment. The observed picture of folding promoted by association may mirror the mechanism according to which intrinsically unfolded proteins assume their functional structure. |
Sieradzan A*, Liwo A, Hansmann U
*University of Gdańsk Poland |
F - Macromolecular Structure, Dynamics and Function |
|
F 11
F11 |
Processing of exogenous glycerol esters is an initial step in energy derivation for many bacterial cells. Lipid-rich environments settled by a variety of organisms exert strong evolutionary pressure for establishing enzymatic pathways involved in lipid metabolism. However, a certain number of enzymes involved in this process remain unknown since they do not share detectable sequence similarity with any known protein domains. In this work, using distant homology detection (Meta-BASIC) and fold recognition (3D-Jury), we predicted that bacterial transmembrane proteins, belonging to the uncharacterized DUF2319 family, possess the alpha/beta hydrolase fold domain together with the catalytic triad critical for hydrolysis. The 3D structure modeling combined with a detailed analysis of sequence/structure features and genomic context (STRING, Operon Predictions Tool, OperonDB) indicates that DUF2319 proteins may be involved in lipid metabolism. Therefore, these enzymes are likely to serve as extracellular lipases catalyzing the initial steps in the modifications of various glycerol esters (triglycerides, lysophospholipids or phospholipids), which are essential for ATP production in the cell. DUF2319 lipases may thus be an attractive target for new antibacterial compounds, since blocking the initial step of lipid processing would interrupt the overall cell metabolism leading to microorganism starvation and death.
|
Łaźniewski M*, Steczkiewicz K, Kniżewski Ł, Ginalski K
*Laboratory of Bioinformatics and Systems Biology, CeNT, University of Warsaw Poland |
F - Macromolecular Structure, Dynamics and Function |
|
F 12
F12 |
The in silico design of enzymes catalyzing reactions not found in naturally occurring biocatalysts is of great interest for a wide range of applications. During the last years, the design of such enzymes has been reported; however this design problem is still far from being solved. The main drawback of all in silico methods is the resulting catalytic efficiency, which is above background but still several orders of magnitude below that of naturally occurring enzymes. Most likely, crucial aspects of enzyme catalysis are still not considered by the design protocol.
In an attempt to overcome these limitations, we develop TransCent aimed at the transfer of active sites between scaffolds. TransCent balances four constraints, which are assumed to be important for establishing catalytic activity, namely (a) protein stability, (b) ligand binding, (c) protonation states of active site residues, and (d) a hydrogen bond network stabilizing the transition state. TransCent is based on the Rosetta software suite extended by two proven state-of-the-art methods, PROPKA and DSX. A Monte Carlo based heuristic is used to calculate low energy solutions of the energy function consisting of four weighted terms, one for each constraint. The weights of the terms have been optimized for sequence recovery in recapitulation experiments on a set of 54 enzymes. Recently, we have developed a novel module named TransLig. TransLig proposes ligand positions in the yet to be designed active site and allows TransCent the proper placing of side chains mediating catalysis and ligand binding. |
Birzer D*, Paulini F, Merkl R
*Universität Regensburg Germany |
F - Macromolecular Structure, Dynamics and Function |
|
F 13
F13 |
One aim of the in silico characterization of proteins is to identify all residue-positions that are crucial for function and structure. Whereas several algorithms predict functionally important sites, not one exists, which concurrently identifies structurally important residue-positions. Thus, we have implemented CLIPS-1D and CLIPS-3D, which predict a role for residues in catalysis, ligand-binding, or protein structure. By analyzing a multiple sequence alignment (CLIPS-1D) and additionally a 3D structure (CLIPS-3D), the algorithms score conservation and abundance of residues at individual sites and their local neighborhood, and categorize by means of a multiclass support vector machine. A cross-validation on 264 proteins confirmed that residue-positions involved in catalysis were identified with state-of-the-art quality; the mean MCC-values for CLIPS-1D and CLIPS-3D were 0.34 and 0.38, respectively. For structurally important sites, prediction quality was considerably higher (mean MCC 0.67 and 0.68). For ligand-binding sites, prediction quality was lower (mean MCC 0.12 and 0.2), because binding sites and structurally important residue-positions share conservation and abundance values, which makes their separation difficult. These data indicate that classification success varies for residues in a class-specific manner and that the additional contribution of 3D data is moderate. Our algorithms compute residue-specific p-values, which allow for the statistical assessment of each individual prediction. The algorithms generate hypotheses about residue-positions important for structure or function of a whole protein family. Due to the focusing on conservation and abundance signals, the algorithm makes it possible to characterize such gene products, whose function cannot be identified by homology-driven approaches. |
Janda J*, Meier A, Busch M, Kück F, Porfenenko M, Merkl R
*Universität Regensburg Germany |
F - Macromolecular Structure, Dynamics and Function |
|
F 14
F14 |
The ability to evaluate protein models against the experimentally determined reference structure is crucial for the development and benchmarking of protein structure prediction methods. Although a number of evaluation scores have been proposed to date, many aspects of model assessment still lack desired robustness. To remedy the situation we developed CAD-score (contact area difference score). The new method is based on contacts derived from the Voronoi diagram of spheres that correspond to heavy atoms of van der Waals radii. The Voronoi diagram of spheres is constructed by an algorithm that is especially suited for processing macromolecular structures. CAD-score considers only physical interatomic interactions within protein structure and does not use any arbitrary parameters or cutoffs. For single domain structures our contact-based score shows a strong correlation with GDT-TS, a commonly accepted evaluation score. At the same time CAD-score displays a better agreement with the physical realism of models. In contrast to the methods based on structure superposition, our new method works equally well on single-domain, multi-domain and even multi-chain protein models of varying degree of accuracy and completeness. Moreover, CAD-score can directly evaluate the accuracy of inter-domain or inter-subunit interfaces. Apart from protein structures, our method works with RNA tertiary structures and various mixed complexes (protein-RNA, protein-DNA). In addition to model evaluation, CAD-score offers an alternative to the superposition-based model clustering. |
Olechnovic K*, Venclovas Č
*Vilnius University Lithuania |
F - Macromolecular Structure, Dynamics and Function |
|
F 15
F15 |
A docking procedure generally consists of two main steps. In the first step a large number of 3D models (decoys) are generated, while in the second step the best solutions are selected, typically following a RMSD-based clusterization. Unfortunately, correctly ranking these ‘best solutions’ to extract native-like ones is still an open problem. Therefore, the normal case in real life research is having many different docking solutions possibly containing native-like ones, with the problem of trustfully distinguishing them from the others. Many valuable methods have been developed to the aim, both based on energy potentials and normality indices. Their performances have also been object of assessment in the recent CAPRI (Critical Assessment of PRedicted Interactions) editions, where it was shown that singling out the best docking models remains a challenge. Herein we present an alternative method we have developed to rank multiple docking solutions. Making use of the ‘consensus’ concept, largely successful in many bioinformatics fields, our method is based on the conservation of residue-residue contacts in a given decoys ensemble. We applied it to two protein-protein docking decoy benchmarks, DOCKGROUND and RosettaDock, as well as to several targets from recent CAPRI rounds. Not surprisingly, results show that, although the method performs quite well when applied to the outputs of one specific docking program, its ideal application is on multi-program outputs. |
Vangone A, Cavallo L, Oliva R*
*University of Naples 'Parthenope' Italy |
F - Macromolecular Structure, Dynamics and Function |
|
F 16
F16 |
Sequence-specific recognition of DNA by proteins plays a critical role in regulating gene expression. The accurate recognition of target sequences in the genome can be achieved by a combination of two different mechanisms: the direct readout through the direct interactions between protein and DNA bases; and the indirect readout through the sequence-dependent conformation and/or deformability of DNA structure. While the specificity of direct readout has been well characterized, it is rather difficult to assess the contribution of indirect readout to the specificity. In order to quantify the specificity of indirect readout, we have developed a new method. First, we used Bayesian statistics to derive the probability of a particular sequence for a given DNA structure using ensembles obtained by molecular dynamics (MD) simulations of DNAs containing all 136 unique tetramer sequences. Secondly, we used the information entropy to quantify the specificity of indirect readout. We applied this method to protein-DNA complexes of known structures to examine its validity. We could correctly predict those regions where experiments suggested the involvement of indirect readout, and could indicate new regions where the indirect readout mechanism can make major contributions to the recognition. The present probability/entropy-based approach has advantage over the previous energy-based approach in that the trajectories of MD simulation can be directly converted into the probability of sequence and the specificity of indirect readout without any approximation to the distributions in the conformational ensembles of DNA, and would serve as a powerful tool to study the mechanism of protein-DNA recognition. |
Yamasaki S, Terada T, Kono H, Shimizu K, Sarai A*
*Kyushu Institute of Technology Japan |
F - Macromolecular Structure, Dynamics and Function |
|
F 17
F17 |
Vibrio cholerae uses quorum sensing communication system to interact with other bacteria and for gauzing environmental parameters. This organism dwells equally well in both human host and aquatic environments. Quorum sensing regulates multitude of activities and is one of the lucrative targets presently pursued for drug design in bacteria to encounter virulence. Histidine phosphotransfer protein LuxU and response regulator LuxO of V. cholerae are known to play important roles in biofilms and virulence machinery. In the present study, we used computational methods to model LuxU and LuxO, and simulated the interactions of LuxO and LuxU. Since no structural details of the proteins were available, we employed homology modeling to construct the three-dimensional structures and then performed molecular dynamics simulations to study dynamic behavior of the LuxO and LuxU from V. cholerae. The modeled proteins were validated and subjected to molecular docking analyses. This allowed us to predict the binding modes of the proteins to elucidate probable sites of interference. |
Kumar S*, Turabe Fazil MHU, Singh DV
*Institute of Life Sciences India |
F - Macromolecular Structure, Dynamics and Function |
|
F 18
F18 |
Protein ion channels play a crucial role in various living processes. Numerous channel disorders affect millions people worldwide. Unfortunately, availability of solved channel structures is unsatisfactory. However, permeation characteristics of a single channel can be measured using the patch-clamp technique. Experiments reveal that even single point mutations may significantly alter current-voltage dependencies of ionic channels. The permeation characteristics can also be inferred from the structural model of a channel, e.g. by means of the time-efficient 3D Poisson-Nernst-Planck calculations. The correspondence between experimental and calculated characteristics can be utilized in channel structure prediction. We propose a novel method for validation of predicted structural models of channels based on comparison of calculated and experimental current-voltage characteristics.
To test this approach, we designed a computational pipeline for reconstruction of spatial coordinates of proteins from contact maps using state-of-the-art bioinformatics methods, including FT-COMAR, SABBAC and SCWRL. This procedure was used to generate over 2000 model structures of the potassium channel KcsA based on maps from 30-100% of contacts. High quality RMSD (<2.5A) achieved 38% of the structures. Each model was then validated using our 3D PNP package. General RMSD of the reconstructed models, calculated in relation to the original KcsA structure, was significantly correlated with the predicted channel conductance and selectivity to the ionic charge. Moreover, we identified certain residues, whose RMSDs were most highly correlated with parameters of the permeation. Finally, we showed that 80% of misfolded models can be eliminated by setting an appropriate threshold of calculated conductance and charge selectivity. |
Dyrka W*, Konopka B, Rybicka M, Kotulska M
*Wroclaw University of Technology, Institute of Biomedical Engineering and Instrumentation Poland |
F - Macromolecular Structure, Dynamics and Function |
|
F 19
F19 |
Factor X (FX) is a vitamin K-dependent serine protease, which is one of the major players in the blood coagulation cascade. Upon activation to FXa, it converts prothrombin to thrombin, which in turn converts fibrinogen into fibrin (blood clots). More than two million people die each year of arterial or venous thrombosis in USA only. Also FXa deficiency causes blood bleeding (hemostasis) like intracranial bleeding, hemathrosis and gastrointestinal blood loss. Therefore FXa is an interesting target for anti-coagulant and anti-blood bleeding drugs. Experimental in vitro studies showed that the naturally occurring Gly43Asp mutant, identified in patients suffering from severe bleeding episodes, markedly impaired the catalytic activity of FXa. Molecular dynamics simulations were performed for 0.48 microseconds followed by free energy calculations on the wild and mutated FXa enzymes in order to figure out the experimental effect of the Gly43Asp mutation at the atomic level. It was found that the wild type enzyme is more flexible than Gly43Asp mutant which allows the WT to be bind the substrate more efficiently. The selectivity of the conformation sampling is controlled by the hydrogen bond network established within and near the active site. Indeed replacing Gly43 by Asp favors the formation of new hydrogen bonds and rupture of wild type established ones. Our results contribute to the understanding of the FXa structure-function relationship, eventually helping in designing more potent drugs targeting the thrombosis and hemostasis diseases |
Abdel-Azeim S*, Vangone A, Oliva R, De Cristofaro R, Cavallo L
*King Abdullah University of Science and Technology (KAUST) Saudi Arabia |
F - Macromolecular Structure, Dynamics and Function |
|
F 20
F20 |
It is essential for ab initio protein structure prediction (PSP) methods to be able to discriminate correctly folded (native) structures from incorrectly folded and unfolded ones. The modeling of solvation effects is critical to the success of these methods, considering that the solvent plays a key role in the protein folding process. In this work we evaluated four implicit solvation models concerning their capacity to discriminate native structures from incorrectly folded and unfolded. The solvation models EAS, I-SOLV, EEF1 and GBobc were analyzed for 15 proteins. Thermal unfolding via molecular dynamics was performed to generate different protein conformations. Structures were considered to be incorrectly folded if the RMSD is high and the Solvent Accessible Surface Area (SASA) is similar to the native one. Aiming to evaluate the capacity to discriminate the native structure from the others conformations, for each model we calculated the differences in the solvation free energies. The models I-SOLV, EEF1 and GBobc were able to discriminate folded from unfolded structures, unlike the EAS model. However, no model was able to satisfactorily discriminate correctly folded from incorrectly folded structures. SASA is determinant in the calculated energies and the models are reflecting the protein structures compactness but not the observed RMSD variations to the native structures. We concluded that the main advantage of the use of solvation models in PSP methods is to drive the selection for more compact structures, although it is less useful to discriminate between an incorrectly compact fold from the native one. |
Kappaun Rocha G*, Lima Custódio F, Emmanuel Dardenne L
*National Laboratory for Scientific Computing (LNCC) Brazil |
F - Macromolecular Structure, Dynamics and Function |
|
F 21
F21 |
Fragment-based protein structure prediction methods allow to import empirical data into the model without the dependency homologous structures. This study aims to assess the ability of distinct fragment libraries to correctly reproduce native structures. We developed a program to generate fragment libraries from a non-redundant set of nonhomologous PDB structures. For each aminoacid of the target sequence a list of 200 fragments raked according to its likelihood (given by sequence and secondary structure similarities) to reproduce the local structure. The ProFragger software was used to create fragments of 9, 6 and 3 residues long. Additionally, a mixed library with fragments of all three sizes were used in the attempt to rebuild the native structure. Separately, fragments were clustered according to structural similarity allowing a single fragment to represent a larger group. Random positions were sampled throughout the query sequence, as the computational cost of verifying every combination makes an exhaustive approach prohibitive. Fragments that better fit local and global DME criteria are inserted in their respective position. The mixed libraries consistently yielded better models as they benefit from the advantages of larger and smaller fragments: the former reproduces secondary structures more accurately and are used in the early stages of the algorithm while the latter improves loops and coils regions later in the simulation. Clustering the libraries by structural similarity had no substantial negative effects on the quality of the model and might be an interesting feature for Ab initio methods, as it reduces search space |
Trevizani R*, Custódio F, Dardenne L
*LNCC Brazil |
F - Macromolecular Structure, Dynamics and Function |
|
F 22
F22 |
The Seattle Structural Genomics Center for Infectious Disease (http://ssgcid.org) is funded by NIAID to solve protein structures from biodefense organisms and emerging infectious diseases. Structure determination efforts are distributed over four collaborating sites: Seattle BioMed, Emerald, UW and PNNL. Community input is actively solicited to identify essential enzymes, virulence factors, drug targets and vaccine candidates of biomedical relevance for our structure determination pipeline. Since project inception in late 2007, SSGCID has deposited over 500 protein structures in the Protein Data Bank (PDB), representing the majority of available structures for several organisms. In addition, materials such as clones and proteins are made available free of charge to the research Community.
The SSGCID bioinformatics infrastructure supports a broad range of tasks, from target selection and tracking to weekly progress reports and impact analysis. This poster will present the core data management system, which records the progress of each target as it passes through the SSGCID pipeline at the four collaborating sites. It relies on a main centralized target tracking database that registers outcome and materials at each step of the structure determination pipeline, with satellite databases feeding summary information from additional processes such as the management of community requests, the selection of potential ligands, the sequence validation of clones and the integration of MetaCyc pathways. The emphasis is put firmly on making raw data available to large public resources (including PDB, BEIR, PATRIC and EuPathDB) to allow results and materials to reach the Community in an efficient and reliable manner. |
Phan I*, Cron L, Subramanian S, Olsen C, Moon W, Ramasamy G, Stacy R, Myler PJ
*Seattle BioMed United States of America |
F - Macromolecular Structure, Dynamics and Function |
|
F 23
F23 |
Myelin basic protein (MBP) is a highly basic multifunctional protein of the central nervous system, whose principal role is to maintain the compactness and integrity of the myelin sheath, acting as a “biological glue”. Because of its involvement in human demyelinating diseases, the investigation of the interaction of MBP with membrane is particularly relevant. The three-dimensional structure of MBP is still unknown; three regions of the protein can potentially be amphipathic alpha-helices and, therefore, are good candidates as interaction sites with membranes. In this study, we investigate the N-terminal segment, ranging from residue Arg29 to Gly48, which also contains a double Phe-Phe site that can strongly contribute to anchor this region to the membrane. Molecular dynamics simulations were performed on this peptide in the presence of both neutral (DMPC) and mixed neutral-negatively charged (DMPC-DMPS) bilayers. The results show a deeper penetration of the amphipathic helix into the DMPC membrane, anchored by its hydrophobic surface. In the mixed DMPC-DMPS bilayer, instead, the peptide remains more at the bilayer surface, due to the stronger electrostatic interaction of MBP basic residues with the negative charges of the headgroups. In both cases, the helix conformation is very stable. Several post-translational modifications (PTMs) are known to be present all along the whole protein, acting as a “molecular barcode” that modulates the interaction of MBP with membrane or with signaling proteins. The effect of Arg31 deimination and Thr33 and Ser44 phosphorylation are currently under investigation both in the neutral and in the mixed bilayer system. |
Polverini E*, De Donatis M, Harauz G
*Department of Physics, University of Parma Italy |
F - Macromolecular Structure, Dynamics and Function |
|
F 24
F24 |
Lysosomal proteins share common features, such as intrinsic stability against proteases and acidic pH, and recognition on the MAN-6-phosphate pathway. Understanding the structural basis for these properties provides a good basis for understanding mutation pathology of lysosomal storage diseases or for improving enzyme preparations for treatment. In addition, these results can be used to design more robust enzymatic tools for industrial proteins, which share similar folds.
Lysosomal proteins are challenging for structural studies because, though stable, they universally undergo post-translational modification including glycosylation, disulphide bridge formation and proteolytic cleavage. Even so, biological databases provide wealth of information on the available structures. We have collected the experimentally determined 3D structures for lumenal lysosomal proteins from Protein Data Bank (PDB) and combined these with annotations available in UniProt, CATH and PDBe. From this set of structures, we have analysed the distribution and location of disulphide linkages, salt bridges, the glycosylation sites and proline clusters. We analyzed evolutionarily conserved structural regions in mammalian lysosomal proteins using a defined sequence set in ConSurf. Typically in our set of structures, the disulphide bridges stabilize the outermost secondary structure to the protein core. Interestingly, although the active site of an enzyme is usually the most flexible part of the protein, our data indicates that lysosomal proteins cluster disulfide bridges on ligand binding areas. All lysosomal proteins are glycosylated and most of the glycosylation sites are occupied. Charged surface residues are not conserved but some structures show characteristic proline clustering on the protein surface. |
Pokharel K*, Repo H, Heikinheimo P
*University of Turku Finland |
F - Macromolecular Structure, Dynamics and Function |
|
F 25
F25 |
Limno-terrestrial tardigrades can withstand almost complete desiccation through a mechanism called anhydrobiosis, and several of these species have been shown to survive the most extreme environments through exposure to space vacuum. Molecular mechanism for this tolerance has so far been studied in many anhydrobiotic metazoans, leading to the identification of several key molecules such as the accumulation and vitrification of trehalose as well as the expression of LEA proteins to prevent protein aggregation. On the other hand, the understanding of comprehensive molecular mechanisms and regulation machinery of metabolic compounds during anhydrobiosis is yet to be explored. To this end, we have conducted a comprehensive metabolome analysis using the tardigrade Ramazzottius varieornatus, which is a potential model species for anhydrobiosis. In order to analyze the metabolic changes in the active and dehydrated states, we measured the metabolome in both conditions using two types of high-throughput mass spectrometry (MS) systems, liquid chromatography time-of-flight MS (LC-TOFMS) for lipids and sugars and capillary electrophoresis TOFMS (CE-TOFMS) for primary metabolite, with three biological replicates for each state. As a result, increase, but no significant accumulation of trehalose in this species suggests a more complex mechanism for anhydrobiosis in comparison to other metazoans. While changes in gene expression profiles are limited in between active and tun states, dynamic changes were observed in the metabolism of this species in response to desiccation. Changes in the metabolic profiles suggested complex intracellular responses to oxidative and osmotic stress. |
Arakawa K*, Ito T, Kunieda T, Horikawa D, Soga T, Tomita M
*Institute for Advanced Biosciences, Keio University Japan |
F - Macromolecular Structure, Dynamics and Function |
|
F 26
F26 |
PPfold is a multithreaded implementation of the Pfold algorithm for RNA secondary structure prediction, which couples a stochastic context-free grammar to an evolutionary model to predict the consensus secondary structure of alignments. We present a new version of PPfold, which extends the evolutionary analysis with a flexible probabilistic model for incorporating auxiliary data, such as data from new, high-throughput and quantitative structure probing experiments. Our tests show that the accuracy of single-sequence secondary structure prediction using experimental data from the SHAPE (selective 2'-hydroxyl acylation analyzed by primer extension) method in PPfold 3.0 is comparable to state-of-the-art software. Furthermore, alignment structure prediction quality is improved even further by the addition of experimental data, making it possible to predict highly accurate structures. PPfold 3.0 is available as a platform-independent Java application, and features an intuitive graphical user interface. The application can be downloaded from http://birc.au.dk/software/ppfold. |
Sukosd Z*, Knudsen B, Kjems J, NS Pedersen C
*Aarhus University Denmark |
F - Macromolecular Structure, Dynamics and Function |
|
F 27
F27 |
Membrane proteins of the ubiquitous Amt/Rh family mediate the transport of ammonium. Despite the availability of different X-ray structures that provide many insights on the ammonium permeation process, the molecular details of its mechanism remain controversial. A variety of permeation mechanisms have been suggested: the passive diffusion of NH3, the antiport of NH4+/H+, the transport of NH4+, or the cotransport of NH3/H+. Based on the structural data for AmtB E.coli we illustrate the mechanism by which proteins from the Amt family can sustain electrogenic transport. We show that NH4+ can reach a binding site from which it can spontaneously transfer a proton to a pore-lining histidine residue (His168). Then, the substrate diffuses down the pore in the form of NH3 while the excess proton is co-transported through a highly conserved hydrogen-bonded His168-His318 pair. The X-ray structures have revealed that the pores of the prokaryotic AmtB and the eukaryotic RhCG proteins share a similar architecture. However, molecular mechanics simulations of both proteins reveal that small differences in the pore lining residues might actually alter the properties of the pore. We notably find that the pore of the AmtB transporter can stabilize water molecules at much greater extent than the pore of RhCG. The possible presence of water molecules in the pore of AmtB opens the door to alternative permeation mechanisms. We discuss the possible permeation mechanisms in both the AmtB and RhCG proteins in light of some recent functional studies, and illustrate how closely related proteins can support quite different mechanisms. |
Baday S*, Wang S, Orabi E, Lamoureux G, Bernèche S
*University of Basel Switzerland |
F - Macromolecular Structure, Dynamics and Function |
|
F 28
F28 |
Suppression of BCR-ABL catalytic activity by the Tyrosine Kinase Inhibitor (TKI) Imatinib Mesylate (IM) has dramatically improved the natural history of Chronic Myeloid Leukemia (CML) ushering the era of molecular targeted therapy. Despite the unparalleled results achieved by IM, 30% of CML patients become resistant to the compound, mostly because of point mutations that interfere with drug binding, often, with mechanisms still unclear. A structural insight in the interactions between the BCR-ABL kinase domain and different TKIs will help in defining the mechanisms allowing BCR-ABL mutants to avoid kinase inhibition by most of these drugs. Five amino acids (E286, T315, M318, I360, D381) are critical for Imatinib (IM) binding. Two of them (T315 and M318) are maintained in binding the second generation (2G) TKI Dasatinib (DAS), while E286, M318, I360 and D381 are required for the interaction with the third generation (3G) inhibitor Ponatinib (PON). Combining computational and biological approaches we demonstrate that, with the exception of T315, the four remaining amino acid critical for TKI interaction are pivotal to preserve both BCR-ABL kinase activity and oncogenic potential and therefore cannot be modified. Molecular dynamics simulations of the protein in complex with the three inhibitors above mentioned were performed for the wild-type, the clinically relevant mutant T315I and a new mutant I360T. The simulations highlighted structural evidences and differences in the binding modes and in the perturbations to the dynamical behaviour generated by the pathogenic mutations. We offer a rationale for the success of Ponatinib in overcoming the T315I resistance. |
Buffa P*, Romano C, Pandini A, Vigneri P, Fraternali F
*King's College of London United Kingdom |
F - Macromolecular Structure, Dynamics and Function |
|
F 29
F29 |
Lipocalins are small extracellular proteins that share several common molecular recognition properties such as transporting small hydrophobic molecules, binding to specific cell surface receptors and formation of macromolecular complexes. Despite high diversity in sequences and functions lipocalins show high structural similarity. The ability of lipocalin to bind ligands is further complicated by the fact that one protein can bind to multiple ligands using similar binding pocket and one ligand can bind to multiple binding pockets in different lipocalins. In this work, our objective is to understand the structural basis for ligand binding in the lipocalin family, by quantifying the structural deviations accompanying ligand binding. Both global and local (ligand binding regions) structural deviations were quantified in a dataset of 90 bound and unbound pairs. The average global root mean square deviation (RMSD) is found to be 0.9 Å indicative of small structural variation in bound and unbound forms at the gross level. However, considerable local deviations are observed in the ligand binding region. These local structural deviations are largely contributed by the residues flanking the ligand binding region in comparison to the ligand interacting residues. Our analysis indicates that the ability of lipocalin proteins to bind to multiple ligands is largely brought about by local structural changes confined to the ligand binding and its flanking regions without significant global changes in the structures. |
Balasubramanian L*, Archunan G, Ramanathan S, Narayanaswamy S
*Bharathidasan University India |
F - Macromolecular Structure, Dynamics and Function |
|
F 30
F30 |
The evolution of marine mammals from their terrestrial ancestor is one of most dramatic in the history of vertebrates. Since divergence ~50 Ma ago, marine mammals underwent a series of adaptations such as a ~10-30 fold increase in myoglobin (Mb) concentration in skeletal muscle. To better understand the interplay between Mb function and evolution in marine mammals, we integrated state-of-the-art models of convective O2-transport, aerobic dive limits (ADL), and thermochemical data for oxygen binding to mutant myoglobin (Mb) in Weddell seal (Leptonychotes weddellii). We show that conserved WT residues are critical for the fitness by means of prolonging ADL. For example, His-64 provides up to 14 minutes more dive time under optimal, aerobic conditions. Moreover, Mb shows its full functional proficiency under physiological conditions of a routine dive, suggesting co-evolution of protein structure, physiology and animal behavior. This work thus quantifies the increased Mb concentration in marine mammals and justifies the strong purifying selection of the key residues for O2 binding in Mb.
|
Dasmeh P*, Kepp KP
*Technical University of Denmark, DTU Chemistry Denmark |
F - Macromolecular Structure, Dynamics and Function |
|
F 31
F31 |
Motivation. Structure and function are highly correlated. There is an increasing number of RNA structures in the PDB that need to be compared and studied for their
biological activity. A number of programs have been developed for alignment but most of them assume that the structures are rigid, in other words, penalize according to the RMSD of the alignments. Method. We have implemented a program that uses a sequence of local transformations in order to evaluate an alignment, instead of a single rigid transformation for the entire matching. It starts by considering several alignments between base fragments imposing constraints for base pairing. Then, a dynamic programming strategy aligns single bases to single bases evaluating the rigid transformation of a local neighbors of each base. A final score is calculated. Results. A benchmark against the ARTS and SARA programs have been carried out and shows overall improvement and some particular good matchings. It seems that the strategy is promising and can be taken further. |
Rocha J*, Capriotti E
*University of the Balearic Islands Spain |
F - Macromolecular Structure, Dynamics and Function |
|
F 32
F32 |
The alignment of protein structures is essential for determining structural relationships between proteins, so as to predict protein function, classify or identify new folds. The correct assignment of evolutionarily or functionally homologous residues or regions is strongly correlated with the accuracy of the structural alignment method used. Recent studies comparing various structural alignment methods have considered only structures of water-soluble proteins, but not those of integral membrane proteins, which have distinct evolutionary and structural constraints caused by their interaction with the hydrophobic membrane bilayer. Thus, some structure alignment methods may be well suited to particular fold types that are unique to membrane proteins. Moreover, there may be opportunities for improvement that reflect the nature of membrane protein structures. We first compiled a novel data set called HOMEP2012, which contains 198 homologous membrane proteins in 38 different structural families. We then tested the accuracy of 13 recent structural alignment methods on this dataset by comparing homology models generated based on alignments obtained with each of the methods. We evaluated the structural accuracy of the whole structure of the models, as well as their transmembrane segments alone. We will show a comparison of the performance of each of the methods for each of the 38 membrane protein families, and identify those methods that are most suitable for α-helical or for β-barrel membrane proteins, respectively. Finally, we will illustrate the relative drawbacks of the various methods and highlight enhancements that should improve their accuracy specifically for membrane protein structures. |
Stamm M*, Forrest L
*Max Planck Institut of Biophysics Germany |
F - Macromolecular Structure, Dynamics and Function |
|
F 33
F33 |
Correctly annotated proteome is the cornerstone of a successful genome research project and therefore accurate and reliable tools for this purpose are needed. PANNZER uses summary statistics to select the ‘best’ description from a list of putative homologs generated by BLAST or HMMER3. The matches are (1) rescored using a regression model to weight the contributions from query coverage, target coverage, identity percentage and taxonomical distance in the NCBI species tree and (2) clustered based on functional similarity, measured using the Information retrieval tool called term frequency-inverse document frequency (TF-IDF). The best cluster is selected using Gene Set Z-score (GSZ), a weighted version of standard hypergeometric Z-score [1].
Our first evaluation set was picked randomly from HAMAP database and annotated by using UniProt database from where the evaluation dataset was removed. PANNZER was able to predict identical to original function in 1311 out of 1415 cases (success rate of 93%). Our second evaluation set was an unpublished, manually curated proteome of Leuconostoc gasicomitatum. Blande et al. (unpubl.) showed that RAST outperformed HAMAP, PGAAP, IMG and IGS annotation pipelines in benchmarking. PANNZER outperformed RAST in 1152 (60%) cases, RAST outperformed PANNZER in 397 (21%) cases and in 364 (19%) cases it was stalemate. Thirdly, PANNZER was ranked 3rd among over 50 methods in the CAFA 2011 competition. 1. Törönen P, Ojala P, Marttinen P, Holm L (2009) Robust extraction of functional signals from gene set analysis using a generalized threshold free scoring function. BMC Bioinformatics 10:307. |
Holm L*, Koskinen P, Nokso-Koivisto J, Toronen P
*University of Helsinki Finland |
F - Macromolecular Structure, Dynamics and Function |
|
F 34
F34 |
Increasing number of bioinformatical tools for protein structure prediction requires quality assessment. The choice of good decoy models could be easy only if a real, experimentally obtained structure of the protein or its close homolog, is known. It is very difficult otherwise. The most successful model quality assessment programs are based on clustering and comparison of candidate decoy models. These programs typically use structure superposition quality measures, such as root mean square deviation (RMSD) or related, which have some flaws. Other interesting measures, which can be applied, are energetic or functional accuracy of a model. Ion channels are especially difficult for crystallography. Here, we study usefulness of flow models for assessment of structural channel models and correlate the results with standard methods. Electrostatic potential profile of ion channels plays an important role in ion flow. This study shows how RMSD of ion channel structural models correlate with functional characteristics, e.g. with the electrostatic potential profile along the channel axis. Energetic stability parameters of model structures and their correlation with flow characteristics of ion channel models and potential profiles are also considered.
To evaluate our method, we tested 2000 different structural models of the KcsA channel. The Kendall correlations between RMSD, root mean square error (RMSE), and other electrostatic parameters of the potential profiles were significant. The area under curve for receiver operating characteristic, based on RMSE, was ca. 0.90 for RMSD = 4 Å. |
Rybicka M*, Dyrka W, Konopka B, Kotulska M
*Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Wroclaw, Poland Poland |
F - Macromolecular Structure, Dynamics and Function |
|
F 35
F35 |
HTLV-I encodes a virus-specific aspartic protease (PR) function to process the gag and gag-pro-pol polyproteins leading to the proliferation. Since this protease enzyme is essential for the retroviral replication. Recent research reports that anti HIV-PR inhibitors cannot function as HTLV-PR inhibitors and several HIV-PR inhibitors failed in activity against HTLV-PR, so predicting a potent inhibitor for HTLV-PR is essential for anticancer drug design. Here we concentrate much on binding pocket of HTLV-PR, which placed in between the two monomer chains and generating new libraries based on available compounds – phenomenon of De-nova drug design. Treating the library compounds with QM/MM treatment for accuracy of charge calculation for understanding the ligand binding, active conformation and chirality are important for the proper execution of rational drug design. Energetic studies on interaction, binding and cavity specific energies through QM/MM charges and molecular dynamics studies informs that new library compounds are having better efficiency of interactions with HTLV-PR. The docked conformation of the inhibitor is preserved during MD simulations and at least four hydrogen bond forms between the inhibitor and the enzymatic pocket. Our results provide a platform for the progress of more effective compounds, which are better in interactions, electronic transfer reaction (HOMO/LUMO), Energetic studies and molecular dynamics studies. The research provided novel chemical scaffolds for HTLV drug discovery. |
Singh SK*, Selvaraj C
*Alagappa University India |
F - Macromolecular Structure, Dynamics and Function |
|
F 37
F37 |
(The Gram-negative bacteria) Campylobacter is one of the most common causes of acute bacterial diarrhoea (campylobacteriosis) worldwide, with ingestion of infected poultry products being the principal route of infection. Elucidating the Campylobacter proteins that have the potential to interact with the chicken immune system (i.e. autotransporter proteins) may lead to new approaches for eliminating Campylobacter from chickens and hence reduce the incidence of human infection. Autotransporter proteins often serve as virulence factors which involves in pathogenesis including disabling host defences and colonization.
A range of bioinformatics techniques have been employed to identify autotransporter proteins, which may have propensities for direct interactions with the chicken immune system. Of the approximately 1600 proteins encoded by the genome, up to 51 are predicted to be autotransporter proteins. We have identified a subset of autotransporter protein passenger domains that do not appear to have homologs in other species and may be specific to Campylobacter. These proteins may perform roles that are critical to Campylobacter, and potentially interact with chicken immune system cells. We have employed a variety of fold recognition techniques to predict the structures of these novel autotransporters and their passenger domains, and examined their structures, motions, and possible interactions with lipid membranes using molecular modelling and simulations. |
Mu X*, Hung A, Smooker P
*RMIT University Australia |
F - Macromolecular Structure, Dynamics and Function |
|
F 38
F38 |
Statistical energy functions (SEFs) are an important tool in protein structure science. SEFs have a broad range of applications such as protein structure prediction, 3D-model assessment or prediction of stability changes upon point mutations. SEFs are usually evaluated on structure decoy sets as collected in the Decoys 'R' Us database and subsequently applied for the above mentioned tasks. It remains unclear to which extent the numerous SEF parameters of a certain SEF implementation are valid throughout the different applications.
We enumerate the typical parameters for a pair-SEF and a contact-SEF and optimize the two SEFs for native fold identification in a decoy set and apply them to change in stability predictions. We then optimize the SEFs for change in stability predictions using the ProTherm database and apply them native fold identification. The parameters tested include distance resolution and scope, pooling different sequence separations, pooling different protein sizes, reference state, and others. We aim to utilize our SEFs for an in-silico mutagenesis approach. Therefore, we finally compare their predictive power with several available methods on a subset of the ProTherm database. |
Laimer J*, Lackner P
*University of Salzburg Austria |
F - Macromolecular Structure, Dynamics and Function |
|
F 40
F40 |
We previously developed a strategy to compare conformational ensembles of proteins using Self-Organizing Maps (SOMs) combined with hierarchical clustering. This strategy includes three steps: a) Cα’s Cartesian coordinates of the conformations are encoded as input vectors for the SOM; b) the SOM is trained and its neurons are clustered; c) the conformations belonging to each cluster are extracted and compared.
We present two applications of this protocol in order to discuss both the interpretability and the topological properties of the clustered SOM. First, we investigated the role of flexibility in protein binding with a focus on the differences between bound and unbound forms of transient complexes including members of the RAS “superfamily”. We show the possibility of using conformations sampled by two different methods (Molecular Dynamics and tCONCOORD) as input for the SOM, highlighting how different samplings are reflected by the SOM topology. Second, we identified the specific mutations that more efficiently convert the flexibility of a psychrophilic enzyme (AHA) to that of its mesophilic counterpart (PPA). This was achieved by annotation of the PPA conformations with the map trained on some AHA variants. The projection of local features, not used as input information, to the clustered SOM allowed a clear annotation of the functional differences among the structural clusters. The applications here presented support the use of this approach as a general and exploitable protocol for multiple comparison of protein conformational ensembles and highlight its potentiality for protein engineering. |
Fraccalvieri D*, Pandini A, Stella F, Bonati L
*Department of Environmental Sciences, University of Milano-Bicocca, Milan, Italy Italy |
F - Macromolecular Structure, Dynamics and Function |
|
F 41
F41 |
Large interest exists to predict the structural and functional effects of point mutations in proteins. These may occur in nature as variants, with an high impact on human health in the case of genetic pathologies, as well as they can be engineered to obtain proteins with enhanced properties. In the last years we applied a computational approach to investigate the mutant forms of GALT enzyme, whose loss of function is cause of the Classic Galactosemia. Now, we have defined a semi-automatic procedure to apply the same approach to other proteins, with the aim of investigating other human pathologies related to protein mutations. The procedure includes the application of many software to generate models of the mutants and to analyze them. In detail, mutate_model script of Modeller is used to generate models, DSSP to analyze their secondary structure, NACCESS for solvent exposure evaluation, also integrated by an original tool to identify interface residues in case of oligomeric proteins, HBPLUS for H-bonds, a combination of online tools for stability, and original tools for salt bridges and detection of residues interested in the functional sites. Original scripts in Perl perform automatic extraction of information of interest from the output generated by the programs, generate results as html pages and prepare more html pages for web publication.
Acknowledgements: This work has been performed in the framework of the Italia-USA programme “Farmacogenomica Oncologica”, and also of FLAGSHIP “Interomics” Project (PB.P05) funded and supported by the Italian MIUR and CNR organizations. |
Facchiano A*, Marabotti A
*Institute of Food Science, National Research Council, Italy. Italy |
F - Macromolecular Structure, Dynamics and Function |
|
F 42
F42 |
Intrinsic protein disorder has been proven to play a major role in protein function, and in the determination of a protein's biological activity. Even though their importance is nowadays generally agreed upon, there is still a lack of consensus regarding a common definition, terminology and nomenclature. Moreover, a comprehensive resource of intrinsic protein disorder annotations was not available until recently. Here we present an analysis of the datasets available in the MobiDB database, which provides disorder annotations for the complete SwissProt database. MobiDB annotations include multiple data sources, both experimental (DisProt, PDB X-Ray structures, PDB NMR structures, and user-submitted annotations) and predicted. The featured predictors, ESpritz and IUPred, are fast and accurate enough to allow genome scale predictions to be performed. In addition to these direct sources, MobiDB provides a novel “consensus” disorder annotation, based on a user- customizable combination of weighted annotation sources. The ultimate goal of this work is to improve our understanding on the differences between disorder flavours, and to find out their particular characteristics. To achieve this, we analyse and compare the available disorder data sources, display results and conclusions, and reflect on potential future research directions. |
Di Domenico T*, Walsh I, Tosatto S
*University of Padua Italy |
F - Macromolecular Structure, Dynamics and Function |
|
F 43
F43 |
The transient receptor potential cation channel, subfamily A, member 1 (TRPA1) belongs to a group of membrane receptors with a great importance in transfer of sensoric information. It is a non-selective cation channel mainly expressed in nociceptive neurons. TRPA1 forms homotetramers with a transmembrane domain, which is similar to voltage gated potassium (Kv) channels, and with large cytosolic termini. TRPA1 is regulated by calcium ions but mechanisms of this regulation are unclear. Despite of its importance, no TRPA1 structure with atomic resolution is available. The only direct structural information to date comes from the electron density map with a resolution of 16 A (Cvetkov et al., JBC, 2011).
In this study, we tried to extend available structural information using tools of homology modeling and molecular dynamics simulations. First, we demonstrated the ability of acidic residues clustered in the TRPA1 carboxy terminus to bind calcium ions. The TRPA1 calcium binding site was modeled using a structure of the cytoplasmic domain of the calcium-activated potassium (BK) channel. Second, we created a homology model of the transmembrane domain of TRPA1 based on its homology with voltage gated potassium (Kv) channels. This model was further refined using the Molecular Dynamics Flexible Fitting method. |
Zíma V*, Barvík I
*Institute of Physics, Falculty of Mathematics and Physics, Charles University in Prague Czech Republic |
F - Macromolecular Structure, Dynamics and Function |
|
F 44
F44 |
In an attempt to understand why a human monoclonal antibody is not equally efficient at neutralizing all Dengue serotypes, we compared the structures of its complexes with four different antigens. The antibody uses two distinct binding modes and its orientation changes in a range of approximately 60 degrees on the various serotypes, causing steric clashes on the viral surface that prevent neutralization of one of them. We were able to exploit these binding differences to design antibody mutants limiting its cross-reactivity and improving its immunological properties, conferring up to 40 fold increased neutralization. The successful rational design of such mutants is a testament to the accuracy achievable by combining experimental NMR epitope mapping with computational docking. |
Simonelli L*
*I.R.B. Switzerland |
F - Macromolecular Structure, Dynamics and Function |
|
F 46
F46 |
Motivation: Protein disordered domains challenge structure-function relationships developed for globular domains and the corresponding sequence analysis. We use the papillomavirus E7 oncoprotein to develop concepts for evolutionary studies of disordered domains using over 200 available natural sequences. Results: The intrinsically disordered (E7N) and globular (E7C) domains of E7 show similar degrees of conservation and coevolution. Sequence evolution of E7N can be described in terms of conserved and coevolving linear motifs separated by variable linkers. Sequence evolution of E7C is compatible with the known homodimeric structure but also suggests additional activities for the domain. Additional cysteine residues in E7C proximal to the four canonical zinc-binding cysteines may play a role in redox regulation of E7 function. Moreover, we describe a conserved peptide binding site on the surface of E7C and suggest a putative target linear motif. The homodimerization and peptide binding activities of E7C are also present in the distantly related host PHD domains. Finally, we integrate the multiple activities and conformations of E7 into a hierarchy of structure-function relationships.
|
Chemes LB*, Glavina J, Alonso LG, Marino-Buslje C, de Prat-Gay G, Sánchez IE
*Structure-Function and Engineering Laboratory, Fundación Instituto Leloir IIBBA CONICET Argentina |
F - Macromolecular Structure, Dynamics and Function |
|
F 47
F47 |
We present an effective approach for prediction of ion binding sites in protein structures using knowledge-based potentials. The potentials are trained on a non-redundant subset of PDB molecular structure data base. Novel reference state definition allows us to obtain potentials for bio-molecular interaction at very high level of detail, over unlimited range of contact distances.
The fast CUDA implementation of the algorithm gives a 20x speed gain versus existing CPU implementation and allows to effectively analyze larger structures. Carried out tests demonstrate that our tool can predict locations of ions bound in protein structures with high accuracy, with median rmsd of less than 1.0 angstrom, and also provides good discrimination between ions of different types: correct ion type is predicted in about 90% of cases. PIONCA (Protein ION CAlculator) software is freely accessible at http://line.bioinfolab.net/ion-calculator-gpu/ |
Uroshlev L, Kulakovskiy I, Esipova N, Makeev V, Rahmanov S*
*Vavilov Institute Of General Genetic Russian Federation |
F - Macromolecular Structure, Dynamics and Function |
|
F 50
F50 |
Recent advancement in computational methods for protein structure prediction has made it possible to generate high quality de novo models required for ab initio phasing of crystallographic diffraction data using molecular replacement. Despite those encouraging achievements in ab initio phasing using de novo models, its success is limited only to those targets for which high quality de novo models can be generated. In order to increase the scope of targets for which the ab initio phasing with de novo models can be successfully applied, it is necessary to reduce the errors in the template de novo models. Here, an approach is introduced that can identify and rebuild the residues with larger errors, which subsequently reduces the overall C-alpha root mean square deviations (CA-RMSD) to the native protein structure. The error in a predicted model is estimated by the average pairwise geometric distance per residue computed among selected lowest energy coarse-grained models. This score is subsequently employed to guide a rebuilding process that focuses on more error-prone residues in the coarse-grained models. This rebuilding methodology has been tested on ten protein targets that were unsuccessful with previous methods. The accuracy of coarse-grained models was improved on average by 0.64 Å on CA-RMSD. These rebuilt coarse-grained models were then turned into all-atom models and refined to produce improved de novo models for molecular replacement. Seven diffraction datasets were successfully phased using rebuilt de novo models indicating the improved quality of these rebuilt de novo models and the effectiveness of this rebuilding process. |
Shrestha R*, Simoncini D, Zhang KYJ
*RIKEN, University of Tokyo Japan |
F - Macromolecular Structure, Dynamics and Function |
|
F 51
F51 |
Proteins often assemble in multimeric complexes in order to perform a specific biological function. Trapping these high-order conformations is however difficult experimentally. Therefore, predicting how proteins assemble using in silico techniques can be of great help. The size of the associated conformational space and the fact that proteins are intrinsically flexible structures make, however, this optimization problem extremely challenging. Nonetheless, known experimental spatial restraints can guide the search process and contribute to model biologically relevant states. We present a swarm intelligence-based optimization algorithm (Protein Optimization Workbench: POW) for the prediction of macromolecular assemblies. We show how, by exploiting a limited set of known experimental restraints, our novel method can successfully (i) predict the arrangement of a variety of protein assemblies according to a predefined symmetry, and (ii) sample the conformation of protein-protein binary complexes. In addition, we show how the inclusion of the native flexibility of each protein subunit is a key ingredient for the prediction of biologically functional assemblies. As a practical application, we study the beta pore-forming toxin aerolysin. Upon binding to specific receptors on the target cell’s membrane, the protein first heptamerizes into a prepore state, and subsequently inserts into the lipid bilayer by means of large conformational changes. On the base of available x-ray structures and density maps, we use POW to predict both aerolysin’s prepore and pore structures. Our results are experimentally validated, and shed a new light on the toxin's mechanism of assembly and membrane insertion. |
Degiacomi M*, Iacovache I, van der Goot G, Dal Peraro M
*EPFL Switzerland |
F - Macromolecular Structure, Dynamics and Function |
|
F 52
F52 |
One of the major challenges in structural biology is to determine the structures of macromolecular complexes and to understand their function and mechanism of action. However, structural characterization of macromolecular assemblies is very difficult. A hybrid computational approach is required that will be able to incorporate spatial information from a variety of experimental methods into modeling procedure.
Thus far, we developed PyRy3D, a method for building low-resolution models of large macromolecular complexes. The components (proteins, nucleic acids and any other type of physical objects including e.g. solid surfaces) can be represented as rigid bodies (e.g. based on atomic coordinates of structures determined experimentally or modeled computationally) or as flexible shapes (e.g. for parts, whose structure is dynamic or unknown). The model building procedure applies a Monte Carlo approach to sample the space of solutions. Spatial restraints are used to define components interacting with each other, and a a simple scoring function is applied to pack them tightly into contours of the entire complex (e.g. cryoEM density maps). This approach enables the construction of low-resolution models even for very large macromolecular complexes with components of unknown 3D structure, such as human mitochondrial RNA polymerase gamma. |
Kasprzak JM*, Potrzebowski W, Dobrychłop M, Bujnicki JM
*Faculty of Biology, Adam Mickiewicz University Poland |
F - Macromolecular Structure, Dynamics and Function |
|
F 53
F53 |
Biomass contains abundant amounts of cellulose as crystalline microfibrils. A limiting step to using cellulose as an alternative energy source, however, is the hydrolysis of the biomass and subsequent transformation into fuels. Cellulose is insoluble in most solvents including organic solvents and water, but it is soluble in some ionic liquids like BMIM-Cl.
This project aims to find alternative solvents that are less expensive and are more environmentally benign than the ionic liquids. All-atom molecular dynamics simulations were performed on dissociated glucan chains separated by multiple (4-5) solvation shells, in the presence of several novel solvents and solvent mixtures. The solubility of the chains in each solvent was indicated by contacts calculations after the equilibration of the molecular dynamics. It was discovered that pyridine and imidazole acted as the best solvents because their aromatic electronic structure was able to effectively disrupt the inter-sheet interactions among the glucan chains in the axial direction, and because perturbation of the solvent interactions in the presence of glucan chains was minimal. |
Das R*, Chu J
*University of California, Berkeley United States of America |
F - Macromolecular Structure, Dynamics and Function |
|
F 54
F54 |
For the purpose of quantitative description of the effect of medium polarity on the tendency of DNA to denaturation we used the nonempirical quantum-chemical method, density functional theory (DFT) to calculate the lactam-lactim and amine-imine teutomeric equilibrium constants KT(L) and KT(A), frequency of mutation (nm) and activation (DE#) and reaction (DE) energy of a proton transfer processes in guanine-cytosine (GC) and adenine-thymine (AT) complementary pairs, depending on length of a triad of hydrogen bonds. There are two different tautomeric equilibria in the cytosine (C) – guanine (G) and adenine (A) – thymine (T) pairs: lactam-lactim (L) and amine-imine (A) and, consequently, they are described by two different equilibrium constants:KT(L)andKT(A).
From these results, the tautomeric equilibrium constants (KT) are inversely proportional to the length of intermolecular hydrogen bond triad RNHO and RNHN. By NMR spectrometry data it has been shown that polar solvents provoke a decrease of the tautomeric equilibrium constant; that is, KT is inversely proportional to the solvent polarity parameter ET:whence it follows that RNHO = ET The length of the intermolecular hydrogen bond triads RNHO and RNHN is directly proportional to the solvent polarity parameter – ET. Hence, a less polar aqueous medium formed by the addition of alcohols, ketones and other organic agents provokes narrowing of the two-stranded (double helix), resulting in an increase in the frequency of proton transfer between nucleotide bases, which increases of the tendency to denaturation. This model, using experimental realization of the ET polarity parameter for the medium surrounding DNA, can be used for a quantitative estimation of its tendency to denaturation. |
Kereselidze J, Kvaraia M*, Pachulia Z
*Sukhumi State University Georgia |
F - Macromolecular Structure, Dynamics and Function |
|
F 55
F55 |
We utilize molecular dynamics (MD) simulations to investigate the mechanisms and key residues that initiate conformational changes occurring as a result of changes in the environmental conditions [1,2]. Using these simulations as a basis and supplementing with toolkits such as perturbation response scanning (PRS)[3-5] and Protein Geometrical Pathways [6] the conformational changes that take place on time scales much slower than those accessible by MD are probed. We find that the E31A mutation reproduces the structural changes observed on [7] 100 microseconds but reduces the time scale of the conformational change by three orders of magnitude. The structure attained is consistent with those from an NMR ensemble [8,9]. We utilize constant force steered molecular dynamics (CF-SMD) [10] to analyze the energy barrier between two conformations and the effect of the mutation to the energy barrier. These findings give clues as to how local perturbations in the protein will lead to shifts in the energy landscape, paving the way to other conformational states.
1. Atilgan, A. R. et.al. J. Chem. Phys. 2011, 135 2. Negi,S. et.al. J. Phys. Chem. B, 2012, 116 3. Atilgan, C. et.al.PLoS Comput. Biol. 2009,5 4. Atilgan, C. et.al. Biophys. J. 2010, 99 5. Atilgan, C. et.al. Annu. Rev. Biophys. 2012, 41 6. Farrell, D. W. et.al. Proteins: Struct., Funct., Bioinf. 2010, 78 7. Slaughter, B. D. et.al. J. Phys. Chem. B 2004, 108 8. Gsponer, J. et.al.Structure 2008, 16 9. Bertini, I. et.al. J. Am. Chem. Soc. 2010, 132 10. Izrailev, S. et.al. Biophys. J. 1997, 72 |
Aykut AO*, Atilgan C, Atilgan AR
*Biozentrum Switzerland |
F - Macromolecular Structure, Dynamics and Function |
|
F 56
F56 |
α-helices are the major secondary structures in proteins. These helices can include different number of amino acids, mostly from 3 to 25. Preferences of the 20 amino acids are known to be different for each position in these structures. For the formation of longer helices during evolution there are two plausible paths; extension via addition of one or more amino acids to the shorter helices and/or merging of two short helices in different combination to make longer one. In the current work, these two hypothesis have been studied by analysing the positional preferences of amino acids toward different helix size. The results indicated that during evolution, neither merging nor extension could be accounted for the formation of long α-helices from shorter structures. |
Fallahi H, Yari K*
*Kermanshah University of Medical Science Iran |
F - Macromolecular Structure, Dynamics and Function |
|
F 57
F57 |
The interaction between the T-cell receptor (TCR) and the major histocompatibility complex (MHC) is one of the most important events for the elicitation of an adaptive immune response. However, the detailed structural mechanism of how a T-cell determines the peptide immunogenicity is still not known. Especially techniques for the prediction of peptide immunogenicity are only very sparsely found in the literature. Most studies use the peptide binding affinity as an approximation of immunogenicity which is a consequential approach since binding is a necessary prerequisite for immunogenicity. However, not each strong binder induces an immune reaction. Hence binding affinity alone does not provide the whole picture.
Molecular Dynamics simulations (MDS) are a computational method to investigate particles motions over time at atomic resolution. This technique allows insight into structural dynamics of protein complexes which are not possible with current experimental techniques. In this study we present MDS of 172 altered Epstein-Barr-Virus (EBV) peptides bound between LC13 TCR and HLA-B*08:01. We show that the initial relaxation dynamics during the first 10 ns of the MDS differ between TCR/peptide/MHC complexes of different peptide immunogenicity levels. Although the differences are subtle, we show by the use of permutation tests that the found differences are statistically significant. |
Knapp B*, Dorffner G, Schreiner W
*Medical University of Vienna Austria |
F - Macromolecular Structure, Dynamics and Function |
|
F 58
F58 |
Sequence conservation is used to assess the importance of individual amino acids in proteins. It could be estimated by measuring the degree to which the frequency of amino acids at a given position differs from an expected value in a multiple sequence alignment of a protein family. This approach has been used to find conserved positions in alpha helices. However, this approach does not address the cooperative nature of amino acids for the proteins structure and function. “Protein sectors” have been recently introduced as a new protein substructure. Using the Halabi ‘s approach (Halabi et al, 2009), we have studied highly conserved and coordinated positions in alpha helices with different size. We classified the alpha helices according to the number of amino acids. Then we studied the correlated amino acids in each class of helices. Our finding revealed that the conserved positions are different among different classes of alpha helices. While in some helices many positions do not show any coordination, high degree of coordination was observed in other classes of alpha helices. These finding would be useful in designing alpha helices with desired size. It could be also helpful for site directed mutagenesis studies where one wishes to change property of alpha helices. |
Fallahi H, Miraghaee S*
*Kermanshah University of Medical Sciences Iran |
F - Macromolecular Structure, Dynamics and Function |
|
F 59
F59 |
Formaldehyde has long been recognized as a hazardous environmental agent highly reactive with DNA. Recently, it has been realized that, due to the activity of histone demethylation enzymes within the cell nucleus, formaldehyde is produced endogenously, in direct vicinity of genomic DNA. Should it lead to extensive DNA damage? We address this question with the aid of a computational mapping method, analogous to X-ray and NMR techniques for observing weakly specific interactions of small organic compounds with a macromolecule in order to establish important functional sites. We concentrate on the leading reaction of formaldehyde with free bases: hydroxymethylation of cytosine amino groups. Our results show that in B-DNA cytosine amino groups are totally inaccessible for the formaldehyde attack. Then we explore the effect of recently discovered transient flipping of Watson-Crick (WC) pairs into Hoogsteen (HG) pairs (“Hoogsteen breathing”). Our results show that the HG base pair formation dramatically affects the accessibility for formaldehyde of cytosine amino nitrogens within WC base pairs adjacent to HG base pairs. The extensive literature on DNA interaction with formaldehyde is analyzed in light of the new findings. The obtained data emphasize the significance of DNA HG breathing. |
Bohnuud T*, Beglov D, Ngan CH, Zerbe B, Hall D, Brenke R, Vajda S, Frank-Kamenetskii M, Kozakov D
*Boston University United States of America |
F - Macromolecular Structure, Dynamics and Function |
|
F 60
F60 |
Human alcohol dehydrogenase class V (ADH5) has been successfully expressed as a fusion protein with green fluorescent protein, and also with glutathione-S-transferase. However, it has never been isolated as a native protein, nor shown any activity towards the traditional alcohol dehydrogenase substrates. We have used computational methods to study structure and properties of this protein. The structure was generated using homology modelling based on multiple ADH structures, and properties were examined using molecular dynamics.
The molecular dynamics simulations implied that the regions involved in dimer interactions behave in a different way than corresponding regions in other ADH enzymes, mainly causing increased structural variability in the central beta sheets. The dimer formation is known to be important for the stability and function of other ADH enzymes. The increased structural variability implies that while the protein is expressed at the transcript level, the stability of the ADH5 dimer is compromised, which in turn would explain the lack of activity and that a dimeric ADH5 has never been isolated. Modelled ADH5 structures with sequence segments modified into those from other ADH enzymes decreased the structural variability, but not down to the level of the other ADH enzymes, implying that the instability is focused to only one part of the sequence. |
Östberg LJ*, Persson B, Höög J
*Department of Medical Biochemistry and Biophysics, Karolinska Institutet Sweden |
F - Macromolecular Structure, Dynamics and Function |
|
F 61
F61 |
The Protein Model Portal (PMP) has been developed to foster effective use of molecular models in biomedical research by providing convenient and comprehensive access to structural information for a specific protein. For the first time both experimental structures and theoretical models for a given protein can be searched simultaneously, and analyzed for structural variation. The current release allows searching 24.2 million of model structures for 3.9 million distinct UniProt entries (UP release 2012_07).
Ultimately, the accuracy of a structural model determines its utiliy for specific applications. Model quality estimation tools allow evaluating the accuracy of generated models. Hence, we present new developments in Protein Model Portal supporting model validation and quality estimation, which consist of (1) service interfaces to several established modeling and model quality estimation tools (2) a novel analysis tool for protein structure variation for both models and experimental structures and (3) the CAMEO (Continuous Automated Model EvaluatiOn) system for the continuous evaluation of the modeling servers included in PMP as well as a ligand binding site residues prediction category. By providing a comprehensive view on structural information, the Protein Model Portal not only offers a unique opportunity to apply consistent assessment and validation criteria to the complete set of structural models available for a specific protein, but also allow continuous assessment of the modeling and quality estimation services registered with CAMEO. Visit us at www.proteinmodelportal.org! |
Bordoli L*, Arnold K, Kiefer F, Haas J, Schwede T
*Swiss Institute of Bioinformatics & Biozentrum University of Basel, Switzerland Switzerland |
F - Macromolecular Structure, Dynamics and Function |
|
F 62
F62 |
T-cells play a major role in the adaptive immune response. T-cell receptor molecules (TCRs) are capable to distinguish between self-peptides and pathogenic peptides presented by major histocompatibility complex molecules (MHC) on cell surfaces. Molecular modeling of TCR:peptide:MHC (TCRpMHC) complexes would allow a better understanding of T-cell signaling and improve the development of immunotherapies and rational vaccine design. On the one hand, TCRs share a common shape, on the other hand the repertoire of different TCRs is estimated to at least 10 Mio. Thus, due to the genetic and structural diversity of the receptor, the modeling of the TCRpMHC complex and of the TCR itself, consisting of two chains, is a challenging task. We investigated the structural characteristics of the TCR variable domains and developed an approach to model the TCR geometry. In remodeling studies we were able to correctly predict the TCR’s association geometry in 83% of the cases. |
Hoffmann T*, Antes I
*TU Munich Germany |
F - Macromolecular Structure, Dynamics and Function |
|
F 63
F63 |
CAMEO (http://www.cameo3d.org) is a service for continuous automated assessment of protein structure prediction servers. Protein modeling is widely used in the life science community to build models for proteins, where no experimental structures are available. However, depending on the specific target protein and the applied modeling approach, the accuracy of computational models may vary significantly between different modeling servers.
CAMEO uses the amino acid sequences of the weekly PDB releases to continuously assess the accuracy and reliability of protein structure modeling servers. Retrospective evaluation of prediction accuracy allows users of models to select the most suitable tool for a given modeling problem. CAMEO evaluates prediction accuracy (not "post-dictions"), and hence provides an independent blind benchmark to document the performance of new algorithms. Since the accuracy requirements for different scientific applications vary, CAMEO offers a variety of scores assessing different aspects of a prediction (coverage, local accuracy, completeness, etc.) to reflect these requirements. A second category for continuous assessment are the Ligand Binding Site Residue Predictions, which just have opened, along with the possibility to annotate ligands within CAMEO Annotations and thus aid the method developers, which in turn will produce more refined predictions for future users of these services. CAMEO is partially supported by the SIB Swiss Institute of Bioinformatics and by grant U01 GM093324-01 from NIGMS for the Nature PSI Structural Biology Knowledgebase. |
Haas J*, Schmidt T, Gallo Cassarino T, Romano V, Bordoli L, Schwede T
*SIB Swiss Institute of Bioinformatics and Biozentrum, University of Basel, Switzerland Switzerland |
F - Macromolecular Structure, Dynamics and Function |
|
F 64
F64 |
The ability to locate in silico, in a protein structure, the active site, the binding site or the initiation site of conformational changes is important for understanding the functional mechanisms and is a prerequisite for the rational prediction of targeted mutations or for protein design. The amino acids in these sites have been optimized during evolution to fulfill a specific function, but not necessarily to improve stability. Often, they correspond to structural weaknesses, that is, to residues that are non-optimal with respect to the structure’s stability. These weaknesses are sometimes compensated by robustnesses, corresponding to residues that bring substantial contributions to the overall stability.
We present a program that automatically detects patches of weaknesses and robustnesses in a protein structure. The stability is estimated using three types of database-derived statistical potentials: a distance potential, a backbone torsion angle potential and a solvent accessibility potential. These potentials allow the identification of stability peculiarities with respect to the tertiary interactions, the local structure or the core/surface pattern, respectively. To illustrate the power of this program, we apply it to bovine ribonuclease, a member of a well-studied enzyme family that binds RNA and undergoes 3D domain swapping under certain conditions. We analyze the location of the functional and structural sites in this protein in relation with the stability patches. |
De Laet M*, Dehouck Y, Gilis D, Rooman M
*BioModeling, BioInformatics & BioProcesses Department, Université Libre de Bruxelles, Belgium Belgium |
F - Macromolecular Structure, Dynamics and Function |
|
F 65
F65 |
In many cellular processes, physical interactions between proteins play crucial roles. For biological functionality, many proteins must organise into protein complexes. On the other hand, biologically non-functional complexes might give rise to a number of pathologies. Existing methods for predicting protein-protein interactions (PPIs) that use statistical learning teachniques or other bioinformatic approaches appear to have limited accuracy. We will therefore build a method based on physical principles using simplified molecular models for computational efficiency, while still maintaining the accuracy of calculating the interactions.
We show that we can accurately calculate PPI strength using a coarse-grained forcefield (about four atoms are represented by a unified ‘atom’), yielding almost three orders of magnitude speedup. As a second step, we show that interactions from the interface residues dominate for two different protein complexes; a TCR-pMHC complex and an MP1-p15 scaffolding complex. We performed random mutations on residues in the surface (non-interface), the interface core (buried in complex) and the interface rim (less exposed in complex). Our results show that mutations on the interface core lead to less attractive interactions than mutations on the interface rim, while non-interface surface mutations hardly effect the PPI. We are now able to calculate an interaction potential between any two proteins in a matter of hours. Restricting this to the interface region only will allow us to calculate protein interactions on a genomic scale (tens of millions of PPIs) in about a day on a medium size compute cluster. |
Feenstra KA*, May A, Pool R, Heringa J
*IBIVU Centre for Integrative Bioinformatics, VU University Amsterdam, Department of Computer Science, Faculty of Sciences Netherlands |
F - Macromolecular Structure, Dynamics and Function |
|
F 66
F66 |
The task of predicting binding sites from a protein’s sequence is of high relevance for life science research, ranging from functional characterization of novel proteins to applications in drug design. Consequently, the development of automated methods for predicting ligand-binding sites has received increasing attention over the past years.
To help addressing relevant biological questions, the predictions need to be specific and accurate. Thus, in CAMEO we continuously assess ligand binding site predictions to evaluate the current state of the art of prediction methods, identify possible bottlenecks, and further stimulate the development of new methods. |
Schmidt T*, Haas J, Gallo Cassarino T, Romano V, Schwede T
*SIB & Biozentrum, University of Basel Switzerland |
F - Macromolecular Structure, Dynamics and Function |
|
F 67
F67 |
Advanced molecular modelling often requires creative combination of multiple complex software packages into a suitable workflow. While computer scientists and IT professionals tend to write sophisticated scripts for the automation of such processes, end users often rely on tedious manual repetition of the individual steps. To provide non-programmers with powerful scripting functionality, graphical workflow toolkits such as Knime, biomoby, Taverna, or galaxy have become increasingly popular over the last decade. The basic idea of these packages is to offer predefined tools (and parameter settings) via graphical (possibly web-based) front ends to a multi user audience such that the tools can be combined to multi-step workflows. Workflows can then themselves be shared between users, or submitted as supplementary material for a publication.
Here, we present ballaxy, our adaption of the galaxy workflow-toolkit to the field of molecular modelling. ballaxy integrates the versatile BALL framework, a C++ library of algorithms and data structures for structural bioinformatics. ballaxy aims to make BALL’s rich functionality more easily accessible by exposing it via an intuitve web based visual programming interface. Apart from basic tools to upload and perform format conversion between standard molecular files, as well as a ligand structure visualization tool, our ballaxy system currently offers 3 non-trivial workflows: BOA Constructor for optimal bond order assignment, NightShift for NMR shift Inference by General Hybrid model training, and Spinster for pure protein chemical shift prediction. ballaxy can be accessed via https://ballaxy.bioinf.uni-sb.de/. Sources of the adapted Galaxy version are available upon request. |
Dehof AK*, Lenhof H, Nickels S, Hildebrandt A
*Center for Bioinformatics, Saarland University Germany |
F - Macromolecular Structure, Dynamics and Function |
|
F 68
F68 |
We are interested in using bioinformatics techniques to gain insight into persistent and variable predicted pockets on protein surfaces as these may represent novel drug targets. We outline our method, Provar, that provides a simple, colour-coded visualisation of a protein's surface in terms of the probability that residues or atoms are involved in predicted pockets. We achieve this visualisation by analysing sets of related structures (such as simulated conformations, sets of homologues, etc), as these are representative of conformational variation that would be observed due to the flexibility of a protein in solution.
This approach allows us to consistently handle diverse homologues and provides visualisation of structural heterogeneity in terms of pocket persistence/variability across members of a given superfamily. Such variations may indicate a pocket's existence on a given protein, even if it is not evident from its crystal structure. We have extended the method to analyse results using simulated conformations of each member of a superfamily and using this 'multi-dimensional' approach we illustrate how pocket-lining probabilities cluster across a set of kinases. Finally, we have extended the method to handle output from protein-ligand docking runs using either GOLD or Autodock Vina. In these cases, Provar-dock allows visualisation of the most persistent ligand contact atoms, across all ligand poses and between multiple protein conformers. We illustrate Provar-dock using simulations of a potassium channel homology model. |
Ashford P*, Moss D, Nobeli I, Williams M
*Institute for Structural and Molecular Biology, Dept of Biological Sciences, Birkbeck, University of London United Kingdom |
F - Macromolecular Structure, Dynamics and Function |
|
F 69
F69 |
Fold similarity between surface proteins can be difficult to detect, especially in pathogens, due to highly variable insertions, etc. At the same time modelled structures can help when such proteins are notoriously difficult to express an refold in vitro and crystal/NMR structures are sometimes pursued for decades - the malarial vaccine candidate antigens Pfs48/45 and Pfs230 in the "6-Cys" (or PGSH) family are a good example. Earlier this year our prediction of a distant evolutionary relationship between these antigens in Plasmodia and the dominant surface protein family in Toxoplasma, the "SAG1-Related Sequences" (SRS) (Gerloff et al (2005) PNAS 102:13598-13603) was confirmed by a first representative NMR structure (Arredondo et al. (2012) PNAS 109:6692-6697).
However, while the SRS surface proteins and "6-Cys" (PGSH) family in Plasmodia seem distantly related their functions and disulfide bonds differ. To find clues about their evolutionary history, we screened 102,878 predicted proteins from Apicomplexan genomes sensitively with merged family HMMs. Among other interesting findings, our searches yielded a surprising find. We identified an atypical SRS/PGSH homolog in the Tissue Cyst forming Coccidia (these includes Toxoplasma gondii and Neospora caninum) that may resemble the "evolutionary link" we sought, with a predicted "hybrid" disulfide bond pattern. While functional studies are pending, this and molecular modelling help explain how disulfide bonds may "change position" during evolution. Updated 3-D structural models are available in our 6-Cys Domain model database (http://pgsh.soe.ucsc.edu). |
Gerloff DL*, Liaw EY, Draizen E, Kemp FD, Hall DS, Magasin JD
*University of California, Santa Cruz United States of America |
F - Macromolecular Structure, Dynamics and Function |
|
G 01
G1 |
Toxoplasmosis, due to the parasite Toxoplasma gondii, is the most widespread infection in the world. It is generally begin but leads to drastic consequences in immunocompromised patients and fetus. The severity of the clinical expression is linked to infecting strain. Indeed The genetic make-up of an infecting Toxoplasma gondii strain may be important for the outcome of infection. In order to survey the distribution of different genotypes within clinical manifestations, polymorphism analysis of several congenital toxoplasmosis cases was investigated.
Multilocus SNPs identification, from at least 7 biallelic markers, was explored to reveal the polymorphism of the infecting strains and thus to allow the characterization and genotyping of the parasite. The multiple alignment with references sequences showed the frequent presence of mixed and recombinants strains in Tunisian clinical samples. In these cases, sequence analysis revealed double peaks at known polymorphic sites, indicating the presence of multiple alleles. The construction of unrooted phylogenetic network, generated by the multiple alignments, demonstrated the intermediate position of the recombinant and mixed strains. A close clustering with environmental African strains is also noted suggesting that on this continent recombination and mixing of the parasite strains are favored. |
Boughattas S*
*Institut Pasteur Tunis Tunisia |
G - Mutations, Variations, and Population Genomics |
|
G 02
G2 |
Motivation: Genetic Interaction (GI) detection impacts the understanding of human disease and the ability to design personalised treatment. The mapping of every GI in most organisms is far from completion due to the combinatorial amount of gene deletions and knockdowns needed. Computational techniques to predict new interactions based only on network topology have been developed but never applied to GI networks.
Methods: We applied several neighbourhood-based and network-embedding techniques to yeast and worm GI networks to predict new links. To investigate the true robustness of each approach, we removed links uniformly at random from the networks and analysed how sparsification impacts prediction. We also tested if a biologically meaningful network topology can be modelled by adding links uniformly at random to the aforementioned sparsified networks. Results: We show that topological prediction of GIs is possible with high precision. Neighbourhood-based techniques perform better when the network is dense, while network-embedding approaches present similar performance in both dense and sparse networks. Given these results, we propose a new graph distance metric that is able to provide robust prediction in both scenarios. Finally, we demonstrate that a random network re-densification process cannot regenerate the topology shaped by the biological properties of the network components. Conclusion: Computational prediction of GIs is a strong tool to aid high-throughput GI determination. The graph distance we propose in this article is able to attain precise predictions that reduce the universe of candidate GIs to test in the lab. |
Alanis-Lobato G*, Cannistraci CV, Ravasi T
*Integrative Systems Biology Lab, King Abdullah University of Science and Technology Saudi Arabia |
G - Mutations, Variations, and Population Genomics |
|
G 03
G3 |
Web-based platform for ortholog sequence data to reveal functional relevance of variants in proteins
The vast amount of sequence data available due to the growing number of genome projects within the next few years will initiate new approaches to evaluate the functional importance of naturally occurring gene variants in proteins. Here, we present a new web-based system for comparing sequences of the same protein from different species (orthologs) with a comprehensive set of experimental in vitro data (site-saturation mutagenesis). Thereby, we demonstrate the possibility to predict the functional relevance of individual positions and mutations within a protein by using ortholog sequence data. In this approach, we used a model protein, a G protein-coupled receptor (GPCR) for ADP (P2Y12), but the database can be adapted to other proteins when sequence data are available. Knowing the evolutionary constraints which determine the function of a protein will facilitate the development of new strategies for pharmacological interventions and in case of GPCRs allow gaining deeper insights in the mechanisms of ligand binding, signal transduction and regulation. The P2Y12 mutant library is available for use upon request on the web at http://www.ssfa-7tmr.de/p2y12/. This website has been implemented in PHP, MySQL and Apache, with all major browsers supported. |
Kreuchwig A*, Cöster M, Wittkopf D, Thor D, Kreuchwig F, Worth CL, Schöneberg T, Krause G
*Leibniz-Institut fuer Molekulare Pharmakologie Berlin Germany |
G - Mutations, Variations, and Population Genomics |
|
G 04
G4 |
Biased gene conversion (BGC) is a process linked to recombination which leads to an increase of the GC-content of genomes over evolutionary times. Conversely, repeat induced point mutation (RIP) is a mechanism that detects duplicated genes, mutates cytosines to thymines, preventing invasion of retro-elements.
Evidences of BGC are being reported in an increasing number of taxonomic groups, including amniotes and plants. RIP has been demonstrated in Ascomycetes, notably Neurospora and Leptosphaeria, and RIP-like mutational patterns have recently been found in Basidiomycetes. By comparing five closely related, highly syntenic, genomes of smut fungi species (Basidiomycetes), we report evidence of ancient GC-biased gene conversion in this group. We infer the ancestral genome composition to be more GC rich than the extant ones, indicating an independent suppression of BGC in several lineages. In addition, we show that duplicated genes have a lower GC content at their first and second codon position, consistent with a RIP-like mechanism affecting paralogues. On the other hand, the GC content at the third codon position is negatively correlated with the age of duplications: the most recently duplicated genes display a lower GC content than non-duplicated genes of the same age, while ancient duplicates have a higher GC content at this position. We hypothesize that this is due to biased gene conversion occurring between paralogues. Altogether, these results suggest that both RIP and BGC are acting in smut fungi and shape the base composition of the genome, albeit at distinct time scales. |
Dutheil J*, Schweizer G, Mannhaupt G, Kahmann R
*Max Planck Institute for Terrestrial Microbiology Germany |
G - Mutations, Variations, and Population Genomics |
|
G 05
G5 |
High-throughput phenotyping technologies in combination with genetic variability for the plant model species Arabidopsis thaliana (Arabidopsis) offer an excellent experimental platform to reveal the effects of different gene combinations on phenotypes. These developments have been coupled with computational approaches to extract information not only from the multidimensional data, capturing various levels of biochemical organization, but also from various morphological and growth-related traits. Nevertheless, the existing methods usually focus on data aggregation which may neglect accession-specific effects.
Here we argue that revealing the molecular mechanisms governing a desired set of output traits can be performed by ranking of accessions based on their efficiencies relative to all other analyzed accessions. To this end, we propose a framework for evaluating accessions via their relative efficiencies which relate multidimensional system's inputs and outputs from different environmental conditions. The framework combines data envelopment analysis (DEA) with a novel valency index characterizing the difference in congruence between the efficiency rankings of accessions under various conditions. We illustrate the advantages of the proposed approach for analyzing genetic variability on a data set comprising quantitative data on metabolic and morphological traits for 23 Arabidopsis accessions under three conditions of nitrogen availability. In addition, we extend the proposed framework to identify the set of traits displaying the highest influence on ranking based on the relative efficiencies of the considered accessions. As an outlook, we present how the proposed framework can be combined with well-established statistical techniques to further dissect the relationship between natural variability and metabolism. |
Kleessen S*, Fernie AR, Nikoloski Z
*Max Planck Institute of Molecular Plant Physiology Germany |
G - Mutations, Variations, and Population Genomics |
|
G 06
G6 |
Motivation: Mutations can have an effect on the average number of offsprings of a given organism, and as such induce positive or negative selection in a population. The detection of selection pressure in a population has been based so far on the ratio between replacement and silent mutations. However, this ratio requires a baseline probabilistic model for the expected frequencies, which was often unavailable.
Results: We here propose a new measure for the selection pressure in a mutating population, using the imbalance in the number of leaves in branches of lineage trees. The method can very clearly detect positive and negative selection, and their combination, overcoming the difficulties made by their cross-talk. We apply the method to multiple cases of evolving populations, including viral and mitochondrial sequences, Immunoglobulin sequences and transgenic mice with low and high affinities for a specific antigen, and show that the proposed method can detect genes and intergenic regions based on the selection rate, to detect selection pressures in viruses' proteins and in the immune response to pathogens. |
Liberman G*
*BIU Israel |
G - Mutations, Variations, and Population Genomics |
|
G 07
G7 |
Background and aims
Genetic variations are changes in a genomic sequence that can cause disease phenotypes. Studies on them help to understand the mechanisms of disease pathogenesis. Several computational methods have been developed for predicting effects of rapidly expanding variation data. It has been impossible for users to compare the performance of tools due to the lack of benchmarks in this field. This study seeks to address these problems by creating a database with benchmark datasets for genetic variations having a variety of effects. Materials and Methods Experimentally verified variation data sets were extracted from literature and relevant databases. Perl scripts were implemented to automate variant position mapping to different levels (protein, RNA, DNA, structure) for the gathered variations, along with identifier mapping to other databases. Results and Significance We have developed a benchmark database suite, VariBench that comprises experimentally validated variation datasets chosen from literature and relevant databases. It also provides variant position mapping to different levels (protein, RNA, DNA, structure) and identifier mapping to relevant databases. VariBench contains the first benchmark datasets for variation effect analysis, a field which is of high importance and where lots of development is currently going on. VariBench datasets can be used e.g. to test performance of prediction tools as well as to train novel machine learning–based tools. Current version of VariBench houses five categories of reference variation datasets that affect protein tolerance, protein stability, mismatchrepair(MMR) gene, transcription factor binding site and splice sites. VariBench is freely available at http://structure.bmc.lu.se/VariBench. |
Nair PS*, Vihinen M
*Institute of Biomedical Technology, University of Tampere Finland |
G - Mutations, Variations, and Population Genomics |
|
G 08
G8 |
The tremendous growth of nucleotide diversity information along completely sequenced human genomes makes it possible to reassess the diversity status of receptor proteins in different human individuals. Our study focused on human olfactory receptors (ORs) as a model for personal receptor repertoires. We performing data-mining from public and private sources and scored genetic variations in 413 intact OR loci, for which one or more individuals had an intact open reading frame. Using 1000 Genomes Project haplotypes, we identified a total of 4069 polypeptide variants encoded by these OR loci, constituting a lower limit for the effective human OR repertoire. Each individual is found to harbor as many as 600 OR variants, ~50% higher than the locus count. Because OR neuronal expression is allelically excluded, this has direct effect on human’s smell perception diversity. We further identified 244 OR segregating pseudogenes (SPGs), loci showing both intact and pseudogene forms in the population, twenty-six of which are annotatively “resurrected” from a reference genome pseudogene status. Using a custom SNP microarray we validated 150 SPGs in a cohort of 468 individuals. Finally, we generated a multi-source compendium of 63 OR loci harboring deletion Copy Number Variations (CNVs).
Our combined data suggest that a total of 271 intact OR loci (66%) are affected by deleterious SNPs/indels and/or CNVs. These results portray a case of unusually high genetic diversity, and suggest that individual humans have a highly personalized inventory of functional olfactory receptors, a conclusion that likely applies to other receptor multigene families. |
Olender T*, Waszak S, Ben-Asher E, Khen M, Lancet D
*The Weizmann Institute Israel |
G - Mutations, Variations, and Population Genomics |
|
G 09
G9 |
South African Coloured population (SAC) has a complex intercontinental admixture due to history and geographical position of South Africa. Admixture and local ancestry in admixed populations are commonly inferred using reference populations, the accuracy of these inferences is related to the choice of non-admixed populations used. The availability of high-throughput genotype data from various populations enables the choice of best proxy ancestry from a pool of reference populations to help in both study of population genetics and mapping genetic diseases. Using an inaccurate ancestral proxy can result in inaccurately inferred ancestry. To address uncertainty in ancestral populations we introduce PROXYANC, an approach to select proxy ancestry for recently admixed populations. We implemented two novel algorithms in PROXYANC, based on population genetic differentiation and optimal quadratic programming, respectively. We validate these algorithms through simulation of an admixed population. We apply PROXYANC on the real data of the SAC. We used genome-wide data from best proxy ancestors of the SAC determined by PROXYANC and refined the estimates of the contributions of proxy ancestral populations to Khoesan (31%), Bantu-Niger (30%), European (19%), South-Asian (12%) and East-Asian (7%). We also demonstrated that increased linkage disequilibrium (LD) is present in this population, and the observed LD has its origin from admixture events, which increased the genetic diversity of this population. We reject the hypothesis that population bottleneck took place. Our results have not only increased our understanding of the evolutionary history, but also provide opportunities for designing disease-gene association study in this recently admixed population |
Chimusa ER*, Mulder N, Van Helden EH
*University of Cape Town South Africa |
G - Mutations, Variations, and Population Genomics |
|
G 10
G10 |
Computational motif discovery methods based on enumerative strategies typically parse the entire sequence in search of over- or under-represented k-mers. A k-mer is a word of length k. Such a strategy is heavily dependent on the model underlying the definition of the background, or expected, k-mer frequency. Here, we use the human genome as a case study and compare the performance of the standard Bernoulli and Markov chain background models to a model based on k-mer frequencies observed in a large sample of the human population.
We use sequencing data from 1,092 individuals made available by the 1,000 Genomes Project. Our data processing pipeline incorporates individual variation in single nucleotide polymorphisms (SNPs), short insertions and deletions (indels), and large deletions into the reference human genome as proxy for the individual human genomes. Given the large volume of data, the computational load is distributed over several processing nodes in a cloud computing infrastructure. Alternative modelling strategies, such as the usage of population-scale sequencing data here proposed, contribute to overcome the limitation that enumerative strategies have in detecting not significantly over-represented, or weak, motifs. |
Santos S, Rodrigues J, Afreixo V, Garcia S*
*IEETA, University of Aveiro Portugal |
G - Mutations, Variations, and Population Genomics |
|
G 11
G11 |
Leishmaniasis is an infectious disease due to a protozoan parasite of the genus Leishmania. Among the disease manifestations, mucocutaneous leishmaniasis (MCL) is one of the most severe forms caused by Leishmania Viannia subgenus. It includes South and Central America species such as L. braziliensis, L. panamensis and L. guyanensis. These species lead to cutaneous leishmaniasis (CL) but some strains have the ability to metastasize to the nasofacial area producing tissue destruction, deformities, and respiratory obstruction, which is the clinical outcome of MCL. Recent evidence suggests an association between MCL and a double-stranded RNA virus (LRV). It has been that metastasizing clones harbour high density of LRV and induce a hyper-inflammatory response in the host as compared to non-metastasizing parasites (Ives et al. 2011). The goal of this work is to find genetic differences between LRV-infected parasites and non-infected parasites.
For this purpose, we sequenced, de novo assembled and annotated six genomes from Leishmania guyanensis parasites isolated from the sand fly vector or from human cutaneous or mucocutaneous lesions respectively and differing in their LRV load. We applied a variety of comparative analysis such as orthology search, SNP detection, chromosomal and copy number variations. Our results showed that there was not a unique source of variation that permitted us to explain the genetics of the LRV presence. The factors regulating the disease phenotype and its association with LRV are likely polygenic as it can be concluded by the heterogeneity of changes involving SNP, pseudogenes, gene amplification, chromosome number variation. |
Calderon-Copete SP*, Dickens NJ, Zangger H, Martin R, Dobson DE, Xenarios I, Smith D, Mottram J, Beverley SM, Hertz-Fowler C, Falquet L, Fasel N
*Vital-IT, Swiss Institute of Bioinformatics Switzerland |
G - Mutations, Variations, and Population Genomics |
|
G 12
G12 |
Drug resistance in bacterial pathogens is an increasing problem that stimulates research activity. However, still our understanding of drug resistance mechanisms remains incomplete. One promising approach to deepen understanding drug resistance mechanisms is to use whole-genome sequences to identify genetic mutations associated with drug resistance phenotypes for bacterial strains.
In this work, we present a new computational method to identyfying drug resistance associated mutations in bacterial strains. In this approach, the genotype data consist of gene gain/loss profiles, derived from gene families, and point mutation profiles which are determined from multiple alignments of the considered gene families. The method employs a score, which we call weighted support, and an assignment of statistical significance to it. We tested our method on collected genotype and phenotype data (from over 50 publications) of 100 fully sequenced S. aureus strains and 10 commonly used drugs. Our computational experiments show that by employing our approach we were able to successfully re-identify most of the known drug resistance determinants. We also argue that the concept of weighted support, by utilizing a phylogenetic information, yields results which fit better the goal of detecting drug resistant associated mutaions than some other scores like the ODDS ratio. As a result of applying our methodology we identified some putative novel associations. |
Woźniak M*, Tiuryn J, Wong L
*Faculty of Mathematics, Informatics and Mechanics, Warsaw University Poland |
G - Mutations, Variations, and Population Genomics |
|
G 13
G13 |
Background: Tumor development is known to be a stepwise process involving dynamic changes that affect cellular integrity and cellular behavior. This complex interaction between genomic organization and gene, as well as protein expression is not yet fully understood. Tumor characterization by gene expression analyses is not sufficient, since expression levels are only available as a snapshot of the cell status. So far, research has mainly focused on gene expression profiling or alterations in oncogenes, even though DNA microarray platforms would allow for high-throughput analyses of copy number alterations (CNAs).
Results: Here we present a four stage mouse model addressing copy number alterations in tumorigenesis. We analyzed DNA from mouse mammary gland epithelial cells using the Affymetrix Mouse Diversity Genotyping array (MOUSEDIVm520650) and calculated the CNAs. No significant changes in CNA were identified for non-transgenic mice, but a stepwise increase in CNA was found during tumor development. The segmental copy number alteration revealed informative chromosomal fragmentation patterns. In inter-segment regions (hypothetical breakpoint sides) unique motifs were found. Conclusions: Our analyses suggest genome reorganization as a stepwise process, that involves amplifications and deletions of chromosomal regions. We conclude from distinctive fragmentation patterns that conserved as well as individual breakpoints exist which promote oncogenesis. |
Standfuß C*, Pospisil H, Klein A
*Technical University of Applied Sciences, Bioinformatics / HPCLife Germany |
G - Mutations, Variations, and Population Genomics |
|
G 14
G14 |
A review of the available single nucleotide polymorphism (SNP) calling procedures for Illumina high-throughput sequencing (HTS) platform data reveals that most rely mainly on base-calling and mapping quality values as sources of error when calling SNPs. Thus errors not involved in base-calling or alignment, such as those in genomic sample preparation, are not accounted for. A novel method of consensus and SNP calling, Genotype Model Selection (GeMS), is given which accounts for the errors that occur during the preparation of the genomic sample. Simulations and real data analyses indicate that GeMS has the best performance balance of sensitivity and positive predictive value among the tested SNP callers. Future work on multiple sample and mutation library SNP calling is also introduced. |
Murillo G*, You N, Su X, Zeng X, Xu J, Ning K, Zhang S, Zhu J, Cui X
*Department of Statistics United States of America |
G - Mutations, Variations, and Population Genomics |
|
G 15
G15 |
Germline mutations in mismatch repair (MMR) genes, MLH1, MLH3, MSH2, MSH6, PMS1, PMS2 and TFGBR2 cause hereditary gastrointestinal cancer. MMR is a DNA repair system that recognizes and repairs base-base mispairs and insertion-deletion loops arising in DNA replication and recombination. Thousands of MMR variants have been discovered, but their relevance to the cancer is usually unknown. Here, we developed a classification tool for MMR missense variants.
We identified from literature 168 functionally tested MMR missense variants of which 82 were pathogenic. The InSiGHT and MMR Gene Unclassified Variants database for gastrointestinal cancer data contains over 600 variants with unknown effect. We used Pathogenic-Or-Not-Pipeline http://bioinf.uta.fi/PON-P for the prediction and analysis of these variants. Since the performance of the individual predictors was not as good as we wanted, we developed a consensus predictor PON-MMR based on several tolerance prediction methods. With this predictor, we were able to classify over 200 previously unknown MMR missense variants as pathogenic or neutral. The results can be utilized to prioritize variants for further experimental validation and may help in the diagnosis of Lynch syndrome and other gastric cancers. The novel predictor PON-MMR is freely available http://bioinf.uta.fi/PON-MMR. |
Ali H*, Olatubosun A, Vihinen M
*Institute of Biomedical Technology/ University of Tampere and BioMediTech Finland |
G - Mutations, Variations, and Population Genomics |
|
G 16
G16 |
Fast and efficent development of microsatellite markers from draft denovo transcriptomes and genomes
Microsatellites are widely used and cost-effective genetic markers for population and conservation genetic studies. However, developing such markers, especially for non-model species, is often hindered by high costs or technical difficulties.
We present a computational workflow consisting of Perl scripts and open-source software tools to identify microsatellite markers in denovo genomes or transcriptomes. The workflow can optionally be applied to data sets of related species in order to increase the probability of finding markers that would work in several species. The workflow employs the Tandem Repeat Finder and primer3 tools to find repeats and construct primers in the flanking regions. Blast is used to test the primer pairs on the genome/transcriptome to check for uniqueness. Finally it filters the microsatellites into PCR fragment size classes and motif lengths. An optional primer blasting and checking for repeats on closely related species can be done. We have tested the workflow on draft genomes of Epichloë species (fungal endophytes of grasses). This generated more than 50'000 primer pairs for E. typhina, with about 3200 primer pairs potentially working on other Epichloë species. In a preliminary test 19 out of 24 selected primer pairs worked on Epichloë samples. Six showed polymorphism among different species. We also tested the workflow on a first draft transcriptome of Dalbergia baronii (Malagasy Rosewood). This resulted in about 120'000 PCR primer pairs. 13 out of 35 tested microsatellites were polymorphic in D. baronii, 10 also in related species. The workflow can be applied to first draft genomes or transcriptomes and results in high quality microsatellite markers. The time required from raw sequence reads to tested microsatellite markers is in the range of four to eight weeks, mainly depending on how much time is spent on the denovo assembly step. |
Zoller S*, Hassold S, Schirrmann M, Leuchtmann A, Widmer A
*Genetic Diversity Centre, ETH Zurich Switzerland |
G - Mutations, Variations, and Population Genomics |
|
G 17
G17 |
Recent advances in genome-wide profiling of chromatin state and transcription factor (TF) binding have identified specific chromatin signatures related to various classes of functional elements in different cell types. However, their genetic basis and degree of variability across individuals remain largely unknown. We have generated genome-wide enrichment profiles of i) DNA methylation, ii) RNA Pol II, TFIIB, MYC, and PU.1 DNA occupancy, iii) histone modifications H3K4me1, H4K20me1, H3K27me3, H3K27ac and H3K4me3, in addition to genome-wide mRNA and small-RNA sequencing data, from lymphoblastoid cell lines of two trios part of the 1000 Genomes project. We analyzed allele-specific effects of sequence variation on TF binding, histone modifications, and gene expression and discovered allele-specific signals for each of the assays (8-31% of accessible sites), the majority of which showed the same allelic direction of effect at overlapping loci (for e.g. RNA Pol II sites in RNA-seq, rho=0.84; H3K4me3 sites in RNA-seq, rho=0.75). We additionally used the allele-specific information to study the parent of origin effects in each assay, and discovered strong patterns of transmission of allelic effects from parents to child (for e.g. PU.1, RNA Pol II, and H3K4me3), ranging from 20% to 70% depending on the assay.
We are currently exploring the effects of such regulatory variation in the context of gene regulatory networks and pathways by combining known information with computational reconstruction approaches. Together, this study will significantly improve our knowledge on the interplay of gene regulatory mechanisms and how genetic variation affects them. |
Gschwind A*, Kilpinen H, Waszak S, Migliavacca E, Witwicki R, Raghav S, Orioli A, Romano-Palumbo L, Wiederkehr M, Turnheer S, Hacker D, Hernandez N, Deplancke B, Reymond A, Dermitzakis E
*Center for Integrative Genomics, University of Lausanne Switzerland |
G - Mutations, Variations, and Population Genomics |
|
G 18
G18 |
The characteristic features of mutations, that could be used to distinguish deleterious mutations from neutral ones is crucial when we try to interpret new variants found from genome sequences.
We used a set of disease causing mutations of Bruton's tyrosine kinase (BTK) to evaluate already known features and tried to find new features that could be exploited in developing new predictors for mutation pathogenicity. Analyses was done all possible single nucleotide changes (1495) causing an amino acid chance (missense mutation) in BTK kinase domain. All mutated structures were modelled and used for further studies. New structural features and scoring matrices were found to be useful to characterize disease causing mutation set compared to the others. |
Väliaho J*, Ortutay C, Vihinen M
*Institute of Biomedical Technology Finland |
G - Mutations, Variations, and Population Genomics |
|
G 19
G19 |
The MoKCa database ( http://strubiol.icr.ac.uk/extra/mokca) has been extended to structurally and functionally annotate, and where possible predict, the phenotypic consequences of mutations in all proteins implicated in cancer. Large-scale re-sequencing of cancer genomes has revealed many protein mutations, and prediction of their functional consequence has become an essential endeavor. Somatic mutation data from tumours and tumour cell lines from the COSMIC database have been mapped onto the crystal structures of the affected protein domains. Positions of the mutated amino-acids are highlighted on a sequence-based domain pictogram, as well as a 3D-image of the protein structure, and in a molecular graphics package, integrated for interactive viewing. Proteins are linked to functional annotation resources and are annotated with structural and functional features such as domains and phosphorylation sites. The predicted structural impact of mutations is assessed using the SAAP pipeline.
MoKCa aims to provide assessments available from multiple sources and algorithms for each potential cancer-associated mutation, and present these together in a consistent and coherent fashion to facilitate authoritative annotation by cancer biologists and structural biologists, directly involved in the generation and analysis of new mutational data. |
Richardson C, Al-Lazikani B, Martin A, Pearl F*
*The Institute of Cancer Research United Kingdom |
G - Mutations, Variations, and Population Genomics |
|
G 20
G20 |
Understanding the relationship between genetic and phenotypic variation is one of the great outstanding challenges in biology. To meet this challenge, comprehensive genomic variation maps of human as well as of model organism populations are required. Here, we present a nucleotide-resolution catalog of genomic variation in 39 Drosophila melanogaster Genetic Reference Panel inbred lines. Using an integrative, local assembly-based approach to variant discovery, we identify more than 3.6 million distinct variants among which more than 800,000 unique indels and complex variants ranging in size from 1 to 6,000 bp. While we found that the SNP density is higher near other variants, we observed that variants themselves are not mutagenic nor are regions with high variant density particularly mutation-prone. Rather, our data suggest that the elevated SNP density around variants is mainly due to population level processes. We also provide insights into the regulatory architecture of gene expression variation in adult flies by mapping cis-expression quantitative trait loci (cis-eQTLs) for more than 2,000 genes. We found that indels comprise around 10% of all cis-eQTLs and have larger effect sizes than SNP cis-eQTLs. In addition, we find that most cis-eQTLs are sex-specific, revealing a partial decoupling of the genomic architecture between the sexes as well as the importance of genetic factors in mediating sex-biased gene expression. Finally, we performed RNA-seq-based allelic expression imbalance analyses in the offspring of crosses between sequenced lines to validate cis-eQTLs genome-wide. |
Waszak S*, Massouras A, Albarca M, Hens K, Westphal W, Ayroles J, Dermitzakis E, Stone E, Jensen J, Mackay T, Deplancke B
*École Polytechnique Fédérale de Lausanne Switzerland |
G - Mutations, Variations, and Population Genomics |
|
G 21
G21 |
The objective of this project is a proteome-wide scale analysis of protein structure complexes with Protein-Protein Interaction Networks (PPINs) as a tool to extract the association between the proteins in the human. Numerous studies have indicated the correlation between protein complex structures and their functions. In particular, the three-dimensional (3D) properties of protein binding interfaces are thought to embed key roles in mediating biological activities and in regulating cellular functions. Biologically relevant information is also growing rapidly thanks to large-scale sequencing projects like the 1000 Genomes project. The effective integration of all these data will pave the way to more accurate insights in the relationship between genotype and phenotype. It is therefore timely to combine these insights with 3D structural knowledge by studying the occurrences of gene variants on protein structures, for example, protein interfaces. Especially, non-synonymous Single Nucleotide Polymorphisms (nsSNPs) could cause conformational changes or failures in forming protein complexes. To accomplish a comprehensive study on structure properties of SNPs, an automatic system pipeline is developed to generate structure-integrated PPINs at protein domain level and map the SNPs onto these structures. The study of inter-domain disordered regions is included in this project, as recent studies suggested their importance in regulating biological functions. The SNPs are classified by their occurrence in different protein regions. The automatic pipeline and structure properties of SNPs will be presented. |
Lu H*, Fraternali F
*King's College London United Kingdom |
G - Mutations, Variations, and Population Genomics |
|
G 22
G22 |
Over the last decade, we have witnessed an incredible growth in the field of exome sequencing. Here we have focused on using exome data for the identification of blood cell traits. We have developed a tool to recognize mutations and phenotypes through genome analysis. In particular, we have produced a system that can interpret mutations in several blood types important for transfusions.
In our protocol, we extract relevant mutation and annotation of a genome with ANNOVAR. These variants are then directly compared with our manually curated database containing the known mutations of the ABO, RH, Duffy, Kell, Diego, Kidd, Lewis, Lutheran, MNS and Bombay blood groups and their phenotypes. Whenever we find a match, we use it to predict the related phenotype. We have tested our application with the publicly available Personal Genome Project genomes, are able to predict the right phenotype for all of them. An example of a rare genetic disorder that has benefited of this thorough characterisation is the Bombay blood mutation. Such an uncommon trait (4 people/million) is characterized by not expressing the H antigen and a conventional blood test recognizes it as a 0 blood group. However, this individual expresses functional antibodies for the A, B and H groups. Due to this alteration, Bombay subjects cannot receive blood from any normal donor. Our work shows the power of this analysis, and suggests its effectiveness for diagnostics in the age of personalized medicine. |
Giollo M*, Minervini G, Scalzotto M, Leonardi E, Ferrari C, Tosatto S
*Università of Padova Italy |
G - Mutations, Variations, and Population Genomics |
|
G 23
G23 |
gp120 is a key protein of Human Immunodeficiency Virus 1 (HIV-1) as it mediates the first steps of entry of the virion into the host cell. Here we connect important aspects of gp120, namely sequence, structure, function, and selection pressure. gp120 sites under selection pressure by interaction with the immune system (antibodies, T-cells) can be identified by analysis of synonymous and non-synonymous substitutions in viral genomes. Non-synonymous substitutions that help the virus to evade immune response can have deleterious effects on gp120 function, and hence may be accompanied by compensatory mutations at other sites that re-establish function. Mutual Information has often been used to identify pairs of positions linked in this way but found to yield results that were difficult to interpret. Therefore we here use additionally the novel quantity Direct Information. Remarkably, sites under strong selection pressure are highly significantly enriched in the group of sites with high DI. We map these positions onto the gp120 structure and interpret the results in terms of function. |
Rawi R*, Hoffmann D
*University of Essen Germany |
G - Mutations, Variations, and Population Genomics |
|
G 24
G24 |
The explosive growth of information from Next-Generation Sequencing (NGS) creates the need for quick processing of huge amounts of data. Moreover, consolidated systems that can process information inclusively and comprehensively, possibly from multiple sources, need to be developed as well. Identification of disease-causing variants from NGS data is currently one of the most acute data analysis problems in human genetics. There is therefore a need for databases and applications that support such analyses on a large scale (thousands of samples) rather than a few tens samples. To improve the quality and performance of such analyses, we have started to implement a reliable infrastructure for large-scale processing of sequencing variants. Our design aims one the one hand at storing information in a compact and efficient way by considering the structure and requirements of the task in the best possible way and on the other hand at offering a wide-rang of dynamic reports and outputs based on different kind of filters via a web application. |
Ardeshirdavani A*, Moreau Y
*ESAT-SCD, KU Leuven / IBBT Future Health Department Belgium |
G - Mutations, Variations, and Population Genomics |
|
G 25
G25 |
The evolution of a biological system is influenced by its structure in the genotype space. Here we apply for the first time methods based on the concept of genotype networks to human variation. The space of all possible haplotypes of a gene can be represented as an Hamming graph, in which each node is a haplotype, and two nodes are connected if they share one single differences. On this graph, we mark all the haplotypes observed in a population, and call it the genotype network of that population. Some attributes of the genotype network of a gene in a population are interesting to study its genetic diversity. The average degree and connectivity of the genotype network give an approximation of the robustness of a gene in a population, intended as its ability to withstand mutations without loosing functionality; and the average path length and the diamenter give an approximation of the ability of the system to explore the genotype space.
Here we present preliminary results of the genotype network properties of a subset of genes from the 1000genomes data. These genes belong to the well known pathway of N-Glycosylation and other genes already described as being under positive selection in humans. We also discuss the implications of this approach for the development of future approaches to the study of human genetics diversity. |
Dall'Olio GM*, Betranpetit J, Wagner A, Laayouni H
*Instituto de Biologia Evolutiva, Universitat Pompeu Fabra Spain |
G - Mutations, Variations, and Population Genomics |
|
G 27
G27 |
The recent development in large-scale sequencing techniques revolutionizes medical genomics by providing efficient means for an in-depth analysis of individual genomes in a cheaper and a more unbiased way. The enormous sequencing throughput of modern next-generation sequencing (NGS) allows us to efficiently screen for potential causal mutations in >300 individuals suffering of intellectual disability.
After comprehensive annotation of sequence variations (SNVs, insertions/deletions, retrocopies) and subsequent filtering for already known sequence variants on average 2-3 mutations per individual remain candidates for further investigation. |
Haas S*, Love M, Emde A, Sun R, Vingron M, Hu H, Zemojtel T, Ropers H, Kalscheuer V
*MPI for Molecular Genetics Germany |
G - Mutations, Variations, and Population Genomics |
|
G 28
G28 |
Genome and exome sequencing is opening new perspectives in the study of genetic variations and in their implications in gene disorders. To efficiently manage and analyze large amounts of DNA sequencing data, new bioinformatic tools are required. Here we present the development of a platform designed to manage and retrieve data from human exome/genome sequencing projects. The platform integrates heterogeneous information to help the association of variations to the pathology/phenotype under study. This information can be related to genes (Gene Ontology, Disease Ontology, OMIM, InterPro annotations) or it can describe the genome context and the CDS-effects of variations (dbSNP, SIFT, amino acid substitution) or variation properties like coverage depth or score.
The platform is accessible through a web interface. The users can upload a file containing the variations (VCF format), and the SNPs are automatically mapped on the genome and stored on a relational database together with possible effects on the corresponding transcript and protein. A powerful and flexible query system allows searching of data using complex queries, setting different criteria to integrate the information. The results are displayed on a ranked list ordered according to the satisfied criteria, offering a direct assessment of the significance of the results. The web platform and the query system are based on a scalable and easily configurable XML-based language, responding to the increasing complexity of data and databases. |
Forcato C*, Albiero A, Vezzi A, Vitulo N, Valle G
*University of Padova Italy |
G - Mutations, Variations, and Population Genomics |
|
G 29
G29 |
A C++ utility for modeling the probability of finding target enzyme gene in the background sequences was compiled incorporating the metagenome size, target gene copy number per organism, target gene length, the fraction of organisms carrying the target gene in community, size of the community, the number and the size of clones as variables among other. Second, metagenomic clone libraries ((MCL); n>200) tapping various enzymes for commercial use was compiled. Redundancy analysis was used to identify whether the parameters used for construction of MCL or the nature of target enzyme explained the success in finding the coding gene. Third, as an example, direct clone libraries describing laccases (EC1.10.3.2; copper-containing oxidases) were retrieved and supplemented with the contextual and spatial data. Diversity analyses and variance partitioning were used to identify the key environmental parameters explaining the distribution of target genes in the environment hence commanding sample selection. The probabilities of finding any laccase gene or the industrially relevant variant present in a varying number of carrying genomes within MCL were modelled. Last, the general activity of metagenomic laccase genes was related to observed protein diversity in natural communities, genetic constraints on protein evolution and the price-performance ratio of established industrially relevant enzyme applications obtained through directed protein evolution. Three orders of magnitude higher activities under 100-times higher substrate concentrations were exhibited only by a tiny fraction of all possible sequences. These traits are selected against in nature and consequently not present in the natural genetic pool amenable to metagenomics. |
Stres B*, Murovec B
*University of Ljubljana / Biotechnical Faculty Slovenia |
G - Mutations, Variations, and Population Genomics |
|
G 30
G30 |
The bacteria Neisseria meningitidis is a predominant cause of meningitis epidemics, at the same time it is also a frequent commensal of the human nasopharynx. N. meningitidis displays a high genetic diversity, yet to present whole genome sequencing has not defined a strict core pathogenome enabling the prediction of virulence on a genetic basis.
Phenotypic consequences of epigenetic modifications are still poorly characterized in bacteria. DNA methylation has been predominantly studied in prokaryotes in the context of restriction modification (R-M) systems and the protection of the bacterial host genome against foreign DNA. Sequence homology to characterized DNA methyltransferase genes may allow predictions of the corresponding methylation target sites for specific genes in N. meningitidis strains. We used methylation-sensitive restriction enzymes and Southern blot to validate predictions of DNA methylation at specific target sites. Despite strong evolutionary constraints to expect for R-M-systems, these approaches indicate a variability in presence and positions within the analyzed genomes comparable to that in the rest of the genome. Recent studies revealed a complex genetic system termed the ‘phasevarion’, in which mutations in simple tandem repeats control the expression of DNA methyltransferases, which in turn control the coordinated switching of the expression of multiple genes. Based on increasing read lengths in large scale sequencing assays, we present a novel method to infer the precise number of repeat units at specific tandem repeat loci. The probabilistic approach detects divergent repeat configurations and therefore functional states of ORFs within the genomes of closely related strains of N. meningitidis. |
Abdul Sater M, Schmid C*
*Swiss Tropical and Public Health Institute Switzerland |
G - Mutations, Variations, and Population Genomics |
|
H 01
H1 |
Polyketide synthases are the enzyme complexes that synthesise a wide range of natural products of medicinal interest, notably a large number of antibiotics. Type I polyketide synthases can introduce beta-carbon branches into a growing polyketide chain via enzymes encoded by the “HMG-CoA synthase (HCS) cassette”. One of the first polyketide biosynthesis cluster in which the HCS cassette was discovered is responsible for the synthesis of mupirocin by Pseudomonas fluorescens which is a clinically important antibiotic effective against certain Gram-positive bacteria, including methicillin-resistant Staphylococcus aureus (MRSA), and is used clinically to treat bacterial skin infections. To understand better what might allow the HCS cassette to recognise β-branch-associated ACPs, genetic, biochemical and computer modelling was used to explore the interaction of the ACPs with MupH, the HMG-CoA synthase homologue from the mupirocin synthesis pathway, and its homologues. Hidden markov models (HMM) were used to classify ACPs as branching and non branching. HMM analysis highlighted essential features for an ACP to behave like a branching ACP. We computationally docked a homology model of MupH with the NMR structure of each of ACPs mupA3a and ACP mupA3b. The results identified key residues critical for the recognition specificity of the ACPs involved in the beta-carbon branching. |
Farmer R*, Winn P, Thomas C
*School of Biosciences, University of Birmingham United Kingdom |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 02
H2 |
Protein-protein interactions encode the wiring diagram of cellular signaling pathways and deregulations of these underlie a variety of diseases, such as cancer. Inhibiting protein-protein interactions with peptide derivatives is a promising way to develop new therapeutics and biological tools. We build a structural database of hundreds of non-natural amino-acid sidechains to probe in silico their insertion into natural peptides. Efficient visualization of the mutants can be done with a PyMOL plug-in. We pro- vide predicted rotamers, as well as all parameters for standard molecular mechanics software to perform more detailed and quantitative analyses, such as binding free energy predictions. Our results on non-natural mutants of a BCL9-derived peptide targeting beta-catenin show very good correlation between predicted and experimental binding free-energies, illustrating the relevance of the method. All data can be downloaded free of charge for academic users at http://www.swisssidechain.ch |
Gfeller D*, Michielin O, Zoete V
*Swiss Institute of Bioinformatics Switzerland |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 03
H3 |
Human dendritic cells (the central coordinators of innate and acquired immune responses) generated from healthy donors have been analyzed using LC-MS/MS instrumentation. These cells either elaborate an antiviral state that suppresses HIV-1 infection or they are rendered permissive.
Preliminary results have shown some variability in the protein expression content among multiple experimental conditions; however, to quantify and perform similarity analysis on these complex biological mixtures supporting the investigation of the antiviral response mechanisms is a difficult task. To achieve this, we are developing a novel approach that consists in clustering the samples according to the different experimental conditions. Each sample is represented by a fixed-length vector where each dimension corresponds to a protein (from the list of identified proteins) and the value for the corresponding dimension is an estimated quantification. Different identification results and different label-free quantification strategies such as spectral-counting and MS1-based profiling (e.g. generated by SuperHirn) are investigated using several clustering algorithms and similarity measures. In addition, the analysis can be extended to peptides and study post-translational modification profiles. We expect the results to indicate whether the quantification of protein/peptide content properly characterizes these multiple experimental conditions in order to: i) validate the intra- and inter-donors consistency of protein expression level under the same condition (i.e. grouped in the same cluster), and ii) identify possible distinctive patterns in each of the different conditions (i.e. samples under different conditions clearly separated in different clusters). This information sets the basis of a closer study of interactions of characteristic proteins. |
Bilbao A*, Zhang Y, Bottinelli D, Alghanem B, Nikitin F, Luban J, Strambio De Castillia C, Varesio E, Hopfgartner G, Mueller M, Lisacek F
*Proteome Informatics Group, Swiss Institute of Bioinformatics / Life Sciences Mass Spectrometry, School of Pharmaceutical Sciences, University of Geneva, University of Lausanne Switzerland |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 04
H4 |
The reliable detection of protein-protein-interactions by affinity purification mass spectrometry (AP-MS) is an important stepping stone for the understanding of biological processes. In a typical AP-MS experiment, true interaction partners need to be separated from contaminants by contrasting counts of proteins binding to specific baits with counts of negative controls.
Several approaches have been proposed for computing scores for potential interaction partners, e.g. the commonly used SAINT software. However, any pure scoring scheme is incomplete in a statistical sense - we show that further pre-and postprocessing of the data significantly improves the reliability of results. Normalization is one of the most important preprocessing steps in data analysis and we investigate the influence of different normalization methods for AP-MS data. These are motivated by methods used in the analysis of RNA-Seq experiments. The usual output from an AP-MS analysis only provides a ranking of potential interaction partners based on scores indicating true interactions. However, it remains a subjective decision where to set the score cutoff for candidates. Additionally, no control for the expected number of false-positives is provided. Our aim is to replace these scores with measures that can be interpreted in a statistical way, such as a family-wise-error rate or a false-discovery rate to allow the determination of a significance level. Therefore, we propose a permutation methodology by shuffling the sample and control replicate labels. We apply the procedure of Westfall&Young to assess adjusted p-values. Evaluation was conducted on a real data set as well as on simulation studies. |
Fischer M*, Renard BY
*Research Group Bioinformatics (NG 4), Robert Koch-Institute Germany |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 06
H6 |
STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a database of known and predicted protein-protein interactions. The database contains information from numerous sources, including experimental repositories, computational prediction methods and public text collections. It is freely accessible and is regularly updated, the latest version 9 contains information on 3 millions proteins from more than 1000 species. The STRING database and web resource is available at http://string-db.org/.
In this new updated version, the so-called "Enrichment Tool" has been added to the advanced network view. The Enrichment Tool uncovers Gene Ontology terms that are enriched in the protein network of interest. These are visualized in a popup-box and are sorted by significance (p-value). It is possible to select a term to visualize the corresponding proteins in the network (colored coded). Statistical backgrounds are adjusted for specific genome-wide RNAi libraries. In addition, several performance optimizations have been made on the STRING code in order to better cope with the increasing amount of species. |
Franceschini A, Simonovic M*, Roth A, Szklarczyk D, Kuhn M, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, von Mering C
*Swiss Institute of Bioinformatics Switzerland |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 07
H7 |
The repertoire of protein diversity in eukaryotes is significantly contributed to by the mechanism of alternative-splicing. Alternative-Splicing is a process by which multiple proteins can be produced form a single gene. These different protein products of a gene can have unique/different binding sites or domains. The union of these sites or domains contribute to the connectivity of a gene/node in the PPI networks. The connectivity of the nodes of PPI networks has been characterized to follow the power-law distribution statistics i.e. most of the nodes have very low connectivity while a few have very high connectivity (called Hubs). Here we have investigated reasons behind the high connectivity of the hub genes/nodes in the eukaryotic PPI networks with respect to the phenomenon of splice variation. We have found that, in the PPI networks of Homo sapiens, Drosophilla melanogaster, Mus musculus and Rattus norvegicus, not only do hub genes have significantly greater number of splice-variants than non-hub genes but the frequency distribution of their splice variant count is also different from non-hub genes. Furthermore, we have also observed that the genes with large number of splice variants tend to give rise to protein products which are unstructured. We also show that the propensity of a node for making large number of interactions arises as a consequence of structurally disordered splice variants. Our work, therefore, sheds light on the phenomenon of alternative-splicing as a significant contributor towards the diversity of Degree-Centrality of nodes in a eukaryotic PPI network and hence to its functionality. |
Sinha A*, Nagarajaram HA
*Centre for DNA Fingerprinting and Diagnostics India |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 08
H8 |
Large--scale initiatives for obtaining spatial protein structures by experimental or computational means have accentuated the need for the critical assessment of protein structure determination and prediction methods. These include blind test projects such as the critical assessment of protein structure prediction (CASP) and the critical assessment of protein structure determination by nuclear magnetic resonance (CASD--NMR). It is important to establish structure validation criteria that can reliably assess the accuracy of a new protein structure. Various quality measures derived from the coordinates have been proposed. A universal structural quality assessment method should combine multiple individual scores in a meaningful way, which is challenging because of their different measurement units. Here, we present a method based on a generalized linear model (GLM) that combines diverse protein structure quality scores into a single quantity with intuitive meaning, namely the predicted coordinate root-mean-square deviation (RMSD) value between the present structure and the (unavailable) “true” structure (GLM-RMSD). For two sets of structural models from the CASD-NMR and CASP projects, this GLM-RMSD value was compared with the actual accuracy given by the RMSD value to the corresponding, experimentally determined reference structure from the Protein Data Bank (PDB). The correlation coefficients between actual (model vs. reference from PDB) and predicted (model vs. “true”) heavy-atom RMSDs were 0.69 and 0.76, for the two datasets from CASD-NMR and CASP, respectively, which is considerably higher than those for the individual scores (-0.24 to 0.68). The GLM-RMSD can thus predict the accuracy of protein structures more reliably than individual coordinate-based quality scores. |
Bagaria A*, Jaravine V, Huang Y, Montelione G, Guentert P
*Institute of Biophysical Chemistry, Center for Biomolecular Magnetic Resonance and Frankfurt Institute of Advanced Studies, Goethe University, Frankfurt Germany |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 09
H9 |
The proteins are localized in different compartments in the cell. During the last decade more and more proteins were annotated with subcellular localization information, and advanced prediction methods were released which were focused on higher prediction probability or assigning compartments for more proteins. We combined the available information and we presented it in a single web service.
We collected knowledge available in curated databases of SwissProt, SGD, FlyBase, WormBase and MGI and we deposited in our database. Moreover we used prediction softwares like YLoc and PSORT to process 1,684,376 unique protein sequences and stored the results locally. As a new technology, we used an in-house text-mining tool to find pairs of co-mentioned proteins and subcellular localizations in 22 million PubMed abstracts. Each pair is assigned a score based on a weighted combination of the number of unique co-mentioning and the ratio of observed number of co-occurrences versus the expected number of co-occurrences. We categorized the proteins into 13 subset based on their localization. Upon a query we visualize results by painting the compartments in a semantic picture of the cell. Moreover we present an overview figure, where we the color gradient correlated with the reliability of evidence to demonstrate the confidence of the information. Link-out is provided for the above mentioned databases, and shows relevant PubMed abstracts in order to support text-mining results. Therefore we provide a new resource where protein localization is presented in a novel way. |
Binder J*, Schneider R, Juhl Jensen L
*EMBL Germany |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 10
H10 |
Drug discovery for autoimmune diseases is recently recognized to be an important task. Previously, 33 proteins were selected such that their gene promoter regions were specifically methelysed or demethylased commonly for three autoimmune diseases, systemic lupus erythematosus, rheumatoid arthritis, and dermatomyositis. These 33 proteins are, AIM2, CARD15, CD82, CSF1R, CSF3, CSF3R, DHCR24, ERCC3, GRB7, HGF, HOXB2, IFNGR2, LCN2, LMO2, LTB4RR, MMP14, MMP8, MPL, PADI4, PECAM1, PI3, RARA, S100A2, SEPT9, SLC22A18, SPI1, SPP1, STAT5A, SYK, TIE1, TM7SF3, TRIP6, and VAMP8. Some of them are known to be related to immune diseases, e.g., resistance to inflammation etc. In this study, we try to perform structure prediction of these selected proteins. FAMS, Full Automatic Modeling System, were employed for this purpose and we can predict three dimensional structure with significantly small enough P-values. Based upon annotations attributed to proteins whose structures are similar to predicted model structure, it turned out that most of selected proteins are suggested to be self immunology related proteins. Thus, they will be important drug target candidates. We also found that some of previously proposed ligands can bind to some of selected proteins. Moreover, we also successfully predicted complex protein formations between selected proteins. The possibility of a new drug target, i.e., suppression of these protein complex formations was suggested. |
Ishida S, Umeyama H, Iwadate M, Taguchi Y*
*Department of Physics, Chuo University Japan |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 11
H11 |
Detecting protein complexes from protein-protein interaction (PPI) network is becoming a difficult challenge in computational biology. There is ample evidence that many disease mechanisms involve protein complexes, and being able to predict these complexes is important to the characterization of the relevant disease for diagnostic and treatment purposes. This paper introduces a novel method for detecting protein complexes from PPI by ranking proteins using PageRank algorithm (ProRank). ProRank quantifies the importance of each protein based on the interaction structure and the evolutionarily relationships between proteins in the network. A novel way of identifying essential proteins which are known for their critical role in mediating cellular processes and constructing protein complexes is proposed and analyzed. We evaluate the performance of ProRank using two PPI networks on two reference sets of protein complexes created from MIPS, containing 81 and 162 known complexes respectively. We compare the performance of ProRank to some of the well known protein complex prediction methods (ClusterONE, CMC, CFinder, MCL, MCode and Core) in terms of precision and recall. We show that ProRank predicts more complexes correctly at a competitive level of precision and recall. The level of the accuracy achieved using ProRank in comparison to other recent methods for detecting protein complexes is a strong argument in favor of the proposed method. |
Zaki N*
*United Arab Emirates University (UAEU) United Arab Emirates |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 12
H12 |
Translation is energetically expensive operation in dividing cells. The tRNA Adaptation Index (tAI) is an indirect measure of translation efficiency. This measure considers the relative abundance of tRNAs and the codon–anticodon wobble rules. Specifically, low and high tAI values indicate lower and higher translation rates, respectively. A non-uniform tAI values along the transcript may affect the overall speed of translation via ribosome stalling. Previous studies defined the lower tAI values at the beginning of the coding sequence as a ‘ramp’ which contributes to translation efficiency by preventing ribosomal drop-off and collisions.
In this research, we partitioned the coding sequences into distinct groups of secreted, membranous and cytosolic proteins and analyzed the tAI profile for each group. The secretory proteome comprises a third of all proteins and is being translated by ribosomes that are docked at the ER membranes. These proteins have a signal peptide (SP) at their N-terminus region, or a transmembrane domain. We found that proteins with SPs are characterized by a ‘ramp’ at the N-terminal segment. Furthermore, these proteins have higher global tAI and are shorter in length. In contrast, membranous and cytosolic proteins (lacking SPs) have no evidence for a ‘ramp’. We conclude that the tAI profile is a reflection of an evolutional refinement of the secreted proteins whose translation must be tightly controlled. Accordingly, the translation rate of the initial segment is attenuated, allowing a ribosome spacing to minimize translation drop-off. The reported trends for the secretory proteomes applied for a large number of eukaryotes. |
Mahlab S*, Linial M
*THe Hebrew University of Jerusalem Israel |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 13
H13 |
Domains represent the fundamental units of proteins and mediate their physical interactions. Consequently, analysis of domain-domain interactions (DDIs) may improve our understanding of protein-protein interactions (PPIs) and how these impact cellular function, disease, and evolution. Physical interactions between domains are typically identified by analyzing crystal structures of protein complexes, however, currently, available experimental DDI data cover only a small fraction of all existing PPIs and determining which particular DDI mediates any given PPI is a challenge. Herein, we present two contributions to the field of DDI analysis. First, we introduce a novel computational method, parameter-dependent DDI selection (PADDS), which, given a set of protein interactions, identifies a small set of DDIs that can mediate the original set of PPIs. We evaluated PADDS on a multi-organism PPI set and showed that it identified more experimentally detected DDIs than existing computational approaches. Second, we demonstrate that computational DDI identification is heavily dependent on the available protein-domain annotation. Because none of the currently available annotation databases provide comprehensive protein-domain annotation for any one organism, we introduced a computational strategy to systematically merge annotation data from multiple sources, ensuring large and consistent domain annotation for any given organism. Applying this strategy to yeast data from six different annotation databases, we increased the average number of domains per protein from 1.05 to 2.44, bringing it closer to the estimated average value of 3. |
Memisevic V, Wallqvist A, Reifman J*
*Department of Defense Biotechnology High Performance Computing Software Applications Institute United States of America |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 14
H14 |
Proteogenomic approaches have gained increasing popularity, however it is still difficult to integrate mass spectrometry identifications with genomic data due to differing data formats. To address this difficulty, we developed the ‘integrating Peptide spectrum matches into Genome browser visualizations’ (iPiG) tool. Thereby, the concurrent analysis of proteomic and genomic data is significantly simplified and proteomic results can directly be compared to genomic data.
The main idea of iPiG is the mapping of peptide spectrum matches (PSMs) to their corresponding gene of origination using a comprehensive set of known genes. The mapping requires a matching of protein identifiers to gene identifiers as well as a string matching of the PSM peptide sequence to the translation of the selected gene. This results in the exact genomic position of the origin of the PSM provided by the location annotation of the gene. Unmapped PSMs are treated again with an overall string matching search to the gene set. Finally, the genomic positions of all mapped PSMs are exported as genome annotation tracks in formats such as bed or gff3. Those formats can easily be loaded into most common genome browser. Once imported into a genome browser, the PSMs are automatically visualized at the right positions by the browser. This allows a genome position-oriented and gene expression-like overview of peptide spectrum matches. iPiG is implemented in Java with a graphical user interface. It is freely available from https://sourceforge.net/projects/ipig/. |
Kuhring M, Renard B*
*Research Group Bioinformatics (NG4), Robert Koch-Institute Germany |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 15
H15 |
The human protein SCRIB is an important regulator of cell polarity. SCRIB has been termed tumour suppressor as its disregulation (e.g. by oncogenic viruses), can lead to tumorigenesis. Numerous proteins have been identified that bind to SCRIB via one of its four PDZ domains. PDZs are a class of globular domains that recognize C-terminal peptides. Phage display data has been frequently used to study the specificity of PDZ-peptide interactions and to develop PDZ interaction predictors. We assessed a published large-scale phage display data set for its application to PDZ-peptide interaction predictions. We showed that two thirds of the phage display data are biased towards hydrophobic residues resulting in less reliable predictions of cellular PDZ-binding motifs. Unbiased phage display data has been published for PDZ1, PDZ2, and PDZ3 of SCRIB. We used this data together with structural information to predict cellular binding partners for the four PDZs of SCRIB. We selected 43 predicted peptides for experimental validation using automated HoldUp, a quantitative chromatographic retention assay. 25 peptides bound at least to one of the four PDZs of SCRIB with dissociation constants below 100 uM. We constructed a protein interaction network of these validated new binders, published associated proteins of SCRIB, and their physical interactors. This network was searched for paths that link the new potential interactors with associated proteins of SCRIB. An initial analysis revealed interesting functional links between DOCK2, GUCSA2, YAP1, and SCRIB. |
Luck K*, Charbonnier S, Fournane S, Foltz C, Iv F, Nominé Y, Vincentelli R, Travé G
*Institut de Recherche et de Biotechnologie de Strasbourg France |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 16
H16 |
Proteins interact with each other to perform essential functions in cells. Consequently, identification of their binding interfaces could provide key information for drug design. With the increasing number of experimentally determined protein 3D structures, they have become the main source of interface prediction methods which rely on analysing proteins which are structurally similar to the query protein.
Current state-of-the-art methods do not deal adequately with biases generated by the heterogeneous nature of the PDB. First, the presence of complex duplicates, or homologs, biases predictions towards specific configurations, which can affect negatively performance. Second, confidence in the information provided by the interface of a structural neighbour should depend on its degree of homology with query protein. To address these limitations, we have introduced Weighted Protein Interface Prediction (WePIP), an original framework which predicts protein interfaces from homologous complexes. WePIP takes advantage of a novel weighted score which is not only based on structural neighbours’ information but, unlike current state-of-the-art methods, also takes into consideration the nature of their interaction partners (ligand). Experimental validation demonstrated that our novel weighted schema significantly improves prediction performance. In particular, we establish the major contribution of ligand diversity quantification. Moreover, application of our framework on a standard dataset shows WePIP performance compares favourably with other state of the art methods. |
Esmaielbeiki R*, Nebel J
*Kingston University London United Kingdom |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 17
H17 |
Following multiple testing correction, few SNPs reach genome-wide significance in genome-wide association studies (GWAS). This can leave many SNPs with “suggestive” p-values, with associations not reaching genome-wide significance.
Region Growing Analysis (RGA) is a network based method to aid separation of false negatives from random noise and true negatives through incorporation of prior biological information such as Protein Interaction Networks (PINs) to GWAS data. In doing so, clusters of associated genes and genes with “suggestive” p-values are identified within a PIN. The aim is to identify regions within PINs which represent malfunctioning pathways or protein complexes involved in disease. Results are not interrogated independently gene by gene, but at protein complex or pathway level. The analysis uses a ranked list of genes, split at a user defined threshold. Genes reaching the threshold are called ‘seed’ genes. RGA finds regions within a network which are more enriched for seed genes than expected by chance. RGA begins from a ‘seed’ gene whose neighbours are visited one by one. If a ‘seed’ gene’s neighbour is itself a ‘seed’ gene, or has an additional neighbour that is a ‘seed’ gene, it is included in the region. This process continues for all neighbours of all genes within the region and stops when no additional genes can be added to the region. This method allows detection of additional biologically plausible candidate genes contributing to complex disease over those found through GWAS alone. A description of the algorithm is presented along with results from WTCCC (2007) data. |
Lehne B, Sutherland R*, Tebbe C, Barkas N, Ahlers V, Sprengel F, Schlitt T
*Department of Medical and Molecular Genetics United Kingdom |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 18
H18 |
Proteins play vitally important roles in living cells with the various functions they provide which directly depend on properly folded tertiary structure. Since determining the tertiary structure of proteins from its sequence is computationally or experimentally time-consuming and often unsuccessful, the prediction of secondary structure has come back into focus as a stepping stone to prediction of tertiary structure.
We here introduce a new method that defines patterns for each structural motif by clustering all experimentally verified structural motifs in our benchmark dataset. This process is done in four steps. In the first step, by using recently proposed several reduced alphabets, experimentally verified amino acid sequences are reconstructed. The idea behind is to examine the association between secondary structure formation and evolutionary information, also the impact of biochemical properties of amino acids. In the following steps, amino acid sequences are grouped into subcategories according to their lengths. For each length group, K-means clustering algorithm is performed. Root mean square metric is computed to assess the quality of the clustering. In the final step, profiles which describe structural element with certain weights at certain positions are created. These profiles then can be mapped onto a query sequence to determine secondary structure according to match score. We applied this approach to the prediction of alpha helices, a secondary structure motif which has several related structural motives like 310-Helix. We compared our prediction results to the prediction of other tools although our predictions are more refined since we differentiate between different helix types. |
Has C*, Toprak M, Allmer J
*Izmir Institute of Technology Turkey |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 19
H19 |
The possibility to sequence whole genome of an organism causes the need to determine precise gene products and biological meaning of them. To reduce the time complexity of manual annotation of gene products, genome annotations are done by computational methods. The validation of computational annotations could be done by exploiting mass spectrometric data on genomic sequence. Therefore, identification of novel protein coding regions, validation of current models, and determination of upstream and downstream of genes will be achieved.
In this study, interrogation of mass spectrometric data was done against six-frame translated human genome. 25 publicly available human blood plasma mass spectrometric data were collected from PeptideAtlas. The initial gene models in ENSEMBL were ignored and identified peptides with 99% confidence were mapped to six-frame translation of the genome. Among these identified peptides, some of peptides were found in current release of human proteome database. Rest of them was “orphan” peptides which were not assigned in human proteome database, were found in six-frame translated genome. According to intersection of current gene models and our map, we were able to confirm many of the proposed exons. Moreover, many of peptides were found in exon-intron boundaries (overlapping exonic peptides), in intronic regions (intronic peptide) and in intergenic regions (intergenic peptides). With consideration of number of spectra that support existence of peptide, necessary changes in the current models were suggested. For intergenic peptide clusters in our map, homology search and consideration of gene signals need to be performed to propose new gene models. |
Has C*, Boz S, Allmer J
*Izmir Institute of Technology Turkey |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 20
H20 |
Characterizing the entire human interactome is a key effort in current proteomics research. This challenge is complicated by the dynamic nature of protein-protein interactions (PPIs), which are conditional on the cellular context: most importantly, both interacting proteins must be expressed in the same cell and localized in the same organelle. Furthermore, PPIs underlie a delicate control through signaling pathways. Despite the high degree of cell-state specificity of PPIs, many interactions are measured under artificial conditions and even when they are detected in a physiological context, this information is missing from the common PPI databases.
We developed a method that assigns context information to PPIs inferred from attributes of the interacting proteins and make the annotated interactions available via the HIPPIE web tool (Human Integrated Protein-Protein Interaction rEference; http://cbdm.mdc-berlin.de/tools/hippie/). We explore the addition of context to the collection of human PPIs using gene expression profiles of 84 human tissues, functional and disease annotations, as well as pathway inference. We demonstrate that context consistency correlates with the experimental reliability of PPIs and that context-filters are able to highlight meaningful interactions with respect to various biological questions. We apply our approach to identify lung-specific pathways used by the influenza virus and brain-specific signaling pathways that play a role in Alzheimer’s disease. |
Schaefer MH*, Lopes TJS, Mah N, Shoemaker JE, Matsuoka Y, Fontaine J, Louis-Jeune C, Eisfeld AJ, Neumann G, Perez-Iratxeta C, Kawaoka Y, Kitano H, Andrade-Navarro MA
*Max Delbrück Center for Molecular Medicine, Computational Biology and Data Mining Germany |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 21
H21 |
Background: As selected reaction monitoring (SRM/MRM) mass spectrometry is increasingly used in the field of proteomics to support hypothesis driven experimentation, consistent data interpretation becomes very important. Manual data integration needs to be replaced with automated analysis, with controlled false discovery rates and well defined criteria of detection.
Methods: By transforming the multiple SRM data channels into their pair-wise ratios, peptides are reliably detected and measured. Measurement quality is calculated using an in silico generated null distribution. Results: We present a new algorithm for performing automated SRM data analysis, focusing on label-free SRM, and compare this with existing alternatives. Typical SRM phenomena that might hinder correct peptide detection and integration are exemplified and discussed. Conclusion: Automating SRM data analysis allows consistent data handling, as well as increasing work-flow throughput by reducing data analysis cost and time. This will in turn enable large scale SRM studies targeting entire pathways, organelles or large sets of clinical material. |
Teleman J*, Malmström J, Levander F
*Dept. of Immunotechnology, LUND UNIVERSITY Sweden |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 22
H22 |
The relationship between mRNA and protein sequences as embodied in the genetic code is a cornerstone of modern-day molecular biology. However, a potential connection between physico-chemical properties of mRNAs and cognate proteins, with implications concerning both code's origin and mRNA-protein interactions, remains unexplored. Here, we compare pyrimidine content of mRNA coding sequences with the propensity of cognate protein sequences to interact with pyrimidines. The latter is captured by polar requirement, a measure of solubility of amino acids in aqueous solutions of pyridines, heterocycles closely related to pyrimidines. By analyzing complete proteomes of 15 different species, we find that the higher the pyrimidine content of a given mRNA, the stronger the average propensity of its cognate protein to interact with pyridines. Remarkably, window-averaged pyrimidine profiles of individual mRNAs strongly mirror polar-requirement profiles of cognate protein sequences for both membrane and cytosolic proteins. For example, 4953 human proteins exhibit a correlation between the two with a Pearson correlation coefficient |R|>0.8. In other words, pyrimidine-rich regions in mRNAs quantitatively correspond to regions in cognate proteins containing amino acids that are soluble in pyrimidine mimetics and vice versa. Moreover, by studying randomized genetic code variants, we show that the natural code is highly optimized to preserve the observed correlations. Overall, our findings refine and reinforce the stereo-chemical hypothesis concerning the code’s origin and provide evidence of direct complementary interactions between mRNAs and cognate proteins in the era before the development of ribosomal decoding, but also in today's cells, especially if both are unstructured. |
Hlevnjak M*, Polyansky AA, Zagrovic B
*Max F. Perutz Laboratories, University of Vienna Austria |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 23
H23 |
A description of the community structure of integrated biological networks provides context to the role of individual gene products and can facilitate a systems level understanding of the underlying processes. Here we explore the community structure of a major plant fungal pathogen Fusarium graminearum in terms of disjoint and overlapping modules. F. graminearum infects wheat as well as cereals such as maize and barley, significantly affecting yield and also resulting in contamination by harmful mycotoxins. The genome sequence is known (13,718 proteins) and about 100 gene products have been linked to pathogenicity through experimental means. We have constructed an integrated network for F. graminearum, which combines information from gene co-expression, predicted protein-protein interactions and sequence similarity. The disjoint community structure is detected using a greedy agglomerative method that optimises the modularity of the network. This partition is then converted to a set of overlapping modules, in which genes can belong to more than one module, through the application of a mathematical programming method that optimises the community strength of all communities. We examine the general functional and topological properties of the modules, and the specific network characteristics of those gene products known to be associated with pathogenicity. Such approaches can enhance understanding of infection biology and has the potential to improve the development of control strategies. |
Bennett L, Lysenko A, Papageorgiou LGP, Urban M, Hammond-Kosack K, Rawlings C, Saqi M, Tsoka S*
*Department of Informatics, King's College London United Kingdom |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 24
H24 |
Molecular recognition plays a fundamental role in all biological processes that is why great efforts have been made to understand and predict protein-ligand interactions. Finding a molecule that can potentially bind to targeted protein is particularly essential in drug discovery process. To do so in silico tools are frequently used, usually to screen molecular libraries in order to identify new lead compounds If information about protein structure is known various protein-ligand docking programs can be used to perform that task. If not ligand-based techniques like CoMFA are employed. Unfortunately currently available docking software is far from predicting correct conformation of ligand in active site. Moreover analysis of enrichment factors as well as docking scores obtained with scoring functions shows poor correlation with experimentally determined binding values, suggest that also ‘scoring’ capabilities of program are unsatisfactory. Here we present complete pipeline for docking from acquiring molecular target up to calculating ligands binding affinity utilizing VoteDock consensus approach. |
Plewczynski D*, Łaźniewski M, Ginalski K
*ICM, University of Warsaw & Warsaw Medical University Poland |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 25
H25 |
We aim to help bioinformaticians and computational biologists who get lost among the large number of available protein-interaction and pathway databases. Our survey provides a valuable tool for researchers to reduce the time necessary to gain a broad overview of PPI-databases and is supported by a graphical representation of data exchange. The graphical representation is made available in cooperation with the team maintaining www.pathguide.org and can be accessed at http://www.pathguide.org/interactions.php in the form of a Cytoscape web implementation. Copies of the original CYS file can be e-mailed upon request.
Key points: 1) The Cytoweb visualization of PPI databases is a convenient to identify independent PPI resources (available at: http://www.pathguide.org/interactions.php). 2) Querying metamining databases is not equivalent to independently querying the source databases independently. 3) The term ‘protein–protein interaction’ is ambigous and can refer to direct physical interactions, membership of the same protein complex or functional (indirect) interactions. |
Klingström T, Plewczynski D*
*University of Warsaw Poland |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 26
H26 |
Elucidating the dynamics of molecular processes in living organisms in response to external perturbations is one of the central goals of systems biology. We investigated the dynamics of protein phosphorylation events in Arabidopsis thaliana (Ath) exposed to six nutrient starvation conditions and subsequent resupply and one control condition. Phosphopeptide levels were measured at five consecutive time points at 0, 3, 5, 10, and 30min after nutrient resupply, respectively. When embedded in the protein-protein interaction network (PIN) of Ath, phosphoproteins were identified to be associated with a higher PIN degree compared to average proteins. Based on the obtained time-series data, we reconstructed the molecular interaction network. We assessed the performance of different network inference methods relative to the successful prediction of intra-organellar interactions as annotated in the SUBA and AtPIN databases. Graphical Gaussian models proved to work best. The topology of the inferred networks corresponded to an information dissemination architecture with the average in-degree being smaller than the out-degree. Hub proteins were found to be associated with kinase and transporter functions. Our results demonstrate that modern proteomics technologies allow monitoring time-resolved phosphorylation cascade events and, combined with established network inference methods, novel insight into the molecular signaling events following external perturbations can be obtained. |
Duan G*, Schulze W, Walther D
*Max Planck Institute for Molecular Plant Physiology Germany |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 27
H27 |
Motivation: Most functions within the cell emerge thanks to protein-protein interactions (PPIs), yet their experimental determination is both expensive and time consuming. Prediction of interactions using solely PPI-network topology (topological prediction) is a novel ap-proach convenient when prior biological knowledge is absent or unreliable.
Methods: Network-embedding emphasizes relations between net-work proteins embedded in a low-dimensional space, where protein-pairs closer to each other represent potential candidate interactions. The approach we propose is based on the intuition that the use of the non-centred minimum curvilinear embedding (ncMCE) - first innovation - combined with the shortest-path (SP) distance for as-signing likelihood scores in the reduced space - second innovation - might boost the performance in prediction. In addition, we introduce an automatic strategy - third innovation - for the selection of the appropriate embedding dimension. Results: We compared our method against several unsupervised and supervised embedding approaches; and node-neighbourhood-based techniques. Despite its computational simplicity, ncMCE-SP was the overall leader outperforming the current methods for topo-logical link prediction. The superiority of ncMCE may be due to its ‘soft-threshold effect’, which boosts the separation between good and bad candidate links. Conclusion: Minimum curvilinearity is a valuable nonlinear frame-work, which we successfully applied in embedding of protein net-works for unsupervised prediction of novel PPIs. The rationale is that a certain level of prior biological knowledge is hidden and mem-orised in the ‘nonlinear evolutionary relations’ of the network topolo-gy and thus can be used for prediction. The predicted PPIs repre-sent good candidates to test in high-throughput experiments. |
Cannistraci CV*, Alanis Lobato G, Ravasi T
*King Abdullah University of Science and Technology Saudi Arabia |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 28
H28 |
Glycosylphosphatidylinositol (GPI) is a post-translational modification molecule which plays important roles in the vital activities of Eukaryotic cells. Each GPI binds to a singular amino acid at a site called the omega-site in soluble proteins (GPI-anchored proteins, GPI-APs) in the Endoplasmic Reticulum and is secreted on the surface of the plasma membrane. There are various GPI-APs related to human incurable disorders including bovine spongiform encephalopathy and paroxysmal nocturnal hemoglobinuria among many others. Thus, identification and functional analysis of GPI-APs are believed to be crucial for the understanding of vital activities of Eukaryote cells and the resolution of molecular mechanisms of human incurable disorders. Therefore, development of computational methods to predict GPI-APs and their omega-sites with high accuracy from genome/protein sequences is of utmost importance.
In this study, new methods for the detection of GPI-APs and for the prediction of the location of omega-sites by physicochemical properties, position specific scores (PSSs) and back-propagation artificial neural networks (BP-ANNs) were developed. GPI-AP and omega-site datasets were obtained from the Swiss-Prot database. The non-GPI-AP dataset was collected based on the hydropathy of N- and C-terminal sequences, and sequences around A, C, D, G, N and S residues in GPI-AP sequences without omega-sites were extracted as a non-omega-site dataset. PSSs were calculated based on amino acid propensities around the omega-sites. PSSs were applied to BP-ANNs which consist of a three-layered structure. This method could distinguish GPI-APs from non-GPI-APs and also could discriminate omega-sites from non-omega-sites with higher accuracy than other GPI-APs detection tools reported previously. |
Tanaka H*, Konishi T, Sasaki T, Ikeda M, Mukai Y
*Faculty of Electrical Engineering, Graduate School of Science and Technology, Meiji University Japan |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 29
H29 |
Enzymes are proteins that play an important role in biochemical reactions as catalysts. They are classified based on the reaction they catalyzed, in a hierarchical scheme by International Enzyme Commission (EC). This hierarchical scheme is expressed as a four-level tree struc- ture and a unique number is assigned to each enzyme class. There are six major classes at the top level according to the reaction they carried out and sub-classes at the lower levels are further specific reactions of these classes. The aim of this thesis is to build a three-level classification model based on the hierarchical structure of EC classes. ENZYME database is used to extract the information of EC classes and enzymes are assigned to these EC classes. Primary sequences of enzymes extracted from UniProtKB/Swiss-Prot database are used to extract features. A subsequence based feature extraction method, Subsequence Profile Map (SPMap) is used in this study. SPMap is a method that explicitly models the differences be- tween positive and negative examples. SPMap pays attention to the conserved subsequences of protein sequences in the same class. SPMap generates the feature vector of each sample protein as a probability of fixed-length subsequences of this protein with respect to a proba- bilistic profile matrix calculated by clustering similar subsequences in the training dataset. In our case, positive and negative training datasets are prepared for each class, at each level of the tree structure. Subsequence Profile Map (SPMap) is used for feature extraction and Sup- port Vector Machines (SVMs) are used for classification. Five-fold cross validation is used to test the performance of the system. The overall sensitivity, specificity and AUC values for the six major EC classes are 93.08%, 98.95% and 0.993, respectively. The results at the second- and third- levels are also promising. |
Yaman AG*, Atalay V, Cetin-Atalay R
*Middle East Technical University Department of Computer Engineering Turkey |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 30
H30 |
The nitrogen-containing bisphosphonates (N-BPs) are well established as the treatments of choice for disorders of excessive bone resorption, myeloma and bone metastases, and osteoporosis. They inhibit farnesyl pyrophosphate synthase (FFPS), a key enzyme in the mevalonate pathway, resulting in inhibition of the prenylation of small GTP-binding proteins in osteoclasts and disruption of their cytoskeleton, adhesion/spreading, and invasion of cancer cells. A very few examples for synthesis of α-amino bisphosphonates based on several amino acids are known from the literature. In the present work, esters of aminoacid react with ketophsophonate (or their analog acid or acyl) to afford the desired products, α-iminophosphonates. The reaction of imine with dimethyl phosphate in the presence of catalytic amount of I2 give ester of α-aminobisphosphonate as sole product in good yield . Finally, we used computational docking methods to predict how several α-aminobisphosphonates bind to FPPS and how R and X influence. Pamidronate, β-aminobisphosphonate already marketed, was used as reference. These results are of interest since they represent a new and simple way to sythesize α-aminobisphosphonates with a free COOH group increased by R2 functionalisable and opening up the possibility of using the molecular docking to facilitate the design of other, novel FFPS inhibitors.
|
Ghalem S*, Mesmoudi M, Daoud I
*Department of Chemistry, Faculty of Sciences, University of Tlemcen- Algeria. Algeria |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 31
H31 |
Purpose/Objectives: The set of essential proteins is minimal requirements for cellular survival and development. Centrality-lethality rule demonstrates proteins with high-connectivity in the protein-protein interaction (PPI) network are more likely to be essential than those selected randomly. However, a significant number of proteins with low-connectivity in the PPI networks are also essential, which are overlooked by most current methods.
Materials/Methods: Three PPI datasets of yeast were used to compare the differences between low- and high-connectivity essential proteins. Essential proteins were obtained from the Saccharomyces Genome Deletion Project, and protein complexes were collected from MIPS database. Results: We compared low- and high-connectivity essential proteins in centrality measures (Betweenness Centrality, Closeness Centrality, Eigenvector Centrality, Information Centrality and Subgraph Centrality), the appearance in protein complexes, their relationship among neighborhood, functional modularity and the IBEP (interactions between essential proteins). The results show that low-connectivity essential proteins have several distinguishable properties. They have higher eigenvector centrality; they tend to interact with essential proteins; they have more neighbors with the same function and their neighbors appear more in protein complexes; and there are fewer interactions among their neighbors. Those differences are statistical significant between the low- and high-connectivity essential proteins. Conclusions: Our analysis confirms there are many essential proteins with low-connectivity in PPI networks. They have different properties to high-connectivity essential proteins, which could be used in determining whether the unclassified proteins are essential or not. |
Dong Y, Liu Q, Wang Z, Wang G*
*College of Computer, National University of Defense Technology China |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 32
H32 |
The objective of this research is to find novel biologically relevant pathway connections and non-synonymous SNPs (nsSNPs) that could be related to patient outcome upon drug treatment. The Integrated Pathway and Interaction Database (IPID), developed at the Bioinformatics Institute (BII) Singapore, is a compendium of unique PPIs from nine public databases. We have mapped these comprehensive PPI data from IPID to the human metabolic and non-metabolic pathway maps of KEGG in order to find novel biologically relevant pathway connections which includes new interactions between pathway members in the map, new proteins connecting two pathway members in the map, direct cross-pathway interactions and cross-pathway interactions via new proteins that can connect to members from different pathway. Using dbSNP, we also collect information about nsSNPs that were linked to all genes in the extended pathway of interest. To narrow down the identified nsSNPs for further analysis, we consider several criteria such as minor allele frequencies in the studied population, evolutionary sequence conservation-based approaches to predict severity of mutations and estimated effects on protein structural stability as well as vicinity to functional sites/ligands to rank the identified nsSNPs. Our approach can be applied to find novel nsSNPs that could be related to patient outcome for any drug pharmacology-related pathway of interest. |
Limviphuvadh V*, Konishi F, Ooi HS, Jenjaroenpun P, Xiang S, Maurer-Stroh S
*Bioinformatics Institute (BII), A*STAR Singapore |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 33
H33 |
The phosphorylation of proteins plays an important role in many biological processes. Correct assignment of the phosphorylation site is a key component of phosphoproteomic analysis by liquid chromatography-coupled tandem mass spectrometry.
Generally the modification site is assigned by finding the position along the peptide backbone where the fragment peaks are shifted by the mass of the modification. Assignment of the phosphorylation site is challenged by the lability of the phosphate group. During collision-induced disassociation the phosphate group can be lost by the neutral loss of HPO3 or H3PO4. Both of these neutral losses yield peaks that are indistinguishable from peaks that are expected for the unmodified fragment. Here we present an algorithm that takes into account the unusual fragmentation pattern of phosphopeptides to improve phosphate localization. The algorithm incorporates a Bayesian network and a Hidden Markov Model. The Bayesian network is used to classify each position along the peptide sequence as carrying a modification on the b ion, y ion or both ions. The HMM then uses these classifications to resolves the position of the phosphate modification. The algorithm was tested by re-analysing a corpus of doubly and triply charged MS/MS spectra. |
Horlacher O*, Nikitin F, Lisacek F, Müller M
*Swiss Institute of Bioinformatics Switzerland |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 34
H34 |
In recent years a large experimental effort has defined the set of PPIs for some model organisms. However, experimental protocols detect PPIs in environments that differ substantially from those found within the cell, detecting interactions that never occur in vivo, either because the cell never expresses them in the same compartment or because the interactions are out-competed by other interactors. Thus, interactome data informs of the total potential set of interactions but not the set of interactions present in the cell at a given time. In order to produce a more accurate model of the set of PPIs present in yeast under specific sets of conditions, we have combined protein interaction data with both co-localization data and large-scale quantitative proteomic data using a few biologically reasonable assumptions. Our analyses show that the use of integrative models produces a different view of the cell from that obtained using either the interactomic data or the proteomic data alone. Changes in the proteome are buffered through the co-localization of proteins, suggesting a quite robust interactome. PPIs associated with house-keeping functions, such as the ribosome and the proteasome, are amongst the most constant in our model, whereas systems associated with well-characterized dynamic properties, including several transcription factors, are amongst the most variable. Finally, our results rationalize the importance of controlling the expression of interaction “hubs”. In sum, the combination of proteomics and interactomics data provides a better description of the cellular outcome, making biological sense of differences and similarities found in particular experiments. |
Talavera D*, Robertson D, Lovell S
*University of Manchester United Kingdom |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 35
H35 |
While examples have been reported in which alternative splicing (AS) affects the interaction between two proteins by removing interacting regions, yet only few systematic studies assessed how much widespread is this regulation, showing conflicting results. We aimed at a detailed investigation of how AS can modulate protein interactions by a statistical analysis of the entity of protein interactions regulation, and by a breakdown of the differential expression of isoforms encoding or not for the interaction interfaces, by linking together protein structures, gene-interaction networks, and expression data. The AS-mediated removal of protein-protein interfaces occurs with statistically significant lower frequency than expected; the extent of this protection depends on the dimension of the interface. This notwithstanding, a considerable amount of human genes (24%) for which an interaction is known encode for at least one isoform where all interface residues are lost due to an AS event. We investigated whether the usage of different splicing isoforms can be regulated in order to allow or prevent the interaction to occur by the analysis of isoform expression levels in RNA-seq panels, drawing tissue-level interaction maps based on the expression of variants encoding or losing the interface, detecting that in many cases, and in tissue-specific fashion, even if two binding partner genes are expressed the usage of splicing isoforms not encoding for the interface residues prevents the interaction. Our results indicate that AS is a powerful modulator of protein interactions, and that splicing isoform usage is finely tuned to allow or prevent specific interactions. |
Ferrè F*, Colantoni A, Bianchi V, Helmer Citterich M
*Tor Vergata University Italy |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 36
H36 |
The morphogen function of Hedgehog (Hh) proteins is crucial for normal development in all higher animals: Hh is excreted by specific cells, diffuses in multimeric form through extracellular space, and triggers differentiation in receptor cells. Multimerization has been shown to be important for normal Hh function. It is not surprising that mutations in Hh can lead to severe developmental defects.
A conspicuous feature of the Hh protein is its triple metal center with one zinc and two calcium ions. Relatively little is known about the role of the calcium ions, and hence we have studied the effect of the presence and absence of calcium on Hh using molecular modeling techniques. Specifically, we have carried out molecular dynamics simulations of Hh variants with mutations at the calcium binding site. Moreover, we have tried to estimate the effect of calcium ions on the stability of Hh multimers by electrostatics calculations. We find that the calcium ions in Hh stabilizes Hh monomers and dimers, and in this way may increase its multimerization propensity. This is true in particular for mutations at the calcium binding site leading to the developmental defect brachydactyly type A1. We conclude that one possible role of calcium ions in Hh is the stabilization of multimers, and that diseases such as brachydactyly type A1 may at the molecular level be defects of Hh multimerization. |
Rebollido-Rios R*, Wilms C, Hoffmann D
*Center for Medical Biotechnology. University of Duisburg-Essen. Department of Bioinformatics Germany |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 37
H37 |
Interspecies protein-protein interactions (PPIs) are the cause of the cellular processes that occur during the infection and maintenance of a pathogen within its host. Computational and experimental approaches can be utilized to further our understanding of the interplay between a host and its pathogen on molecular level. Here we utilize an approach that is based on a set of machine learning algorithms known as conformal predictors and extensive feature data in order to predict PPIs between Salmonella and its human host. The advantage of this method is that the size of the list of the most suspected interactions can be chosen according to a pre-selected significance level. Secondly, the conformal predictor approach is combined with the results of an interlog PPI prediction approach as well as experimental evidence for human proteins that may be involved in the interplay between Salmonella and human. By this procedure we were able to predict highly confident Salmonella-human PPIs and state on their possible role in Salmonella pathogenicity. |
Schleker S, Nouretdinov I, Garcia-Garcia J, Kshirsagar M*, Oliva B, Klein-Seetharaman J, Gammerman A
*School of Computer Science, Carnegie Mellon University United States of America |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 38
H38 |
Ribonucleoprotein (RNP) complexes perform crucial roles in diverse cellular processes including DNA replication, repair, RNA transcription and maturation, and protein synthesis. The ability to accurately determine the absolute quantity of proteins within such complexes provides useful information relating to complex stoichiometry and biogenesis. Here, mass spectrometry based-approaches are being used to characterize and obtain absolute quantitative information of proteins associated with RNPs. By virtue of parallel data acquisition, alternating scans of low energy collision-induced dissociation (CID) and high-energy CID during liquid chromatography-mass spectrometry analyses reveal both protein identification and quantitation in a single experiment. The low energy CID mode is used to obtain accurate mass measurements of precursor ions and intensity data for quantitation, while the high-energy CID mode generates peptide fragmentation of all precursor ions for database searching and subsequent protein identification. Because the information obtained in these experiments is related to the number of peptides identified per protein (within the complex), optimization of experimental conditions is necessary to provide an unbiased analysis of all low and high molecular weight components. The approach was developed using E. coli wild-type ribosomes and then applied to ribosome assembly particles that accumulate as a result of perturbation. Ribosomal proteins and extra-ribosomal protein factors associated with these particles were identified. In addition, absolute quantitative information of proteins provided insights on the degree of heterogeneity of these ribonucleoprotein particles. This MS-based platform enables the rapid characterization of RNA-protein complexes including ribosome assembly particles resulting from perturbations, stress conditions and deletion strains of proteins or assembly factors.
|
Dator R*, Limbach P
*University of Cincinnati United States of America |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 39
H39 |
Many protein functions can be described in terms of “linear sequence motifs” of less than five function-determining residues. The LxCxE motif interacts with the retinoblastoma tumor suppressor (Rb), which plays a key role in cell cycle progression. The LxCxE motif was identified in several proteins from RNA and DNA viruses, suggesting the LxCxE motif may be present in other viral proteins. We have developed a method to predict the affinity of a sequence stretch to the retinoblastoma protein using a combination of structure- and sequence-based calculations. Structure-based calculations used FoldX, which is an empirical force field for the prediction of the stability of proteins and protein complexes. We used the LxCxE-Rb complex structure to compute a first position specific scoring matrix. Sequence-based calculations used molecular information theory, which makes use of residue statistics at an alignment of known binding motifs. We used over 200 sequences of LxCxE motifs from the papillomavirus E7 protein to compute a second position specific scoring matrix. The combination of both matrices is able to reproduce quantitative and semi-quantitative binding experiments from the literature. Finally, we used the new algorithm to scan all known sequences from human viruses. We are able to find the known instances of the LxCxE motif and to give a list of novel putative LxCxE motifs. We discuss the list in the light of the structural and functional properties of the protein containing each motif. |
Glavina J*, Chemes LB, de Prat Gay G, Sánchez IE
*Protein Physiology Laboratory, Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires. Argentina |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 40
H40 |
Background
Isobaric tag for relative and absolute quantitation (iTRAQ) is widely used in quantitative proteomics. Robust statistical methodology is needed to identify proteins that differentially expressed due to biological reason from those caused by experimental variation. Methodology Samples are created by mixing 5, 10, 15 and 20 ug E.coli cell lysate with 100 ug Mouse lysate, corresponding to relative fold changes of 0.5, 1, 1.5 and 2. Proteins contained in the samples were identified using a TripleTOF 5600 mass spectrometer and relative quantified using 8 channel iTRAQ (4 pairs of duplicates). A systematic statistical modelling is carried out on the dataset resulting in a hierarchical model that estimates confidence interval of technical variation for each individual peptide / protein. The quality of the model can be evaluated by the proportion of E.coli proteins that are identified as differentially expressed. Results 123412 peptides are identified with a protein FDR cutoff of 0.05 applied, leading to 2808 mouse proteins and 838 E.coli proteins. The proposed model allows identification of up to 459 E.coli proteins to be differentially express, showing a significant improvement (16.2% - 342.9%) comparing to conventional methodology. A maximum of 45.9% decrease in the false positive discovery has also been observed. Conclusion The proposed model provides novel solution for the analysis of iTRAQ proteomics data with significantly improved true positive rate as well as reduced false positive rate. |
Zhou C*, Walker M, Dive C, Whetton A
*Paterson Institute for Cancer Research, University of Manchester United Kingdom |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 41
H41 |
Prediction of protein function from protein-protein interaction (PPI) networks has received attention in the post-genomic era. A popular strategy has been to cluster the network into functionally coherent groups of proteins and assign the entire cluster with a function based on functions of its annotated members. Traditionally, network research has focused on clustering of nodes. However, why favor nodes over edges, when clustering of edges may be preferred? For example, nodes belong to multiple functional groups, but clustering of nodes typically cannot capture the group overlap, while clustering of edges can. Clustering of adjacent edges that share many neighbors was proposed recently, outperforming different node clustering methods. However, since some biological processes can have characteristic "signatures" throughout the network, not just locally, it may be of interest to consider edges that are not necessarily adjacent. Hence, we design a sensitive measure of the "topological similarity" of edges to quantify the resemblance of their extended (up to 4-deep) network neighborhoods, which can deal with edges that are not necessarily adjacent. We cluster edges that are similar according to our measure in different baker's yeast PPI networks, outperforming existing node and edge clustering approaches. We apply our approach to the human PPI network to predict new pathogen-interacting proteins. This is important, since these proteins represent drug target candidates. We validate 44% of our predictions in the literature. |
Solava R*, Milenkovic T
*University of Notre Dame United States of America |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 42
H42 |
Multiple Sclerosis is one of the most common diseases of the central nervous system. Previous reports showed that inflammation may be a consequence of a preceding autoimmune disease and treating patients with beta interferon may exacerbate damage. A possible physiopathological mechanism may rely on the increased activity of potassium channels and that blocking them is likely an important therapeutically strategy. On this concern, previous studies reported that sea anemone neurotoxins are able to efficiently block the potassium channel Kv1.3, which is responsible for the proliferation of T lymphocytes. The aim of this study was to better characterize and identify the key residues involved in the interaction between the kv1.3 channel and the different sea anemone neurotoxins by molecular docking. Our results demonstrate that neurotoxins-Kv1.3 binding interaction may be on ASP345, GLY344, TYR343, GLY248, TYR247, TYR55 key residues, demonstrating possible regulation of different biological functions driven by these aminoacids. This knowledge may be useful to understand this interaction complex, that in turn will will serve as an input for the design of more promising drugs for this disease. |
Sabogal A*, Barreto V, González JC, Ramirez DM, Barreto GE, Gonzalez J, Acevedo O, Morales L
*Pontificia Universidad Javeriana Colombia |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 44
H44 |
Complete proteome discovery projects provide a wealth of unique information, including the true expression level of proteins, post-translational modifications, and ORFs missed in the genome annotation. However, in contrast to well over 3000 completed genome sequences, no complete proteome has been described so far[1].
We have used Bartonella henselae, a model for bacterial-induced tumor growth and host-pathogen interaction, to explore the extent of its expressed proteome, with a particular focus on its membrane proteome. RNA-Seq data identified which genes are actively expressed by bacteria grown under two conditions that mimic the pH-dependent induction of virulence genes in the mammalian host mediated by the BatR/BatS two-component regulatory system. Cytoplasmic, total membrane, inner and outer membrane protein fractions were generated from the same samples, extensively sub-fractionated and analyzed by bottom-up and top-down proteomics on high accuracy mass spectrometers. Using RNA-Seq data as a sensitive endpoint estimate, directed shotgun proteomics guided by our analysis-driven experimentation (ADE) feedback loop[2] was applied to the sub-cellular protein fractions, targeting in particular membrane proteins, short and basic proteins. We identified roughly 85% of the proteins whose genes were expressed, including a similar coverage of membrane proteins, in particular all members of the VirB/D4 type IV secretion system. We will present insights based on strand-specific RNA-Seq, as well as results from a novel, generic proteogenomics approach that identified several previously unannotated ORFs. References 1.Ahrens, CH et al., Nat Rev Mol Cell Biol.,11:789-801 (2010). 2. Brunner, E, Ahrens CH, Mohanty et al., Nat Biotechnol., 25:576-583 (2007). |
Omasits U, Quebatte M, Stekhoven D, Roschitzki B, Fortes C, Wu S, Pasa-Tolic L, Dehio C, Ahrens C*
*IMLS, University of Zurich Switzerland |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 46
H46 |
Endometrial cancer, the most common pelvic gynecological malignancy, is a malignant growth of the endometrium, the lining of the uterus. Although majority of patients are treated at an early stage, about one forth of patients in Norway suffer from distant metastases, which are largely incurable leading to death. Currently, most studies of comprehensive profiling of malignant tissues are based on primary lesions. However, the metastatic lesions may be more relevant to define markers or targets for new therapeutics with systemic disease. Also, there are still only a small number of studies on metastasis in general and in endometrial cancer in particular. In this study, we propose a network-based approach for identifying metastasis submodules in endometrial cancer. The submodules are built up by genes that have distinct changes between transcriptional profiles of 122 primary tumors and 19 metastatic lesions based on microarray data. First, we retrieve a protein-protein interaction network from available database and extract subnetworks. In order to identify subnetworks, we will consider (i) the network topology by searching hubs and their neighbor nodes, and/or (ii) correlations of gene expressions within each sample group and their overlapped subnetworks. Secondly, we evaluate the identified subnetworks by applying gene set enrichment analysis (GSEA), to indicate the discriminatory power of the gene expressions in an input subnetwork between primary tumors and metastases. We will investigate the effects of different gene ranking methods used together with GSEA, including an ensemble approach of combining various ranking lists for consensus results. The candidate subnetworks with high enrichment score will be provided as the end result, possibly with relationships between subnetworks. Instead of traditionally identifying individual genes having discriminatory patterns between sample groups, we consider the expression patterns of sets of genes with validated physical interactions. This facilitates an easier functional interpretation of the end results, as biological processes are known to be driven by functional modules rather than individual genes or proteins, as well as the available functional annotations of the interactions in the originating protein-protein-interaction database. |
Kusonmano K*, Wik E, Salvesen H, Petersen K
*Computational Biology Unit, Uni Computing, Uni Research AS; Department of Obstetrics and Gynecology, Haukeland University Hospital; Department of Clinical Medicine, University of Bergen Norway |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 47
H47 |
Protein-protein interactions, collectively known as the interactome, drive most biological processes. The area of specific interaction between proteins to form a protein complex is known as the interface. Identifying interface residues involved in protein-protein interactions assists with the functional characterization of protein complexes. Here we present PROTIN_ID (PROTein INterface IDentification), a tool that rapidly identifies putative interface residue clusters by utilizing structural data and sequence conservation of surface patches. PROTIN_ID takes experimentally determined protein structures or homology models as input. For these it collates and aligns similar protein sequences, calculates conservations scores from the multiple sequence alignments, maps conservation scores onto surface residues and applies spatial clustering to generate surface clusters. Surface clusters are ranked based on sequence conservation and size. PROTIN_ID can be used to identify and visualise conserved surface patches for protein structures or homology models. The main usage, however, is to generate interface residue restraints for use as ambiguous restraints in data-driven protein-protein docking such as with the HADDOCK method (Dominguez et al, 2003). PROTIN_ID is available as a stand-alone tool and it has also been implemented as a web server. |
Alameer A*, Schmid R
*University of Leicester United Kingdom |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 48
H48 |
Molecular docking represents a versatile and important computational method for determining the structure of protein-protein complexes. Despite considerable efforts, a general solution to this problem is not yet within reach. One major challenge is the definition of suitable criteria for a scoring function that allows the identification of a good docking solution among many false arrangements.
Our previous work has demonstrated that the concepts from information theory can actually be adapted to treat the biological problem of protein-protein docking: a formalism has been developed, based on the concept of mutual information (MI), to investigate several structural features of the protein-protein docking solutions for their information content (Othersen et al. J. Mol. Model., 2012, 18(4): 1285-1297). We have also shown that the MI-values can successfully be converted into a scoring function. However, these first “proof-of-concepts” also emphasized aspects that had to be improved to result in a robust and widely applicable approach. We present here an extended MI-based approach that relies on a larger dataset and allows a more flexible treatment of the structural features in the scoring function. The new training dataset consists of carefully chosen docking solutions generated with a reparametrized version of the docking program FTDock. We also investigate the role of amino acid diversity by comparing the information content of the structural features when using different hierarchies of amino acid alphabets. A further improvement is the detection of redundancies between different features and the development of a suitable formalism for the estimation of the MI. |
Jardin C*, Stefani A, Othersen O, Johannes H, Sticht H
*Institute for Biochemistry Germany |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 49
H49 |
The enrichment and interpretation of proteomic data are critical steps in understanding the broader context of the regulation of biological processes under different conditions or in different phenotypes. Currently, most of the enrichment approaches are adapted to use microarray results and require parametric data. However, proteomic data are often non-parametric.
In this paper, we demonstrate a suite of interactive tools that can enrich proteomic results with a graphical overview. This facilitates diagnosis and interpretation prior to further analysis. From a list of protein expressions, a network is constructed using a map of the most disrupted biological process and the disease entity is then identified on the basis of clinical data. |
Mezhoud K*
*National Center for Nuclear Sciences and Technologies Tunisia |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 50
H50 |
Aggressive forms of cancer are associated with elevated levels of reactive oxygen species in the cell. One consequence is the formation of covalent adducts between DNA and its associated proteins, a lesion referred to as DNA-protein cross-links (DPCs). DPCs readily form in biologically relevant oxidation systems, yet despite their apparent biological significance, they have received less attention, and are one of the least understood types of DNA damage. Our goal is to investigate DPCs by identifying participating proteins, the site of cross-linking and the structures of adducts involved. Because cross-links are inherently complex, we initially identified adducts formed using model systems, from small molecule models consisting of pentapeptides and guanine, to larger systems involving Ribonuclease A and a 27-nucleotide DNA. Structural characterization of these cross-links employed ICP, accurate mass, and tandem MS techniques. Data processing tools including Mascot and ProteinLynx Global Server (PLGS), as well as excel macros that automatically calculate different m/z ions, were utilized to interpret the complex mass spectral data obtained from LC-MS/MS analysis of these cross-links. Our findings indicate that oxidative cross-links predominantly occur at guanine, with proteins that bind to DNA more likely to participate. Structural analysis shows cross-linking between the amino acids lysine and tyrosine, and products of guanine oxidation: guanidinohydantoin, spiroiminodihydantoin and an aromatic addition product from guanine hyperoxidation. We conclude that a significant level of oxidative cross-linking occurs between guanine and nucleophilic amino acids of a protein. Future work will focus on identifying relevant proteins in a human cell culture model system. |
Solivio MJ*, Catron B, Sallans L, Caruso J, Limbach P, Merino E
*University of Cincinnati United States of America |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 51
H51 |
Heat shock factor 1 (HSF-1) is the major regulator of heat shock protein transcription in eukaryotes. For the survival and proliferation cancer cells require constant repair of heat shock proteins (HSPs). Recent studies showed that inhibition of heat shock factor 1 (HSF1), the key regulator for the stress-activated transcription of HSPs can reduce the tumorigenic potential of cancer cells. Such a “non-oncogene addiction” phenomenon makes HSF1 an attractive cancer drug target.
We have attempted with the help of virtual screening and molecular docking approach using Lamarckian Genetic Algorithm to elucidate the extent of specificity of HSF-1 towards different class of Phenothiazines (an anti-cancer agent). 3000 molecules on the basis of structural similarity and substructure search of Phenothiazine were studied for the binding with HSF-1, PDB ID- 2LDU. Molecular Docking was carried out and the docking result of the 3000 molecules demonstrated that the binding energies were in the range of 12.03 kcal/mol to -2.17 kcal/mol, with the minimum binding energy of -12.03 kcal.mol. The molecule K21 showed Drug Likeness score of -0.46 with Mol PSA as 11.66 A2 and MolVol as 205.34 A3. While molecule K1755 showed drug likeness score of -1.03 with Mol PSA as 33.76 A2 and MolVol as 367.35 A3. Both the molecules showed H-Bond interaction with the binding sites of HSF-1. Further in-vitro and in-vivo study is required on these molecules as the binding mode provided hints for the future design of new HSF-1 inhibitors with higher potency and specificity. |
Karthik G, Ramanathan K*, Maniyodath N, Pyandapal K, Sridhar A, Sairam A, Naqvi A
*St. Joseph’s College of Engineering, Anna University India |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 52
H52 |
The accurate detection and localisation of peptide and protein modifications is central to deciphering biological function and regulation. Proteomics mass spectrometry data is a rich source of information of protein modification. However, tools to identify, localize, and score unexpected modifications are isolated. To address this we have taken a range of different modification detection methods, encompassing blind or error tolerant searching as well as spectral clustering methods, and combined them in an informatics pipeline. This extensive analysis of possible modifications allows the selection of parameters for a subsequent search to re-score the peptide spectrum matches in a statistical framework using Mascot Percolator. We have also extended the SloMo localisation tool to assess and score the localisation of all modifications. The resulting identifications are mapped into a peptide network showing unmodified, modified and multiply modified peptides. The results can be displayed in a protein centric, proteome centric or modification centric manner.
We have applied ModX to a range of datasets (i) human lens protein data published by Wilmarth et al., 2006, containing an extensive range of PTMs (ii) data generated on a recombinant tagged mouse Arid1a protein and (iii) an E.Coli whole cell lysate. This type of analysis is useful for discovery of modifications and also for the identification of peptides that may mislead quantification due to having differently modified forms. We have developed ModX as an open source tool to help combine the outputs from multiple PTM detection tools and to allow validation of detected modifications. |
Wright J*, Choudhary J
*Wellcome Trust Sanger Institute United Kingdom |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 53
H53 |
Analysis and interpretation of biological networks is one of the primary goals of systems biology. In this context identification of sub-networks connecting sets of seed proteins or seed genes plays a crucial role. Whereas finding an arbitrary sub-graph is trivial, retrieval of a minimum size one leads to the classical Steiner tree problem, which is NP-complete. Many approximate solutions have been published and theoretically analyzed in the computer science literature, but far less is known about their practical performance in the bioinformatics field.
Here we conducted a systematic simulation study of four different approximate and one exact algorithm on a large human protein-protein interaction network with ~14,000 nodes and ~400,000 edges. Moreover, we devised an own algorithm to retrieve a sub-graph comprising several equal size Steiner trees. The application of our algorithms was demonstrated for two breast cancer signatures and a sub-network playing a role in male pattern baldness. We found a modified version of the shortest paths based approximation algorithm by Takahashi and Matsuyama to lead to highly accurate solutions, while being at the same time being several orders of magnitude faster than the exact approach. Our devised algorithm for merged Steiner trees of equal size, which is a further development of the Takahashi and Matsuyama algorithm, proved to be particularly useful for small seed lists. All our implemented methods are available in the R-package SteinerNet on CRAN (www.r-project.org). |
Sadeghi A, Fröhlich H*
*University of Bonn Germany |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 54
H54 |
Motivation: The knowledge of the sub-cellular localization of a protein can help in elucidating its function, as a protein’s localization can provide hints about its functional role in a cell. Despite advances in high-throughput imaging, localization maps remain importantly incomplete. Several methods have been developed that accurately predict localization, yet many challenges remain to be tackled.
Results: Here, we introduce a new framework, LocTree2, for the prediction of localization in all three domains of life, spanning water-soluble globular and membrane proteins. It predicts three localization classes for Archaea, six classes for Bacteria, and eighteen classes for Eukaryota. LocTree2 uses a hierarchical system of Support Vector Machines (SVMs) implemented to imitate the cascading mechanism of cellular sorting. The classification is made using sequence information only. The method reaches high levels of sustained performance for both Eukaryota (Q18=65%; Q18 is eighteen-state accuracy for classifying proteins to eighteen localization classes) and Bacteria (Q6=84%; Q6 is six-state accuracy). Our method also accurately distinguishes between membrane and non-membrane proteins. LocTree2 works well even for protein fragments; these may result from erroneous assemblies or wrong gene predictions that are common in genome projects. In our hands, LocTree2 compared favorably to other state-of-the-art methods when tested on new data. Availability: Online via the PredictProtein service (predictprotein.org); as standalone version at http://www.rostlab.org/services/loctree2. |
Goldberg T*, Hamp T, Rost B
*Technical University Munich Germany |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 55
H55 |
Motivation: Microarray data (MA) and protein-protein interaction networks (PPI) are commonly used to analyze the interactions among genes and proteins, retaining that proteins whose associated genes experimentally show similar expression patterns relate to accomplish the same task. Updated PPI networks represent the relationships among groups of proteins and their functions in specific organisms. We discuss a method for integrating the MA and PPI network data, and present some results of its application to Arabidopsis thaliana.
Results: Microarray and PPI data have been separately clustered; the clusters are then merged in a graph; cliques of this graph would represent a group of functionally related proteins. The relation between a gene/protein and a pathway has been evaluated by a score function and used as a predictor of the functional role of a protein. Blind tests show that the method is able to predict the pathway of a protein with a good level of accuracy when data from PPI and MA are integrated. |
Santoni D, Swiercz A, Zmienko A, Kasprzak M, Blazewicz M*, Bertolazzi P, Felici G
*Institute of Computing Science, Poznan University of Technology; Poznan Supercomputing and Networking Center Poland |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 56
H56 |
The number of known protein-protein interactions (PPIs) grows rapidly, yet their molecular details remain largely unknown. Over the last years, structural biologists have addressed this issue with an increased output of structurally resolved hetero complexes. This wealth now enables statistically significant quantitative statements about interface properties. Here, we addressed the question how interfaces differ when observing the same protein-protein interaction twice. A new high-quality, non-redundant dataset derived from the entire PDB was analyzed employing different definitions of the “same interaction” and a range of interface similarity measures. The hypothesis was that the interface between the same pair of proteins remains the same irrespectively of how often it is measured. Although the results mostly confirm this hypothesis, the surprising finding was how often it was not true: for many comparisons of interfaces, the molecular details of the interaction differed importantly, often without the slightest change of amino acids. In addition, no matter how many “special cases” were sieved out, the essential message remained: interfaces appear immensely plastic. Hand-selected sample structure largely support this view and fall into diverse categories such as conformation-specific binding modes or flexible interfaces that overcome steric constraints in higher order complexes. Our results should be of significant interest to the field of protein docking as we show that there are often multiple solutions to the same problem. Together with the publicly available dataset, we also expect to facilitate and improve various PPI-based prediction methods and interaction databases. |
Hamp T*, Rost B
*TU Muenchen Germany |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 57
H57 |
In various large-scale bioinformatics analyses of whole proteomes, the proper selection of physicochemical and biochemical features of amino acids are crucial for efficient and error-prone encoding of the short functional sequence motifs for representing their biological significance. In most cases, researchers perform exhaustive manual selection of the most informative amino acids indices. Hence, we have analyzed present version of AAindex database of 544 indices using consensus fuzzy clustering, where the recently proposed fuzzy clustering techniques are exploited. After analyzing the AAindex database, we have found two more new clusters and also resolved the problem of unknown clusters. Thereafter, we have also proposed three different set of High Quality Indices (HQI) to avoid such heuristic process of searching indices. Current sets of indices are then used to update the AutoMotif Service (AMS) for predicting the wide selection of 88 different types of the single amino acid post-translational modifications (PTM) in protein sequences. The selection of experimentally confirmed modifications is acquired from the latest UniProt and Phospho.ELM databases for training. For each set of HQI, the method builds the ensemble of Multi Layer Perceptron (MLP) pattern classifiers, each optimizing different objectives during the training (for example the recall, precision or area under the ROC curve (AUC)). The consensus is built using brainstorming technology, which combines multi-objective instances of machine learning algorithm, and the data fusion of different training objects representations, in order to boost the overall prediction accuracy of conserved short sequence motifs. The performance of AMS 4.0 is compared with the accuracy of previous versions, which were constructed using single machine learning methods (artificial neural networks, support vector machine). Our software improves the average AUC score of the earlier version by close to 7 % as calculated on the test datasets of all 88 PTM types. Moreover, for the selected most-difficult sequence motifs types it is able to improve the prediction performance by almost 32 %, when compared with previously used single machine learning methods. |
Plewczynski D*, Saha I, Maulik U, Basu S
*ICM, University of Warsaw Poland |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 58
H58 |
Sequence comparison has been widely exploited to derive phylogenetic relationships based on genomic conservation. Analogously, network comparison aims to gain complementary insight into biological function and evolution. Graphlet-based methods, focusing on local network topology, are proving useful in this regard. Recently some doubt has arisen concerning the applicability of graphlet-based measures to low edge density networks. In particular, it was argued that the methods were “unstable” under these conditions. We demonstrate that it is the model networks themselves that are “unstable” at low edge density and that graphlet-based measures correctly reflect this instability. Furthermore, while model network topology may be unstable at low edge density, biological network topology is stable. In particular, one must distinguish between average density and local density. While model networks of low average edge density also yield low local edge density, this is not the case with protein-protein interaction (PPI) networks: real PPI networks show low average edge density, but high local edge density. Hence, local structure and thus graphlet-based measures are stable on PPI networks. Finally, we demonstrate that PPI networks of many species are well-fit by several models not previously tested, countering a recent suggestion that no existing network model is able to match the structure found in real biological networks. In addition, we model several viral PPI networks for the first time and demonstrate an exceptional fit between the data and theoretical models. |
Hayes W, Sun K*, Przulj N
*Department of Computing, Imperial College London United Kingdom |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 59
H59 |
In recent years an ever-increasing amount of experimental evidence supporting the existence of physical protein-protein interactions has become available in the public domain. These data represent a potentially useful source of information in determining the interaction topology of a given protein complex where the individual protein components are known, but the atomic structure remains unsolved.
To assess the ability of this set of data to distinguish by homology between direct physical pairwise associations and indirect interactions within complexes, experimental data from multiple sources were combined and analysed using existing gold standard protein-protein interaction datasets from Wang et al. (2012) and Smialowski et al. (2009) together with our own PDB complex derived set of direct and indirect interactions. Similar experiment types were grouped by referencing the PSIMI ontology and the performance of these groups assessed over varied sequence identity cut-offs for detection of homologous interactions. A simple scoring mechanism was used to derive ROC curves for individual experiment groups. The results suggest that better discrimination can be achieved by grouping experiment types according to their ability to distinguish direct from non-direct interactions over using individual experiment types or all available evidence combined. Using a selected grouping of experiment types and an appropriate sequence homology cut-off, we achieved coverage in the range of 67% - 69% for the positive and 5% - 42% for the negative gold standards, with corresponding AUCs ranging from 0.64 - 0.86. |
Corrigan S*, Topf M, Shepherd A
*Birkbeck, University of London United Kingdom |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 60
H60 |
Background Over the last five years proteogenomics (genome-based proteomics) has emerged as a promising approach to verifying and correcting genome annotation, particularly in identifying protein N-termini in a high-throughput manner. There is a need to establish its potential and limitations using a well-developed model in highly characterized and well annotated model organisms.
Results We present here the analysis of signal peptides discovered with the help of proteogenomics in Escherichia coli K12, the classical bacterial model organism. We demonstrate that after appropriate filtering 94% of N-terminal cleaved sequences represent signal peptides. In a single run proteogenomics recovered one third of all signal peptides that have been experimentally confirmed during the past three decades. Although E. coli is the most extensively studied bacterium, proteogenomics reported at least thirty signal peptides not known before. Based on a comparison of proteogenomics data with the wealth of experimentally verified and predicted signal peptides, we argue here that the percentage of proteins containing signal peptides in gram-negative bacteria is in the order of 10%, much lower than reported before. Conclusions Proteogenomics emerges as a major approach to reveal novel signal peptides, especially for those organisms that have not yet been explored experimentally. Our results suggest that in the future proteogenomics investigations, after appropriate filtering, one should expect >90% of N-terminal cleaved sequences in gram-negative bacteria to be signal peptides. Proteins having signal peptides comprise ~10% of proteome in gram-negative bacteria. |
Ivankov D*, Payne S, Galperin M, Bonissoni S, Pevzner P, Frishman D
*Technical University of Munich Germany |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 61
H61 |
Clinical picture of several syndromes is characterized by a very diverse set of clinical manifestations, and the molecular mechanisms underlying this phenomenon are not well understood. Identification of molecular mechanisms responsible for co-presence of symptoms would enable both the targeted search for molecular biomarkers and the understanding of the differences in clinical picture between patients with the same syndrome. Here we present an example of using network biology for elucidating molecular mechanisms underlying co-presence of diverse symptoms in Noonan syndrome (NS). Noonan syndrome is an autosomal dominant disorder characterized by diverse set of clinical symptoms including short stature, facial anomalies, webbed neck, sternal deformity, cardiomyopathy due to heart defects and, in males, cryptorchidism (CO). This study addresses the question of the relation between CO with other symptoms in NS. Pathway analysis of the 184 genes associated to CO revealed that eight pathways are associated to the CO candidate gene list. The analysis also revealed the association of cryptorchidism and cardiomyopathy in NS. Additionally, a network-based prediction of candidate genes by identifying the most significant first neighbors in the human protein-protein interaction network (PPIN) was performed. A connection of four pathways “muscle contraction”, “cardiomyopathy”, “focal adhesion” and “RAS signaling” pathways has been identified through ACTB-RAC1 protein-protein interaction. The proposed network-based approach elucidated molecular mechanisms underlying co-presence of CO and cardiomyopathy in NS, and is promising for diagnostic, prognostic and therapeutic markers detection. It also has a potential for molecular explanation of co-presence of diverse clinical manifestations in other syndromes. |
Kunej T*, Cannistraci CV
*University of Ljubljana Slovenia |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 62
H62 |
Phosphorylation governs many cellular processes and is of paramount importance for the proper functioning of a cell. Many studies strive to identify phosphorylation sites by means of mass spectrometry based proteomics yielding thousands of such potential sites which then become available for further analysis by bioinformatics predictors. However, the relationship between phosphorylation, sequence and structural features of proteins such as disorder, accessibility and chain flexibility is still not clearly understood making phosphosite annotation sketchy. Moreover, studies on conservation of phosphosites have been limited only to a few species.
In our study, we explore these relationships in yeast for 18 selected proteins obtained from large-scale yeast MS/MS study. The protein sequences were subjected to multiple sequence alignment and the conservation of phosphosites was investigated. Further, in an attempt to test the hypotheses that phosphosites are often found in disordered regions of proteins and possess higher surface accessibility, over 6 publicly available disorder predictors (PrDOS, IUPred etc) and accessible surface area calculators (Naccess, MSMS, GetArea etc) were employed to quantify these relationships in around 90 phosphosites. Our results indicate partial support for the above hypotheses along with the findings that phosphosites are relatively more conserved than the corresponding non-phosphorylated residues. But disagreement between the various predictors prompted further evaluation of the various bioinformatics predictors used. Along with elucidating the phosphosite-protein features association, we also present which bioinformatics predictors when applied in combination with structural and sequence conservation information are useful to annotate phosphosites and contribute to more robust results. |
Darbha J*, Mueller M, Boeckmann B, Lisacek F
*SIB Swiss Institute of Bioinformatics Switzerland |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 63
H63 |
Protein–protein interactions are crucial in most biological processes, Many protein families contain sub-families that interact with different proteins. Detection of specificity residues is often used to indicate functional residues.If the functional difference is based on protein-protein interactions (PPI), specificity sites can actually correspond to the interface region.
Four kinds of information are invoved in the analysis: •Homodimeric and monomeric proteins from PDBe PISA; •Clusters of Orthologous Groups of homodimers and monomers from COG; •Homologous relations between orthlologous groups of homodimers and monomers obtained using HHsearch; •Structure information of homodimeric proteins from PDBe PISA. We have applied the Sequence Harmony (SH) method for sub-family specific site detection to detect specificity sites that determine the interaction or non-interaction between protein families. Our first results show in most cases, the SH signal can distinguish between interface and other surface resdiues rather well. The distinction between interface and buried (core) residues is much less, however this can be alleviated by adding accessibility prediction to our analysis. This opens up the possibility of using SH to predict interface regions from sequence information alone. |
Hou Q*, Feenstra KA, Heringa J
*IBIVU Centre for Integrative Bioinformatics,VU University Amsterdam Netherlands |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
H 64
H64 |
Proteomics provides additional insights into biological systems that cannot be provided by genomic or transcriptomic approaches (Aebersold and Mann, 2003). In particular, proteomics holds great promise for the identification of biomarkers capable of accurately predicting disease already at a very early stage. For this, accurate identification and quantification of proteins is required. The presented protein quantification approach relies on experimentally identified and quantified peptides.
While several probabilistic models exist for the identification of proteins, (label-free) quantification is often done in a deterministic way. We propose a statistical approach to protein quantification with five main advantages. (i) Peptide intensities are modeled as random quantities, allowing to account for the uncertainty of these measurements. (ii) Our Markovian-type model for bipartite graphs ensures transparent propagation of the uncertainties and reproducible results. (iii) The problem of peptides mapping to several protein sequences (often neglected in other models) is addressed automatically according to our statistical model. (iv) The model can handle various types of input data (i.e. shotgun, directed MS or targeted MS; labeled or unlabeled). (v) The model can be used to reassess the measurements on peptide level. The performance of our model is shown on several datasets. The software is available as an R package. References: [1] Aebersold and Mann, (2003), Mass spectrometry-based proteomics. Nature 422(6928):198-207. |
Gerster S*, Ludwig C, Matondo M, Aebersold R, Bühlmann P
*SIB Swiss Institute of Bioinformatics - BCF Switzerland |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
I 03
I3 |
Genome-scale metabolic networks allow for predicting the effect of alterations in metabolism for biotechnological applications. Existing studies rely on simulating the knockout of reactions or their transformation from other organisms, but do not account for the possibility of including previously uncharacterized enzymes, together with the reactions they catalyze. However, even in well-studied organisms, the function of many proteins remains unknown.
A new method for mass-balanced randomization of metabolic networks allows estimating the significance of network properties with respect to evolutionary pressure. A novelty of the method is that it allows generating large sets of biochemically feasible reactions satisfying atomic mass-balance and thermodynamic constraints. Thus, an intriguing application is the detection of previously unknown reactions which, when introduced into a metabolic network, redirect fluxes in order to increase growth, or, in general, improve the output of the network according to any given objective. We have modified metabolic networks of different model organisms in silico by inserting the generated reactions, one at a time, and determined the effect using Flux Balance Analysis. From the large sets of introduced chemical reactions, most do not significantly improve the corresponding objective, indicating that the analyzed networks already make efficient use of resources. However, we obtain specific reactions which significantly improve upon the objectives. These reactions can serve as candidates for metabolic engineering via genetic transformation of a corresponding enzyme-coding gene. The method facilitates targeted screening for novel chemical reactions improving an objective of interest in a metabolic network, thus proposing new strategies for metabolic engineering. |
Basler G*, Grimbs S, Nikoloski Z
*Max Planck Institute for Molecular Plant Physiology Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 04
I4 |
In transcription factor binding site (TFBS) motif discovery, the true width of the motif to be discovered is generally not known a priori. The ability to automatically determine the most likely width of a motif is therefore essential in motif discovery. However, this is a challenging problem as a result of the changing model dimensionality (equivalent to the number of free parameters in the model) at different motif widths. Existing general model selection criteria which incorporate adjustments for dimensionality (e.g. the Akaike Information Criterion) have not performed well in this context.
We propose and implement a novel dual heuristic for automatically determining the width of an unknown motif, based on motif containment and information content (MCOIN). Tests on large collections of synthetic data and datasets containing known E. coli TFBS motifs show that the MCOIN heuristic outperforms the E-value of the resulting multiple alignment as a predictor of unknown motif width at higher levels of motif conservation, using mean absolute error. The heuristic is also shown to improve the overall correctness of results based on ROC analysis. Finally, we show that the performance of the MCOIN heuristic as a predictor of unknown motif width will improve as the performance of the core motif discovery algorithm improves. |
Kilpatrick A*, Aitken S, Ward B
*The University of Edinburgh United Kingdom |
I - Regulation, Pathways, and Systems Biology |
|
I 05
I5 |
Large amounts of data about interactions inside the cell is available by research in systems biology. By analysing metabolic pathways of well-characterised organisms with those less well studied, complex mechanisms can be discovered and the biological meaning of groups of metabolic processes can be inferred. However, available software applications developed for the
prediction and comparisons of metabolic networks often do not exploit the entire set of available biological knowledge, nor take into account some diversities among species from a biological point of view. A new approach using probabilistic and machine learning techniques is proposed: finite-state transducer (FST) learning framework. FSTs are two-tapes finite-state machines, allowing to combine different types of data and learn from a set of training data. FST has been used for modelling some biological phenomena involving indels and mismatches. In order to apply FST learning to the metabolic pathway analysis problem, new theoretical formalisms should be developed. This work describes a proposal of the new formalisms, as well as its application for prediction and comparison of metabolic pathways in bacteria. |
Roche-Lima A*, Domaratzki M, Gordon N, Alvare G, Zhang J, Fristensky B
*Department of Computer Science. University of Manitoba. Canada |
I - Regulation, Pathways, and Systems Biology |
|
I 06
I6 |
The term systems biology comprises several disciplines from mathematic modeling to semantic data integration. Typically it involves genome-scale data sets of experimentally measured quantitative information about cell components (mRNA, proteins, metabolites, etc.).
At BASF Plant Science, we do systems biological research by applying data interpretation methodologies enabled by bioinformatics technologies to the integrative analysis of quantitative large-scale data sets. Here, we present a suite of systems biology data mining methods and tools that support the discovery of genes that are expected to have beneficial effects on the agronomical performance of crop plants. |
Loevenich S*
*metanomics GmbH Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 07
I7 |
As the available information of a system increases, it becomes problematic to interpret experimental data and evaluate new hypotheses. A way to approach this problem is by developing computational models of the network of interest.
For our network a qualitative modeling approach was chosen. In boolean models, the nodes of the network are linked with activation or inhibition events. Each node is characterized by discrete values and a set of logical functions defines their regulation. All interactions described in the literature were manually annotated and a database was created with the retrieved information. The model was then simulated using boolean simulation software, which resulted in a number of steady states of the system. To evaluate the model, we compared the in silico derived steady states to the three experimentally observed phenotypes of cytokinesis (normal septation, multiseptation and no septation). We then proceeded towards optimizing the model. A number of knockout(KO) and overexpression(OE) gene perturbations were selected, for which the phenotype is well established. These perturbations were used as an in silico experimental set. Their simulation outcome was compared to the expected phenotype and the model received a score. We then proceeded to refinement steps, which included several iterations of addition, modification or deletion of regulatory edges. The outcome of this process was an optimized model, which is now used for predictive purposes. We have performed all in silico double KO and OE perturbation experiments and observed the in silico phenotypic outcomes. Currently we are testing the in silico predictions experimentally. |
Xenarios I, Chasapi A*, Niknejad A, Wachowicz P, Schmitter D, Cerutti L, Sage D, Simanis V, Dorier J
*Vital-IT, University of Lausanne Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 08
I8 |
Microarray data has allowed biologists to make dramatic discoveries in the mechanism of gene regulation. These experimental data can be used to generate gene regulatory networks, which represent the interactions between genes, revealed by expression levels. The main problem of any microarray data set is that it often contains too few samples (in the order of tens or hundreds) compared to the number of genes (in the order of thousands). In this paper we analyse wheat microarray data sets. The analysis of wheat microarrays is a relatively new topic and consequently knowledge of gene regulation is still very superficial. We aim to identify some basic structure underlying the gene regulation by exploring the similarities and differences of regulatory networks generated from a number of independent studies.
In this paper we explore a combination of efficient network building using glasso (graphical lasso) on a number of different studies of the same system followed by a comparison, of each network using a combination of graph theoretical metrics and a form of sensitivity analysis. Based on the results we derive “consensus” and “differential” networks to better understand the key mechanisms and to detect how these vary under different conditions. |
Bo V*, Lysenko A, Tucker A, Saqi M
*Brunel University United Kingdom |
I - Regulation, Pathways, and Systems Biology |
|
I 09
I9 |
In all vertebrates, the segmental pattern of the body axis is established during embryogenesis as somites, masses of mesoderm distributed along the two sides of the neural tube, are formed sequentially in the anterior-posterior direction. Although it is universally accepted that the mechanism is governed by the clock and wave-front model involving genes from Wnt, Fgf and Notch pathways, the exact phase and the hierarchy between those pathways is not well understood. Moreover, the exact number of genes involved as well as their timing is a subject of actual debate [Krol et al 2011].
Based on the analogy between the clock and wave-front model and the wave propagation in physics phenomena, we proposed a model to explain the gene regulation during the process which allows us to determine the correct phase for each gene. The method, based on maximum entropy deconvolution [Rowicka et al 2007] and the geometry of the system, is applied to gene expression timecourse data-sets of [Dequeant 2006, Aurelie Krol et al 2011]. The results allow us to firstly confirm the existing model and secondly propose new candidates cycling genes as well as possible interactions between those genes and hierarchy between corresponding pathways. We have identified a number of genes which have two expression peaks per cycle, which allows us to postulate a previously unknown role for those genes. We present the timing of genes expression involved in the somitogenesis phenomena and discuss the results according to the known regulated genes and propose new candidates genes involved in the process. |
Fongang B*, Kudlicki A
*University of Texas Medical Branch United States of America |
I - Regulation, Pathways, and Systems Biology |
|
I 10
I10 |
Dendritic cell (DC) differentiation and maturation play an essential role in the regulation of inflammatory response in a variety of diseases. However, the knowledge on how these events are regulated and how they affect disease progression or severity is still mostly phenomenological. In particular, an in silico model of the DC gene regulatory network is still missing and would be useful not only to structure known experimental results, but also to predict novel regulations or phenotypes.
Our work presents a model of the DC gene regulatory network, built using a qualitative modeling approach in which the activity of a gene is encoded as a Boolean variable that can be either active (1) or inactive (0). The Boolean modeling framework is interesting since it can cope with comparatively large regulatory networks derived from experimental data. This modeling approach does not suffer from the lack of data on the stochiometry and kinetics of biochemical reaction. This network was built in two steps: first a biocurated knowledge network was built by extensive manual literature search to obtain biologically relevant interactions between genes/proteins and compounds. This network was then optimized and calibrated until it was able to reproduce a set of experimental data, based on stimulation of TNF-alpha, CD40L, TGFbeta, TSLP and IL13. In silico experiments were performed using BoolSim [1,2] that implements a very efficient method to find all steady states (corresponding to cellular phenotype) of the network, based on an implicit enumeration of all possible network states. The resulting network was then used to perform additional in silico experiments and predict effects to be tested in future experiments. [1] A. Garg et al, Bioinformatics 24, 1917 (2008). [2] A. Di Cara et al, BMC Bioinformatics 8, 462 (2007). |
Dorier J*, Niknejad A, Berntenis N, Liechti R, Ebeling M, Xenarios I
*Vital-IT, Swiss Institute of Bioinformatics Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 12
I12 |
Glycolytic oscillations have been extensively studied in yeast experimentally and computationally. Thereby, two major hypotheses are formulated. On the one hand glycolytic oscillations are believed to be caused by allosteric effects at the enzyme phosphofructokinase. On the other hand glycolytic oscillations are believed to be caused by the stoichiometry of glycolysis and the participation of one of the pathway products in several reactions in the first phase of glycolysis.
Compared with yeast, glycolysis in bacteria shows a different stoichiometry due to the presence of the phosphoenolpyruvate:carbohydrate phosphotransferase system. To our knowledge E. coli is the only bacterium for which glycolytic oscillations were studied experimentally and computationally. Here, again, the phosphofructokinase and the stoichiometry are assumed to cause the observed glycolytic oscillations. Compared with yeast and E. coli, lactic acid bacteria like Streptococcus pyogenes lack the allosteric regulation of phosphofructokinase. However, while fitting the kinetic model of the central carbon metabolism in S. pyogenes to time course data we observed damped oscillations. These oscillations occur for different parameter sets and over a wide and physiological parameter range. Since phosphofructokinase is not allosterically regulated in our model we propose that the observed glycolytic oscillations are caused by the stoichiometriy of glycolysis, especially by the feedback loops introduced by ATP and the use of phosphoenolpyruvate for sugar uptake. Furthermore, we present an optimization method for finding oscillatory areas in chemical reaction networks. In order to investigate and understand the origin of glycolytic oscillations in lactic acid bacteria we perform bifurcation and control analysis. |
Levering J*, Kummer U, Sahle S
*University Heidelberg Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 13
I13 |
Today, the human population has a new mission to feed itself in the face of exponentially increasing numbers, increasing scarcity of fresh water, limited availability of land and fertiliser stocks and global climate change. To maintain yields, farmers must apply increasing amounts, resulting in environmental damage. Of the essential plant nutrients, phosphate is most crucial for plant growth as it is often limited in the soil and phosphate fertilisers are projected to run out in the next 50-100 years. Thus, there is an essential need to develop crop varieties that can produce a good yield in low phosphate conditions. In order to understand phosphate homeostasis in plants, a mechanistic dynamic model has been developed representing the cycle regulating the production of high-affinity transporter that includes microRNA-mediated degradation of a key transcriptional repressor (PHO2) of phosphate-transporter genes. Model parameters were either taken from literature or approximated so that steady-state levels of phosphate and regulatory macromolecules are roughly similar to that seen in rice in vivo. Simulations and model analyses showed a range of behaviours and recommended certain gene constructs as generating higher phosphate uptake. However, an analysis of tissue-specific transcriptomic data for the components of the regulatory cycle showed that different components are expressed at very different levels in the epidermis compared to the vascular system. The effect of this is to have phosphate regulation existing in different states in the two tissues. To explore this differential behaviour, a multicellular model comprising of epidermal, cortex and vascular tissues is being developed. |
Ajmera I*, Hodgman C, Lu C
*Multidisciplinary Centre for Integrative Biology (MyCIB), United Kingdom |
I - Regulation, Pathways, and Systems Biology |
|
I 14
I14 |
piRNAs have long been thought to be primarily responsible for keeping the genome stability in germ-line tissues such as testes by retrotransposon silencing. However, recent studies have observed the expression of piRNAs also in murine hippocampus. To investigate the dynamic expression of piRNAs as well as their regulatory controls and TSS architectures in both embryonic brain and testes, we adopted an integrative genome-wide approach combining RNA-Seq and CAGEscan. We identified 1397 novel piRNA clusters in the testes and 125 in the brain. Of which 73 piRNA clusters exhibited expression in both tissues. On average, 30 and 180 known piRNAs aligned to each piRNA cluster in testes and brain respectively. Moreover, all of the aligned piRNA associate uniquely with piRNA clusters. Our CAGEscan analysis provided evidence of similar TSS architecture between protein-coding genes and piRNA clusters. Among the four known gene TSS architecture, three were visible in the six most highly expressed piRNA clusters in testes. Furthermore, we detected an enrichment of purine/purine, specifically, guanine/guanine at the -1,+1 di-nucleotides relative to the TSS. We found 6 and 5 distinct expression patterns throughout mouse development for testes and brain respectively. The analysis of the regulatory networks of each set of piRNA clusters based on their expression profile aggregately identified transcription factors unique to each tissue and common to both. Several transcription factors have previously been implicated in embryogenesis, neural development and gametogenesis. Our integrative analysis not only provides the first evidence of a genome-wide expression of piRNAs during brain development but also provides the first insights on their transcriptional controls and architecture. |
Ghosheh Y*, Ryu T, Clinton M, Carninci P, Faulkner G, Ravasi T
*Integrative Systems Biology Lab, Division of Chemical & Life Sciences and Engineering, Division of Applied Mathematics and Computer Science, King Abdullah University of Science and Technology Saudi Arabia |
I - Regulation, Pathways, and Systems Biology |
|
I 15
I15 |
Motivation: Modeling of time-evolving protein-protein interaction (PPI) networks, called dynamic networks, involves estimation of time-varying topological and functional characteristics of networks for understanding how cellular functions evolve. Most of previous methods focused on estimation of topological characteristics of networks, but not functional characteristics. Integration of functional information of the nodes to estimate dynamic networks is desirable to effectively infer temporal transition of cellular functions.
Results: We present a probabilistic model for reverse-engineering Time-Evolving PPI networks using MultiPle Information (TEMPI), time-course gene expression data and GOBPs. TEMPI computes variational posterior distributions over latent variables to jointly estimate time-evolving PPIs and the relative importance of GOBPs over time, permitting to interpret temporal transition of cellular functions using reconstructed networks. To demonstrate the utility of TEMPI, we applied it to two datasets, 1) a synthetic dataset and 2) time-course gene expression data collected during cell cycle in yeast. The results showed that TEMPI reconstructed time-evolving PPI networks that represent temporal activations and transitions of GOBPs and associated edges over time, which facilitates more effective interpretation of activation of key cellular processes over time. Thus, TEMPI serves as a useful tool that can provide detailed hypotheses for temporal activation of key cellular processes and its underlying mechanisms. |
Kim Y*, Jang J, Hwang D, Choi S
*School of Interdisciplinary Bioscience and Bioengineering, POSTECH Korea, South |
I - Regulation, Pathways, and Systems Biology |
|
I 16
I16 |
Yeast has been used as one of the eukaryotic model organisms to study both replicative and chronological ageing. Several conserved regulatory genes and pro-ageing pathways from yeast to mammalians have previously been identified. However, additional genes and molecular mechanisms involving cell longevity remain to be elucidated. With the availability of genome-scale metabolic models, protein interaction data, and expression data in yeast, we performed an integrated analysis on these datasets to elucidate underlying mechanisms that operate the changes in gene expression and to uncover additional components of the pathways. Given gene expression data of ageing yeast at three time points, we applied computational approaches to identify active modules in the yeast protein network and to discover important biological features such as TFs, metabolites and biological processes of which genes were significantly changes in chronologically ageing process. |
Wanichthanarak K*, Nookaew I, Nielsen J, Petranovic D
*Chalmers University of Technology Sweden |
I - Regulation, Pathways, and Systems Biology |
|
I 17
I17 |
Inferring the structure of gene networks from expression data is a major theme in systems biology. Despite the large number of methods developed for this purpose, such inference is still an open problem, as demonstrated in a community-wide challenge within the DREAM project. The main findings from this challenge were that the network inference problem is underdetermined and that most methods have difficulty in distinguishing direct from indirect regulation. While this underdetermined property has not been theoretically established, it implies that the solution to the inference problem consists of a set of networks, not just one network.
The new algorithm in this work is aimed to produce all networks, in the form of directed (acyclic) graphs, that are consistent with the provided expression data. Here, the set of consistent networks is defined by the upper-bound, i.e. the network with the most edges, and the lower-bound, i.e. the network with the fewest edges. The true network is guaranteed to lie between the upper and lower bound and thus, can be precisely identified when the upper and lower bounds coincide. Our algorithm, called TRACE (Transitive Reduction And Closure Estimation), uses the transitive closures and transitive reductions of the wild-type network and possible knock-out networks that can be extracted from the given data. Specifically, the upper bound is assembled by intersecting the transitive closures of wild-type and knock-out networks, while the lower bound is constructed from the union of the transitive reductions. If desired, the set of consistent networks can be straightforwardly enumerated by adding to the lower bound, edges that appear only in the upper bound, in a combinatorial fashion. We have applied the algorithm to randomly generated gene networks to demonstrate its performance and efficacy and to investigate the inferentiability of gene networks. |
Ud-Dean SMM*, Gunawan R
*ETH Zurich, Institute for Chemical and Bioengineering Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 18
I18 |
Mesenchymal Stromal Cells (MSC's) are multipotent cells that express CD105, CD73 and CD90, and lack expression of CD45, CD34, CD14 or CD11b, CD79alpha or CD19 and HLA-DR antigens. MSC's can differentiate to adipocytes, osteocytes and chondrocytes and are widely used in regenerative medicine.
We compared gene expression patterns of phenotypically defined MSC's derived from Embryonic stem cells (hES-MSCs), Fetal limb (FL-MSCs) and Bone Marrow (BM-MSCs). Each microarray dataset comprised of an undifferentiated control and three differentiated cell types : Adipocytes, Osteocytes and Chondrocytes taken at an early stage (Day7) and a terminal stage (Day14). Differentially expressed genes between the differentiated cells and undifferentiated controls were compared across MSC’s derived from three sources. Only 2-4% of differentially expressed genes were found to be common among MSCs from all three sources. GO functions of these common genes were cell death, cellular movement, cellular growth and proliferation and cell cycle, all very central to cell survival. The top differentially expressed GO functions for differentiated MSC’s from each source were very different. For hES-MSCs, the key functions were cellular assembly and organization and DNA replication and repair. For FL-MSCs, the top functions were cardiovascular system development and function and angiogenesis. For BM-MSCs the top functions were tissue and connective tissue development. Thus, although the MSC's derived from each of these sources have the same surface phenotype they use different mechanisms for differentiation to similar tissues. These differences in gene expression could give clues to explain biological differences in differentiation capacity and kinetics. |
Vaz C*, Tan BBT, Yong DA, Tanavde V
*A*STAR Bioinformatics Institute Singapore |
I - Regulation, Pathways, and Systems Biology |
|
I 19
I19 |
Molecular entities in biological systems are usually depicted as networks. Nodes and edges can portray different realities according to the goal of the network, but reality is that biological networks studies are becoming increasingly popular in systems biology. In particular, genome-scale metabolic networks allow the systems biologist to study the full metabolic capacities of an organism or to analyze its metabolism under different environment conditions. In our work, a parameter that allows comparisons between different metabolic networks has been defined. We apply this parameter and the Kruskal algorithm in order to infer evolutionary relationships among organisms and reconstruct phylogenic trees purely from the metabolic information contained in the genome-scale metabolic networks of each organism. |
Gamermann D*, Montagud A, Conejero A, Reyes R, Fuente D, de Córdoba PF, Urchueguía J
*Universidad Católica de Valencia and Universidad Politécnica de Valencia Spain |
I - Regulation, Pathways, and Systems Biology |
|
I 20
I20 |
We have recently proposed a promising method for enzyme inference in metabolic networks combining metabolite and microarray time series data. Here we extend this approach to obtain even more robust selection of the enzyme candidates.
The present approach is based on a model in terms of ordinary differential equations (ODEs) with time-dependent coefficients. From metabolite data, we may estimate the time dependent kinetic parameters that best explain the observations. Assuming that the time variations are due to gene expressions, we next compare the estimated enzymatic dynamics with those in the gene expression data and select the most potential gene candidates. However, selecting the best candidate per individual reaction does not necessarily yield the optimal result, since the reactions are coupled with each other. For example it may be that the second or third best candidate actually leads to a better fit. Although simulating all possible sets is impossible due to the combinatorial explosion, with the pre-selected set of most prominent genes one can run through, say 3-5 best candidates per reaction relatively quickly depending on the size of the network. We apply this post-processing approach to select the glycosyltransferases that are likely to be involved in the glycosylation pathways of the flavonoid biosynthesis in tomato seedlings. Earlier, we have shown that in such pathways we may obtain unique, identifiable estimates for the time dependent kinetic parameters. Based on this analysis, a set of gene candidates are currently being cloned for further experimental validation. |
Astola L*, Gomez Roldan V, Molenaar J
*Wageningen university and research center Netherlands |
I - Regulation, Pathways, and Systems Biology |
|
I 21
I21 |
Defining active regulatory regions in the genome is a crucial step towards the understanding and modeling of transcriptional regulation. We have recently shown in mouse stem cells and neuronal progenitors that transcription factor binding to active regulatory regions leads to defined reduction in DNA methylation allowing for the identification of regulatory regions in otherwise methylated parts of the genome. Here we present an advanced computational method for the unbiased identification of such footprints from large-scale bisulfite-sequencing data. Our approach
partitions the genome into fully methylated (FMRs), unmethylated (UMRs) and low-methylated regions (LMRs), while accounting for false methylation calls due to single nucleotide variations. As an additional feature, our approach further detects partially methylated domains which represent a fourth class of methylation pattern that needs to be discriminated from LMRs and UMRs. By applying our method to publicly available mouse and human methylation datasets, we find that many UMRs are active promoters, whereas LMRs correspond to distal regulatory elements. LMRs are highly variable across different tissues, correlate between related cell types and show motif enrichments for tissue-specific transcription factors. These finding extend and generalize our previous results and suggest that the presented method provides a robust and reproducible approach for unbiased segmentation of basepair bisulfite methylomes and the discovery of regulatory elements from DNA methylation data. |
Burger L*, Gaidatzis D, Stadler M
*Friedrich Miescher Institute Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 22
I22 |
Pathogenic Escherichia coli, such as Enterohemorrhagic E. coli (EHEC) and Enteroaggregative E. coli (EAEC), are globally widespread bacteria. Some may cause the hemolytic uremic syndrome (HUS). Varying strains cause epidemics all over the world. Recently, we observed an epidemic outbreak of a multi-resistant EHEC strain in Western Europe, mainly in Germany. The Robert Koch Institute reports >4,300 infections and >50 deaths (July, 2011). Farmers lost several million EUR since the origin of infection was unclear. Here, we contribute to the currently ongoing research with a computer-aided study of EHEC transcriptional regulatory interactions, a network of genetic switches that control, for instance, pathogenicity, survival and reproduction of bacterial cells. Our strategy is to utilize knowledge of gene regulatory networks from the evolutionary relative E. coli K-12, a harmless strain mainly used for wet lab studies. In order to provide high-potential candidates in human pathogenic E. coli bacteria, such as EHEC, we developed the integrated online database and analysis platform EhecRegNet. We utilize 3,489 known regulations from E. coli K-12 for predictions of yet unknown gene regulatory interactions in 16 human pathogens. For these strains we predict 40,913 regulatory interactions. EhecRegNet is based on the identification of evolutionarily conserved regulatory sites within the DNA of the harmless E. coli K-12 and the pathogens. Identifying and characterizing EHEC’s genetic control mechanism network on a large scale will allow for a better understanding of its survival and infection strategies. This will support the development of urgently needed new treatments. |
Röttger R*, Pauling J, Baumbach J
*Max Planck Institute for Informatics Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 23
I23 |
Statistical learning methods, such as Bayesian Networks, have gained a high popularity to infer cellular networks from high throughput experiments. However, the inherent noise in experimental data together with the typical low sample size limits their performance with high false positives and false negatives. Incorporating prior knowledge into the learning process has thus been identified as a way to address this problem, and principle a mechanism for doing so has been devised (Mukherjee & Speed, 2008). However, so far little attention has been paid to the fact that prior knowledge is typically distributed among multiple, heterogeneous knowledge sources (e.g. GO, KEGG, HPRD, etc.).
Here we propose two methods for constructing an informative network prior from multiple knowledge sources: Our first model is a latent factor model using Bayesian inference. Our second model is the Noisy-OR model, which assumes that the overall prior is a non-deterministic effect of participating information sources. Both models are compared to a naïve method, which assumes independence of knowledge sources. Extensive simulation studies on artificially created networks as well as full KEGG pathways reveal a significant improvement of both suggested methods compared to the naïve model. The performance of the latent factor model increases with larger network sizes, whereas for smaller networks the Noisy-OR model appears superior. Furthermore, we show that our informative priors significantly enhance the reconstruction accuracy of Bayesian Network and Nested Effects Models. Finally, two examples, one from breast cancer and one from murine stem cell development highlight the utility of our approach. |
Praveen P*, Fröhlich H
*Bonn Aachen Center for IT Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 24
I24 |
MicroRNAs (miRNAs) negatively regulate the levels of mRNA post-transcriptionally. Overexpressing miRNA in cells revealed hundreds of suppressed genes. Additionally, capturing miRNAs at the RISC complex provides a map of miRNAs and their targets. Using these data, we implemented combinatorial and statistical constraints in the miRror2.0 algorithm. miRror estimates the likelihood of a combinatorial action of miRNAs to explain the observed data. A systematical assessment from 30 transcriptomic datasets and hundreds of miRNAs sets shows that miRror is a robust protocol that outperforms a dozen of miRNA-target prediction databases. We then questioned the additive contribution of miRNA pairs. We found that miRNAs belonging to a family adopt an overlapping, backup mode of regulation. However, experimental data support an expanding, complementation mode of regulation by most miRNA pairs. Finally, we activated the miRror protocol on transcriptomic data from manipulated cells, and identified instances in which small miRNAs sets govern the observed gene expression. We propose that the miRNA combinatorial regulation is the chosen strategy in governing cellular homeostasis while overcoming the low specificity assigned to any unique miRNAs |
Balaga O*, Linial M, Friedman Y
*Hebrew University of Jerusalem Israel |
I - Regulation, Pathways, and Systems Biology |
|
I 25
I25 |
Molecular phenotyping technologies offer the possibility to simultaneously obtain multiple time-series (MTS) data from different levels of biological systems. As a result, MTS data capture
the dynamics of biochemical processes and components whose couplings may involve different scales and exhibit temporal changes. Therefore, it is important to develop methods for determining the time segments in MTS data, corresponding to critical biochemical events which are reflected in the coupling of the system’s components. Here we provide a network-based formalization of the MTS segmentation problem based on temporal dependencies and the covariance structure of the data. We demonstrate that the problem of partitioning MTS data into k segments to maximize a distance function, operating on polynomially computable network properties, can be efficiently solved. To enable biological interpretation, we also propose a confidence index for the outcome of the network-based segmentation. Our empirical analyses of synthetic benchmark data as well as time-resolved transcriptomics data from the metabolic cycle of Saccharomyces cerevisiae demonstrate that the proposed method accurately infers the phases in the temporal compartmentalization of biological processes. In addition, through comparison on same data sets, we show that the proposed formalization of the MTS segmentation problem outperforms the contending state-of-the-art methods. |
Omranian N*, Klie S, Nikoloski Z
*Potsdam university Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 26
I26 |
Statistical approaches to describing the behaviour, including the complex relationships between input parameters and model outputs, of nonlinear dynamic models (referred to as metamodelling) are gaining more and more acceptance as a means for sensitivity analysis and to reduce computational demand. Understanding such input-output maps is necessary for efficient model construction and validation. Multi-way metamodelling provides the opportunity to retain the block-wise structure of the temporal data generated by dynamic models throughout the analysis. Furthermore, a cluster-based approach to regional metamodelling allows description of highly nonlinear input-output relationships, revealing additional patterns of covariation.
By presenting the N-way Hierarchical Cluster-based Partial Least Squares Regression (N-way HC-PLSR) method, we here combine multi-way analysis with regional cluster-based metamodelling, together making a powerful methodology for extensive exploration of the input-output maps of complex dynamic models. We illustrate the potential of the N-way HC-PLSR by applying it both to predict model outputs as functions of the input parameters, and in the inverse direction (predicting input parameters from the model outputs), to analyse the behaviour of a dynamic model of the mammalian circadian clock. Our results display a more complete cartography of how variation in input parameters is reflected in the temporal behaviour of multiple model outputs than has been previously reported. Variations in the model sensitivity to certain input parameters across the model output space are also illustrated. Our results indicate that the N-way approach allows a more transparent and detailed exploration of the temporal dimension of complex dynamic models, compared to alternative 2-way methods. |
Tøndel K*, Indahl UG, Gjuvsland AB, Omholt SW, Martens H
*Centre for Integrative Genetics (CIGENE), Dept. of Mathematical Sciences and Technology, Norwegian University of Life Sciences Norway |
I - Regulation, Pathways, and Systems Biology |
|
I 27
I27 |
SwissRegulon portal is a repository of databases and bioinformatics tools related to transcription regulatory processes. It includes:
SwissRegulon: A database of genome-wide annotations of regulatory sites. We currently have annotations for 17 prokaryotes and 3 eukaryotes (including human and mouse) in our collection. PhyloGibbs: An algorithm for inferring regulatory motifs and regulatory sites from collections of DNA sequences, including multiple alignments of orthologous sequences from related organisms. MARA: Motif Activity Response Analysis is a free online tool that models genome-wide expression data in terms of our genome-wide annotations of regulatory sites. TCS: A database of predicted two-component signaling interactions across bacterial genomes. SwissRegulon address: www.swissregulon.unibas.ch |
Pachkov M*, Balwierz P, van Nimwegen E
*University of Basel Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 28
I28 |
Background
In vascular endothelial cells, IL-6 promotes inflammatory conditions through activation of the JAK/STAT pathway. Failure to suppress these pathways may therefore promote a chronic inflammatory state, which can ultimately lead to cardiovascular disease. Methods We developed an ODE-based mathematical model of IL-6 signaling that includes crosstalk between the JAK/STAT and ERK signaling cascades. Bayesian statistics and sensitivity analysis were used for parameter estimate and experimental design. Experimental data were obtained from IL-6-stimulated human umbilical vein endothelial cells and key signalling proteins were analysed by immunoblotting. Results Model-driven experiments show oscillatory behaviours for STAT3 and ERK phosphorylation and SOCS3 induction, caused by the interplay of different regulatory mechanisms and cycles of production/degradation. The model we obtain after refinement is able to adequately predict the behaviour of these key proteins. Conclusion A new model for the crosstalk between the JAK/STAT and the ERK pathways is presented. We investigate how different experimental data impact on parameters distribution and explore how information from in-silico data can be used to guide Bayesian experimental design. We then perform selected experiments to improve the model. We show how successful combination of Bayesian Inference with sensitivity analysis enables identification of optimal experimental design for successful model improvement. |
Cretella R*, Palmer T, Girolami M, Breitling R
*University of Glasgow United Kingdom |
I - Regulation, Pathways, and Systems Biology |
|
I 29
I29 |
In the face of necessity of new antimalarial drugs, ethnopharmacological approach brings new perspective in terms of new chemical compound discovery. Previous studies have shown that the drug Cepharanthine, extract from Stephania rotunda, has interesting antimalarial activities but with unknown mechanism of action. Preliminary studies, have shown that Cepharanthine stops the malaria cycle to the ring phase. In this study, we decided to compare the transcriptome of Plasmodium falciparum treated for 8 hours or not during this phase with Cepharanthine (dose 5 mM close to IC90), through microarray experiments. We used a 44k custom microarray developed in our lab composed of 9730 probes corresponding to 5144 genes.
Our results shown that the treated samples are closer to the t=0 samples than the control ones (K-means and hierarchical clustering). A two-way ANOVA analyses (FDR, p<0.05) etablished that 1139 probes are differentially expressed (FC > 2.0) in the treated samples. 296 of those probes have specific actions of the Cepharanthine on the malarial ring phase. GO analyses of the genes of interest and pathways analyses are currently in progress and are difficult due to the large number of P. falciparum genes with unknown function (close to 50%). Our studies have shown that the Cepharanthine has an parasitostatic-like effect on the ring phase. We have determined new potential metabolic targets for this drug and so a possible mechanism of action. It will increase the understanding of the biology of P. falciparum and help the development of drug candidates by pharmacomodulation. |
Desgrouas C, Chapus C*, Ollivier E, Parzy D, Taudon N
*UMR-MD3, IRBA France |
I - Regulation, Pathways, and Systems Biology |
|
I 30
I30 |
Gene expression development during cell differentiation is a key factor to understand the mechanism of development. However, conventional gene expression analysis cannot distinguish among individual cell expression. Recently,single cell gene expression technology was invented.
Using this technology, one can see how gene expression in individual cells can change during differentiation. In this paper, we re-analyze single cell gene expression measurements obtained by next gene sequencing technology during differentiation from mouse ES cell to MEF. In order to select genes which express the difference between mESC and MEF, we employed SAM. Although only two have significant P-values, gene expression atlas gives us con-current results; most genes are differently expressed between embryo and adult cells. Since SAM could not give us significant P-values, we consult other feature selection methods which are free from P-values: i.e., principal component analysis (PCA). Some genes selected by SAM or PCA are mostly important. For example, Sox2 is one of famous Yamanaka factors used for generation of iPS cell. However, most gene are ribosome related genes. Although it is reasonable, it will be more interesting for us to find that the other genes are expressive, too. Single gene expression analysis showed that gene expression profiles in each cell is scattered, but biologically important genes in mESC are expressive. Suitable treatment of single cell RNA-seq data turns out to be informative to understand gene expression regulation during cell differentiation from ES cell. |
Iijima K*, Taguchi Y
*Chuo University Japan |
I - Regulation, Pathways, and Systems Biology |
|
I 31
I31 |
It is getting increasingly accepted that stochasticity plays an important role in biological systems. However, research based on stochastic kinetic models is still limited by two major challenges. First, the chemical master equation (CME) that governs the time evolution of such models can only be solved in the simplest cases. Second, measurements usually stem from heterogeneous cell populations, which may lead to severe errors in conclusions that are drawn from models that do not account for heterogeneity.
Here, we propose a moment-based framework for stochastic kinetic models that allows to analyze population measurements, despite the presence of heterogeneity. More specifically, we demonstrate how extrinsic variability can be incorporated in a model formulation and show how population moments can be derived from the resulting conditional CME. This allows us to match model predictions to measurements of heterogeneous populations and thus enables model-based studies of systems that are influenced both by intrinsic and extrinsic variability, which opens up new possibilities for quantifying cell-to-cell variability. We then study two biological system. First, we couple the moment-based framework to parameter inference, which allows us to draw conclusions about the unknown sources of extrinsic variability in the transcriptional response to osmotic stress in budding yeast using measured flow cytometry distributions. Second, for an engineered light-induced gene expression system, we use our framework to find sequences of light pulses that maximize lower bounds on the information that the protein distribution provides about specific properties (for instance time scale or magnitude) of the extrinsic fluctuations. |
Ruess J*
*Automatic Control Laboratory, ETH Zurich Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 32
I32 |
Human embryonic stem (hES) cells are characterized by unlimited potential for self-renewal and capability of differentiation to all somatic cell types. It has been shown that the network of transcription factors such as Nanog, Oct4, Sox2 and specific epigenetic features underpin the pluripotency of hES cells. Changes in gene expression during development are accompanied or caused by epigenetic regulation. Epigenetic regulators such as histone modifications are known to affect the chromatin structure and thus regulate gene expression on the transcriptional level. However, the joint impact of different regulators on the behavior of the regulatory network associated with maintenance of pluripotency has not been well studied. In the current study we integrated histone modifications H3K4me3 and H3K27me3 into the transcriptional regulatory network associated with pluripotency in hES. We reviewed and combined in one regulatory network different mechanisms that control pluripotency. Due to the network complexity, it becomes necessary to use computational models to understand its behavior and to be able to make predictions about it. To address this matter we apply standardized qualitative dynamical systems approach for the simulation of the reconstructed regulatory network. |
Nikolaeva E*, Peterson H, Vilo J
*University of Tartu, Institute of Computer Science Estonia |
I - Regulation, Pathways, and Systems Biology |
|
I 33
I33 |
When searching for the regulatory cause of a biological signal, one is often given a set of genomic regions, e.g. promoter sequences of differentially expressed genes. Based on these sequences the task is then to detect the Transcription Factors (TFs) that are most likely regulators of the observed targets.
Classical over-representation methods mostly rely on the Fisher's exact test and assay each candidate TF individually. This approach is very likely to produce false positive hits as it does not take into account that another TF in the collection might be more probable. Furthermore, these methods cannot detect combinations of TFs that together target the sequence set, e.g. two TFs that respond to the same signal but regulate different targets. In this project, we adopt a Bayesian Network framework to predict the TFs whose binding sites are over-represented in the given sequence set. Using predicted binding affinities to estimate the probability that a TF can bind to each sequence, our method outputs the TFs that cover the whole target set with high affinity. The main advantage of this approach is that it takes into account all candidate TFs at the same time. Thus, our method is able to reduce the number of false positive hits. Additionally, our method is able to predict combinations of TFs that act in concert. |
Perner J*, Vingron M
*Max Planck Institute for Molecular Genetics Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 34
I34 |
Quickly evolving high-throughput technologies allow comprehensive, integrative biological studies which require software tools that support the combined analysis of large datasets from different fields of study. These tools need to merge heterogeneous input data and analyze and present them in a comprehensive and convenient way.
MarVis-Graph is an interactive software for analysis of combined data from untargeted metabolomic, transcriptomic and proteomic studies, e.g. as derived from mass spectrometry, micro-array, or RNA-Seq measurements. The MarVis-Graph software features the import of heterogeneous data from comma-separated-value- and spreadsheet-files. It maps the experimental data to the corresponding entities in pre-defined metabolic networks (e.g. molecular features in mass spectrometry analysis against molecules, RNA sequences toward genes). These networks are projected on data from the KEGG database [1] or the BioCyc database collection [2] and are represented as undirected graphs within MarVis-Graph. Beside reference networks, containing all known metabolic reactions from KEGG or BioCyc, also the analysis of organism-specific metabolic networks is supported. MarVis-Graph utilizes algorithms from graph theory and clustering to analyze metabolic networks containing experimental data with the aim to identify reaction-chains or sub-networks with high experimental evidence. Identified sub-networks are ranked based on a variety of scoring methods and can be visualized interactively. In contrast to other tools, MarVis-Graph analyzes complete metabolic networks and is not restricted to pre-defined sub-graphs (e.g. single pathways). In combination with tools from the MarVis-Suite [3], MarVis-Graph has been applied successfully in wound response studies of Arabidopsis thaliana. [1] Kanehisa, M., Goto, S., Sato, Y., Furumichi, M., and Tanabe, M.; KEGG for integration and interpretation of large-scale molecular datasets. Nucleic Acids Res. 40, D109-D114 (2012). [2] Caspi R, Altman T, Dale JM, Dreher K, Fulcher CA, Gilham F, Kaipa P, Karthikeyan AS, Kothari A, Krummenacker M, Latendresse M, Mueller LA, Paley S, Popescu L, Pujar A, Shearer AG, Zhang P, Karp PD. "The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases,". Nucleic Acids Research 38:D473-9 2010. [3] Kaever, A., Landesfeind, M., Possienke, M., Feussner, K. , Feussner, I. , and Meinicke, P.; MarVis-Filter: Ranking, Filtering, Adduct and Isotope Correction of Mass Spectrometry Data. Journal of Biomedicine and Biotechnology. Journal of Biomedicine and Biotechnology. Volume 2012 (2012) |
Landesfeind M*, Kaever A, Feussner K, Feussner I, Meinicke P
*Georg-August-University, Institute of Microbiology & Genetics, Department of Bioinformatics Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 35
I35 |
A major difficulty in genome-scale metabolic networks reconstructions and comparisons is to integrate data from different resources. Actually, these may use different nomenclatures and conventions for metabolites and reactions. To address this issue, we have developed MNXref, a precompiled automatic reconciliation of many of the most commonly used metabolic resources (ChEBI, Rhea, KEGG, MetaCyc, BRENDA, BiGG, The SEED, UniPathway, BioPath, HMDB, LipidMaps). Based on MNXref, we designed a web portal (www.metanetx.org) that hosts hundreds of reconciled metabolic networks from various public resources (BiGG, The SEED, BioCyc, YeastNet, InSilicoOrganisms...). Users can also upload their own metabolic models and automatically map them into the MNXref namespace. All models can be interactively combined and compared; results can be visualized using tables or a graphical user interface. The www.metanetx.org portal also offers network analysis tools, such as flux balance analysis, and in the near future it will host model analysis and construction methods developed within MetaNetX, a project supported by the SystemsX.ch initiative. |
Bernard T*, Moretti S, Ganter M, Stelling J, Pagni M
*Swiss Institute of Bioinformatics - Vital-IT Group Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 36
I36 |
Retinoblastoma (pRb) is a multi-domain tumor suppressor protein. In humans it is part of a protein family encompassing three members. pRb plays many cellular roles, mediated by several downstream effectors and transcriptional targets. The best known role of pRb is the control of cell cycle progression at the G1/S transition, mediated through its interaction with E2F transcription factors required for entrance into S phase. While proteins related to pRb and E2F are conserved in most eukaryotic lineages, in yeast no clear hortologues of pRb or E2F have been found. Nevertheless a pathway functionally homologous to the pRb/E2F pathway is present.
We present a comparison of structural and regulatory features of the yeast and mammalian proteins. Complex phosphorylation regulatory loops control pRb end Whi5 function. The interaction landscape of pRb and Whi5 is quite large, with more than one hundred proteins interacting either genetically or physically with either protein. A system-level comparison of networks centered on pRb and Whi5 is presented with the aim to to probe evolutionary conservation of the function of Whi5 and pRb and their cognate regulatory circuits. |
Hasan MM*, Sacco E, Papaleo E, Brocca S, de Gioia L, Alberghina L, Vanoni M
*Department of Biotechnology and Biosciences, University of Milano-Bicocca Italy |
I - Regulation, Pathways, and Systems Biology |
|
I 37
I37 |
The reconstruction of regulatory networks and the use of these networks to predict the transcriptional response of a cell to experimental stimuli is a key challenge in systems biology. Identifying causal regulators in these networks can help to develop treatment against diseases, in our case against Burkitt's lymphoma.
The PC-algorithm is a computational feasible approach to infer the structure of a network even for many nodes, as long as the graph is sparse [1]. More precisely, it estimates the equivalence class of the unknown underlying directed acyclic graph (DAG) representing it as a completed partially directed acyclic graph (CPDAG). Unfortunately, the resulting networks are hard to interpret biologically, since statistical interactions of gene expression levels often do not correspond to physical molecular interactions in a cell. However, these networks also allow for the estimation of causal effects in intervention experiments. Using these intervention effects we can answer the question: By how much would the expression of gene A change, if the expression of gene B was artificially reduced by a certain amount. However, the sample version of the PC-algorithm may construct invalid CPDAGs due to the construction of conflicting v-structures, caused by incomplete observations. This problem compromises causal effect estimation as well. Here we describe to what extend the use of information from time series as well as prior knowledge can reduce the number of inner network conflicts and thus improve the estimation of hypothetical intervention effects. References: [1] SPIRTES, P., GLYMOUR, C. and SCHEINES, R. (2000). Causation, Prediction, and Search, 2nd ed. MIT Press, Cambridge, MA. MR1815675 |
Taruttis F*, Engelmann J, Spang R
*University Regensburg, Institute of Functional Genomics Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 38
I38 |
Reprogramming of energy metabolism in cancer cells has become a major focus of research, also to intervene with their malignant growth behavior. Despite early progress, e.g. on the Warburg effect, understanding of regulation mechanisms in tumor metabolism is still a largely open question. Availability of genome-wide information on cellular metabolic networks has opened the possibility of systematically analyzing cancer metabolism. Though, given the complexity of metabolic networks in terms of structure and regulation, understanding cancer metabolism is still a challenging task.
Many cancer-associated signaling pathways induce metabolic reprogramming towards malignant pathways. Our aim is to identify and model these signaling molecules and key enzymes in metabolism that are responsible for this transformation. As a first step hereto, we analyze gene expression profiles to identify breast cancer specific regulation of metabolism. Data-sets from breast cancer patients are typically heterogeneous with respect to the patients' characteristics like receptor status and treatment. Though clinical classification of tumor subtypes is decisive on treatment, it often has poor prognostic value in terms of survival. We hereby employ the fact that not all genes are informative for clinical outcome to classify patients into good and bad prognosis using random (survival) forests. We show that this selected set of genes is prognostic on other data-sets. Based on these newly defined groups, we carry out pathway analysis using our tool PathWave to systematically identify discriminative patterns of regulation in metabolic networks. We will discuss gene expression profiles in tumors with good and bad prognosis related to cell metabolism. |
Soons Z*, Piro RM, Wolf T, Sonntag J, Eils R, Wiemann S, König R
*Heidelberg University / German Cancer Research Center Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 40
I40 |
Networks are the most popular representation of biological systems. Analysis on their systematic characters, such as modularity, is very helpful to uncover the mechanism of many biological processes or functions. It is reported previously that biological networks exhibit functional modularity. But until now, there lacks effective metrics to measure and analyze this property, although there are several topological modularity indexes.
Here, we define modules according the function annotations of biological network elements. Calculating their cohesiveness and the coupling degrees between each pairs of them, we use a scale value name MCC to measure functional modularity of biological networks and present a rule to judge it. The result shows that several types of biological networks, such as the yeast transcription factor regulatory network (TFRN) show weak functional modularity, while the yeast protein-protein interaction network (PPIN) and metabolic networks (MN) of 19 eukaryote species exhibit functional mod-ularity significantly. Analyzing of TFRN and PPIN shows that network's functional modularity is related with the functional specificity of elements (TFs for TFRN, and proteins for PPIN). Comparing between MCC and Newman’s modularity, ACC describe that MCC is more suitable to measure functional modularity of biological networks. |
Liu Q*, Guo H, Zhu Y
*Department of Chemistry and Biology, College of Science, National University of Defense and Technology China |
I - Regulation, Pathways, and Systems Biology |
|
I 41
I41 |
Inference of gene regulation from expression data may help to unravel regulatory mechanisms involved in complex diseases or in the action of specific drugs. A challenging task is to design biological experiments with a limited budget and produce datasets suitable to reconstruct putative regulatory modules.
Here we introduce an original experimental set-up and a customized method of analysis to reconstruct regulatory modules starting from genetic perturbation data, e.g. from siRNA inactivation/activation of target genes. The experimental set-up involves the screening of a panel of genes, pre-selected based on biological knowledge about the pathways of interest and/or on genome-wide transcriptional profiling at fixed time point. Samples are collected in short time series, useful to reconstruct asynchronous regulations. The analysis method consists of two main steps: a significance analysis to elicit significant modulations following the silencing of each target gene, and an inference procedure to extract regulatory modules, composed of different feed-forward loops (FFLs), including couples of target genes. Our procedure was applied to produce and analyze a yet unpublished dataset, in which IFN-α transcriptional response in endothelial cells was investigated by RNAi silencing of candidate IFN-α modulators: STAT1, IFIH1, IRF1, IRF7, GBP1, OAS2 and USP18. Custom TaqMan Low-Density Arrays were used to screen 96 pre-selected genes. Samples were collected at four time points, with double stimulation (IFN-α/wash-out), in biological duplicates. Our approach allowed to quantify the strength of IFN-α modulators and to identify putative regulatory modules involving them. Few selected FFLs, from the inferred regulatory modules, are currently being experimentally validated. |
Grassi A*, Di Camillo B, Ciccarese F, Finesso L, Indraccolo S, Toffolo G
*Istituto Oncologico Veneto - IRCCS Italy |
I - Regulation, Pathways, and Systems Biology |
|
I 42
I42 |
The research deals with the distribution of loads in the femoral head of hip joint. Here is described the formalized mathematical model of the distribution of loads in the femoral head (in norm and with pathology). Method of Load angle’s measuring (L-angle) and its importance is described in the first time.
L-angle is an angle between mechanical and anatomical axes of hip. The value of L-angle has a specific theoretical and practical interest. The original model of diagnostics of lower limb’s kinematic system solvency is proposed. It can be used in medical cybernetics, computational biology and can save the functions of the lower limb’s kinematic system from secondary changes, saves fixed position of a limb. |
Loginov A*, Yermakov D
*Luhansk Taras Shevchenko National University Ukraine |
I - Regulation, Pathways, and Systems Biology |
|
I 43
I43 |
Nucleosomes are the basic unit of chromatin, comprising a stretch of DNA of length 147 bp wrapped around a histone octamer. Since 70-90%
of the eukaryotic genome is packaged into nucleosomes, they play a crucial role in modulating accessibility of transcription factor binding sites (TFBSs). Consequently, nucleosome positioning has profound effects on gene expression in eukaryotes. Biophysical modeling predicts that competition between nucleosomes and transcription factors (TF) for binding to nearby sites on the genome can induce both positive and negative cooperativity in TF binding. In particular, we show that the cooperative effect depends periodically on the distance between TFBSs, with positive cooperativity for sites less than 40 bp apart, negative cooperativity for larger distances up to one nucleosome length, and again positive cooperativity for distances just above one nucleosome length. A comprehensive statistical analysis of TFBS positioning for 158 TFs of Saccharomyces cerevisiae shows that many pairs of TFs have positioned their binding sites so as to optimize positive cooperativity of their binding. Moreover, this positioning is most significant for a number of TFs that have already been implicated in opening chromatin. In summary, our results show that the "grammar" of the regulatory code in yeast promoters is shaped to a significant extent by nucleosome-mediated cooperativity of TFs. |
Ozonov E*, van Nimwegen E
*Biozentrum and Swiss Institute of Bioinformatics Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 44
I44 |
Genistein and daidzein are soy metabolites which have been shown to have positive effects on human health. Those effects depend strongly on the result of varied degradation routes that break down the large aromatic molecules into smaller and simpler phenolics. Some of reactions performed by gut bacteria which are unique for a specific individual. This process monitored by metabolomics
experiments from blood and urine but it is not yet possible to deduct which degradation routes were active. Also possible dynamics and behavior of the system remains unknown. In this work we use Petri Net theory to explore the degradation pathway of genistein and daidzein. We build a Petri Net model using literature and knowledge of experts. We provide analysis of the model and perform in silico simulation by maximal parallelism approach. Moreover we show that t- and p- invariants of the Petri Net model can be used to support multivariate statistical analysis of metabolomics data. |
Reshetova P*, Westerhuis JA, van Kampen AHC, Smilde AK
*Biosystems Data Analysis Group, Swammerdam Institute for Life Sciences, University of Amsterdam Netherlands |
I - Regulation, Pathways, and Systems Biology |
|
I 45
I45 |
Based on a well established micro-fluidic assay it is possible to control the stimulation of ERK activation and monitor the responses at the single cell level for various inputs. The effort is to parametrize the system using dynamical modeling, and to characterize the responses by discriminating the main different behaviors based on critical parameter regimes. |
Fengos G*
*ETH Zurich Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 47
I47 |
According to the “World malaria report 2011” by WHO, malaria remains a major healthcare issue that is responsible for 216 million cases and mortality of about 655 000 people in 2010. Although efficient and cost effective antimalarials are available, the drug resistance that the parasites eventually acquire in response to the treatment urges discovery of novel medicines. While experimental studies of P. falciparum remain challenging, new in silico approaches have been developed to reconcile the current knowledge and generate testable hypotheses through computational simulations. To date, several genome-scale metabolic models have been created for P. falciparum. They have predicted a number of vulnerable points in its metabolism, which may serve as potential drug targets. Herein we present our elaboration of the most recent metabolic reconstruction of P. falciparum by implementing thermodynamic constrains. Consequently, our model does not require predefined directionality for the reactions; instead, it is determined by the respective thermodynamic properties and concentration ranges. By comparing the results of the analogous simulations with and without thermodynamic constrains we observed significant differences in reversibility of the reactions and essentiality of the genes. For instance, thermodynamically consistent model predicted more double gene knockouts to be lethal for the parasite as opposed to the same model with predefined reaction directionalities. These results highlight the importance of accounting for elemental, charge balance and energetic properties of the reactions in metabolic reconstructions. Furthermore, our model is capable of incorporating metabolomic datasets as additional constrains to represent multiple metabolic states of the parasite. |
Tymoshenko S*, Soldati-Favre D, Hatzimanikatis V
*École Polytechnique Fédérale de Lausanne (EPFL) Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 48
I48 |
Alternative mRNA splicing is a major mechanism for gene regulation and transcriptome diversity. Despite the extent of the phenomenon, the regulation and specificity of the splicing machinery are only partially understood. Adenosine to inosine (A-to-I) RNA editing of pre-mRNA by ADAR (Adenosine De-Aminase Acting on RNA) enzymes has been linked to splicing regulation in several cases. In the current study, we used bioinformatics approaches and RNA-seq and exon-specific microarray of ADAR knockdown cells, to globally examine how ADAR and its A-to-I RNA editing activity influence alternative splicing.
While A-to-I RNA editing only rarely targets canonical splicing acceptor, donor and branch sites, it was found to affect splicing regulatory elements (SREs) within exons. Cassette exons were found to be significantly enriched with A-to-I RNA editing sites compared to constitutive exons. RNA-seq and exon-specific microarray revealed that ADAR knockdown in hepatocarcinoma and myelogenous leukemia cell lines leads to global changes in gene expression, with hundreds of genes changing their splicing patterns in both cell lines. Genes that showed significant changes in their splicing pattern are frequently involved in RNA processing and splicing activity. Analysis of recently published RNA-seq data from glioblastoma cell lines showed similar results. Our global analysis reveals that ADAR plays a major role in splicing regulation. Although direct editing of the splicing motifs does occur, we suggest this is not likely to be the primary mechanism for ADAR-mediated regulation of alternative splicing. Rather, alternative splicing regulation is achieved by modulating trans-acting factors involved in the splicing machinery. |
Solomon O*, Safran M, Deshet-Unger N, Akiva P, Jacob-Hirsch J, Cesarkas K, Kabesa R, Amariglio N, Unger R, Rechavi G, Eyal E
*Bar-Ilan University; Cancer Research Center, Sheba Medical Center Israel |
I - Regulation, Pathways, and Systems Biology |
|
I 49
I49 |
For many infectious diseases the initiation of the infection is only poorly understood. A variety of strategies exist for a pathogen to enter the host cell but there are also common parts of pathways through which bacteria or viruses can trigger this process. Understanding the network of host genes is involved in the entry mechanism is important for identification of new potential drug targets.
The InfectX consortium performs genome wide RNAi gene silencing screens of human cells during infection with different pathogens. This effort results in a large image based data source to study and compare the entry mechanism for several viruses and bacteria. We gather the data from many pathogens for an individual and comparative analysis using different statistical models. Specifically, for each pathogen we perform hit selection using stability ranking to identify host genes whose knock-down has an effect on the infection process. We then integrate the results of all screens to find gene sets that are common for different diseases as well as genes that are distinct for individual pathogens. |
Siebourg J*, Beerenwinkel N
*ETH Zürich Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 50
I50 |
As the comparative expression analysis using techniques such as quantitative proteomics become increasingly common, rapid acquisiton of large datasets are now available for further analysis. Pathway centric functional analysis of experimental data provides a new perspective towards understanding expression regulation associated with a wide range of biological conditions. Robust and high-throughput pathway analysis of quantitative expression data however has its challenges. We identify some of these challenges that stand in the way of emergence of effective methods as well as introduce a novel method we have recently developed to tackle some of these challenges.
Out method (called FEvER) aims to identify pathways that are likely targets of differential expression, based on datasets from comparative expression studies. In order to infer an overall likelihood of expression regulation at pathway level, our approach utilizes two fundamentally different enrichment models in parallel, with complementary statistical tests for evaluation of the enrichment significance. The FEvER method contains a number of useful features and is designed to be computationally effective. We have verified the method using a well studied system in a model organism which we present in a recent study, together with functional analysis of several complicated datasets using FEvER. |
Kirik U*, Cifani P, Albrekt A, Lindstedt M, Heyden A, Levander F
*Lund University Sweden |
I - Regulation, Pathways, and Systems Biology |
|
I 51
I51 |
Gene transcription by RNA polymerase II (pol-II) is a key step in gene expression. Transcription consists of a number of dynamic events such as recruitment of pol-II to the promoter, elongation and termination.
In this work we present a mathematical model of transcription dynamics which allows us to compute RNA transcription speed and infer pol-II activity at the gene promoter using pol-II occupancy data measured using ChIP-seq. Our approach directly models the movement of pol-II down the gene body and allows us to determine the presence of transcriptionally engaged pol-II. This allows us to differentiate transcriptionally engaged pol-II from pol-II paused at the promoter yielding good estimates of transcription speeds and pol-II promoter activity. By clustering the inferred promoter activity time profiles, we are able to determine genes that respond quickly to stimuli and also group genes that are likely to be acting together. |
Maina C*, Peltonen J, Honkela A, Lawrence N, Rattray M
*University of Sheffield United Kingdom |
I - Regulation, Pathways, and Systems Biology |
|
I 52
I52 |
Motivation: Currently, predominant pathway analysis approaches treat pathways as collections of individual genes and consider all pathway members as equally informative. As a result, at times spurious and misleading pathways are inappropriately identified as statistically significant, solely due to components that they share with the more relevant pathways.
Results: We observed that membership of a gene in multiple pathways is the norm, not the exception, yet in most pathway repositories, the vast majority of co-annotated gene pairs share only one pathway annotation. Based on this observation, we introduce the concept of Pathway Gene-Pair Signatures (Pathway-GPSs) as pairs of genes that, as a combination, are specific to a single pathway. We devised and implemented a novel approach to pathway analysis, Signature Over-representation Analysis (SIGORA), which focuses on the statistically significant enrichment of Pathway-GPSs in a user specified gene list of interest. In a comparative evaluation of several published datasets, SIGORA outperformed traditional methods by delivering biologically more plausible and relevant results. This suggests that our approach provides a useful complementary tool for pathway analysis. Availability: An efficient implementation of SIGORA, as an R package with precompiled GPS data for human and mouse pathway annotations from KEGG, Biocarta, REACTOME and INOH, is freely available for download from CRAN. |
Foroushani ABK*, Brinkman FSL, Lynn DJ
*Simon Fraser University // Teagasc, Ireland |
I - Regulation, Pathways, and Systems Biology |
|
I 54
I54 |
Knowledge about the structure of regulatory networks is only the first step to understanding their functionality. Construction of dynamical models, however, is a difficult task since the information needed to determine parameters is often incomplete. Discrete models of biological networks, based on qualitative observations, have stirred interest by producing meaningful insights. Despite the high level of abstraction employed, the problem of parameter identification still has to be confronted, since the number of models consistent with a given network structure, though finite, is generally very large. Incomplete information about the system then necessitates approaches allowing to systematically classify the pool of feasible models by studying dynamical properties.
To address this need, we developed a tool with the following functionalities. (1) It enumerates the pool of feasible models by a rich CP based description language including interaction labels, parameter value constraints, focal equations, feedback circuit functionality, time series and attractor constraints. If exhaustive enumeration is not possible, sampling and parameter value perturbation techniques may be employed. (2) It annotates models by class labels for properties like CTL/LTL specifications, number and type of attractors and length of shortest simulation reproducing a time series. A thorough description of the interface between the main program and the classifier algorithms allows researchers to customize the tool by adding their own classifiers. (3) It returns a SQL database of annotated models for analysis. Our contributions are a unified framework of property based model classification, a database to organize results and an open interface to further classifiers. |
Klarner H*, Siebert H
*Freie Universität Berlin Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 55
I55 |
RNA interference (RNAi) is a technology that allows the identification of genes involved in a biological process at a genome-wide scale. However, due to off-target effects and low knock-down efficacy of siRNAs, data from RNAi screens often exhibit a high amount of noise. The identification of potential hits, e.g. for a follow-up in vivo screen, based solely on the screening readout therefore commonly gives rise to a significant amount of false positives and false negatives.
Here, we present a semi-supervised learning framework, integrating readout from RNAi screens with functional association data, for the robust identification and prioritisation of hits. We exploit the fact that directly or indirectly interacting genes are likely to be involved in the same biological process. Our model provides a gene ranking, including genes lacking screening readout, which is sufficiently smooth with respect to both the underlying network structure and the raw readout. We demonstrate, that the smoothed ranking obtained by our framework shows a significant increase of true hits in the top ranking genes, when compared to the ranking based on the raw readout only. Moreover, the presented framework can be used to predict novel hit candidates, even in the case of missing screen readout. |
Schmich F*, Mohammadi P, Merdes G, Beerenwinkel N
*ETH Zurich, D-BSSE, Computational Biology Group Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 56
I56 |
Motivation:
The identification of the information flow through the cell from external stimuli via protein signaling to the transcriptome response is essential to understand how a cell decides to proliferate, migrate or differentiate. Problem statement: Transcriptome-based pathway analyses usually assume a correlation between gene expression and protein abundance. However, in the dynamic context of cellular decisions, this ignores the different time-scales of fast protein signaling and slow gene regulation, wherein fast protein events such as phosphorylation are "hidden" from transcriptome dynamics. Approach: Here we take a time-scale separation approach to infer causal pathways by combining transcript profiling and protein interaction network data. We identify putative transcription factors (TF) from the slow transcriptome response by gene set enrichment analysis and link TF regulation to the respective upstream membrane receptors. Back-tracing is performed using the mean first passage time of a random walk on the protein-protein network to obtain the most probable reaction path. Thereby we overcome the coarseness of the standard shortest-path approaches. Results: We apply our method to the transcriptome dynamics obtained from double paracrine communication between primary human keratinocytes and fibroblasts in vitro. We could link the time-dependent enrichment of transcription factors to a specific regulation of upstream receptors using our novel random walk approach. In contrast, no specific regulatory link could be determined using the shortest path between TFs and receptors. Conclusions: Our random walk approach connects slow transcriptome dynamics with the fast upstream protein signaling pathways. As a sensitive and specific measure it resolves the parallel cellular information flow in an unprecedented manner and sheds new light on the complex orchestration of signaling in cellular decisions. |
Bao J*, Weber S, Nascimento J, Boerries M, Busch H
*FRIAS-LifeNet, University of Freiburg Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 57
I57 |
The immune system is one of the most complex signal processing machinery in biology. The adaptive immune system, consisting of B and T lymphocytes, is activated in response to a large spectrum of pathogen antigens. B cells recognize and bind the antigen through B-cell receptors (BCRs) and this is fundamental for B-cells activation. However, the system response is dependent on BCR-antigen affinity values that span several orders of magnitude. Moreover, the ability of the BCR to discriminate between affinities at the high end (e.g. $10^{9} M^{-1}$ - $10^{10} M^{-1}$) challenges the formulation of a mathematical model able to robustly separate these affinity-dependent responses.
Queuing theory enables the analysis of many related processes, as those resulting from the stochasticity of protein binding/unbinding events. Here we define a network of queues, consisting of BCR early signaling states and transition rates related to the propensity of molecular aggregates to form or disassemble. By considering the family of marginal distributions of BCRs in a given signaling state, we report a significant separation (measured as JS-divergence) that arises from a broad spectrum of antigen affinities. |
Felizzi F*, Comoglio F
*ETH Zurich Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 58
I58 |
The improvement of several areas such as biomedical research and biotechnological production greatly benefits from the results provided by dynamic models. In particular, of the underlying process or phenomena the field of Metabolic Engineering takes advantage from this type of formulation, namely to find optimal environmental conditions or to discover optimal sets of genetic manipulations for the design of efficient microbial strains. Here we studied the dynamics and the biochemical basis of the central metabolism in Lactoccocus lactis, usually used in the dairy industry.
Although several kinetic models have been developed to study the enzymatic regulation [1] and metabolic flux analysis [2,3] in L. lactis, most of these are very simplified and do not take into account acetoin, butanediol and mannitol production pathways. Therefore, they cannot be useful in the design of an engineering strategy for the production of these industrially important polyol products by L. lactis. In this work, we present the modeling of the main metabolism in L. lactis, expanding and refining the most recent kinetic model recently developed by Levering et al. [4], by including the acetoin, butanediol and mannitol pathways. For modeling we used a set of ordinary differential equations (ODE´s) with approximate Michaelis-Menten-like kinetics for each reaction. We then parameterized and validated the model using time series metabolite concentrations data obtained in our group by Nuclear Magnetic Resonance (NMR). The results show that this model can provide a very promising tool to elucidate the effect of alterations in the whole main metabolism of L. lactis. Reference List 1. Voit E, Neves AR, Santos H: The intricate side of systems biology. Proceedings of the National Academy of Sciences of the United States of America 2006, 103: 9452-9457. 2. Hoefnagel MHN, Starrenburg MJC, Martens DE, Hugenholtz J, Kleerebezem M, Van Swam II et al.: Metabolic engineering of lactic acid bacteria, the combined approach: kinetic modelling, metabolic control and experimental analysis. Microbiology 2002, 148: 1003-1013. 3. Oh E, Lu M, Park C, Oh HB, Lee SY, Lee J: Dynamic modelling of lactic acid fermentation metabolism with Lactococcus lactis. Journal of Microbiology and Biotechnology 2011, 21: 162-169. 4. Levering J, Musters MWJM, Bekker M, Bellomo D, Fiedler T, de Vos WM et al.: Role of phoshate in the central metabolism of two lactic acid bacteria - a comparative systems biology approach. FEBS J 2012, 279: 1274-1290. |
Costa RS*, Hartmann A, Gaspar P, Neves AR, Vinga S
*Instituto de Engenharia de Sistemas e Computadores (INESC-ID) Portugal |
I - Regulation, Pathways, and Systems Biology |
|
I 60
I60 |
Epigenetics is the study of heritable changes in gene function that cannot be explained by the changes in the DNA sequence. DNA methylation is one of the most studied epigenetic processes, which involves the addition of a methyl group to the carbon-5 position of cytosine base of the DNA. Aberrant DNA methylation patterns are known to be associated with various types of diseases such as cancer. However, the epigenetic mechanism leading to metastasis state of tumorigenesis from normal state is poorly understood. In this context, we aim to profile the epigenetic signals giving rise to tumor progression in mouse carcinogenesis system. To address this, we have analyzed DNA methylation and gene expression datasets of four different cell lines, each of which representing a tumorigenic state. We have identified the genes enriched in methylation using a model based and non-parametric approach. We then study differentially methylated regions (DMR’s) and correlate it gene expression dataset. Gene ontology analyses of genes enriched in methylation are in concordance with tumor progression activity. Partial results suggest a possible functional role for the highly methylated genes in transforming non tumorigenic state to metastasis state. |
Modhukur V*, Kull M, Vilo J
*University of Tartu Estonia |
I - Regulation, Pathways, and Systems Biology |
|
I 61
I61 |
Proteins are the direct mediators of cellular processes and their abundance determines the physiological state of the cell. Recent genome-scale mRNA and protein measurements in Escherichia Coli have estimated that only ~40% of variation in protein abundance is explained by differential mRNA concentration. Post-transcriptional regulation acts then as a major determinant of protein levels; and it has been demonstrated that sequence related features can accurately predict some of its effect. Here, we sought an integrative approach where mRNA concentration and gene-related features were combined to yield a predictive model explaining more than 60% of variation in protein abundance. The model suggests a complex organization of E. coli regulation where transcriptional regulation is preferred, followed by translation elongation and translational initiation. We also found that steady-state levels of proteins are partially determined by their function and localization. Furthermore, the model demonstrated good predictability on unseen data as well as across different E. coli strains, outperforming most commonly used protein abundance predictors such mRNA concentration and codon adaptation index (CAI). Quantitative models for protein abundance are expected to improve genome-scale metabolic models and accelerate the design of synthetic genetic circuits with direct implication in biotechnology applications. |
Guimaraes JC*, Rocha M, Arkin AP
*University of California, Berkeley United States of America |
I - Regulation, Pathways, and Systems Biology |
|
I 62
I62 |
The Systems Biology Knowledgebase (KBase) has two central goals. The scientific goal is to produce predictive models, reference datasets and analytical tools and demonstrate their utility in DOE biological research relating to bioenergy, carbon cycle, and the study of subsurface microbial communities. The operational goal is to create the integrated software and hardware infrastructure needed to support the creation, maintenance and use of predictive models and methods in the study of microbes, microbial communities and plants. The microbial communities component will be focused on building the computational infrastructure to understand the community function and ecology through study of genomic and functional data and integration of community models with single-organism models. This will allow for researching community behavior and building predictive models of communities in their role in the environmental processes and the discovery of useful enzymes.
The KBase microbial communities team will integrate both existing and new tools and data into a single, unified framework that is accessible programmatically and through web services. This will allow the construction of sophisticated analysis workflows by facilitating the linkages between data and analysis methods. The standardization, integration and harmonization of diverse data types housed within the KBase and data located on servers maintained by the larger scientific community will allow for a single point of access, ensuring consistency, quality assurance, and quality control checks of data. |
Meyer F*, Chivian D, Wilke A*, Desai N, Wilkening J, Keegan K, Trimble W, Keller K, Dehal P, Cottingham R, Maslov S, Stevens R, Arkin A
*Argonne National Laboratory United States of America *Argonne National Laboratory United States of America |
I - Regulation, Pathways, and Systems Biology |
|
I 63
I63 |
Robustness of Network-based Disease Gene Prioritization Methods Points to Pathophenotypic Plasticity
Complex biological systems usually pose a trade-off between robustness and fragility where a small number of perturbations can substantially disrupt the system. In this study, we have hypothesized that disease-gene prioritization methods based on the network of protein-protein interactions can be employed to investigate the mechanistic relationships between disease-genes and explain the robustness emerging from these relationships. We have tested the robustness of several network-based disease-gene prioritization methods with respect to the perturbations of the system using various disease phenotypes from the Online Mendelian Inheritance in Man database. As the network-based disease-gene prioritization methods are based on the connectivity between known disease-genes, we have further used these methods to understand the plasticity of the pathophenotypes. Our results have suggested that pathophenotypes such as breast cancer, diabetes and obesity bear more plasticity compared to the rest of the compared pathophenotypes. |
Guney E*, Oliva B
*Universitat Pompeu Fabra Spain |
I - Regulation, Pathways, and Systems Biology |
|
I 64
I64 |
With the objective to identify novel candidates involved in the regulation of NADP(H)-homeostasis, we are exploring the utility of integrative analysis of different gene/protein/metabolite-based networks constructed using separate data sets. The combined experimental and meta-level analysis is carried out for two different prokaryotic organisms, Escherichia coli and Synechocystis sp. PCC 6803. Although both organisms are official "model" organisms, in practice the former organism is studied more extensively. The relative lack of understanding and online resources for Synechocystis makes the study of this organism more challenging, thereby presenting a test-case for the utility of gene family assignments.
Two network-types are central to the meta-analysis of both species, (1) transcriptome and (2) bibliome. Co-expression networks were constructed using a collection of published microarray data. For the bibliome analysis, we commence by building a single, organism-independent network based on text mining all literature accessible in PubMed and PubMed Central Open Access. Thereafter, we separate between the two species by creating networks using only those relations that can be linked to genes in each respective organism via gene family assignment1. Multiple networks are thereafter merged and searched for clusters and smaller motifs linked to seed genes encoding proteins known to participate in NADPH-metabolism. The meta-level analysis is complemented by physiological and metabolic flux analysis of strains with selected perturbation of NADPH-metabolism. The experimental and computational data is cross-verified in both directions. The combined analyses allowed several novel candidate genes to be identified. 1 Kaewphan et al. BioTxtM 2012. http://bioinformatics.psb.ugent.be/pdf/publications/kaewphan_et_al.pdf |
Kreula S*, Kaewphan S, Van Landeghem S, Van de Peer Y, Ginter F, Jones P
*Bioenergy group, University of Turku Finland |
I - Regulation, Pathways, and Systems Biology |
|
I 65
I65 |
The identification of cis-acting elements on DNA is crucial for the understanding of the complex regulatory networks that govern many cell mechanisms. However, this task is very complex since it is es- timated that there are 1500 different transcription factors (TFs) in the human genome, each of which can bind to multiple loci directly or in- directly. The standard computational approach is the use of a position weight matrix (PWM) to represent the binding preference of a transcrip- tion factor and the use of statistical procedures to detect genomic regions with high binding scores. Given the small and degenerate signals of most PWMs, such approach suffers from a very high number of false positive hits. Current research has proven that genome wide assays reflecting open chromatin, such as DNase digestion or histone modifications, can improve sequence based detection of the binding location of transcription factors that are active in a particular cell type. We propose here a Mul- tivariate Hidden Markov Model that is able to improve the prediction of transcription factor binding locations by integrating DNase digestion and histone modification data. Our methodology improves sensitivity, in comparison to existing methods, with little or no effect at specificity rates. This study shows that it is possible to improve predictability power of cis-acting elements by correctly integrating DNase and histone mod- ification data, allowing for more sophisticated studies using a larger set of epigenetic signals. |
Gesteira Costa Filho I*, Gade Gusma E
*RWTH University Hospital Aachen Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 66
I66 |
The process of differentiation of stem cells to fully differentiated cell types is controlled by modifications of chromatin and activity of transcription factors. These regulatory signals often act in a multi-functional, with a specific reason for the modification of DNA or histone causing repression or activation of transcription depending on its location in the promoter and other regulatory signals that may be present. This work aims to analyze the pattern of regulation of gene expression of a set of embryonic stem cells and hematopoietic system (blood cells), which traditionally has been the subject of intense experimental investigation. For this purpose, we used the mixture model sparse linear regression from histone modifications and binding sites for transcription factors. Thus, it was possible to estimate the coefficient representing the relative importance of each sign for groups of regulatory genes. Furthermore, it is also indicated coefficients of a signal that activates or represses transcription. The model was able to predict the prediction of gene expression in relation to the use of a simple linear model and also made the selection of regulatory signals in the analysis of all transcription factors with known motifs (> 600). The method identified interesting routes to histone modifications and a selection of transcription factors related to the development of cells of the hematopoietic systems and chromatin remodeling. |
Rêgo T*, Costa I, Carvalho F, Roider H
*Universidade Federal de Pernambuco Brazil |
I - Regulation, Pathways, and Systems Biology |
|
I 67
I67 |
miRNAs have recently been shown to play a key role in cell senescence, by downregulating target genes. Thus, inference of those miRNAs that critically downregulate target genes is important. However, inference of target gene regulation by miRNAs is difficult and is often achieved simply by
investigating significant upregulation during cell senescence. Here, we inferred the regulation of target genes by miRNAs, using the recently developed MiRaGE server, together with the change in miRNA expression during fibroblast IMR90 cell senescence. We revealed that the simultaneous consideration of 2 criteria, the up(down)regulation and the down(up) regulatiion of target genes, yields more feasible miRNA, i.e., those that are most frequently reported to be down/upregulated and/or to possess biological backgrounds that induce cell senescence. Thus, when analyzing miRNAs that critically contribute to cell senescence, it is important to consider the level of target gene regulation, simultaneously with the change in miRNA expression. |
Taguchi Y*
*Chuo University Japan |
I - Regulation, Pathways, and Systems Biology |
|
I 68
I68 |
Secondary metabolites (SM) are pharmaceutically important substances produced mostly by fungi and bacteria. Genes involved in the SM biosynthesis are often co-regulated and organized in clusters, which can be regulated by cluster-specific transcription factors (csTFs). The problem of the cluster prediction remains a hot topic in the fungal secondary metabolite research.
We suggest a novel not-similarity based approach to prediction of the SM clusters based on the density of the binding motifs for the csTFs. The occurrences of the cluster-specific motifs must be higher in the clusters and less probable in other parts of the genome. Note that their occurrence outside the cluster is not excluded. The algorithm searches for the motif-enriched regions supporting consecutive promoters possessing motifs but also allowing gaps in the clusters. The algorithm starts with the prediction of over-represented common motifs in an interim set of promoters around the SM synthase. Each significant motif is then searched in all promoter sequences of the corresponding chromosome. The sequence of promoters is scanned by a sliding window counting the number of found motifs per frame. The optimal result should be obtained for the frame equal to the cluster length. Consideration of different frame lengths allows to find the real cluster length. The effectiveness of the method was successfully demonstrated by re-identification of several functionally characterized clusters with known borders. We also show that the method works on the completely unknown clusters. It is important that the method provides with additional information about the binding sites for cluster-specific regulators. |
Wolf T, Shelest V, Shelest E*
*Hans Knoell Institute, HKI Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 69
I69 |
Formation of the Dpp gradient: Restricted Extracellular Diffusion or Receptor Mediated Transcytosis?
Decapentaplegic (Dpp) forms a long range gradient and acts as a morphogen in the early Drosophila wing imaginal disc patterning.
Two opposing mechanisms describing Dpp gradient establishment have long coexisted: Restricted Extracellular Diffusion (RED) and Receptor Mediated Transcytosis (RMT). In the RED mechanism, Dpp diffuses in the extracellular matrix (with local binding to the Thickveins (Tkv) receptors mediating the internalization) whereas in the RMT mechanism transport from the production region is achieved after internalization by cell-to-cell transcytosis. In a recent work, we have proposed a mathematical framework which a-priori allows for both mechanisms to co-exist. Comparing our theoretical predictions to wild type and Tkv-mutant experimental data, we were able to conclude that transcytosis can only play a minor role and the main mechanism leading to the establishment of a long range Dpp gradient is extracellular diffusion. |
Dalessi S*, Schwank G, Basler K, Bergmann S
*University of Lausanne (UNIL) Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 70
I70 |
Motivation: There have been many successful experimental and bioinformatics efforts to elucidate transcription factor (TF)-target networks in several organisms. For many organisms, these annotations are complemented by miRNA-target networks of good quality. Attempts that use these networks in combination with gene expression data to draw conclusions on TF or miRNA activity are, however, still relatively sparse.
Results: In this study, we propose Bayesian inference of regulation of transcriptional activity (BIRTA) as a novel approach to infer both, TF and miRNA activities, from combined miRNA and mRNA expression data in a condition specific way. That means our model explains mRNA and miRNA expression for a specific experimental condition by the activities of certain miRNAs and TFs, hence allowing for differentiating between switches from active to inactive (negative switch) and inactive to active (positive switch) forms. Extensive simulations of our model reveal its good prediction performance in comparison to other approaches. Furthermore, the utility of BIRTA is demonstrated at the example of Escherichia coli data comparing aerobic and anaerobic growth conditions, and by human expression data from pancreas and ovarian cancer. Availability and implementation: The method is implemented in the R package birta, which is freely available for Bioconductor (>=2.10) on http://www.bioconductor.org/packages/release/bioc/html/birta.html. |
Fröhlich H*
*University of Bonn Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 71
I71 |
In systems biology many applications are based on predictions of regulatory patterns of different levels, including regulated genes, transcription factors (TFs) and their binding sites (TFBSs). Reliable prediction of TFBSs is one of the crucial points for the transcriptional network reconstruction. Moreover, the state of the art in promoter modeling for higher eukaryotes is predicting not single TFBSs, but their combinations. Thus, it is reasonable to supplement the predictions of binding sites for single TFs by further search of their combinations.
SiTaR-D is a further development and merging of two tools previously developed in our group: SiTaR, an approach to prediction of single TFBSs, and DistanceScan, the tool for searching TFBS combinations. Initially DistanceScan was based on Match output. In SiTaR-D, the DistanceScan ideology is applied to the SiTaR searching results, which proved to be superior not only to Match, but also to other prominent PWM-based tools. Both SiTaR and DistanceScan are based on modeling random distributions (of motifs (SiTaR) or distances (DS)) and comparing them with with the distributions observed in the query sequences. We apply the modified merged version of SiTaR-D to prediction of functional combinations in secondary metabolite gene clusters in fungi. For this purpose, SiTaR-D is directly linked to our new database of fungal-specific TFs, FunTF. FunTF counts more than 250 experimentally verified fungal TFs and 180 TFBSs and is, to the best of our knowledge, the only manually curated TF/TFBS database developed specifically for fungal species. |
Fazius E, Shelest V, Shelest E*
*Hans Knoell Institute, HKI Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 72
I72 |
It is widely appreciated that the binding of the transcriptional machinery (RNA polymerase II, pol II), transcription and gene expression are controlled by a complex interplay of regulatory factors, including the binding of specific transcription factors in promoter and enhancer regions, and the covalent modification of chromatin and DNA. Using Chip-seq methods it has recently become possible to quantify these factors with high resolution and sensitivity on a whole genome scale, motivating the construction of statistical models linking transcriptional outputs to regulatory inputs. The aim of such models is to identify predictive factors that lead to mechanistic hypotheses. Using data from mouse macrophage cells, we constructed stepwise linear regression models predicting pol II binding from histone modifications (H3K4Me3, H3K4Me1 and DNase I hypersensitivity) and transcription factor (CEBPα, CEBPβ, p65, PU1, BCL6) binding. The models are highly predictive, with R2 values of around 75%. Intriguingly, and in common with other recent work, we found that simple models involving just chromatin structure and modifications were as predictive as models containing specific transcription factors. We followed up this observation using data from human embryonic stem cells, where a much wider range of data is available (23 transcription factors and 24 histone modifications), necessitating the use of a LASSO regression method to select predictive variables. From this work a more complex picture emerges with a much broader range of interaction effects between histone modifications and transcription factors on polII binding. |
Ward J*, Westhead D
*University of Leeds United Kingdom |
I - Regulation, Pathways, and Systems Biology |
|
I 73
I73 |
Signaling pathways are the key means for the communication between cells. Deciphering crosstalk mechanisms between the signaling pathways is an appealing research towards the functional characterization and elucidation of living cells.
We have developed an efficient computational method for inferring crosstalk mechanisms underlying signaling pathway interactions. We investigated 8 active signaling pathways in human published in SignaLInk database (http://www.signalink.org/), i.e., EGF/MAPK and INS/IGF, Insulin/IGF, TGF-beta, Wingless/WNT, Hedgehog, JAK/STAT, Notch, and Nhr. We reconstructed and analyzed the molecular interaction networks of the 8 pathways. By using the shortest path algorithm, 27,844 paths (from one ligand to one transcription factor) were totally found with lengths varied from l = 0 (direct interaction) to length l = 9. The paths were the ordered cascades of directed interactions. We later integrated the ordered cascade data and Gene Ontology data by employing the Microsoft Sequence Clustering algorithm, which is a hybrid algorithm combining clustering techniques with Markov chain analysis. We carried out both k-means method and Expectation Maximization (EM) method for doing clustering. EM method produced better results than k-means method did. The validation of different data combinations showed that the ordered data of paths enhanced the clustering performance. The proposed method could reveal the crosstalk modules between pathways and also the order of those modules in transmitting signals. In addition to the modules confirmed in the KEGG database and the Cell Signaling database, some inferred crosstalk modules suggest some testable biological hypotheses and unveil an essential functionality of crosstalk networks and their biological relevance. |
Nguyen P*, Priami C
*The Microsoft Research -University of Trento Centre for Computational Systems Biology Italy |
I - Regulation, Pathways, and Systems Biology |
|
I 74
I74 |
Multimodal oncological strategies which combine chemotherapy or radiotherapy with hyperthermia (i.e. raising the temperature of a region of the body affected by cancer) have a potential of improving the efficacy of the non-surgical methods of cancer treatment. Hyperthermia engages the heat-shock response mechanism (HSR), which main component (the heat-shock proteins) is known to directly prevent the intended apoptosis of cancer cells. Moreover, cancer cells can have an already partially activated HSR, thereby hyperthermia may be more toxic to them relative to normal cells. However, HSR triggers thermotolerance, i.e. the hyperthermia treated cells show an impairment in their susceptibility to a subsequent heat-induced stress. For that reason, the application of the combined hyperthermia therapy should be carefully examined.
We adapt the Szymańska & Żylicz (2009) model and propose its stochastic extension, which we then analyze using the approximate probabilistic model checking (APMC) technique. We establish a global function of a level of protein denaturation and a correct protein denaturation rate. Next, we formalize the notion of the thermotolerance and compute the size and the duration of the HSR-induced thermotolerance. Finally, we quantify the effect of a combined therapy of hyperthermia and a cytotoxic inhibition of proteins refolding. By mechanistic modelling of HSR we are able to support the common belief that the combination of cancer treatment strategies increases therapy efficacy. Moreover, our results demonstrate feasibility and practical potential of APMC in analysis of stochastic models of signaling pathways. Acknowledgements: This work was partially supported by the Biocentrum Ochota project POIG.02.03.00-00-003/09, and by Polish NSC grants 2011/01/B/NZ2/00864, 2011/01/D/ST1/04133. MR is a scholar within the Human Capital Operational Programme financed by ESF and state budget. |
Rybiński M*, Szymańska Z, Lasota S, Gambin A
*Mossakowski Medical Research Centre, Polish Academy of Sciences Poland |
I - Regulation, Pathways, and Systems Biology |
|
I 75
I75 |
Histone proteins are the building blocks of eukaryotic chromatin, and are essential for genome packaging, function and regulation. The replication-dependent histones must be tightly regulated with the cell cycle as they are primarily required during S-phase. Surprisingly, little is known about the transcriptional regulation of histone gene expression. Here, we conducted a comprehensive computational analysis, based on genome-wide ChIP-seq/ChIP-chip data of over 50 transcription factors and histone modifications in embryonic stem cells. Based on enrichment scores supported by gene expression data of knockout studies, we propose that E2f1 and E2f4 are master regulators of the histone gene family, that CTCF and Zfx are repressors of core and linker histones, respectively, and that Smad1, Smad2, YY1 and Ep300 are restricted or cell-type specific regulators. Our analyses suggest that the regulation of histone gene transcription is significantly more complex than previously perceived, and that the combination of factors orchestrate histone gene regulation, from strict synchronization with S-phase to targeted regulation of specific histone subtypes. |
Gokhman D*, Livyatan I, Sailaja BS, Melcer S, Meshorer E
*The Hebrew University of Jerusalem Israel |
I - Regulation, Pathways, and Systems Biology |
|
I 76
I76 |
A common approach in the analysis of functional genomics data is the identification of differentially regulated cellular pathways by comparing the expression levels of pathway-representing gene or protein sets across different sample groups. Statistical methods for this purpose typically compare measures of the average gene/protein expression in a pathway under different biological conditions, while changes in the variance of expression levels across the pathway members are not investigated.
Here, we present PathVar, a web-application for gene and protein expression data analysis to identify and prioritize pathways displaying deregulation patterns in their member expression variance across samples (unsupervised setting) or predefined sample groups (supervised setting). In particular, we show examples of new pathway deregulations identified on microarray cancer data, which are not detected by a conventional comparison of averaged expression levels. Finally, we discuss how the software exploits information on pathway expression variance within an automated pipeline for robust sample clustering and classification. |
Glaab E*, Schneider R
*University of Luxembourg Luxembourg |
I - Regulation, Pathways, and Systems Biology |
|
I 77
I77 |
A better understanding of biological systems can be obtained through models of deterministic differential equations. Typically computational models are used to find single best fitting parameter points to a model adopted ad hoc. Such approach ignores the possibility of alternative models and it bases its conclusions on single parameter points which does not account for the uncertainty in the data.
Here, we present a computational framework to discriminate between several candidate models using time series experimental data. For each model we identify the viable space -- the parameter space compatible with the data -- using the HYPERSPACE toolbox. Model improvement is driven by a newly developed computational method based on correlated noise injection which point out the model terms that poorly explain the data. Then, we used Bayes factors to select the most likely model given the data. We applied our method to a novel model of cellular transport in Saccharomyces cerevisiae in response to nutrient perturbations. We focused particularly on glutamine transport, carried out by four permeases: Gap1, Gnp1, Agp1 and Dip5. The affinities of these permeases for glutamine vary at least three orders of magnitude and their regulation is heterogeneous. Such variability of the permeases presents challenges in selecting a specific dynamical model for glutamine transport. Using our method, we identified the most likely model as a: (1) a Michaelis-Menten uptake in the micro molar range, regulated by the nitrogen catabolism repression and (2) a second Michaelis-Menten uptake in the mili molar range, independent of nitrogen catabolism repression. To our knowledge this is the first comprehensive model for glutamine transport in yeast. Our findings could be incorporated into larger models of nitrogen metabolism regulation. Finally, our computational framework could be used to illuminate the mechanisms that underlie observed dynamical responses of other biological systems. |
Lopez Garcia de Lomana A*, Sunnaker M, Stelling J, Wagner A
*University of Zurich Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 78
I78 |
Robustness has been long recognized to be a distinctive property of living entities.
While a reasonably wide consensus has been achieved regarding the conceptual meaning of robustness, the biomolecular mechanisms underlying this systemic property are still open to many unresolved questions. The goal of this presentation is to provide an overview of existing approaches to characterization of robustness in mathematically sound terms. The concept of robustness is discussed in various contexts including network vulnerability, nonlinear dynamic stability, and self-organization. The second goal is to discuss the implications of biological robustness for individual-target therapeutics and possible strategies for outsmarting drug resistance arising from it. Special attention is paid to the concept of swarm intelligence, a well studied mechanism of self-organization in natural, societal and artificial systems. It is hypothesized that swarm intelligence is the key to understanding the emergent property of chemoresistance. |
Rosenfeld S*
*NIH/National Cancer Institute United States of America |
I - Regulation, Pathways, and Systems Biology |
|
I 79
I79 |
Parameter inference is a complex problem that has yet to be fully addressed in systems biology. In contrast with parameter optimisation, parameter inference computes both the parameter means and their standard deviations (or full posterior distributions), thus yielding important information on the extent to which the data and the model topology constrain the inferred parameter values. The majority of methods commonly used for optimising biochemical pathways yield point estimates of the goodness-of-fit for some specific combination of parameter values. However, these can typically only be guaranteed to be locally optimal, and therefore limit the comparisons that can be made between models.
We report on the application of nested sampling, a statistical approach to computing the Bayesian evidence log Z, to the inference of parameters in two established systems models. We present results for the application of nested sampling to a stochastic model of noisy messenger RNA transcription: parameter means and standard deviations are reliably estimated, and posterior samples show correlations between parameter values. For a model of circadian rhythms, we demonstrate that the coefficient of variation of parameter estimates (standard deviation/mean) varies ten-fold across the model parameters indicating that the data and model combined constrain parameters to significantly different degrees. We further show that the standard deviation of parameter estimates is reduced by the analysis of increasing numbers of circadian cycles of data, up to 4 cycles, but is unaffected by sampling the data more frequently. |
Aitken S*, Akman O
*University of Edinburgh United Kingdom |
I - Regulation, Pathways, and Systems Biology |
|
I 80
I80 |
Most of what is presently known about how microRNAs regulate gene expression comes from studies that characterized the regulatory effect of microRNAs binding sites located in the 3'~untranslated regions (UTR) of mRNAs. In the recent years, there has been increasing evidence that microRNAs also bind in the coding region (CDS), but the implication of these interactions remains obscure because they have a smaller impact on mRNA stability compared to microRNA-target interactions that involve 3'~UTRs. Here we show that microRNA-complementary sites that are located in both CDS or 3'-UTRs are under selection pressure and share the same sequence and structure properties. Analyzing recently published data of ribosome-protected fragment profiles upon microRNA transfection from the perspective of the location of microRNA-complementary sites we find that sites located in the CDS are more potent at inhibiting translation while sites located in the 3'~UTR are most efficient at triggering mRNA degradation. Our study suggests that microRNAs may combine targeting of CDS and 3'~UTR to flexibly tune the time scale and magnitude of their post-transcriptional regulatory effects. |
Hausser J*, Bilen B, Zavolan M
*Biozentrum, Universität Basel and Swiss Institute of Bioinformatics Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 81
I81 |
High-throughput drug screens often use focused readouts that quantify drug effect on the process of interest, e.g. intracellular infections and although they yield lists of hit compounds with desired bioactivity, they typically don’t provide clues about the molecular mechanism(s) of drug actions. On the other hand, the image-based siRNA screens provide rich information about the loss-of-function phenotypes of individual genes using multiparametric readouts that allow fine-grade discrimination between different types of functional deficiencies.
Here we propose a novel strategy to link chemical and RNAi screening data for predicting molecular targets of drugs. As a proof-of-principle we predicted and validated the target pathways of hits from a drug screen on the intracellular mycobacterial infection in macrophages by comparing their phenotypes to the RNAi profiles of the genome-wide RNAi screen on endocytosis (GWSe) that we reported earlier. For that we rescreened the hit compounds in the endocytic trafficking assay of GWSe. The clustering of obtained phenotypic profiles identified three distinct drug groups. For each group we obtained a list of genes in the GWSe with profiles that were significantly similar to the phenotype of each drug cluster. Gene annotation enrichment analysis of the lists lead to a prediction that one of the drug groups inhibited the mycobacterial infection through autophagy induction that was confirmed experimentally. Our approach is uniquely able to predict target pathways by relying solely on the phenotypic information without relyin on existing drug target information. |
Samusik N*, Sundaramurthy V, Barsacchi R, Kalaidzidis Y, Zerial M
*Max Planck Institute of Molecular Cell Biology and Genetics Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 82
I82 |
More and more studies generate data using several omics technologies for the same sample. To unravel cellular processes and systems responses from this multiomics datasets is considered as a highly complex task which requires the development of sophisticated computational tools. Here we present a novel approach to integrate transcriptome and metabolome time series data sets based on correlation matrices and enrichment of annotation terms.
The algorithm calculates a matrix of the correlation between transcript- and metabolite levels and consists of four phases: (1) Extract subcorrelation matrix for a specific annotation label; (2) generate a histogram of the correlation values in the submatrix; (3) permute molecule labels, extract the sub-correlation matrix and draw a histogram for each permuted instance; (4) use the counts for each bin over all permuted histogram to compute a p-value for each bin. The results of analysis are visualized as networks, with nodes corresponding to annotation labels and edges corresponding to significantly enriched high correlation values. The method was applied to three publicly available time-series data sets of Arabidopsis thaliana exposed to cold stress [1], elevated CO2 levels [2] and sulphur starvation [3]. Robustness of the method to parameter variations and plausibility of the results was verified by drawing comparisons with previous studies. In concordance with the previous studies, for all data sets tight correlations among amino acids were detected. Further, transcriptional deregulations of photosynthesis, primary metabolism as well as shifts in the global protein content were in agreement with previous reports. These findings underline the general applicability and usefulness of the novel method. [1] Kaplan F, Kopka J, et al.: Transcript and metabolite profiling during cold acclimation of Arabidopsis reveals an intricate relationship of cold-regulated gene expression with modifications in metabolite content. The Plant Journal 2007, 50(6):967 -981. [2] Dutta B, Kanani H, et al.: Time-series integrated "omic" analyses to elucidate short-term stress-induced responses in plant liquid cultures. Biotechnology and Bioengineering 2009, 102:264-279. [3] Hirai MY, Klein M, et al.: Elucidation of gene-to-gene and metabolite-to-gene networks in Arabidopsis by integration of metabolomics and transcriptomics. Journal of Biological Chemistry 2005, 280(27):25590. |
Kopp W, Thallinger GG*
*Institute for Genomics and Bioinformatics, Graz University of Technology Austria |
I - Regulation, Pathways, and Systems Biology |
|
I 83
I83 |
Gene set analysis using biological pathways has become a widely used statistical approach for gene expression analysis. A biological pathway can be represented through a graph where genes and their interactions are respectively nodes and edges of the graph. From a biological point of view only some portions of a pathway are expected to be altered; however, few methods using pathway topology have been proposed and none of them tries to identify the signal paths, within a pathway, mostly involved in the biological problem. Here, we present a novel algorithm for pathway analysis clipper that tries to fill in this gap. clipper implements a two-step empirical approach based on the exploitation of graph decomposition into a junction tree to reconstruct the most relevant signal path. In the first step clipper selects significant pathways according to statistical tests on the means and the concentration matrices of the graphs derived from pathway topologies. Then, it identifies within these pathways the signal paths having the greatest association with a specific phenotype. We test our approach on two expression datasets. In both cases, our results demonstrate the efficacy of clipper in the identification of signal transduction paths totally coherent with the biological problem. |
Martini P, Sales G, Massa S, Chiognia M, Romualdi C*
*University of Padova Italy |
I - Regulation, Pathways, and Systems Biology |
|
I 84
I84 |
Motivation
Incorporating new hypotheses into an existing (dynamic) model of a biological system often leads to a set of different but similar models. Identifying the best model among them is a complex task. It could be facilitated by using an easily computable similarity score to reveal a model order a priori. Methods Here, we define (structural) model similarity from the 'graphical' setup only. Models are related by choices such as where and how to start and end feedback loops. Choices can have two or more options and each choice leads to one difference or similarity between any two models. The choices are recorded in a vector for each model, from which we compute the similarity score of two models via the Hamming distance (HD). The HD score can be refined using additional information, such as reaction rates, by incorporating it into weight attributes. We can also employ subvectors to cluster models. Results We demonstrate the approach on 16 dynamic models of minimal biochemical oscillators. Although the HD is a simple measure, it strongly correlates with the volume estimation (VE) of the oscillatory region in the models’ parameter spaces. Combining the VE with the HD, we identify the most robust option for each design choice. Conclusion We developed a method for model classification that is easy to understand and adapt. The method directly relates model structure to robust performance and can be improved by giving a weight to each choice. |
Ibig A*, Stelling J
*Department of Biosystems Science and Engineering, ETH Zurich Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 85
I85 |
Identifying transcription factors (TFs) that regulate a specific target gene is a complex problem. Even simple yeast has more than 200 known TFs and experimentally validating all of them would be prohibitively time-consuming and expensive. Therefore, it would be useful to computationally identify most likely regulators. A naïve approach would be to look for the occurrence of binding motifs of potential TFs in the promoter region of the target gene. However, this approach is known to produce large number of false positives. In this work, we propose a computational method that complements putative binding site information with results from public microarray and ChIP experiments to rank potential regulators of a target gene. We start by searching for microarray data sets in which our gene is differentially expressed. Next, we use Gene Set Analysis to rank TFs according to how enriched their known targets are among the differentially expressed genes in each data set. Finally, we use rank aggregation twice to first combine ranks from multiple expression data sets into a single ranking and then to include information from predicted binding sites and large-scale TF deletion experiments. In the end, we applied our approach to predict regulators for yeast TPO1. Among the top 6 predictions were Pdr1 and Pdr3, two TFs previously known to up-regulate TPO1 in multiple chemical stresses. Furthermore, we were able to experimentally show that the deletion of several other top predictions leads to altered TPO1 expression in either control conditions or in benzoic acid stress. |
Alasoo K*
*Institute of Computer Science, University of Tartu Estonia |
I - Regulation, Pathways, and Systems Biology |
|
I 86
I86 |
Combinatorial interactions among transcription factors (TFs) direct tissue-specific gene regulation. Despite this, previous discoveries of gene regulatory mechanisms usually consider only single TFs ndependently.
Here, we provide a prediction of co-regulating TFs in 12 human tissues and 2 human cell lines based on the predicted binding affinity in human promoter regions. We analyzed all possible pairs of 1376 vertebrate TFs from TRANSFAC database. First, we scanned all promoter regions for single TF-DNA inding affinities with TRAP and ranked, for each TF, all target genes corresponding to the promoters by their predicted binding affinity. We then studied the similarity of pairs of the ranked gene lists stratified by the tissue or the cell line. We detected candidates of co-regulating TFs by applying statistical tests for multiway contingency tables and recently published statistical tests for the intersection of independent lists of genes. We validated our candidates by both known protein-protein interactions (PPIs) and known gene regulatory mechanisms in the selected tissue. Further we compare the results obtained from the two tests and discuss their sensitivity and false discove rate. In addition, we control for the type-I error rate by comparing with the test results on randomly chosen DNA sequences. |
van Bömmel A*, Chung H, Vingron M
*Max Planck Institute for Molecular Genetics Germany |
I - Regulation, Pathways, and Systems Biology |
|
K 01
K1 |
Next-generation sequencing (NGS) techniques have revolutionized the field of genomics/functional genomics. We have recently sequenced and assembled the genome of the filamentous fungus Sordaria macrospora, a model organism for fungal development, solely from NGS reads (PLoS Genet 6:e1000891). We are currently applying NGS in two approaches for the identification and characterization of developmental genes. (I) With laser microdissection, we can separate young fruiting bodies from surrounding hyphae. RNA isolation and amplification from microdissected samples yields enough material for RNA-seq analysis. The resulting data were compared to RNA-seq data from whole mycelial exctracts to characterize the genome-wide spatial distribution of gene expression during sexual development. Additionally, we used the RNA-seq information to improve the predicted S. macrospora gene models, and annotated UTRs for more than 50 % of the genes. (II) We sequenced the genomes from three mutants that were generated by conventional mutagenesis, and identified the three causative mutations through searches for small mutations and large structural variants. One mutant carries a 1.4 kb deletion in the developmental gene pro41. The second, a spore color mutant, has a point mutation in a gene that encodes an enzyme for melanin biosynthesis. In the third mutant, a point mutation in the stop codon of a conserved fungal transcription factor causes the sterility of the mutant. For all three mutants, transformation with a wild-type copy of the affected gene restored the wild-type phenotype. These data show that whole genome-sequencing of mutant strains is a rapid method for the identification of developmental genes. |
Nowrousian M*, Teichert I, Kück U, Wolff G
*Ruhr-University Bochum Germany |
K - Sequencing and Sequence Analysis |
|
K 02
K2 |
In a metagenomic project all sequencing reads are usually compared against a reference database on which the result heavily depends. Sequence databases such as the NCBI-NR database offer various alphanumerical or plain text identifiers for each entry, explaining the origin of the sequence. By definition taxonomic and functional assignment quality depends on the identifier used for mapping. As the reference databases continue to grow rapidly, fewer and fewer sequences have a highly reliable sequence identifier such as a RefSeq-ID.
We present several statistics for the NCBI-NR and NT databases including taxonomic mapping coverage and quality. Additionally SEED and KEGG based functional assignment are also reviewed. Based on the results we suggest various improvements for an enhanced analysis of sequence datasets using reference databases. This includes using different mapping methods such as GI numbers for taxonomic and functional assignments and NCBI-NR to SEED mappings using the UNIPROT database. |
Weber N*, Huson DH
*Center for Bioinformatics, University of Tuebingen Germany |
K - Sequencing and Sequence Analysis |
|
K 03
K3 |
Heteroplasmy of mitochondrial DNA (mtDNA) is a phenomenon in which sequences vary within and between cells of the same individual. Recent technological advances in sequencing, collectively known as "next-generation sequencing” (NGS), have opened new avenues for the characterization of mtDNA heteroplasmy. A possible barrier to using NGS with total DNA for heteroplasmy detection is the presence of nuclear insertions of mitochondrial fragments (NUMTs), which might be confused with mtDNA sequences. To avoid NUMTs, mtDNA sequencing traditionally begins with some method of obtaining pure organellar DNA (e.g. PCR amplification). Here we show, using simulations and data from the 1000 genomes project that NUMTs have little effect on heteroplasmy detection in mtDNA without specific purification. Using simulated reads of the Illumina technology, we examine the mapping of reads originating from NUMTs and their association with false heteroplasmy. We show that if sufficient care is used in mapping and filtering the resulting reads, no false heteroplasmy is detected (i.e. no false positives). Only very few nuclear reads incorrectly mapping to the mtDNA (0.04%), of which all but one were the result of sequencing errors making the NUMT-originating read more similar to the mtDNA. We also show that heteroplasmy detected in data taken from the 1000 genomes project does occur preferentially in mtDNA positions that align to NUMTs. We conclude that NUMTs contamination in NGS experiments does not contribute to false heteroplasmy detection when proper filtering is applied, even in the absence of specific pre-amplification of the template. |
Nagar T*, Rubin E
*Ben-Gurion University Israel |
K - Sequencing and Sequence Analysis |
|
K 04
K4 |
Chinese Hamster Ovary Cells (CHO) are used as an important mammalian cell factory for therapeutic proteins since 1987 [1]. In 2011 the transcriptome and the genome of CHO were published [1], [2]. Because Mus musculus is a near relative of Chinese Hamster (Cricetulus griseus) its sequence was used as a substitute in CHO microarray experiments until the availability of the CHO sequences. Therefore it is interesting to see how similar the published sequences of CHO cell lines are to each other and each of them to the M. musculus sequence. For this reason we performed several bioinformatics tests to check the similarity of the available sequences - with an unexpected result: reciprocal Blasts, sequence clustering and sequence length ratio reveals large differences even between subsequent assemblies of the same CHO strain.
There are a number of reasons conceivable for these results, which warrant further analysis. |
Tatto NE*, Graf AB, Borth N, Mattanovich D
*University of Natural Resources and Life Sciences (IAM) - Vienna/ACIB GmbH Austria |
K - Sequencing and Sequence Analysis |
|
K 05
K5 |
Background:
Cellular respiration is the process by which cells obtain energy from glucose, and is a very important biological process in living cell. As cells do cellular respiration, they need a pathway to store and transport electrons, the electron transport chain. The function of the electron transport chain is to produce a trans-membrane proton electrochemical gradient as a result of oxidation-reduction reactions. If protons flow back through the membrane, ATP synthase converts this mechanical into chemical energy by producing ATP, which is provided energy in many cellular processes. Therefore, to identify electron transport proteins is an important issue in helping biologists better understand the workings of the cellular respiration. Methods: In this work, we propose a method based on radial basis function networks using Position Specific Scoring Matrix (PSSM) profiles and amino acid biochemical properties to identify electron transport proteins. Results: We have selected a non-redundant set of 354 electron transport proteins from UniProt database. The proposed method showed a 5-fold cross-validation accuracy of 92.9% for discriminating electron transport proteins from other transport proteins. We also evaluated the performance of the method with an independent dataset of 71 electron transport proteins, and we obtained an accuracy of 91.3%. In addition, we have systemically analyzed electron transport proteins in five complexes of electron transport chains. Finally, we developed a protocol based on PSSM profiles and biochemical properties for identifying electron transport proteins in new protein sequences. Conclusions: We have developed a novel approach based on PSSM profiles and biochemical properties for identifying electron transport proteins in new protein sequences. The proposed approach could serve as an effective tool for annotating electron transport proteins in genomic sequences. |
Ou Y*
*Yuan Ze University Taiwan |
K - Sequencing and Sequence Analysis |
|
K 07
K7 |
Alternative splicing of pre-mRNA is an important process eukaryotes use to increase their repertoire of different protein products. Several types of alternative splice forms exist including exon skipping, differential splicing of exons at their 3'- or 5'-end, intron retention, and mutually exclusive splicing.
WebScipio is a web-application to reconstruct the exon-intron structure of genes based on a given protein query sequence and a genomic DNA target sequence (www.webscipio.org). We implemented an extension to the WebScipio software to search for mutually exclusive spliced exons. The search is based on the precondition that mutually exclusive exons encode regions of the same structural part of the protein product. This precondition provides restrictions to the search for candidate exons concerning their length, splice site conservation and reading frame preservation, and overall homology. Using the new algorithm, the mutually exclusive exonomes of several species were reconstructed and are accessible via http://www.motorprotein.de/kassiopeia. New genes can originate through multiple mechanisms including gene duplication, gene fusion/fission, exon shuffling, retroposition, horizontal gene transfer, and de novo from noncoding sequences. Most of the new genes are derived through duplications. Gene duplicates are normally classified into dispersed and tandem duplicates. While algorithms have been developed to reconstruct the history and evolution of tandemly arrayed genes, specific programs are not available for the reconstruction of these gene arrays. Here, we present an extension to the WebScipio web-application to search for and predict tandem gene duplicates for a given query sequence. |
Hatje K*, Kollmar M
*Max Planck Institute for Biophysical Chemistry Germany |
K - Sequencing and Sequence Analysis |
|
K 08
K8 |
The classical DNA sequencing by hybridization uses a binary information about oligonucleotide presence in an analysed DNA sequence. A given oligonucleotide is or is not a part of the sequence. However, the development of the DNA chip technology allows to take into consideration some information about repetitions in the target sequence. Currently, it is not possible to determine the exact data of such type but even partial multiplicity information should be very useful.
Two simple but realistic multiplicity information models are taken into account. The first one assumes that it is known if a given oligonucleotide occurs in the analysed sequence once or more than once. According to the second model it is possible to determine if a given oligonucleotide appears in the target sequence once, twice or at least three times. A tabu search algorithm has been implemented to verify these models. It solves the problem with any kind of hybridization errors. A computational experiment results confirm that the additional information leads to an improvement of the reconstruction process. They also show that the more precise model of information increases the quality of the obtained solutions. |
Kwarciak K*, Formanowicz P
*Poznan Univeristy of Technology, Institute of Computing Science Poland |
K - Sequencing and Sequence Analysis |
|
K 09
K9 |
The concept of homology is at the heart of most studies dealing with protein sequence, structure and function. In the absence of protein structure, inference of homology usually has to rely exclusively on sequence data. At present, most sensitive sequence-based methods use comparison of multiple sequence alignments represented as sequence profiles or profile hidden Markov models. Sensitivity of such methods strongly depends on algorithms of profile
construction and comparison. We propose profile scoring and comparison based on the multivariate t-distribution statistics, which is used to describe the distribution of target profile probabilities. Relating to this type of distribution, we develop a new expression of log-odds scores for a pair of profiles. To reveal the utility of the new scoring method, we perform a benchmark test on a set of distantly related proteins and compare the results with the existing profile comparison methods using the ROC analysis. The proposed paradigm of scoring has several important and useful features. Using either multivariate or matrix-variate t-distribution, the scoring model can be easily augmented to the level of profile contexts. Moreover, it can be readily included in the Bayesian non-parametric statistics. The latter makes it possible to statistically cluster profile segments resulting in more sensitive group-oriented profile-pair scores. Funding: This work was supported by Research Council of Lithuania and Howard Hughes Medical Institute. |
Margelevicius M*, Venclovas Č
*Institute of Biotechnology, Vilnius University Lithuania |
K - Sequencing and Sequence Analysis |
|
K 10
K10 |
RNA-seq is a family of next-generation sequencing technologies, which collectively provide a revolutionary tool for studying gene expression. Beginning from a steady-state RNA sample, these methodologies construct a library of millions of differentially abundant short sequence tags or “reads”. Generation of these reads can be thought of as random sampling from an underlying population of cDNA fragments and the aim of statistical analysis is to infer the relative expression levels of the features (e.g. genes) of a reference genome on which these reads are mapped onto. Previously, inference on differential gene expression has been based on modified tests originally devised for analyzing microarray data and on methods developed de novo for the analysis of RNA-seq data.
Building on this volume of work, we investigate the applicability of a novel, explicitly Bayesian approach for the analysis of RNA-seq data. Our point of departure is the use of a hierarchical probabilistic model for modelling overdispersed tag counts and an associated Markov Chain Monte Carlo algorithm for inferring the posterior distribution of model parameters conditional on the observed counts. We demonstrate that the algorithm remains operational even in the absence of biological replicates, a situation not uncommon in practice. At the next stage, we show how the inferred parameter posteriors can be used to test for differential gene expression in a variety of experimental setups, including pairwise and multi-group comparisons. Our analysis is performed on both artificial and actual experimental data obtained from different macaque brain regions. |
Vavoulis DV*, Pardo L, Heutink P, Gough J
*University of Bristol United Kingdom |
K - Sequencing and Sequence Analysis |
|
K 11
K11 |
Advances in sequencing technologies have created new opportunities in cancer genomics. Unbiased sequencing of tumours is increasingly affordable and has resulted in the discovery of new cancer driving genes. To make these discoveries clinically relevant, however, requires highly accurate deep sequencing for a specified subset of the genome. The PacBio SMRT sequencer fills this niche, making it well-suited for re-sequencing studies in large patient cohorts, such as validation of whole genome sequencing studies, and for the validation and application of prognostic and predictive biomarkers.
While optimal data-analysis of next-generation sequencing platforms remains poorly characterized, many informatics tools have been created. Here, we optimize parameters for the entire single-nucleotide variant detection pipeline for PacBio data, including read filtering, read alignment, variant calling and variant filtering. We employ BWA for sequence alignment and the Genome Analysis Toolkit (GATK) for SNP detection. We sequenced 19 known cancer genes in blood and tumour samples from ten prostate cancer patients with PacBio and called variants using GATK. The variants are compared to gold standard datasets produced by OncoScan SNP microarrays and PCR traces, allowing for reliable measurements of sensitivity and specificity. We developed a parameter optimization process, covering millions of data points within the joint PacBio-BWA-GATK parameter space. Using linear modeling, we identify critical parameters and optimize SNP calls for PacBio sequencing data. Identified parameter sets will be validated in an additional ten tumour-normal prostate cancer pairs. |
Lalonde E*, Harding NJ, Brown A, Meng A, Zia A, Trudel D, de Borja R, Buchner N, Starmans MHW, Chan-Seng-Yue MA, Chong T, Denroche R, Fleshner NE, Hennings-Yeomans PH, Sam M, Sendorek D, Timms L, Johns J, Pintilie M, Volik S, Yousif F, Zafarana G, Muthuswamy L, Lambin P, Fraser M, Collins CC, Stein LD, Hudson TJ, van der Kwast T, Beck T, McPherson JD, Bristow RG, Boutros PC
*Ontario Institute for Cancer Research Canada |
K - Sequencing and Sequence Analysis |
|
K 12
K12 |
Recently, several studies have demonstrated that sequencing the exomes of a few patients and subsequent identification of the novel variants they have in common can be sufficient to identify the causal gene responsible for monogenic Mendelian diseases. However, this approach will not always successfully find disease-linked genes because it is limited by genetic heterogeneity, incomplete penetrance and missing data. Assuming that biological networks provide functional information on the underlying disease processes, we present an analysis strategy which utilizes this information to suggest gene complexes that may have a disease-causing role. BioGranat-IG is an efficient implementation of our analysis strategy, which we can state as a combination of graph-search and the classical set-cover problem. We show, using simulated data, that BioGranat-IG is able to recover gene complexes for two inherited diseases known to have a basis of genetic heterogeneity. Additional performance tests assess BioGranat-IG’s utility under various conditions, which include: size and connectedness of the underlying gene complex; number of patients; number of confounding sequence variants, and incompleteness of biological networks. Our tests demonstrate that the number of individuals that need to be sequenced depends on the number of genes involved, the local network topology around the disease genes as well as the penetrance of the disease. We conclude that BioGranat-IG represents an important addition to the exome analysis tools currently available to bioinformaticians and biologists. |
Dand N*, Sprengel F, Ahlers V, Schlitt T
*King's College London United Kingdom |
K - Sequencing and Sequence Analysis |
|
K 13
K13 |
RNA sequencing (RNA-seq) is a recently developed method for transcriptome profiling. RNA-seq increasingly replaces microarrays as the method is not limited to annotated organisms or restricted to selected transcriptomic regions. In principle, RNA-seq also has a wider range of quantification and has the potential of detecting fusion genes. Since RNA-seq is a relatively new method, the analysis and interpretation of the data is still improving. Interpreting RNA-seq has several challenges including duplicate reads, assigning reads to unique transcripts, handling reads that map to multiple locations (e.g. gene families and paralogues) and reads mapping to regions with overlapping genes.
In this work, we analyze the impact of such issues on the interpretation of gene expression read outs from RNA-seq data on 12 breast cancer cell lines. In cases where reads cannot be assigned to specific transcripts, our suggestion is to estimates expression of “gene complexes” instead. For validation of our method, we correlate gene expression estimates from RNA-seq with estimates from microarrays from the same cell lines. We consider it important to understand RNA-seq data in greater detail for it to become the standard for gene expression profiling. |
Flores D*, Belling KC, Elias D, Stenvang J, Jun W, Brünner N, Ditzel H, Calley J, Gupta R
*Center for biological sequence analysis - Technical University of Denmark Denmark |
K - Sequencing and Sequence Analysis |
|
K 14
K14 |
Sequencing of mRNA (RNA-seq) by next generation sequencing technologies is widely used for analyzing the transcriptomic state of a cell. Here, one of the main challenges is the mapping of a sequenced read to its transcriptomic origin. As a simple alignment to the genome will fail to identify reads crossing splice junctions and a transcriptome alignment will miss novel splice sites, several approaches have been developed for this purpose. Most of these approaches have two drawbacks. First, each read is assigned to a location independent on whether the corresponding gene is expressed or not, i.e. information from other reads is not taken into account. Second, in case of multiple possible mappings, the mapping with the fewest mismatches is usually chosen which may lead to wrong assignments due to sequencing errors.
To address these problems, we developed ContextMap which efficiently uses information on the context of a read, i.e. reads mapping to the same expressed region. The context information is used to resolve possible ambiguities and, thus, a much larger degree of ambiguities can be allowed in the initial stage in order to detect all possible candidate positions. Although ContextMap can be used as a stand-alone version using either a genome or transcriptome as input, the version presented in this study is focused on refining initial mappings provided by other mapping algorithms. Evaluation results on simulated sequencing reads showed that the application of ContextMap to either TopHat or MapSplice mappings improved the mapping accuracy of both initial mappings considerably. We show that the context of reads mapping to nearby locations provides valuable information for identifying the best unique mapping for a read. Using our method, mappings provided by other state-of-the-art methods can be refined and alignment accuracy can be further improved. Availability: http://www.bio.ifi.lmu.de/ContextMap. |
Bonfert T*, Csaba G, Zimmer R, Friedel CC
*LMU Muenchen Germany |
K - Sequencing and Sequence Analysis |
|
K 15
K15 |
Commonly used RNA folding programs compute the minimum free energy structure of a sequence under the pseudoknot exclusion constraint. They are based on Zuker’s algorithm which runs in time O(n^3). Recently, it has been claimed that RNA folding can be achieved in average time O(n^2) using a sparsification technique. A proof of quadratic time complexity was based on the assumption that computational RNA folding obeys the "polymer-zeta property". Several variants of sparse RNA folding algorithms were later developed. Here, we present our own version, which is readily applicable to existing RNA folding programs, as it is extremely simple and does not require any new data structure. We applied it to the widely used ViennaRNAfold program, to create sibRNAfold, the first public sparsified version of a standard RNA folding program. To gain a better understanding of the time complexity of sparsified RNA folding in general, we carried out a thorough run time analysis with synthetic random sequences, both in the context of energy minimization and base pairing maximization. Contrary to previous claims, the asymptotic time complexity of a sparsified RNA folding algorithm using standard energy parameters remains O(n^3) under a wide variety of conditions. Consistent with our run-time analysis, we found that RNA folding does not obey the "polymer-zeta property" as claimed previously. Yet, a basic version of a sparsified RNA folding algorithm provides 15- to 50-fold speed gain. Surprisingly, the same sparsification technique has a different effect when applied to base pairing optimization. There, its asymptotic running time complexity appears to be either quadratic or cubic depending on the base composition. Code availability: http://sibRNAfold.sourceforge.net/ |
Dimitrieva S*, Bucher P
*SIB, EPFL Switzerland |
K - Sequencing and Sequence Analysis |
|
K 16
K16 |
Proteins belonging to PD-(D/E)XK phosphodiesterases constitute a functionally diverse superfamily with representatives involved in replication, restriction, DNA repair and tRNA-intron splicing. To date there have been several attempts to identify and classify new PD-(D/E)XK phosphodiesterases using remote homology detection methods. Such efforts are complicated, because the superfamily exhibits extreme sequence and structural divergence. Using our highly sensitive, transitive homology detection approach [1], supported with superfamily-wide domain architecture and horizontal gene transfer analyses, we performed a comprehensive reclassification of proteins containing PD-(D/E)XK domain [2]. The PD-(D/E)XK phosphodiesterases span over 21 900 proteins, which can be classified into 121 different groups of various families. Eleven of them, including DUF4420, DUF3883, DUF4263, COG5482, COG1395, Tsp45I, HaeII, Eco47II, ScaI, HpaII and Replic_Relax, are newly assigned to the PD-(D/E)XK superfamily. Some groups of PD-(D/E)XK proteins are present in all domains of life, whereas others occur within small numbers of organisms. We observed multiple horizontal gene transfers even between human pathogenic bacteria or from Prokaryota to Eukaryota. Uncommon domain arrangements greatly elaborate the PD-(D/E)XK world. These include domain architectures suggesting regulatory roles in Eukaryotes, like stress sensing and cell cycle regulation. Our results may inspire further experimental studies aimed at identification of exact biological functions and specific substrates of these highly diverse proteins.
[1] Ginalski K, von Grotthuss M, Grishin NV, Rychlewski L (2004) Detecting distant homology with Meta-BASIC. NAR 32, W576-81. [2] Steczkiewicz K, Muszewska A, Knizewski L, Rychlewski L, Ginalski K (2012) Sequence, structure and functional diversity of PD-(D/E)XK phosphodiesterase superfamily. NAR, in press. |
Steczkiewicz K*, Muszewska A, Knizewski L, Rychlewski L, Ginalski K
*University of Warsaw Poland |
K - Sequencing and Sequence Analysis |
|
K 17
K17 |
RNA has many pivotal functions especially in the regulation of gene expression by ncRNAs. Identification of their structure is important for understanding their function and often requires a more in-depth analysis of the folding space. Here, the major drawback is the exponential growth of the folding space. Therefore, methods are either limited in the sequence length they can analyze or they make use of heuristics, sampling or abstraction.
With RNAHeliCes, we introduced a position-specific abstraction based on helices which we termed helix index shapes or hishapes for short. Based on this, we developed two methods, one for energy barrier estimation, called HiPath, and one for abstract structure comparison, termed HiTed. Furthermore, we could show the superior performance of HiPath compared to other existing methods and the competitive accuracy of HiTed. Despite polynomial complexity when returning k-best hishapes, the number of possible hishapes is still exponential. This makes it necessary to reduce the number of hishape classes. By modifying candidate selection during the recursive calculation, we are able to slow down the exponential growth. In particular, (1) a substructure A will not be added in the external loop or multiloop if the free energy of A > 0 kcal/mol, or (2) a substructure B will be filtered out if its free energy is higher than another substructure (on the same subword) whose hishape is a subset or superset of hishape of B. The result of such a calculation delivers an approximation of the set of local minima in an energy landscape. |
Huang J*, Voss B
*Genetics & Experimental Bioinformatics, Institute of Biology III , University of Freiburg Germany |
K - Sequencing and Sequence Analysis |
|
K 19
K19 |
RNA-binding proteins (RBPs) and regulatory, RNA-containing ribonucleoprotein complexes play an essential role in many steps of mRNA metabolism. Recently, in vivo methods have been introduced to identify transcriptome-wide targets of RBPs using crosslinking and immune-precipitation (CLIP) followed by deep sequencing of crosslinked RNAs. Photoactivatable ribonucleoside-enhanced CLIP (PAR-CLIP) is one of those methods and provides nucleotide resolution because the crosslinking of the photo-activatable nucleoside to the RBP induces a characteristic mutation at the site of the crosslink in the cDNA. Initially, the number of characteristic mutations was used to score the sites [1]. Following this, another approach, based on the enrichment of reads in CLIP sites relative to read levels in mRNA was introduced and its performance was evaluated by comparing the top scoring sites with the independently estimated affinity of the RBP [2]. Here, we introduce a new probabilistic method that evaluates the density of crosslink-induced mutations at a position or within a region relative to the expected density in a background sample (mRNA-sequencing). We evaluated the performance of our method using Hu-antigen R (HuR) affinity measurements [3]. Initial results indicate that sites with a higher posterior probability of crosslinking have higher HuR affinity compared to the sites having a lower posterior probability and that our method has improved performance compared to the previously introduced methods. We are currently validating this method with different data sets. Our method may contribute to the elucidation of RBP targets by improving specificity especially for sites that have low coverage in the data because the corresponding mRNA abundance is low.
[1] Hafner M. et al. Cell 141 (2010), vol. 141 (1) [2] Kishore S. et al. Nat Meth (2011), vol.8 (7) [3] Ray, D. et al. Nat Biotechnol (2009) vol. 27 (7) |
Bilen B*, Zavolan M
*University of Basel Switzerland |
K - Sequencing and Sequence Analysis |
|
K 20
K20 |
The study of alternative splicing has intensified in the past years by the advent of high throughput sequencing and its application to the characterisation of the transcriptome, known as RNA-seq. Significant evidence has accumulated showing that most human genes have more than one alternative splice form expressed at significant levels and in a regulated fashion (e.g. Wang et al, Nature 2008; Trapnell et al, Nat Biotech 2010; Waks et al, Mol Syst Biol 2011). However, the relative abundance of the different isoforms from a given gene constitutes a fundamental question that remains to be answered. Here we analyse three independent large-scale RNA-seq datasets to address this question, jointly covering a total of 21 different human tissues and cell lines, as well as specifically cytosol and nucleus. We use three different methods for isoform expression quantification (MISO, Cufflinks and mmseq) and include simulated data to test the reliability of our analysis. Overall, our analyses provide consistent evidence that, despite the complexity of the human transcriptome, most genes express one predominant isoform over the rest. In addition, given the diversity of the datasets used, we are able to pursue this question further into the comparison of predominant isoforms across different tissues, cell lines and cellular fractions. |
Gonzàlez-Porta M*, Frankish A, Rung J, Harrow J, Brazma A
*EMBL - European Bioinformatics Institute United Kingdom |
K - Sequencing and Sequence Analysis |
|
K 21
K21 |
Deep sequencing technology, due to its high throughput and low cost, has become a powerful research tool in a wide range of applications, such as RNA-seq and ChIP-seq. In the last years there have been many efforts in the bioinformatics community to provide software in R/Bioconductor to simplify the processing and the biological interpretation of such large data sets. However until now, there is no integrated start-to-end analysis solution within R that sufficiently abstracts the technical details and would be suitable for use by biologists. In particular, alignments need to be performed outside of R and genome annotation information must be manually incorporated. Here we outline the deep sequencing analysis package QuasR, a further development of the FMI deep sequencing pipeline, built to make efficient use of available hardware resources and to simplify analysis of next generation sequencing data. |
Lerch A*, Gaidatzis D, Stadler M
*Friedrich Miescher Institute for Biomedical Research Switzerland |
K - Sequencing and Sequence Analysis |
|
K 22
K22 |
TATools provides a unique management environment for understanding transcriptome data by merging results of diverse classical sequence analysis. Additional features and dedicated viewer pages makes TATools a valuable solution for highlighting novelty in a single transcriptome as well as cross-analysis of several transcriptomes in the same environment. As a use case, the in-depth analysis of a venom gland transcriptomes is presented.Transcriptome data are submitted to BLAST search, HMM/PSSM search and signal sequence detection. The output results are merged and stored in a dedicated mysql relational database. A user-friendly interactive web-platform allows data submission, visualization and annotation. |
Koua D*, Lisacek F, Mylonas R, Stocklin R
*Swiss Institute of Bioinformatics Switzerland |
K - Sequencing and Sequence Analysis |
|
K 23
K23 |
Massively parallel whole transcriptome sequencing, commonly referred to as RNA-Seq, has become the technology of choice for performing gene expression profiling. However, reconstruction of full-length novel transcripts from RNA-Seq data remains challenging due to the short read length delivered by most existing sequencing technologies. We propose a novel statistical genome-guided method called "Transcriptome Reconstruction using Integer Programming" (TRIP) that incorporates fragment length distribution into novel transcript reconstruction from paired-end RNA-Seq reads. To reconstruct novel transcripts, we create a splice graph based on aligned RNA-Seq reads.We enumerate all maximal paths in the splice graph which correspond to putative transcripts. To solve the transcriptome reconstruction problem we must select a set of putative transcripts with the highest support from the RNA-Seq reads. We formulate this problem as an integer program (IP) model where the objective is to select the smallest set of putative transcripts that yields a good statistical t between the fragment length distribution empirically determined during library preparation and fragment lengths implied by mapping read pairs to selected transcripts. Experimental results on both real synthetic datasets generated with various sequencing parameters and distribution assumptions show that TRIP has increased transcriptome reconstruction accuracy compared to previous methods that ignore fragment length distribution information. |
Mangul S*, Caciula A, Al Seesi S, Brinza D, Mandoiu I, Zelikovsky A
*Department of Computer Science, Georgia State University United States of America |
K - Sequencing and Sequence Analysis |
|
K 24
K24 |
The Genome Reference Consortium (GRC) is the international collaboration responsible for maintaining the assemblies of the human, mouse and zebrafish reference genomes that are deposited with the INSDC. The GRC is also responsible for improving these genomes by closing remaining gaps, replacing rare variants and correcting sequencing errors.
For these genomes, a single tiling path is insufficient to represent regions with complex allelic diversity. The GRC is working to create assemblies that better represent this diversity and provide more robust substrates for genome analysis. So far, we have produced major releases for the the human (GRCh37) and mouse (GRCm38) genomes that convert them to a modernised assembly model which provides one or more "alternate loci" to represent each region with complex allelic diversity. Between major releases, we add minor patch releases that provide additional alternate loci, or correct errors within the assembly; so far, we have produced eight patch releases for the human genome. The latest patch release provides more than 1 Mb of additional sequence. The resulting improved reference assembly provides a better basis for read alignment when the alignment algorithm used takes this assembly model into account. Initial tests have, for example, shown that out of those portions of the YH1 (NGS human YanHuang) assembly initially reported as not corresponding to the human reference assembly, 25% can be aligned to GRCh37 patch release 2. We present here the modernised assembly model and details of recent improvements we have made to these reference genomes. |
Torrance J*, Howe K, on behalf of the Genome Reference Consortium
*Genome Reference Consortium, Wellcome Trust Sanger Institute United Kingdom |
K - Sequencing and Sequence Analysis |
|
K 25
K25 |
The DNA of eukaryotic genomes is organized into chromatin, which synchronizes replication, allows DNA packaging, and coordinates gene activity. The combination of DNase I digestion and high-throughput sequencing (DNase I seq) has been recently used to map chromatin accessibility in a given tissue or cell type on a genome-wide scale. Thus, nucleosome-free regions are identified as DNase I hypersensitive sites (DHSs), providing information about where regulatory proteins are acting and revealing regions of potential and actual protein-DNA binding.
Additionally to DHSs, at an adequate sequencing depth, short regions of protected nucleotides known as footprints can be detected, predicting transcription factor binding events, and sometimes identifying specific sequence motifs. To our knowledge, only one computational algorithm (DBFP) is able to detect footprints within digital DNase I data, at an expensive cost both in time and memory requirements. Here we propose a functional data analysis approach based on cubic spline smoothing and roughness derivative penalization to locate and quantify footprints at a genome-wide scale. First, read-coverage at a DHS is represented as a linear combination of basis functions, where the number of basis is selected by the Bayesian information criterion. Secondly, generalized cross validation is used to choose a smoothing parameter for the functional derivative, showing narrow regions of unexpected read depletion, usually placed in the centre of nucleosome-free genomic loci with enhanced chromatin accessibility. This work was supported by the FP7 Marie-Curie ITN SYSFLO (agreement number 237909). |
Madrigal P*, Krajewski P
*Institute of Plant Genetics, Polish Academy of Sciences Poland |
K - Sequencing and Sequence Analysis |
|
K 26
K26 |
It is generally assumed, that apart from cases of programmed translational frameshift, the presence of frameshift-prone sequences in transcripts is inexpedient, and thus subject to negative selection.
We analyzed the distribution of a common, programmed frameshift pattern (X-XXY-YYZ) in the Saccharomyces cerevisiae genome and its artificial, synonymous variants, for which the yeast characteristic codon bias and protein sequence were preserved. We compared the obtained results with the expected frequency of occurrence of this pattern deduced from combinatorial calculations and identified genes of significant enrichment or depletion of frameshift sites. These genes were further subjected to the computer simulation checking their susceptibility to new frameshift sites. To achieve this, we iteratively introduced random point mutations to the gene's coding sequence until the appearance of a new X-XXY-YYZ site. Point mutations were accumulated in the course of simulation and their final number constituted a gene's individual score, which was compared with scores obtained for gene's synonymous sequences. We found that genes with reduced number of frameshift sites were able to sustain significantly more point mutations than their synonymous equivalents. This indicates that evolution acts not only to keep the quantity of frameshift sites low, but also to prevent it from increasing. On the other hand, the capacity of genes already enriched with frameshift sites for "no effect" point mutations was significantly weakened, suggesting that for these genes the occurrence of new frameshift sites may be evolutionary supported, and presumably advantageous. |
Siwiak M*, Zielenkiewicz P
*Institute of Biochemistry and Biophysics Poland |
K - Sequencing and Sequence Analysis |
|
K 27
K27 |
During the last years there has been a great interest in the characterization of genomic structural variation, and paired-end mapping (PEM) is the most widely used method to detect different types of these variants. However, compared to insertions and deletions, inversion prediction presents unique challenges. GRIAL is a new algorithm developed specifically to detect and map accurately inversions. It is based on geometrical rules derived from expected inversion PEM patterns to cluster individual mappings belonging to each breakpoint locus, merging clusters into inversions, and refine breakpoint location. Using available fosmid PEM data from 9 different individuals, we have been able to predict 636 inversions in the human genome and have compared the performance of other currently used methods with that of GRIAL, which shows higher sensitivity and precision in breakpoint location. In addition, by creating different quality scores to asses the reliability of the predictions according to the ratio of discordant and concordant mappings and the support for the breakpoints, we have been able to identify misleading PEM patterns and their causes, and to determine that a big fraction of the predicted inversions are false positives. Therefore, GRIAL improves significantly the inversion prediction and the value of the biological information obtained. |
Martínez Fundichely A*, Oliva M, González JR, Casillas S, Cáceres M
*Institut de Biotecnologia i de Biomedicina (IBB) Spain |
K - Sequencing and Sequence Analysis |
|
K 28
K28 |
Design of primers and probes (assay) is an important step in the conception of molecular diagnostic solutions.
Bioinformatics is necessary both for the design of these oligonucleotides and to validate candidates assays in silico before in vitro testing, which is an expensive and time-consuming task. ALDOv1.2 is a bioinformatics pipeline dedicated to the conception of qPCR primers and probes for molecular diagnostic purposes. This sequence analysis workflow provides simplexes or multiplexes candidates assays ; These candidates assays are ranked according to their ability to detect the maximum number of the targeted sequences with a minimum set of primers and probes. The main added-value of this pipeline relies on its capacity to validate each candidate assay through a thermodynamics-based simulation of a qPCR reaction. A typical validation task of a candidate assay versus 10 million entries from Genbank sequence database costs approximately 3 computation hours on a 96-cores server using standard qPCR conditions as input. Moreover, part of the pipeline is used for monitoring assay performance to cope with the emergence of new variants of targeted genes. To our knowledge, ALDOv1.2 is the first tool able to automatically adapt primers and probe sequences in response to new emerging target alleles using continuously updated input sequence databases. We present some applications of this tool in the context of molecular diagnostics of several pathogenic agents. This work was supported both by bioMérieux company and the University of Rouen (France). |
Paillier F*, Rabearivelo I
*bioMérieux France |
K - Sequencing and Sequence Analysis |
|
K 29
K29 |
The extracellular matrix (ECM) is a major component of tissues of multicellular organisms. It consists of secreted macromolecules, mainly polysaccharides and glycoproteins. Malfunctions of ECM proteins lead to severe disorders such as Marfan syndrome, osteogenesis imperfecta, numerous chondrodysplasias, and skin diseases. The extracellular matrix proteins promise great possibilities as therapeutic targets or diagnostic markers. In this work, we report a random forest approach, EcmPred, for the prediction of ECM proteins to predict ECM proteins based on sequence derived properties.
EcmPred was trained on a dataset containing 300 ECM and 300 non-ECM and tested on a dataset containing 145 ECM and 4187 non-ECM proteins. EcmPred achieved 83.00% accuracy on the training and 77.52% on the test dataset. EcmPred predicted 15 out of 20 experimentally verified ECM proteins. By scanning the entire human proteome, we predicted novel ECM proteins validated with gene ontology and InterPro. The identification of ECM proteins can be helpful for the analysis of ECM-related functions and diseases. |
kandaswamy kk*, Pugalenthi G, Martinetz T
*Max Planck Institute for Biology of Ageing Germany |
K - Sequencing and Sequence Analysis |
|
K 30
K30 |
De novo genome assembly using short read sequences is a highly challenging problem. The major challenge for the de Bruijn graph-based assemblers is how to reduce the graph complexity, so that the computer memory requirement will not increase exponentially with the read length and, especially, the read count. Here, we propose a new de novo assembly method, called JR-Assembler, that can efficiently assemble a genome using hundreds of millions of Illumina short read sequences with an efficient usage of computer memory. JR-Assembler uses an advanced overlapping consensus approach that includes several innovative ideas. It includes the following major steps: (a) seed selection, in which both the frequency and overlapping information of reads are used to select a set of reads for extension, (b) contig extension by jumping between reads rather than base by base, (c) quality evaluation of extended sequences by remapping unassembled reads at each extension and filtering out incorrect extensions, (d) improvement of read connection by dynamically trimming low quality nucleotides from the 3'-end of a read during extension, (e) repeat identification by breaking the extension at each repeat boundary to reduce the mis-assembly, and (f) merging multi-scale contigs by unassembled reads. Empirical and simulation studies revealed that JR-Assembler achieves not only better contig quality but also better memory usage than many popular methods, such as Velvet and SOAPdenovo. Indeed, for a computer with 256GB RAM, only JR-Assembler can handle several Giga reads within a reasonable time. |
Chu T*, Liu T, Lu C, Lee GC, Li W, Shih AC
*Institute of Information Science, Academia Sinica Taiwan |
K - Sequencing and Sequence Analysis |
|
K 31
K31 |
The annotation of genomes is one of the central topics in sequence research and has inspired various gene finding algorithms, which make use of the sequence information, EST libraries or proteomics to identify genes present in the genome. With the increasing availability of rna sequencing (RNAseq) by high-throughput technologies, gene finding can also be exclusively driven by rna reads, which even allows the de novo determination of spliced genes based on a single experiment.
We present GIIRA (Gene Identification Incorporating RNAseq data and Ambiguous reads), a new gene finder that is unlike most other tools exclusively based on a RNAseq mapping and inherently includes ambiguously aligned reads. First, GIIRA defines regions of interest supported by a sufficient number of mapped transcripts. Second, it assigns ambiguously mapped reads to their most likely origin using a non-negative least squares approach, which ensures that all possible information is included in the analysis and therefore avoids the exclusion of genes that are predominantly supported by ambiguous reads. In a last step, the potential gene regions are ranked based on a scoring function respecting their assigned reads. GIIRA is independent of any a priori annotation, relies on information not used in previous methods and is able to refine RNAseq mappings. Several simulations and tests on real E.Coli data showed promising results and that including ambiguous reads significantly increases the number of identified genes. |
Zickmann F*, Renard BY
*Research Group Bioinformatics (NG4), Robert Koch-Institute Germany |
K - Sequencing and Sequence Analysis |
|
K 32
K32 |
An essential question in the metagenomic analysis of a microbial ecosystem is the identification of the source microorganism of each read or contig, an aspect that is still in its infancy. To assess the performance of different taxonomic profiling tools and to investigate the microbial communities of a sample of a spontaneous Brazilian cocoa bean box fermentation process, both similarity-based and composition-based computational methods were applied. Rarefaction analysis based on these results indicated saturation for both the bacterial and fungal communities, proving that both dominant and rare members of the cocoa bean fermentation ecosystem were captured. In addition, bacteriophages associated with this cocoa bean fermentation sample were retrieved.
Operational taxonomic units that were consistently predicted by the different taxonomic profiling tools were taken into account to avoid a software-dependent outcome. This approach identified Hanseniaspora uvarum, Hanseniaspora opuntiae, Saccharomyces cerevisiae, Lactobacillus fermentum, and Acetobacter pasteurianus as the prevailing microbial species during the cocoa bean fermentation investigated, species that were also obtained by culture-dependent and culture-independent approaches. In addition, a wider community diversity was retrieved by the metagenomic sequencing approach, situated mainly within the γ-Proteobacteria and the fungal members. Hence, a complete and more reliable insight into the community diversity of the cocoa bean fermentation sample studied could be obtained. The wider diversity retrieved in the present study is of importance to generate further insights into the functional roles of bacteria, fungi, and bacteriophages during cocoa bean fermentation, which is necessary to select an appropriate starter culture for homogeneous, fast, and successfully controlled processes. |
Illeghems K, De Vuyst L, Papalexandratou Z, Weckx S*
*Vrije Universiteit Brussel Belgium |
K - Sequencing and Sequence Analysis |
|
K 33
K33 |
Copy Number Variation (CNV) is pervasive in the human genome and has been shown to play a causal role in genetic diseases. Recent advances in sequencing technologies have provided the means for identifying genomic variation at an unprecedented resolution. A single next-generation sequencing (NGS) experiment can offer a multitude of features that can be used to extract different CNV signatures. Achieving a high sensitivity across the CNV spectrum requires taking into account the strengths and weaknesses of each feature and incorporating them into a unified model.
Here, we present cnvHiTSeq, an integrative method for CNV discovery and genotyping that jointly analyzes multiple features of sequence data at the population level. cnvHiTSeq utilizes a population-haplotype framework to incorporate diverse data sources into a single probabilistic model. By organically combining evidence from read depth, read pairs and split reads, cnvHiTSeq achieves sensitive and precise discovery of all CNV classes even from low-coverage data. Using data from the trio and low-coverage phases of the 1000 Genomes Project, our method identified considerably more variants than competing methods, while maintaining a low false discovery rate (FDR). cnvHiTSeq detects ~80% of all CNVs greater than 100bp with a FDR of 4.3% (9.8%) for deletions (duplications). Our unified approach also benefits from population-level modelling to achieve a high CNV genotyping accuracy of 98.2%. Thus, cnvHiTSeq offers a complete solution to sequencing-based CNV detection and genotyping, aiming to further our understanding of CNV impact on disease and evolution. |
Bellos E*, Coin L
*Imperial College London, Department of Epidemiology & Biostatistics United Kingdom |
K - Sequencing and Sequence Analysis |
|
K 34
K34 |
In higher organisms, chromatin formed by DNA bound to histones and other structural proteins can occur in a condensed form which needs to be opened for transcriptional activities. Several proteins are known to modify chromatin structure, and there is evidence for sequence components involved in the targeting of many of them. However, for many proteins, defining targeting sequences has proven difficult with available methods. The approach presented here identifies sequence words with predictive power and does not depend on the presence of a consensus motif. We use frequencies of short sequence words and build multivariate models to find the sequence variation in protein bound versus unbound sequences. An iterative algorithm then tries to identify potential DNA-motifs by aligning the top sequence words from multivariate models. This procedure has been applied to study sequence signatures involved in targeting male-specific lethal (MSL) complex to X-chromosomal genes in Drosophila melanogaster. Through this study, separate sequence compositions was identified in different gene features (promoters, 5’ UTRs, coding sequences, introns, 3’ UTRs and intergenic sequences), which will impact motif discovery attempts. By taking this variation into account we find a distinct sequence composition within coding sequences of MSL bound genes. We also find DNA-motifs for the insulator protein BEAF upstream of MSL-bound genes and we verify the presence of BEAF using available mapping data. We can for the first time accurately predict MSL binding along the X-chromosome solely based on DNA sequence. |
Philip P*, Pettersson F, Stenberg P
*Department of Molecular Biology, Umea University Sweden |
K - Sequencing and Sequence Analysis |
|
K 35
K35 |
Human gut microbiota forms an interconnected ecosystem playing essential role in host homeostasis. Whole-genome sequencing provides insight into its phylogenetic as well as gene-centric functional composition. Russian population provides well of interesting cohorts for exploring variation of metagenome. Its socio-geographic status from multi-million citizen metropolitan cities to low-populated rural areas. Together with financial polarization, different ethnical composition, lifestyles and diet, this makes such large territory in the middle of Eurasia an interesting field for metagenomic studies. In order to explore this metagenomic diversity, 132 fecal samples were obtained from inhabitants of various cities and rural areas. Whole-genome sequencing was performed using SOLiD technology. Phylogenetic and functional profiling was performed by reads alignment to reference genomes and genes. Coverage depth was normalized and summed across genera to obtain their relative abundance. Clustering analysis showed genera-level dominants of microbiota. Several novel combinations of community drivers were discovered. Correlations across various population cohorts were assessed, including urban versus rural population, Asian versus Europeans and others. Genotype-level diversity was examined by analysis of consensus SNPs in reference genomes. It demonstrated different variance in rural population metagenomes (which tend to have close genotypes in abundant bacterial species) and metagenomes of large cities population (that demonstrates high diversity, even when genera-level composition is similar). |
Tyakht A*, Popenko A, Belenikin M, Altukhov I, Ischenko D, Alexeev D, Govorun V
*Research Institute of Physico-Chemical Medicine Russian Federation |
K - Sequencing and Sequence Analysis |
|
K 36
K36 |
Gene fusion events and other types of chromosomal translocations are known to be related to several types of cancers and can be interpreted as hallmarks for the corresponding development stage of the disease. Therefore the detection of these translocations is attracting interest in modern genomics. RNA-seq is one of the most promising approaches to detect gene fusions on a genome-wide scale. However, RNA-seq data analysis can be challenging due to the huge amount of data, high error rate and repetitive nature of the genomes.
Currently there exist some computational methods designed to deal with these challenges. Existing methods use either paired-end information of sequencing reads, or local alignment of read segments to detect fragments that are aligned to different genes and can form gene fusion candidates. Then, a number of filters are applied to these candidates in order to increase the specificity of the results. However, there is still room for improvement both in in terms of accuracy and computational performance. Furthermore, most of the existing methods are not capable of detecting novel fusions (translocations which involve unknown or unannotated genes). Here we present a new method which efficiently combines several approaches to detect possible transcriptome rearrangements, including novel fusions. Our method outperforms existing methods both in terms of accuracy and computational speed. To provide high specificity our method estimates the biological relevance of possible fusion events by taking into account the expression pattern of the involved genes and applying a probabilistic model for multi-mapped reads. |
Okonechnikov K*, Glowinski F, Garcia-Alcalde F
*Max Planck Institute for Infection Biology Germany |
K - Sequencing and Sequence Analysis |
|
K 37
K37 |
RNA sequencing is widely used today for studying gene expression analysis as well as identifying and quantifying various mRNA processing intermediates in the cells. In this study we employed the ABI's SOLiD 4 sequencing for the global analysis of the nuclear processing of the unspliced U12-type introns by the exosome, a ribonuclease complex involved in RNA turnover.
There are two types of introns in multicellular organisms termed as U2- and U12-type introns. These two intron types are removed by separate spliceosomes in the cell nucleus. Earlier studies have shown that the splicing rate of the U12-dependent introns in the nucleus is slower than the U2-type, suggesting that the U12-type introns can regulate the levels of a subset of mRNAs in the cell. Consistently, an elevated level of unspliced U12-type introns have been detected in the steady-state mRNA populations in various organisms. Here we investigated the hypothesis that the unspliced mRNAs containing U12-type introns are degraded by the exosome complex. We compared the retention levels of the U12- and U2-type introns in control cells and in cells in which the exosome function has been disabled (e.g. knockdown of the Rrp41 and Dis3 subunits). Analysing more than 100,000,000 paired reads (50 bps + 35 bps, with ~200 bps library) that were mapped to U12 intron containing genes, we discovered that exosome inactivation stabilizes unspliced U12-type introns as opposed to the U2-type introns in the same genes. Moreover, the effects of Rrp41 and Dis3 knockdowns were not identical, suggesting different regulatory roles for the two subunits. |
Oghabian A*, Niemelä E, Frilander M
*RNA-splicing laboratory, Institute of Biotechnology, University of Helsinki Finland |
K - Sequencing and Sequence Analysis |
|
K 38
K38 |
The binding of transcription factors (TFs) to their specific motifs in genomic regulatory regions is commonly studied in isolation. However, in order to elucidate the mechanisms of transcriptional regulation, it is essential to determine which TFs bind DNA cooperatively as dimers or trimers, and to infer the precise nature of these interactions.
We represented the universe of possible TF complexes by their corresponding motif complexes, and analyzed motif-complex enrichment at 652,036 cell-type-specific DNase I hypersensitive sites in 78 human cell lines. Based on this analysis, we predicted 319 highly significant cell-type-specific TF complexes, the vast majority of which are novel. Our predictions included several known examples of TF dimerization, showed significant overlap with an experimental database of protein-protein interactions inferred from mammalian two-hybrid assays, and were also independently supported by quantitative variation in DNase I digestion patterns. Interestingly, well-known master TFs for specific cell types were easily identifiable as connectivity “hubs” in the corresponding TF cooperativity networks. Our results indicate that chromatin openness profiles are highly predictive of cell-type-specific TF-TF interactions. Moreover, they suggest that cooperative TF dimerization is a widespread phenomenon, and that most cell types are regulated by multiple TF complexes. |
Jankowski A*, Szczurek E, Tiuryn J, Prabhakar S
*Genome Institute of Singapore Singapore |
K - Sequencing and Sequence Analysis |
|
K 39
K39 |
Bacterial small non-coding RNAs (sRNAs) are recognized as novel widespread regulators of gene expression in response to environmental signals. However, thought their importance, their accurate in silico prediction remains a challenging problem. Most of the existing algorithms identify and score potential sRNAs structures on the basis of: a) thermodynamic stability, b) probabilistic, c) conservation, and/or covariance of the sequence alignments. Thought, they represent a big step forward in the detection of sRNAs still generate high numbers of false positives. del Val et al. and Tjaden et al., combined these algorithms in pairs with variable results leading to the observation that performance may boost when combining different methods.
Following this line we propose the search of optimal aggregations of individual methods: zMFold, QRNA, RNAz, Alifoldz, Dynalign, MSARi and vsFold, that reduce the number of false positive predictions and increase the sensitivity. The optimization strategy was carried out using a multi-objective evolutionary optimization algorithm, which maximizes the two contradictory measures of the goodness of an aggregation. The summarized prediction is obtained by combining the best aggregations through a majority vote multiclassifier. We trained our strategy in the genome of Salmonella thyphimirium LT2, and validated the results in S. meliloti, and over the multispecies dataset proposed by Tjaden et al., 2012. The results demonstrate that the proposed methodology minimizes the false positives rate maximizing simultaneously specificity and sensitivity when compared to individual and more sophisticated methods. Moreover, the predictions range the same accuracy in different genomes from low to high GC content. |
Arnedo-Fernández J*, Romero-Zaliz R, Zwir I, delVal C
*University of Granada Spain |
K - Sequencing and Sequence Analysis |
|
K 40
K40 |
The recent surge in high-throughput sequencing data have provided the ability to process tables of read counts for differential expression analysis in an unbiased way. However, many genomic loci despite having similar read counts across multiple samples can exhibit differential read processing patterns. A read processing pattern originates from the positional arrangement of reads when mapped to a specific loci in the reference genome. In this study, we are developing a computational framework involving a previously published tool, deepBlockAlign to compare the read processing patterns across total RNA-seq datasets from 11 human tissues. To nullify the effect of sequencing depth on read processing patterns, we have followed appropriate normalization steps. Preliminary analysis on 33,095 loci
where expression is observed in all 11 samples have revealed ~1000 loci with differential read processing patterns i.e a specific read processing pattern for specific set of tissues and another pattern for rest of the tissues. The diversity in read processing patterns at a specific loci especially in 3'UTRs may divulge the diversity present in cis-regulatory mechanisms. |
Pundhir S*, Gorodkin J
*Center for non-coding RNA in Technology and Health, University of Copenhagen Denmark |
K - Sequencing and Sequence Analysis |
|
K 41
K41 |
Molecular biology is currently undergoing a fundamental transformation, from being largely wet-lab experiment driven to information driven, and from being qualitative to quantitative. The DNA sequence analysis is at the center stage of this transformation.
The main problems of DNA sequence analysis are the assessments of sequence similarity and variation. Traditionally, these tasks have been approached by a variety of computer science methods. The hypothesis of this research is that the analysis of DNA sequence can be significantly improved by application of algebraic, combinatorial and information-theoretic methods. Utilizing these methods, we focus on the development of fundamental mathematical models of the DNA sequence and verification of their utility for biological function discovery and for comparative genomics. We postulate that these models should be applicable across organisms and species, and across vastly varying genome and genome library sizes. As a technical approach we propose to analyze the DNA sequence in terms of combinations of random strings and repetitions, associated with a certain special combinatorial construction known as cyclic difference sets. Results of an initial investigation of bacterial genomes demonstrate that this approach is biologically meaningful and computationally efficient. |
Brodzik A*
*MITRE United States of America |
K - Sequencing and Sequence Analysis |
|
K 42
K42 |
Utilising structural-bioinformatic tools to analyse cell-type expression datasets, we present a general method to annotate transcribed sequences with their corresponding protein-domain information. This unlocks a more detailed understanding into the evolutionary development and molecular basis of cell functions.
Such a structural perspective provides for more sensitive detection of genes homologous to transcribed sequences from across the tree of life, allowing for the creation of cell-type specific evolutionary profiles of domain usage. These profiles identify times of evolutionary descent where a given cell-type appears to have utilised more or less of the protein innovation at that time than average. By clustering cell types on these profiles, we find groups that share a common protein evolutionary history. This highlights important domain architectures that define evolutionary shifts and functional innovations. Further, it also elucidates which order cells could have existed as well as allowing for effective comparisons of the molecular basis of the functional differences between cell types. |
Sardar A*, Rackham O, Oates M
*Univeristy of Bristol, Department of Computer Science United Kingdom |
K - Sequencing and Sequence Analysis |
|
K 43
K43 |
Tools for identifying microbial environmental sequences usually target bacterial 16S sequences. Using the internal transcribed spacer (ITS) region in Fungi presents several challenges, including the absence of a reliable curated reference set and taxonomic uncertainty arising from dual nomenclature (different names for the sexual and asexual states of single species). In our experience, these two factors contribute to the assignment of Fungal environmental sequences at high taxonomic ranks when Lowest Common Ancestor (LCA) approaches are applied. We briefly describe a pipeline for identifying Fungal ITS metagenomic sequences, and propose two heuristic variants of the LCA approach resulting in more specific identifications. The “%-support” method rejects BLAST matches to discordant taxa (possibly misidentified reference sequences) until a minimum percentage of matches are included and the “assign-to-rank” method rejects matches as required to reach the target rank. As a first step towards establishing an identification confidence value, the percentage support for each identification is reported. The impact of dual nomenclature was assessed and a method for handling these names will be reported.
The proposed heuristics were applied to a metagenomics dataset of approximately 20M Fungal ITS sequences generated by 454 pyrosequencing using a GS FLX instrument and Titanium reagents. The sequences were produced from 355 air and rain samples collected at regular intervals from 24 Canadian sites over a 4 year period. Bidirectional sequencing yielded both ITS1 and ITS2 sequences in approximately equal proportions. Our identification approach compensates for large scale chimeric sequences occurring at the conserved 5.8S region. |
Lewis CT*, Chen W, Bilkhu S, Zhang J, Seifert KA, Hambleton S, Sharpe AG, Lévesque CA
*Agriculture and Agri-food Canada Canada |
K - Sequencing and Sequence Analysis |
|
K 44
K44 |
Quantitative trait loci (QTL) mapping has been widely used to discover underlying genetic factors that can explain particular phenotypes of interest, for example the expression level of genes (eQTL). Recent advent of next-generation sequencing enables genetic profiling of chromatin traits. For example, Degner et al. (Nature 2012. 482,390-394.) utilized DNase-seq to map chromatin accessibility at millions of loci to HapMap-based genotypes across 70 human individuals. We carried out FAIRE-seq to profile chromatin accessibility in 96 yeast strains where gene expression data is also available, and compared the patterns of chromatin QTL with eQTL. We found out that the genetic perturbations affect few open chromatin sites while a large number of genetic factors have significant associations with gene expression. On the other hand, some genetic regulatory loci regulate mass open chromatins. Based on these findings in yeast, we sought dissect the genetic architecture of chromatin regulation by using the public data (by Degner et al). We discover the patterns of chromatin QTL and eQTL in human are similar with those in yeast, suggesting that chromatin structure is relatively robust to genetic perturbations and there exist multi-target regulatory loci, or hot-spots, that influence a large number of chromatin traits. |
Kim K*
*KAIST (Korea Advanced Institute of Science and Technology) Korea, South |
K - Sequencing and Sequence Analysis |
|
K 45
K45 |
Large-scale analysis of gene expression via quantitation of mRNA levels is now routinely performed via sequencing (RNA-seq). RNA-seq can quantify expression with higher sensitivity and with larger dynamic range than microarrays at similar cost. However, both methods require the native RNA molecules in biological samples to be converted to cDNA and amplified prior to quantitation. Each step has the potential to introduce biases and artefacts into downstream analyses. Direct RNA sequencing (DRS) from Helicos BioSciences (Cambridge MA) avoids these problems by sequencing individual molecules of RNA directly from the sample, thereby providing higher fidelity data.
DRS data is very different from other sequencing technologies which brings with it additional challenges in its analysis. In particular DRS reads are shorter, have more sequencing errors and are very sensitive to mis-annotation of 3’-UTRs in the genome. We describe novel strategies for dealing with these differences thereby producing cleaner data for performing gene expression analysis. We have applied DRS to a case/case study of atopic eczema in a unique collection of Irish children. 4mm punch biopsies of skin were collected from 26 paediatric eczema cases and RNA extracted. The cases were stratified by their FLG genotype (a strong genetic risk factor for eczema) as; wild type (n=7), heterozygous (n=13) or compound heterozygous (n=6). Controlling for confounding factors was performed in edgeR where nine genes were found to be significantly (FDR < 0.05) differentially expressed. Gene Ontology analysis revealed significant terms of relevance to the pathomechanisms of atopic eczema, which warrant further investigation. |
Cole C*, Brown S, Irvine A, McLean I, Barton G
*College of Life Sciences, University of Dundee United Kingdom |
K - Sequencing and Sequence Analysis |
|
K 46
K46 |
The genome of barley (Hordeum vulgare) has a size of more than 5 Gb and is characterized by a high content of repetitive DNA (~80%). A physical map and a reference sequence of the genome are being constructed by the International Barley Genome Sequencing Consortium (IBSC, http://barleygenome.org). Currently, ~80% (3.9 Gb) of the barley genome is represented by a genetically anchored physical map.
To improve the quality of the reference sequence, the IBSC decided to evaluate approaches of sequencing overlapping BAC clones derived from the minimal tiling path of the barley physical map comprising about 70,000 BAC clones. First results from BAC pooling approaches using different 2nd generation sequencing platforms and assembly methods will be presented. |
Felder M*, Taudien S, Himmelbach A, Steuernagel B, Schmutzer T, Mascher M, Ariyadasa R, Poursarebani N, Nussbaumer T, Gundlach H, Mayer KF, Platzer M, Scholz U, Stein N
*Leibniz Institute for Age Research - Fritz Lipmann Institute (FLI) Germany |
K - Sequencing and Sequence Analysis |
|
K 47
K47 |
The “ASSET” consortium (Analysing and Striking the Sensitivities of Embryonic Tumours) is focussing on unravelling network vulnerabilities in three major childhood tumours neuroblastoma, medulloblastoma and Ewing sarcoma (ES).
Here, we apply a multi-layered approach, to develop knowledge on the genomic landscape of the ES family of tumours. Ewing sarcoma refers to a family of highly malignant primary bone tumours, which arise either in bone or soft tissues, affecting children and adolescents. The genetic hallmark of ES is the presence of the chromosomal translocation t(11;22)(q24;q12) that generates the EWS-FLI-1 fusion gene. The corresponding fusion functions as an aberrant transcription factor that plays a crucial role in the pathogenesis of ES. In this work, we sequenced the exome of the Asp14 Ewing sarcoma cell line with and without inactivation of EWS/FLI by RNA interference to survey its exact mutational status. Identified mutations were prioritised using structural and functional level analysis approaches, in an attempt for mechanistic understanding. Combined results from the mutational status, the drug sensitivities and pathway knowledge analyses enable the investigation of the genetic signature of EWS/FLI1 as well as the identification of genes that are possibly involved ES pathogenesis or in drug resistance mechanisms. |
Tsafou K*, Belling K, Kouskoumvekaki I, Herrero D, Kovar H, Gupta R
*Technical University of Denmark, Department of Systems Biology, Center for Biological Sequence Analysis Denmark |
K - Sequencing and Sequence Analysis |
|
K 48
K48 |
Alternative splicing is an important cellular process that increases transcriptome diversity but aberrant splicing events often have pathological consequences. The emergence of next generation RNA sequencing (RNA-seq) provides an exciting new technology to analyze alternative splicing on a large scale. However, there are still no standard solutions available to analyze this data in respect to alternative splicing and computational methods, in particular for the analysis of differential expression of genes and isoforms, are at the beginning to emerge. We developed a new method to predict genes that are differentially spliced between two different conditions using RNA-seq data. Our method is based on geometric angles between high dimensional read count vectors. It is able to predict genes that undergo differential splicing even if these events comprise of higher complexity. We applied our approach to two case studies. We compared high-throughput sequences of the transcriptomes from human brain and liver samples as well as neuroblastoma patients with low-risk and high-risk tumors. We verified our predictions by several methods including complementary in silico analyses of the data itself. We found additional evidence of splicing diversity in normalized read coverage plots and in reads that span exon-exon junctions. Furthermore, we found significant numbers of regulatory splicing factor motifs and a substantial number of publications linking our predicted genes to alternative splicing. Splicing differences can serve as powerful biomarkers to discriminate tissues or patients and have a great potential to improve existing stratification methods of cancer patients. We could successfully exploit splicing information to cluster tissues and patients and we were able to improve classification performances by combining gene expression based features with splicing features. |
Aschoff M, Hotz-Wagenblatt A*, Glatting K, Fischer M, Eils R, König R
*German Cancer Research Center (DKFZ) Germany |
K - Sequencing and Sequence Analysis |
|
K 49
K49 |
A tandem repeat is defined as a sequence of two or more contiguous, similar copies of a pattern of nucleotides. Tandem repeat sequences have many applications fields including typing microbes, disease diagnosis, mapping studies, DNA fingerprinting in forensic field, sequences homology, and population studies. In this poster, new algorithm for finding variable number of tandem repeats (VNTR) is described. This algorithm consisted of three stages. One is to find seed-unit pair which has max alignment score, another is to extend seed-unit to candidate unit which has max score and the third is to enlarge candidate unit to find tandem repeat region score. The global alignment method is used in the each stage to find tandem repeat region. The data of predicted VNTR regions are integrated into the local genome database and could be viewed through VNTR viewer within the microbial genome annotation system named as WeGAS. |
Lee D*
*Agency for Defense Development Korea, South |
K - Sequencing and Sequence Analysis |
|
K 50
K50 |
In this poster we describe a “stream-as-you-go” approach that minimizes the data transfer time of data- and compute-intensive scientific applications deployed in the cloud, by making them incrementally processable and working completely in-memory through a data streaming engine, the IBM InfoSphere Streams computing platform deployed over Amazon EC2 in this case. We are describing our prototypes of two sample use cases, NGS Read Alignment with SHRiMP and SNP Calling with Bowtie and soapSNP. Finally, we are comparing performance against trivial and Hadoop based solutions. |
Kienzler R*, Tatbul N, Bruggmann R
*IBM Innovation Center Zurich Switzerland |
K - Sequencing and Sequence Analysis |
|
K 51
K51 |
To understand the regulatory properties of RNA binding proteins (RBPs) that are involved in translation control, it is important to identify their putative binding sites in their target RNA. The binding specificity of an RBP is predominantly determined by the nucleotide sequence of its binding motif, but since the binding motifs of RBPs are rather short, also other binding properties might contribute to their specificity, such as the accessibility of binding sites or clustering of binding sites allowing for binding of RBP multimers. Here, we propose a simple biophysical model to explain the binding specificities of GLD-1, an RBP that is involved in translation control in the germ line of Caenorhabditis elegans. Our analysis is based on different data from CLIP (cross-linking and immuno precipitation), RIP-Chip experiments and measured binding affinities to short nucleotide sequences. While the binding sites in UTRs seem to be mostly accessible, the prediction of binding sites in the coding region of mRNAs might be improved by taking accessibility into account. |
Brümmer A*, Zavolan M
*Biozentrum, Universität Basel Switzerland |
K - Sequencing and Sequence Analysis |
|
K 52
K52 |
We developed a novel method for single nucleotide polymorphisms (SNPs) detection in Next Generation Sequencing (NGS) data, which includes quality scores (QSs) from sequencing and mapping steps to accurately discriminate true polymorphisms from erroneous ones.
Cross sample comparison of NGS data are a common approach to identify variant positions (e.g. SNPs). Our method compares two aligned read stacks to pinpoint variant positions with significant nucleotide differences based on as little prior assumptions as possible. Several applications fall into this framework, such as variant detection in forward genetic screens, medical resequencing, population allele frequency estimation or identifying editing events. We employ a Bayesian approach to perform a goodness-of-fit test of allele frequencies in sequenced samples that we model by the Dirichlet distribution. During a head-to-head comparison of two samples the prior distribution is calculated from the QSs that are provided by the employed NGS platform to define the posterior distributions for each pileup. SNPs are identified by means of a posterior ratio test while controlling for mapping and sequencing errors is performed by adjusting a threshold. To validate our method, we developed a series of benchmarks that simulate typical read stacks from the aforementioned application scenarios. We synthetically generate sequencing data with polymorphic sites from varying mixing-ratios and coverage combinations. Our method shows a similar performance in comparison to state-of-the-art solutions like samtools in typical SNP calling settings (e.g. the comparison of two diploid genomes). ACCUSA2 performs better in settings where allele frequencies and read stack depths are highly variable (allele frequency: 0.1 and sample coverage: 4x (mutant alleles) and 20x (reference); samtools sensitivity: 6.78% and precision: ~100.00%; ACCUSA2 sensitivity: 36.67% and precision: 92.10%). Furthermore, ACCUSA2 outperforms samtools/bcftools in terms of net running time. |
Piechotta M*, Dieterich C
*BIMSB @ MDC Germany |
K - Sequencing and Sequence Analysis |
|
K 53
K53 |
The Ibis base-caller provides fast, robust base-calling for the Illumina platform. The inclusion of a proprietary support vector machine (SVM) in the initial Ibis release, and a number of changes in the Illumina instrument and the chemistry have motivated us to develop Ibis further. We introduce here freeIbis, which now incorporates LIBOCAS, an open source multiclass support vector machine. The new version provides support for all current Illumina sequencing platforms, and improved sequence accuracy compared to the default basecalling provided by Illumina. A major improvement is the modification of quality score calculation. Quality scores now show a very high correlation to the observed error rates independent of nucleotide and cycle.
Ibis is freely available for download under the GPL from: https://bioinf.eva.mpg.de/freeIbis/ |
Renaud G*, Kircher M, Stenzel U, Kelso J
*Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology Germany |
K - Sequencing and Sequence Analysis |
|
K 54
K54 |
InterProScan is a bioinformatics tool that provides a one-stop-shop for automated sequence analysis of both protein and nucleic acid, the latter via a full six-frame translation. It offers the ability to identify both structural and functional regions of interest, based upon methods and models that have been generated by a large number of member groups ('member databases'). These member databases use a variety of different bioinformatic techniques and algorithms, optimised for specific feature types. InterProScan is therefore able to offer the researcher the ability to quickly characterize a new or novel sequence with considerable confidence. This is a complete re-write of the existing version of InterProScan, version 4, written in Perl. Version 5 of InterProScan has been developed to provide all of the existing functionality of the public InterProScan service with improved support for annotating nucleic acid sequences, mapping of features to GO and mapping of features to pathways. InterProScan 5 takes advantage of several modern Java technologies, utilising the Spring dependency injection framework and leveraging technologies such as Java Message Service (JMS) and BerkeleyDB to provide a fast and efficient pre-calculate match lookup service.
|
Jones P*, Fraser M, Quinn A, Scheremetjew M, McAnulla C, Hunter S
*EMBL-EBI United Kingdom |
K - Sequencing and Sequence Analysis |
|
K 55
K55 |
Whole-Exome Sequencing (WES) has emerged as powerful sequencing strategy to identify novel mutations and genes associated with Mendelian disorders and complex traits. Managing sequence data and results from WES studies is computationally challenging task. We present resExomeDB, an online catalog we developed for WES results and related publications. resExomeDB provides access to mutations and genes underling human disease that have been identified recently by WES. In addition, our catalog relates WES results to 1) external gene-centered resources, 2) disease oriented databases such as OMIM and the International Classification of Diseases (ICD), 3) genome-wide association studies catalogs, and 4) external pathways and functional annotation databases. resExomeDB centralize all publications, software, platforms related to exome/whole genome sequencing and provides functionalities that allow browsing and searching the catalog by mutation, gene, study, or publication. The current release of resExomeDB database contains more than 300 mutations in 259 genes associated to 175 diseases/traits manually curated from more than 600 publications. resExomeDB is freely available for academic, non-commercial use at: http://www.exomedb.org |
Chami ME*, Benatchba K, Koudil M, Tazir M, Maouche S
*Bioinformatics-ESI Group, Ecole nationale Supérieure d'Informatique (ESI) Algeria |
K - Sequencing and Sequence Analysis |
|
K 56
K56 |
Assembling a genome de novo has become common practice in bioinformatics with different technologies, algorithms and strategies currently in use and being developed. However, the question of how to assess quality and accuracy of an assembly remains open. The most common metrics are N50, maximum and mean contig or scaffold length, and other size-related measures. These are measures of contiguity and do not assess accuracy. Although mapping known sequence collections, such as full-length cDNAS or genes are useful quality metrics, they only validate a subset of the assembly. Moreover, even when generating assemblies of assemblies or consensus graphs there is still no clear way of assessing which version of a differently assembled region is more accurate.
We have developed k-mer content based approaches to measure the information content of sequenced reads and assemblies. We have also assessed whether k-mer content is congruent with or clearly deviates from the known figures such as GC-content. Using the k-mer distribution of the reads and resulting assemblies we can explore information incongruence and bias between samples, libraries and runs, and in relation to the reference or the final assembly, as well as information coming from each set and how much of it is used, discarded or simply lost in the assembly procedure. We have used our methods with a range of species including relatively simple prokaryotes, yeasts and complex polyploid crops, and different sequencing technologies. |
Clavijo B*, Ayling S, Caccamo M
*The Genome Analysis Centre United Kingdom |
K - Sequencing and Sequence Analysis |
|
K 57
K57 |
Cancer is characterized by early gatekeeper mutations and the accumulation of later driving mutations. Understanding from patient samples the evolution of those mutations is critical to determine disease progression and requires to distinguish non-tumoral contaminating cells from sub-clonal tumoral cells. The algorithm DeCCoS (Determination of Clonality and Contamination from Sequencing) infers the progression of somatic mutations using DNA Next Generation Sequencing (DNASeq) data from tumor and matched germline samples.
Considering sets of representative SNPs (rSNPs) across clonal hemizygous somatic deletions, a 100% pure tumor has reads from only the undeleted allele. Therefore, reads supporting the deleted allele denote contamination or sub-clonality. The probability distribution of the reads supporting rSNPs results from the convolution of a Bernoulli and a binomial distribution representing reads from cells with and without the deletion, respectively. The deviation of the observed reads from the expectation characterizes the contamination/clonality in the deletion. Assuming uniform contamination across a sample, outlier(s) within a sample denote sub-clonal lesion(s) (i.e. a fraction of tumor cell population harbors the deletion). The level of sub-clonality potentially indicates the time at which the deletion event occurred, i.e., the higher the clonality the earlier the deletion emerged. We evaluated DNAseq data from seven human cancer samples with 27 recurrent somatic deletions that we have recently published. A median of 14 deletions per tumor were called with 22% of the deletions showing evidence for sub-clonality. Further in situ assays will be used to validate sub-clonality assessments and larger datasets will be considered to construct the evolution of driver somatic events. |
Prandi D*, Romanel A, Rubin M, Demichelis F
*Centre for Integrative Biology, University of Trento Italy |
K - Sequencing and Sequence Analysis |
|
K 58
K58 |
Since the announcement of the 1000 Genomes project in 2008, every year brings new sequencing projects targeting thousands of new species (e.g. 10K for vertebrates, i5K for insects). The avalanche of data generated by such projects raises an important challenge for their analysis and visualization. We present a method to systematically evaluate the quality of genome assemblies based on the consistency of transcriptome to genome alignments. Graph algorithms applied to a network representing such alignments allow selecting the optimal assembly strategy. Furthermore, the visualization in Cytoscape enables rapid confirmation of intra-scaffold connections between contigs and detection of possible scaffold extensions and inconsistencies. |
Riba-Grognuz O*, Keller L, Falquet L, Xenarios I, Wurm Y
*University of Lausanne Switzerland |
K - Sequencing and Sequence Analysis |
|
K 60
K60 |
The sequencing of the human genome was recently complemented by whole-genome assessment of other mammals; species used as experimental models (e.g. mouse, rat, rhesus macaque), animals positioned at key evolutionary junctures of the mammalian kingdom (e.g. opossum, platypus, chimpanzee) and/or for conservation purposes (Tasmanian devil). The dramatic drop in sequencing costs and increase in sequencing efficiency have allowed the Genome 10K Community of Scientists to aim to assemble a genomic zoo—a collection of DNA sequences representing the genomes of 10,000 vertebrate species, approximately one for every vertebrate genus. The trajectory of cost reduction in DNA sequencing suggests that this project is feasible now. Capturing the genetic diversity of vertebrate species would create an unprecedented resource for the life sciences and for worldwide conservation efforts. To achieve this goal the scientific community should also concomitantly develop improved annotation and assembly technologies adapted to short-read de novo sequencing.
We have recently sequenced and assembled de novo the genome of the chamois (Rupicapra rupicapra), an emblematic species of the Alps. We produced a scaffold that totals about 2.7 Gb of sequence (scaffold N50 is 14702 bp, scaffold N90 is 1412 bp). The current accepted taxonomy recognizes two species on the basis of morphological and behavioral characters. However, it is controversial as different authors suggested that one, three or even six species should be considered for all the chamois populations. The aim of the present proposal is to determine the population structure of the Alpine chamois, as well as resolve the controversy about the taxonomy of the Rupicaprina. |
Sarkar N*, Reymond A, Robinson-Rechavi M
*University of Lausanne and Swiss Institute of Bioinformatics Switzerland |
K - Sequencing and Sequence Analysis |
|
K 61
K61 |
Over the past few years, massively parallel DNA sequencing platforms have become widely available, reducing the cost of DNA sequencing by over five orders of magnitude. We used two of these methods to reveal the genome sequence of a highly abundant cyanobacterium in biological desert crust. These new rapid evolved next-generation sequencing technologies posed challenges for us, the bioinformatics, in terms of sequence quality scoring, alignment, assembly and more, making de novo assembly, a challenge.
We are working on solving the genome of a desert cyanobacteria from biological sand crusts. Biological sand crusts are found in many deserts around the world. They play an important role in stabilizing sandy areas and affect the vegetation composition. The crusts are formed by adhesion of the sand to extracellular polysaccharides excreted mainly by filamentous cyanobacteria. Their destruction by man-made activities is considered an important promoter of desertification. Using the SOLiD™ System, we were able to get most of the genes in the genome; however, the short reads, produce by the SOLiD™ technique, assembled into short contigs, which failed to assemble into scaffolds and the draft output was highly fragmented. By combining an addition sequencing method, the 454, we were able to get much longer contigs, which assembled into scaffolds. However the 454 method introduced sequence contamination. The fragmented SOLiD™ data helped us to filter out these sequence contaminants. Only the combination of the two methods enabled us to produce the pure draft genome. The identity and uniquely of the microbe will be shown. |
Shotland Y*
*Chemical Engineering, Shamoon College of Engineering Israel |
K - Sequencing and Sequence Analysis |
|
K 62
K62 |
While splicing differences between tissues, sexes and species are well documented, very little is known about the extent and the nature of splicing changes that take place during development and aging of humans and other mammalian species. Here, using high-throughput transcriptome sequencing, we have characterized splicing changes that take place during whole human lifespan in two brain regions: prefrontal cortex and cerebellum. Identified changes were confirmed using an independent RNA-seq dataset, exon arrays and PCR, and were detected at the protein level using mass-spectrometry. Splicing changes across lifespan were abundant in both of the brain regions studied, affecting more than a third of the genes expressed in the human brain. Approximately 15% of these changes differed between the two brain regions. Across lifespan, splicing changes followed discrete patterns that could be linked to neural functions, and associated with the expression profiles of the corresponding splicing factors. More than 60% of all splicing changes represented a single splicing pattern reflecting preferential inclusion of gene segments potentially targeting transcripts for nonsense-mediated decay in infants and elderly. |
Mazin P, Xiong J*, Liu X, Khaitovich P
*Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences China |
K - Sequencing and Sequence Analysis |
|
K 63
K63 |
Approximate Tandem Repeats (ATRs) in a genomic sequence are contiguous, inexact repetitions of a pattern of nucleotides. They play a role in the gene expression and transcription regulations, are widely employed in DNA mapping, forensic analysis and many evolutionary studies. Their entire genesis, meaning and functions are not well understood and remain a subject of ongoing research. An extremely important aspect of this investigation is development of tools for efficient and accurate identification of ATRs. Their location in genomes can be detected using experimental techniques. However, with very high power of direct sequencing technology, methodologies of discovering ATRs with text mining techniques are coming to the first place.
We applied our approach to look for ATRs, based on the Burrows-Wheeler transform of the input sequence, to detect ATRs in the whole human genome. Furthermore, we compare the obtained results with outcome of two other tools designed to identify ATRs: mreps and Tandem Repeat Finder. The quantitive comparison of all, common and unique ATRs found by distinct tools was made, classifying found repeats according to the motif's length. Moreover, we tried to identify biologically significant ATRs, by looking for variant number of copies of discovered ATRs in the alternative human genome. Results demonstrate differences and similarities in ATRs discovered and reveal that all three methods are able to find meaningful repeats. Acknowledgments. This work was supported by the European Union from the European Social Fund (grant agreement number: UDA-POKL.04.01.01-00-106/09). |
Danek A*, Pokrzywa R, Polański A
*Silesian University of Technology Poland |
K - Sequencing and Sequence Analysis |
|
K 64
K64 |
Next generation sequencing technology has considerably improved over the past few years: in particular short reads sequencing is very easy and cheap. On the other hand, the procedures for read assembly are still unsatisfactory as they generate a vast number of relatively short contigs that would require suitable physical maps and scaffolding procedures to be further assembled in a draft genomic sequence. Unfortunately, physical maps are still very difficult to produce and the current methods take little advantage of the next generation sequencing technology. The aim of our project is to develop a strategy to overcome this problem.
We obtained a BAC library of more than 11,000 clones, from Nannochloropsis gaditana genome, with an average insert size of 120 kb. A high-coverage of the genome was also produced by an independent shotgun project. By selecting random clones from the BAC library we produced 64 pools, each covering about 40% of the genome; each pool was sequenced with the SOLiD-5500XL. Non-repeated sequences could be considered as genetic markers, and each should be represented in about 40% of the pools. Looking at their presence or absence in each pool we could create a profile for each unique genomic sequence. The rationale to produce a physical map is that two neighboring sequences should share very similar profiles thus making possible their positioning in a high density and high quality physical map. The preliminary results of this assembly indicate that this approach can lead to the production of reliable physical maps. |
De Pascale F*, Schiavon R, Corteggiani Carpinelli E, Telatin A, Vezzi A, Valle G
*University of Padua, CRIBI, Department of Biology Italy |
K - Sequencing and Sequence Analysis |
|
K 65
K65 |
Tandem repeats represent one of the most prevalent features of genomic sequences. Due to their abundance and functional significance, a plethora of detection tools has been devised over the last two decades. Despite the longstanding interest, tandem repeat detection is still not resolved.
Our large scale tests reveal that current detectors produce different, often nonoverlapping inferences, reflecting characteristics of the underlying algorithms rather than the true distribution of tandem repeats in genomic data. Our simulations show that the power of detecting tandem repeats depends on the degree of their divergence, and repeat characteristics such as the length of the minimal repeat unit and their number in tandem. To reconcile the diverse predictions of current algorithms, we propose and evaluate several statistical criteria for measuring the quality of predicted repeat units. In particular, we propose a model based phylogenetic classifier, entailing a maximum likelihood estimation of the repeat divergence. Applied in conjunction with the state of the art detectors, our statistical classification scheme for inferred repeats allows to filter out false positive predictions. Since different algorithms appear to specialize at predicting tandem repeats with certain properties, we advise applying multiple detectors with subsequent filtering to obtain the most complete set of genuine repeats. |
Schaper E*, Kajava A, Hauser A, Anisimova M
*ETH Zürich Switzerland |
K - Sequencing and Sequence Analysis |
|
K 66
K66 |
Next-generation sequencing (NGS) provides unprecedented single base level information of the human genome and transcriptome and opens up to previously inaccessible biological questions. Natural though computationally challenging problems are the understanding as to where transcription arises predominately from one allele or which genomic aberrations reflect genomic mosaicism. We developed a SNPs per-base pileup multi-threaded application (based on samtools API) that speeds up the most intensive computational task in single base-level problems. Its execution time increases linearly with the number of SNPs and decreases logarithmically with the number of cores. Using 32 cores of an in-house machine on 30x whole genome data file (~200Gb) and 500.000 random SNPs we measured a 17x speed up with respect to the canonical samtools implementation (mpileup). In addition, the output is downsized and its format is designed to simplify post-processing computations.
We implemented the PaPI approach (Parallel Pileup Implementation) to optimize the study of allele-specific expression (ASE) in the context of human prostate cancer. Given a gene list of interest and a set of heterozygous gene coding SNPs, our application computes ASE estimates from RNA-seq NGS data and performs a test that indicates the ASE statistical significance. The multi-threaded application PaPI can be used within any NGS pipeline that requires SNPs per-base pileup and runs on computer systems with multiple CPUs, CPUs with multiple cores, or across a cluster of machines. |
Romanel A*, Prandi D, Demichelis F
*University of Trento (CIBIO) Italy |
K - Sequencing and Sequence Analysis |
|
K 67
K67 |
Tobacco (N. tabacum) is an allotetraploid plant species, thought to be the result of a recent (0.2 MyA) hybridization of the ancestors of two other members of the Solanaceae family, N. sylvestris and N. tomentosiformis. Analyzing the transcriptomes of these species should shed light on the differences in the genetic make-up of tobacco’s ancestor species and pave the way for analyzing the transcriptome of this commercially and scientifically important plant.
We have sequenced the transcriptomes of leaves and roots from N. sylvestris and N. tomentosiformis and assembled them using de-novo computational techniques. From the between 56 and 87 million reads we assembled between 145k and 247k transcript sequences from 92k to 196k gene models. From these we predict up to 20k proteins in each of the plant organs. Here, we present the annotation of the transcriptome using a variety of methods and show the functional breakdown of the protein complement in terms of enzyme classification (E.C. numbers) and GO terms. We further give an overview of the inter-species and inter-tissue differences. |
Battey J*, Sierro N, Ouadi S, Goepfert S, Peitsch MC, Ivanov NV
*Philip Morris International R&D, Philip Morris Products SA Switzerland |
K - Sequencing and Sequence Analysis |
|
K 68
K68 |
In the past few years, thousands of long intergenic non-coding RNAs (lincRNAs) have been discovered in humans and other species. Although functions of lincRNAs remain largely unknown, several lincRNAs have been shown to act as important regulators, e.g. XIST, HOTAIR and HOTTIP. To gain insight into the roles of lincRNAs in human brain development and aging, we sequenced human brain transcriptome at 14 stages of development and aging distributed over the entire human lifespan using high throughput sequencing (RNA-seq).
Currently there are more than 10,000 human lincRNAs annotated by Ensembl release 64 and in the work of Cabili et al. To further improve the lincRNA annotation, we successfully reconstructed another 552 novel lincRNA gene candidates with our data. Taking all into account, more than 1,300 lincRNAs with evident expression in human brain were identified. Over 1/3 of them had significant expressional changes with age. Based on their expression patterns, we tried to infer lincRNAs functions. Furthermore, we identified both cis- and trans- regulatory effects of these age-related lincRNAs on the expression of protein-coding genes. The results are also supported by macaque prefrontal cortex transcriptome time series data and by the effect of lincRNA knockdown experiments in mouse. The expression profile and effect consistency between human lincRNAs and their orthologs in other species implies the functional conservation and importance of lincRNAs. |
He Z*, Bammann H, Hu H, Khaitovich P
*CAS-MPG Partner Institute for Computational Biology China |
K - Sequencing and Sequence Analysis |
|
K 69
K69 |
Finding genes that are differentially expressed between conditions is an integral part of understanding the molecular basis of phenotypic abnormalities. During the last decades, DNA microarrays have been used extensively to quantify the abundance of mRNA corresponding to different genes, and more recently high-throughput sequencing of cDNA (RNA-seq) has emerged as a powerful alternative. As the costs of sequencing decrease, it is conceivable that the use of RNA-seq for differential expression analysis will increase rapidly. To address the challenges posed by the fundamental differences between the data obtained from microarray and RNA-seq experiments, a number of new software packages have been developed especially for differential expression analysis of RNA-seq data. We conduct an extensive evaluation and comparison of ten of these packages, in terms of their ability to find differentially expressed genes and control the rate of false discoveries under different conditions. All evaluated methods are freely available within the R framework and take as input a matrix of counts, i.e. the number of reads mapping to each genomic feature of interest in each of a number of samples. The results show that very small sample sizes, which are still common in RNA-seq experiments, impose problems, albeit of different types, for all evaluated methods. For larger sample sizes, there are considerable differences in performance among the methods. |
Soneson C*
*Bioinformatics Core Facility, Swiss Institute of Bioinformatics Switzerland |
K - Sequencing and Sequence Analysis |
|
K 70
K70 |
HiCUP is the first publically available pipeline tailored for processing data generated by Hi-C, a new technique to investigate the three-dimensional organization of the genome. Hi-C involves formaldehyde fixing chromatin to preserve genome structure, followed by restriction enzyme digestion, ligation and then sonication to generate a population of short DNA sequences termed di-tags which reflect the spatial arrangement of the genome at the time of fixation.
HiCUP takes paired-end FASTQ files as input and then independently maps forward and reverse reads using Bowtie. Independent mapping is essential since di-tags, unlike conventional paired-end reads, do not represent one continuous sequence. HiCUP then pairs mapped forward and reverse reads, removing artefacts and other uninformative di-tags by positioning reads on an in silico digestion of the reference genome. Removing experimental Hi-C artefacts is important since even a small number of invalid di-tags could lead to incorrect conclusions being drawn concerning the structure of the genome. HiCUP reports valid di-tags in BAM/SAM format, readily amenable for post-pipeline visualisation and analysis. Furthermore, the pipeline is flexible, allowing users to specify numerous parameters and provides statistics summarising the results, which may help to improve the experimental protocol. |
Wingett S*
*The Babraham Institute United Kingdom |
K - Sequencing and Sequence Analysis |
|
K 72
K72 |
With currently available ultra-deep, replicated RNA-Seq data the transcript by transcript analysis of reproducibility has shown that expression estimates for most genes are very noisy. We here introduce MapAl, a tool for fast and straightforward expression profiling by RNA-Seq that builds on the existing tools. In the post-processing of RNA-Seq reads, MapAl incorporates gene models already at the stage of read alignment, increasing the number of reliably measured known transcripts consistently by 50%. Adding genes identified de-novo then allows a reliable assessment of double the total number of transcripts compared to other popular pipelines. This substantial improvement is of general relevance: Measurement precision determines the power of any analysis to reliably identify relevant signals or changes, such as in screens for differential expression, independent of whether replicates employed or not.
MapAl supports both users and further development by giving a free choice of combining alternative steps at different stages of the process. In particular, a wide range of read mappers supporting the standard SAM format can be employed, eg. Bowtie, SHRiMP2. With the new release we have also improved the handling of exon junctions, especially when reads spanning multiple splice junctions. These reads are particularly powerful in the discrimination of specific spliceforms and with read lengths of modern platforms ever increasing, reads spanning multiple splice junctions are becoming a more frequently observed issue. |
Labaj PP, Sykacek P*, Kreil DP
*Chair of Bioinformatics, Boku University Vienna, Austria Austria |
K - Sequencing and Sequence Analysis |
|
K 73
K73 |
MicroRNAs are post-transcriptional regulators that bind to complementary sequences on target mRNA, usually resulting in translational repression or gene silencing. The best characterized features determining miRNA-target recognition are short, 6-nt seed sites, which perfectly complement the 5' end of the miRNA(positions 2-7). However, such strategy suffer from both false-positive(~40-66%) and false negative predictions(~50-70%) and cannot identify noncanonical target sites. And it has been experimentally validated that perfectly matched miRNA seeds are neither necessary nor sufficient for all functional miRNA-target interactions.
Here, we found a group of microRNA which contained CG sequence in seed region. Compare with the canonical target site, mRNAs harboring U/A-bulge sites were both evolutionary more conserve and less conservative in neighbor context information. Meanwhile, such U/A-bulge targets shows similar significant repression effect in miRNA over-expression and knock down experiment. In summary, we found a new class of miRNA target site containing nucleation bulges and an alternative mode of miRNA target recognition. |
Yan Z*
*Chinese Academy of Science China |
K - Sequencing and Sequence Analysis |
|
K 74
K74 |
The analysis of deep sequencing (DS) data requires scalable computational infrastructure as hundreds of samples may result 10 TB of DS data. A promising technological answer is cloud services where the execution can be effectively parallelized and the disk speed can be optimized. We have extended freely available Anduril framework, our previously developed bioinformatics workflow engine for large-scale data analysis (Ovaska et al. Genome Medicine 2010), into scalable solution to analyze various types of DS data. Anduril is now usable both in cloud services as in single machines. For scalability, Anduril includes: 1) the divide and conquer scheme for splitting and iterating datasets, 2) optimized execution and automatic parallelization, 3) connections to public resources, 4) third-party implementations, 5) remote execution to distribute the calculation, and 6) automatic installation for distributing Anduril to various infrastructures or cloud nodes.
Here we show the utility of Anduril in cloud infrastructure in processing DS data by analyzing 171 breast cancer patients (171 tumor exomes and corresponding blood exomes, 19 RNA-seq and 16 miRNA-seq samples) available in The Cancer Genome Atlas (TCGA) database. Our complete DS pipeline can be effectively parallelized, which allows fast execution of hundreds of samples. The pipeline contains variant analysis using various existing methods, such as VarScan and GATK, as well as a custom filtering scheme. Here we also show an approach to first query variants in different chromosomal locations, such as known or predicted binding sites, and combine these results with RNA-seq and miRNA-seq data. |
Karinen S*, Lindell R, Louhimo R, Ovaska K, Rogojin V, Laakso M, Lahesmaa-Korpinen A, Cervera A, Chen P, Núñez-Fontarnau J, Hautaniemi S
*University of Helsinki Finland |
K - Sequencing and Sequence Analysis |
|
L 01
L1 |
Pheno2geno is a package for the computational generation of markers and construction of genetic maps from molecular phenotypes in segregating inbred line populations, what is possible because phenotypes with a clearly separated bimodal or trimodal expression distribution can be used as genetic markers. Pheno2geno offers: de novo map construction, saturation of existing maps and detection of sample mix-ups.
Pheno2geno starts by finding the phenotypes that are suitable as markers. These phenotypes should show differential expression between founders and are selected using a Student’s t-test or RankProd. In the next step mixture modeling is used to select phenotypes that show bimodal (e.g. for RIL, BC) or trimodal (e.g. for F2) expression patterns with mixing proportions comparable to the expected segregation frequencies (e.g. 1 to 1 in RIL) across all the offspring individuals. Only these phenotypes are transformed from continuous measurements into discrete markers. After marker detection the markers are used to create de novo map. Additional information, like known physical and/or genetic positions for all/some of the markers could be used and it will improve the quality of the resulting map. When a genetic map is available pheno2geno can be used to saturate it. For each of the markers it is assessed whether it has a single significant QTL. If so, it is being placed on the map in the position of this peak, if not it is dropped from the analysis. Pheno2geno can be used to point out sample mix-ups and errors. Using QTL information the package compares the observed phenotype value with the expected value. This results in calculating a mismatch score, and obtaining phenotype-based idealized genotypes. Tests on an 180 individuals from A. thaliana inbred population show that pheno2geno is able to point out wrongly measured samples and saturate the genetic map. Saturation resulted in a decrease of the average distance between markers by more than fourfold. |
Zych K*, Arends D, Jansen RC
*Jagiellonian University Poland |
L - Technology and Software |
|
L 02
L2 |
The β-transmembrane (βTM) proteins, embedded in the outer membranes of gram-negative bacteria, mitochondria, chloroplasts, and cell wall of gram-positive bacteria, perform critical roles in active and passive transport of small molecules and nutrients in bacteria. Consequently, they are also important as targets for several antimicrobial drugs. However, these proteins representing only ~0.27% of the PDB structures remain largely unexplored due to experimental difficulty. Therefore, a detailed structural and functional analysis of the βTM proteins is required. An initial step towards this direction is predicting the transmembrane regions. The limited availability of data is the major hindrance in developing such prediction tools. Most of the existing algorithms are based on hydropathy scales and evolutionary information.
Here we present a β-transmembrane region prediction algorithm based on the sequence information alone. The βTM protein sequences are encoded mathematically into an amino acid adjacency matrix. The corresponding matrix invariant is then used as descriptor set to develop a data-driven algorithm for the prediction of transmembrane regions using non-linear modeling method Counter Propagation Neural Network. The algorithm is therefore independent of any evolutionary data and physiochemical property indices. The sensitivity of the classification model in the self-consistency test is 91.02%. When tested with an external validation set, the sensitivity is 77.87%. Statistical constraints are then applied to the results obtained from the classification model to get the final transmembrane regions. The algorithm performs with a sensitivity of 73.75% when challenged with 35 βTM proteins that are not used in the development of the prediction model. |
Roy Choudhury A*, Novič M
*National Institute of Chemistry Slovenia |
L - Technology and Software |
|
L 03
L3 |
Thanks to structural biology and omics technologies, biologists have now collected large amounts of data regarding individual biological molecules. Time has come to put individual molecules together and set them back into their cellular context. We describe Graphite-LifeExplorer: an open source 3D modeling tool to build a system view of living organisms. In addition to the 3D modeling of protein arrangements, one of its main features is the ability to model, in an intuitive manner, DNA molecules of unlimited length and at the resolution of one base pair. Its objective is to offer an easy way to add DNA in a 3D environment containing proteins dispersed or arranged into complexes. Full-atom description export capability allows to reuse a multi-component assembly 3D model made with Graphite-LifeExplorer with the various computational methodologies available today. Among next developments the possibility will be given to model single-stranded nucleic acids and therefore to represent structural arrangements of RNA and proteins.
http://www.lifeexplorer.eu/ http://www.loria.fr/~shornus/FFG/gle.html |
Lariviere D*, Hornus S, Levy B, Fourmentin E
*Fourmentin-Guilbert Scientific Foundation France |
L - Technology and Software |
|
L 04
L4 |
Jalview – a widely used sequence alignment, editing, visualization and analysis tool provides access to state of the art open-source bioinformatics algorithms via the Java Bioinformatics Analysis Web Services (JABAWS) system. Jalview 2.8, released in Summer 2012, allows access to JABAWS 2, which was released in December 2011 and includes services for AACons, a fast implementation of several multiple alignment conservation methods, Clustal Omega – the latest in the Clustal family of multiple alignment programs, and the protein disorder predictors DISEMBL, IUPred, GlobPlot and JRONN, a new open source version of RONN.
JABAWS 2 is available for local download or as an Amazon Machine Image deployed on EC2, and supports local and cluster based job execution. Services may be run from the command line, or through the client integrated in Jalview. Jalview 2.8 also introduces support for RNA secondary structure annotation and base-pair conservation, and integrates VARNA (varna.lri.fr) an RNA secondary structure viewer and editor. The Desktop also allows access to RFAM, and employs JDAS to access Distributed Annotation System servers. Jalview and JABAWS have been developed with support from the UK’s Biotechnology and Biological Sciences Research Council (BBSRC), with additional contributions from students supported by the Google Summer of Code. Jalview is available under the GPL at http://www.jalview.org, and JABAWS is available under the Apache License and can be downloaded complete with all third-party software at http://www.compbio.dundee.ac.uk/jabaws |
Procter J*, Troshin P, Lui L, Engelhardt J, Ponty Y, Barton G
*University of Dundee United Kingdom |
L - Technology and Software |
|
L 05
L5 |
Research institutes cooperate and perform Collaborative Animal Trials (CAT) in order to save costs and material and to apply methods from systems biology. In a CAT, experiments are performed by separate working groups with samples from shared animals. Measurements comprise quantitative and qualitative information of genes, transcripts, proteins, metabolites, cells, bacteria, viruses etc. or physiological data of the animals. Sharing of data files between working groups and application of mathematical approaches require for a central data file management system.
Several commercial systems provide a web based file storage. These closed systems provide dynamical data user administration functionalities but lack to reproduce the structure of a CAT. In contrast, specialized open source systems exist. Most of these systems provide structured data storage but have only limited data user administration functionalities or are not connected with an active open source community. Until now, no system exists which is part of a large open source project and combines comfortable functionalities with structured data storage. To fill this gap, we provide CATRIA, a Drupal module for central file management in a CAT. Drupal is a common open source content management platform. The Drupal core installation includes user management functionalities and a graphical user interface. CATRIA expands the Drupal core with functionalities to store the structure and data-sets of a CAT, functionalities to handle data sharing between users, and user interfaces for large scale data administration. |
Twardziok S*, Kleffe J, Wrede P
*Institut für Molekularbiologie und Bioinformatik, Charite-Universitätsmedizin Berlin Germany |
L - Technology and Software |
|
L 06
L6 |
Background
As the “omics” revolution unfolds, the growth in data quantity and diversity is bringing about the need for pioneering bioinformatics software, capable of significantly improving the research workflow. To cope with these computer science demands, biomedical software engineers are adopting emerging semantic web technologies that better suit the life sciences domain. The latter’s complex relationships are easily mapped into semantic web graphs, enabling a superior understanding of collected knowledge. Despite increased awareness of semantic web technologies in bioinformatics, their use is still limited. Results COEUS is a new semantic web framework, aiming at a streamlined application development cycle and following a “semantic web in a box” approach. The framework provides a single package including advanced data integration and triplification tools, base ontologies, a web-oriented engine and a flexible exploration API. Resources can be integrated from heterogeneous XML, CSV, SQL or SPARQL data sources and mapped directly to one or more ontology predicates. Advanced interoperability features include REST services, a SPARQL endpoint and LinkedData publication. These enable the creation of multiple applications for web, desktop or mobile environments. Conclusions The platform, targeted at biomedical application developers, provides a complete application skeleton ready for rapid application deployment, enhancing the creation of new semantic information systems focused on niche fields. COEUS is available as open source at http://bioinformatics.ua.pt/coeus/. |
Lopes P*, Oliveira JL
*DETI/IEETA, University of Aveiro Portugal |
L - Technology and Software |
|
L 07
L7 |
The popular "Burrows Wheeler Aligner" (BWA, Li 2009) has been our mapper of choice for Illumina data. Using it on ever growing data sets, especially with more permissive parameters, made it impractical to run it on a single computer. Distributing BWA over a cluster would have required unappealing amounts of scripting.
Instead, we modified BWA to include a streamlined, network aware workflow. This networked mode can dynamically add and remove compute nodes from a running job. It therefore handles disappearing nodes (e.g. crashed machines) without further intervention. While other approaches to distributed mapping (e.g. pBWA, http://pbwa.sf.net) typically require a dedicated compute cluster and will not survive the failure of a single node, the increased flexibility of networked BWA makes it practical to use "off-peak" computational resources, such as idle workstations or "spot" instances in a compute cloud. For our internal sequence data processing, which consumes and produces BAM files, we have replaced regular BWA with this new version. This allows us to complete tasks within days, that were formerly hampered by long run times, intermittent failures and high memory consumption. Since scripts to split the work and keep track of the intermediate files are no longer needed, the integration of the mapper was simplified considerably. Our communication mechanism is light-weight and can readily be adapted to other compute-intensive tasks commonly encountered in the analysis of high throughput sequencing data. Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168] |
Stenzel U*
*Max Planck Institute for Evolutionary Anthropology Germany |
L - Technology and Software |
|
L 08
L8 |
The sequencing equipment that utilizes a Sanger method is a most widely used in clinical applications by now. We developed a new method of sequence analysis - the Base Calling with Vocabulary (BCV) software package (http://basecv.sourceforge.net/) that is intended for analysis of direct (population) sequencing chromatograms using known vocabulary sequences similar to the target DNA. The BCV pipeline has the following functionalities: base calling - determining the sequence of IUPAC codes for a chromatogram; indel detection - detecting insertions or deletions (indels) in components of the complex DNA sample; mixture deconvolution - determining the sequences of the sample DNA components. BCV performs base calling by looking for the best grouping of peaks into positions (corresponded to IUPAC symbols) and filtering out artifacts. Next, the program makes a greedy deconvolution of the obtained chromatogram sequences, aligning sequences from the vocabulary with a profile of true peak amplitudes and decreasing the amplitudes of the peaks recruited into the alignment. The obtained sequences are then joined into clusters. The clustering strategy is depended on the usecase. If the mixture deconvolution is used then the predicted sequences are clustered into most divergent subgroups. If the indel detection function is selected, the algorithm collects into a single main group those sequences that could be aligned relative each other without indels and explain together a sufficient portion of peak's amplitudes. The software detects indels relative the consensus sequence of the main group. Indel analysis may be performed on all chromatograms obtained simultaneously for a single sample. |
Neverov A*, Fantin Y, Favorov A, Mironov A, Chulanov V
*Lomonosov Moscow State University Russian Federation |
L - Technology and Software |
|
L 09
L9 |
We have developed a pragmatic analysis method for the interpretation of a large, multi-dimensional, genomic data set derived from a bovine F2 cross-breed pedigree. The PleioTrapper set of scripts integrates physiological phenotypes, gene expression data, and high density SNP-chip genotypes to automate identification of candidate causative genes and trait pleiotropy.
The dataset consists of 1625 Friesian-Jersey cross-breed animals with >650k genotypes imputed from the Illumina BovineHD 777k platform. The 864 F2 cows of the pedigree have been assessed for a vast array of phenotypes, including measures of milk production and composition, disease, growth, behavior, responses to hormone challenges, infra-red and mass spectrometry analyses of blood serum and milk, and many other physiological characteristics. Genome-wide expression analysis has also been conducted on a subset of the F2 animals, utilizing RNA from fat (n=359) and liver (n=470) biopsies. We have used this phenotypic dataset, containing more than 100,000 raw and statistically-derived traits, to assemble a searchable, pre-computed database of phenotypic and genotypic associations. Analyses include pairwise phenotype correlations, linkage QTL, and genotype associations using single-SNP and Bayesian GWAS methods. This database can be queried by traits, genes, or sequence variants to return a ranked list of co-associated variables. We are currently sequencing several hundred F1 parent genomes, with the intention to recompute associations using the complete catalogue of variation in this population. Using these methods, we have discovered many novel associations. We will present our approach, and give examples of the power of these tools to identify new and unexpected findings. |
Obolonkin V*, Snell R, Littlejohn M
*LIC Ltd. New Zealand |
L - Technology and Software |
|
L 10
L10 |
Comparative approaches are fundamental in biology and medicine; many of the principal achievements in both fields originated from studies conducted on model organisms. Homology relations are extremely important to successfully generalize experimental observations and verify relationships among taxa. High throughput and in particular next-generation sequencing techniques produce massive amounts of data and are applied to many different organisms. The necessity to integrate data, inferring novel insights by comparing different experiments, often involves the identification of orthologs between species and the conversion of identifiers between various databases formats.
HOMECAT (HOmology Mapper for Enrichment and Comparative Analysis with Translation) is a Cytoscape plugin for comparative investigations based on homology. Starting from a list of identifiers or a network, the plugin searches the best consensus orthologs and paralogs from the principal homology databases and conveniently represents them as metanodes, where homologs are satellites of the input nodes. Data provided by the user or collected from various sources, like ArrayExpress or GEO, can be mapped on collapsed metanodes and visualized as pie charts allowing visual inspection of similarities. The plugin is interfaced with EBI PICR and BridgeDB and seamlessly converts user-provided identifiers to query homology databases and to convert results back to user format. Data integration is also facilitated by the possibility for the user to convert identifiers prior to the mapping directly from the plugin. A case study is presented to elucidate the plugin functionalities. |
Zorzan S*, Laudanna C, Lorenzetto E, Ettorre M, Bolognini S, Pontelli V, Buffelli M
*The Center for Biomedical Computing, University of Verona Italy |
L - Technology and Software |
|
L 11
L11 |
Proteins are the building blocks of cells and the executioners of nearly all cellular functions. Their structure is of paramount importance to understand their dynamics and function, as well as the interactions with other molecules. In this work, we apply the Generalized Simulated Annealing (GSA) to the prediction of protein structures in a new software. The GSA is a stochastic search algorithm employed in energy minimization and used in global optimization problems, such as gravity models, fitting of numerical data and conformation optimization of small molecules. Our software applies the analytical inverse of the probability distribution from GSA, a new method to apply rotations to the phi and psi angles of the peptide bonds and side chains, faster connection with NAMD for potential energy calculation and the possibility of parallel execution, granting a new take on ab-initio protein structure prediction. The new design also allows for an easier inclusion of knowledge derived potentials, based on experimentally determined protein structures. We present results for the 14 amino acid protein mastoparan-X. The chain folds with RMSD of 3,0 angstroms after 500.000 GSA steps. Currently, for this system, the software calculates 5 million GSA steps in under 6 hours using 4 processors. Predicted structures can be refined with molecular dynamics simulations and used to study proteins whose conformation can not be determined with experimental methods. |
Melo M*, Bernardi R, Pascutti P
*Laboratório de Modelagem e Dinâmica Molecular, Universidade Federal do Rio de Janeiro Brazil |
L - Technology and Software |
|
L 12
L12 |
Correct interpretation of many biological experiments is currently based on consistency of biomolecular annotation databases. Such databases are very useful for the scientific community, but, unfortunately, incomplete by definition. To improve their consistence, computational methods able to supply ranked lists of predicted annotations are hence extremely useful.
We departed from a previous work on the automatic prediction of Gene Ontology (GO) annotations based on the truncated Singular Value Decomposition (SVD) of the annotation matrix, where every matrix element corresponds to the association of a biomolecular entity to a GO term. Then, we developed a new method where the truncation choice is based on analysis of the Area Under Curve (AUC) of different Receiver Operating Characteristic (ROC) curves for different truncations. To evaluate our method, we used annotations of different organism genes available on July 2009 in an old version of GO Annotation databases. By analyzing Gallus gallus annotations between genes and Biological Process terms, the best truncation parameter, suggested by the algoritm, led to better results than other truncation levels: from all the input annotations, the SVD method with best truncation predicted the highest number of annotations whose presence were confirmed in a more recent GO database version (October 2011). Contrariwise, other truncation levels, related to higher AUC values, led to worst prediction results. To get more correct biomolecular annotation predictions, our SVD best truncation choice method revealed very effective and reliable. Furthermore, since our approach is not limited to specific annotation types, can be applied to any controlled annotation. |
Chicco D*, Masseroli M
*Politecnico di Milano Italy |
L - Technology and Software |
|
L 13
L13 |
Interactions of RNA with other molecules play crucial roles in many biological processes. Mono- and divalent cations drive proper folding of RNA and stabilize its secondary and tertiary structures. Ions also act as essential cofactors in many reactions catalyzed by RNAs. The hammerhead ribozyme, group I and group II introns as well as ribonuclease P ribozymes are examples of RNA that need ions to cleave phosphodiester bonds. Moreover, many RNA molecules are essential elements of cellular or viral physiology, and they serve as targets for small-molecule ligands that may act as drugs. For example, the majority of antibiotics target ribosomal RNA.
We have developed novel bioinformatics tools for predicting RNA interactions with metal ions and small molecules. MetalionRNA determines sites around a user-specified RNA 3D structure, where cations are most likely to bind. LigandRNA scores and ranks user-defined RNA-ligand complexes. Both methods employ a grid-based algorithm and a knowledge-based potential derived from metal ion and ligand binding sites from experimentally solved PDB structures. Our methods can be used to assist X-ray crystallographic structure determination, or can be used in a fully predictive mode to identify or to design drugs targeting RNA (e.g. novel antibiotics targeting bacteria ribosomes or novel inhibitors affecting riboswitch structure and function). It can also be used to propose metal positions for structural models that typically lack coordinates of ions, e.g. RNA structures determined by NMR or theoretical models. The MetalionRNA and LigandRNA programs are available free of charge as web servers at http://genesilico.pl/ |
Philips A*, Milanowska K, Lach G, Boniecki M, Bujnicki JM
*Adam Mickiewicz University Poland |
L - Technology and Software |
|
L 14
L14 |
Molecular Modeling and visualization are crucial components of application scenarios in structural bioinformatics. Even more, molecular visualization is a cornerstone of every subject dealing with molecular structures, such as structural biology, material sciences or condensed matter physics.
Typically, the graphical interfaces to molecular visualization and modeling software tools are very domain specific and require an extensive learning period. Often, in scenarios such as teaching, presentations and demonstrations, it would be highly preferable to have an intuitive but flexible environment for showcasing molecular models or actions accompanied by supplementary information. Here, we present the PresentaBALL framework, which uses established web technology standards to provide a freely configurable interface into the extensive modeling and visualization capabilities of BALLView. BALLView is the graphical front-end to the Biochemical Algorithms Library (BALL), a versatile C++ class library, which provides a rich set of data structures and algorithms for structural bioinformatics. PresentaBALL is a novel WebKit based extension embedded into the GUI of BALLView, and allows for easy creation of lessons or protocols using HTML, as well as additional preparation and connection of molecular showcases. Even more, By providing complete access to the Python scripting interface of BALL via HTML, PresentaBALL offers the possibility to directly interact with the molecular scene. With PresentaBALL users are able to easily set up sophisticated scientific presentations, as well as elaborated academic tutorials, even allowing for interaction and progress monitoring. |
Nickels S*, Mueller SC, Stöckel D, Dehof AK, Lenhof H, Hildebrandt A
*Intel Visual Computing Institute, Saarland University Germany |
L - Technology and Software |
|
L 15
L15 |
Development of high-throughput techniques has resulted in large amounts of experimental data of which the storage and analysis has become a key bioinformatics challenge. This evolution presents many opportunities, but the data volume makes the analysis of biological datasets increasingly complex. Patterns hidden within these datasets can only be detected by using specialized pattern extraction techniques. Powerful pattern mining methods, such as frequent itemset mining, are still underutilized in common life science problems. This can be attributed to shortcomings of the currently existing implementations that often suffer from 1) a lack of user-friendliness; 2) their “black box” mode of operation; 3) complicated interpretability of the results and 4) difficulties in post-processing of the results. The MIME (Making Itemset Mining Easy) framework was developed to address several of these issues. It provides an intuitive graphical interface linked to a framework that incorporates various algorithms for rapid pattern detection and post-processing. MIME offers various quality measures that are often used in a data mining context and allows for iterative mining. Results can be exported to text files that are easy to edit. This tool exhibits various features that drastically facilitate motif discovery in various bioinformatics challenges. We demonstrate the power of this approach through a case study on the discovery of structural motifs related to post-translational modifications. |
Naulaerts S*, Moens S, Goethals B, Laukens K
*University of Antwerp Belgium |
L - Technology and Software |
|
L 16
L16 |
Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on position weight matrices (PWMs) reflecting the preferred binding motifs associated to corresponding TFs. Such models make the strong assumption that each nucleotide within a TFBS participates independently in the corresponding DNA-protein interaction and do not allow flexible length motifs (e.g. variable spacing between half-sites). We propose to use hidden Markov models (HMMs) to predict TFBSs. HMMs are flexible and can model position interdependence within TFBSs as well as variable length motifs. The availability of thousands of experimentally validated DNA-TF interaction sequences coming from ChIP-sequencing allows us to construct and train HMMs to reflect the TFBS properties observed in experimental data. We developed a new graphical representation of the modelled motifs to convey properties of position interdependence. HMMs have been assessed on human and mouse ChIP-seq data sets coming from the ENCODE project, revealing that the new HMM-based method performs better than PWMs and dinucleotide PWMs in discriminating motifs within ChIP-seq sequences from background sequences. We hypothesized that ChIP-seq signal values are correlated with the affinity of the TF to bind to the DNA and found that HMM scores correlate with ChIP-seq peak signals. Moreover, we have shown that HMM scoring better correlates than PWMs with published DNA-binding affinities of the extensively characterized TF Max. These results demonstrate the capacity of a new HMM-based approach to model DNA-protein interactions. |
Mathelier A*, Wasserman W
*Centre for Molecular Medicine and Therapeutics - University of British Columbia Canada |
L - Technology and Software |
|
L 17
L17 |
Even though biological databases addressed the problem of providing consistent and formal descriptions of functions and roles of genes and proteins, the quantification of the functional similarity between genes or proteins exploiting these data is still a challenge. Several semantic similarity measures for Gene Ontology annotations have been proposed in the last years, however for many of these measures an implementation still lacks, or the available ones do not present the flexibility to be easily integrated in bioinformatics analysis pipelines.
In this work we introduce FastSemSim, a new software enabling fast and easy evaluation of semantic similarity measures on GO annotations. The software was designed to address the open issues of the existing available tools: it allows the use of the most updated or custom annotation corpora and ontologies, and it scales up well when millions of similarity scores have to be evaluated. Moreover, if on one hand this software represents an helpful Python framework which can be used to easily implement new semantic similarity measures, on the other hand it provides a user-friendly graphical interface. Currently the software provides 12 different semantic similarity measures, including an enhanced version of Resnik measure combined with the max-mixing strategy (Guzzi et al., 2011), especially designed to speed up the calculation of similarities. In different case studies, FastSemSim proved to enable the systematic evaluation of the semantic measures, meeting the requirements of handling huge amounts of data in order to support genome/proteome-wide analyses. Extensive documentation and source code are available at https://sourceforge.net/p/fastsemsim/. |
Mina M*, Sanavia T
*University of Padova Italy |
L - Technology and Software |
|
L 18
L18 |
The functions of RNAs, particularly of the non-coding variety, are strongly informed by their secondary structure. As such, particular attention has been paid to RNA structure prediction, and successively faster and more accurate prediction software has emerged in the last three decades. A primary class of such software approach the problem as an optimization of the structure based on minimum free energy.
One of the more difficult aspects of RNA structure prediction involves a structural motif termed the 'pseudoknot', where unpaired bases within a loop pair with bases outside said loop. For single sequence minimum free energy methods the general problem of prediction with pseudoknots is known to be NP-Hard. In this study, we report the results of applying stochastic swarm-based optimization via Lévy flights to the problem. This approach allows us to treat the problem as black-box optimization, thus eschewing many of the assumptions necessary in structure prediction the presence of pseudoknots. Furthermore, the nature of swarm based algorithms allows a high degree of parallelization on newer hardware such as general purpose GPUs, providing significant speedup. We present the results on the accuracy and speed of the approach in relation to earlier work on pseudoknot prediction based on probabilistic models (DotKnot), integer programming (IPKnot), and domain heuristics (HotKnots), as well as the classical Rivas-Eddy algorithm. |
Acar A*
*Middle East Technical University Turkey |
L - Technology and Software |
|
L 19
L19 |
Comprehensive analysis of a protein of interest involves aggregation of numerous annotated protein features. To also check for interactions between features, visualization is mandatory.
A wide range of public databases exists with experimentally annotated protein features, like post translational modifications, natural variations or structural features. There are even more bioinformatic tools available predicting protein features, like transmembrane topology, sequence motifs or binding sites. Usually, each database or prediction tool comes with a unique web-based query interface. It is therefore quite laborious to gather annotated and predicted features to achieve a comprehensive overview for a protein of interest. Protein centered databases - like UniProt - partly address this issue by combining and re-annotating protein features from primary literature as well as from other databases. However, the variety of available sources and the possibility for visual feature inspection is very limited. Therefore we present Protter, an open-source web-based tool we developed to visualize a protein's sequence, topology and a variety of feature annotations. It readily integrates with results from bottom up proteomics experiments, where peptide identifications are mapped to proteins in sequence databases. Moreover, computational, experimental and manual annotations of protein features can be added and visualized in publication quality graphics. |
Omasits U*, Ahrens C, Wollscheid B
*ETH Zurich Switzerland |
L - Technology and Software |
|
L 20
L20 |
Correct identification of causative genes for an important agronomic trait can be very valuable for effective marker assisted breeding. Forward genetic approaches such as linkage analysis and association mapping can determine genomic regions (QTL) associated with a phenotype. However, even well-defined QTL often span genomic regions that contain dozens to hundreds of genes. Evaluation of potential functional candidacy of positional candidate genes is often time-consuming and requires the integration of multi-source data such as functional annotations, biochemical pathways, gene expression data, comparative information from related organisms, gene knock-out and over-expression and the scientific literature. For organisms with a sequenced genome, we can use the free Ondex framework (www.ondex.org) to integrate data gathered from multiple databases into a labelled and directed multi-graph. Subsequently, data mining tools are required to mine integrated Ondex knowledgebases in a systematic and automated manner for trait related candidate genes.
We have developed QTLNetMiner, a web application that queries an integrated Ondex knowledgebase with a given term (e.g. early flowering, disease resistance) and shows associated candidate genes and QTL on chromosome and table views. The relevance of a gene to a particular query is weighted using information retrieval and network inference methods. The supporting evidence networks for selected candidate genes are visualized in the Ondex Web Java-applet. QTLNetMiner is designed in a generic way and can be installed for any organism with a sequenced genome and an integrated Ondex knowledgebase. We are currently developing QTLNetMiner instances for Arabidopsis, poplar, pig, cattle and chicken, see http://ondex.rothamsted.ac.uk/QTLNetMinerPoplar/. |
Hassani-Pak K, Zorc M*, Taubert J, Rawlings C
*Rothamsted Research United Kingdom |
L - Technology and Software |
|
L 21
L21 |
One of the most challenging tasks in macromolecular crystallography (MX) is the determination of the 3D structure of proteins for which crystals do not diffract to high resolution. Computational approaches for automatic model building in MX have traditionally been focused on high-resolution data. Thus their application to data at resolutions worse than ~2.5 Å is limited and typically results in incomplete and highly fragmented models. Therefore, robust methods are urgently needed for automated determination of low-resolution MX structures to high levels of completeness and accuracy.
FittOFF (Fitting OF Fragments), utilising the experience accumulated within the ARP/wARP [1] and OpenStructure [2] projects, combines a knowledge-based backbone fragment library with the pattern matching capabilities of ARP/wARP to identify and complete structural gaps in partially built models. In contrast to loop-building approaches commonly used in MX, FittOFF does not require the built fragments to be sequence-assigned. Gap identification is achieved by docking built fragments to a secondary structure predicted from the amino-acid sequence. Identified gaps are filtered for false-positives using another knowledge-based approach that relates the number of residues missing in a gap to the distances between the anchoring residues. The application of FittOFF to ARP/wARP model building was tested on a set of ten structures containing up to 300 residues and solved at resolutions between 3.0 Å and 3.8 Å. We observed a noticeable increase in model completeness and doubling of the average fragment length. [1] G. Langer, et al., Nature Protocols 2008, 3, 1171-1179. [2] M.Biasini, et al., Bioinformatics 2010, 26, 2626-2628. |
Wiegels T, Biasini M*, Schwede T, Lamzin V
*Swiss Institute for Bioinformatics, Biozentrum, Universitaet Basel Switzerland |
L - Technology and Software |
|
L 22
L22 |
The availability of complete genome sequences of many organisms coupled with advances in the areas of biotechnology and IT has resulted in an exponential growth of available biomedical data. Those data are heterogeneous for both clinical (phenotypic, behavioral, environmental) and genomic (differential expression, interactions,…). Today, the understanding of biological systems and the exploitation of those high-throughput data is a big challenge. So it is crucial to develop methods to take advantage of this amount of data, to integrate them and comparing them to prioritize the best candidates associated to biological processes. That's why we develop a new gene prioritization tool, called GEPeTTO (GEne PrioriTization TOol). It was produced to be applied in the age-related macular degeneration, but it is totally adjustable to other pathologies or biological processes. It consists in selecting known genes and comparing candidate genes to score and rank them from the more similar to known genes to the less similar to known genes. Those ranking methods use criteria like interactome, transcriptome, sequence or text-mining, but GEPeTTO is an an innovative approach using new evaluation criteria like genomic context or evolutionary data. |
Walter V*, Nguyen HN
*IGBMC - Laboratoire de Bioinformatique et de Génomique Intégrative France |
L - Technology and Software |
|
L 23
L23 |
The study of genomic variation is driving the need to find new ways of structuring and representing genomic variation data. Commonly, variation data are structured into SQL table relationships on disk. This presents a problem since the sequence interval query performance of SQL databases declines dramatically as the number of variants increases. Since even modest scale projects can feasibly produce one billion variant features, we developed the RQFS, which is a simple high performance file system strategy that scales without performance degradation.
The RQFS is a simple file system tree containing fixed size "bins" which represent nucleotide positions along the length of a reference chromosome. Variation data is written as extended attributes on files according to start position. For example the query for Chr1:1000000..1100000 requires no database server, index scanning or formal schema and can be read directly (in parallel if necessary) from the file system tree. We have created and populated RQFS trees for Arabidopsis and human genomes. Our results show that the RQFS outperforms an SQL database by 8 to 20 fold using a 256 million feature simulated dataset for Arabidopsis using the same storage media. The RQFS can scale to one billion variant features in Arabidopsis without performance degradation and can send 100000 features per second from storage media to the network wire. The RQFS facilitates the development of diverse web services such as a dynamic variant feature track generator for the genome browser JBrowse. |
Karcz S*, Links M, Parkin I
*Agriculture and AgriFood Canada - Science and Technology Branch Canada |
L - Technology and Software |
|
L 24
L24 |
Exchange of data between various frameworks, or between successive tools in a workflow, benefits from using common formats. Textual tab-separated formats are convenient when used in text-processing frameworks, while RDF is advantageous within the Semantic Web. In parallel, XML Schema-based formats have the advantage of enabling both textual and binary representation, and are convenient to use in object-oriented frameworks or with Web services.
BioXSD has been developed to fill the niche of missing canonical XML Schema-based exchange format for basic bioinformatics data: sequences, alignments, features, references to resources, and identifiers. The canonical format can either be consumed and produced by tools directly, or serve as an intermediate format other formats can losslessly be serialized to and from. Current version 1.1 of BioXSD includes improvements in optimizing the volume of data, and in allowing highly flexible annotation -- including for example feature semantics, complex scoring, and provenance metadata. In addition, the semantic annotation of the BioXSD Schema itself has been improved. It currently includes annotations with EDAM, Dublin Core, and RDFS, supporting semantic reasoning when integrating data and translating between formats. Translation between BioXSD and other formats is indeed the main focus of our future work on BioXSD, together with improving the representation of variation data and genome-scale alignments, and involving with the community in efficient feedback and implementations. |
Kalas M*, Puntervoll P, Karosiene E, Ison J, Blanchet C, Rapacki K, Jonassen I
*Uni Computing, University of Bergen Norway |
L - Technology and Software |
|
L 25
L25 |
Most genomics datasets can be viewed as features along the genome (genome tracks). Our aim is to provide an user interaction layer on top of a dynamic genomics data viewer running inside the web browser: the Genome Data Viewer (GDV). It was build to be faster and more intuitive to browse. Moreover it incorporates a set of tools to explore and analyze your tracks. |
Jarosz Y*, Sinclair L, David FP, Leleu M, Rougemont J
*EPFL BBCF Switzerland |
L - Technology and Software |
|
L 26
L26 |
The rapid increase of genomics data generation, especially with the expansion of High Throughput Sequencing (HTS) technologies, brings biologists and bioinformaticians to find solutions to efficiently manage and store the resulting amount of data.
BioRepo is a Biological data Repository which addresses those needs. Similarly to a Laboratory Information Management System (LIMS), its main goals are to store, to manage and to share data among collaborators, but also to allow their direct visualisation in genome browsers. |
Mouscaz Y*, Kapopoulou A, Leleu M, Jarosz Y, Rougemont J, Duboule D, Trono D
*EPFL - SV - GHI - LVG Switzerland |
L - Technology and Software |
|
L 27
L27 |
The CBS-KNAW Fungal Biodiversity Centre in the Netherlands produces approximately 50,000 DNA sequences annually for the classification and identification of fungal species. In order to manage such a large amount of sequencing data, there is a need for robust tools to automatically generate and keep track of the whole experimental procedure. To this end, we have developed a laboratory information management system. The system has improved the quality of the sequence data by preventing the users from making mistakes. User actions and samples are traceable. As a result, it has sped up the whole sequencing workflow and allowed us to identify species more efficiently. |
Vu D*, Robert V
*CBS-KNAW Fungal Biodiversity Centre Netherlands |
L - Technology and Software |
|
L 28
L28 |
In the last years, the unprecedented throughput, experimental complexity, and rapid change in the proteomics field created unique challenges for a Laboratory Information Management System (LIMS). Nowadays, it is fundamental for a researcher to have the possibility to track and store original raw data, processed files, and experimental information, and to be able to retrieve them anytime. For these reasons, a properly defined workflow is highly required not only because it facilitates the aforementioned massive data handling, but also because it helps to accommodate those quality assurance requirements demanded for data publication.
ProteoWiki is a Semantic MediaWiki implementation for the management of a proteomics service laboratory. Users of the service can enter information about a sample and the desired analysis to be performed by using a semantic-enabled form built on top of a wiki page. After submitting an on-line request, a workflow is created, and different experimental tasks are assigned to the lab operators. Users and operators, according to their profile and granted permissions, can track the state of the requests and the associated experiments at any time. The final output is the generation of a report that summarizes all the steps and parameters in analyzing proteomics samples, while keeping track of the original files. User management, sample tracking, task assignment and lab scheduling, are all automated processes in ProteoWiki, thus highly improving laboratory efficiency, increasing throughput and ensuring data traceability for the final user. |
Mancuso FM*, Hermoso Pulido A, Roma G, Molina H, Lowy E, Sabido E
*Centro de Regulagio Genomica (CRG) Spain |
L - Technology and Software |
|
L 30
L30 |
Next Generation Sequencing technology has become a standard method for investigating a wide spectrum of biological mechanisms. Deep Sequencing studies broadened our view on processes like transcriptional regulation, DNA damage repair or splicing and provide deeper insight into cellular events like differentiation or tumourigenesis. However, the analysis of this massive amount of data is a challenging task. Many studies investigate different cellular or experimental conditions, requiring the comparative analysis of two or more samples. If we aim to identify genomic regions that differ between two conditions and to interpret the results in a biologically meaningful way, we need a precise computational interpretation of genome-wide NGS profiles. Up to now only few tools have been developed for the direct comparison of NGS profiles, coming from techniques like ChIP-Seq or RNA-Seq in high-resolution.
To overcome this problem, we introduce a new method. In comparison to other approaches it allows for a more precise positional detection of differences between NGS profiles. It distinguishes differences in two qualities: a) if a peaks missing in one of the profiles and b) if existing peaks in a genomic region have altered heights in different conditions. Finally, our approach can make use of replicates, as well as control data. Taken together, with our method, we are able to obtain a more detailed view on profile differences and enable a more precise prediction of genomic regions important for biological interpretation, as well as for further experimental analysis. |
Klein C*, Meiler A, Habermann B
*MPI for Biology of Ageing Germany |
L - Technology and Software |
|
L 31
L31 |
The BiocaNET is a bioinformatics platform composed of six nodes and a central node (Costa Rica), whose main mission is to provide support and development services in bioinformatics in the different areas of genomic, proteomic and molecular medicine in Central America. To accomplish these purposes, it also provides a technological infrastructure based on the provision of bioinformatic methods, bioinformatic systems and workflows in computer biology.
The principal effort of BiocaNET has been dedicated to the construction of a solid platform to develop algorithms, methods, services and bioinformatics procedures with the objective of providing support towards projects on Sciences OMICS that involve studies directed on Biodiversity and Molecular clinical medicine of the area. The BiocaNEt has developed this infrastructure based on web services, including one cluster of supercomputing (HPC “Nelly” repository with Rock Mamba server Linux 6.0) for analysis, storage and processing of biological data. On the tools and technology side, the platform maintains a group of software tools commons and standard system, including: 1) GALAXY cloud (NGS data), 2) EMBOSS, 3) Bioconductor and 4) TAVERNA server (workflows. Since 2012, The BiocaNET along with the associated participation of Government of Costa Rica, have been designing and implementing technological programs on bioinformatics in genomic medicine. Each one of these subjects on the program offered a technological solution developed by BiocaNET and exemplified with study cases (molecular oncology and pharmacogenomics) within Costa Rica’s Hospital sector and others. This wide technological offer makes it possible, within its complimentary training activities, to provide education on bioinformatics programs for the Latin America Scientific Community. |
Orozco A*, Morera J, Jimenez S
*Bioinformatics Research Laboratory (BREL), School of Medicine, University of Costa Rica Costa Rica |
L - Technology and Software |
|
L 32
L32 |
SWISS-MODEL is a widely used web service for comparative protein structure modelling. Around 2000 models are built everyday by scientists around the world. Here we present current efforts to improve the quality of homology models built for target sequences by using an improved template library and a web interface focusing on an interactive template selection step. |
Biasini M, Arnold K*, Bienert S, Waterhouse A, Schwede T
*Biozentrum University of Basel Switzerland |
L - Technology and Software |
|
L 33
L33 |
We present a software able to segment subcellular objects with subpixel resolution in fluorescence microscopy 2D or 3D images. It relies on a model of objects and of image formation. We show that this segmentation provides a noise robust basis for colocalization analysis compared to pixel based correlation between different fluorescence channels. The software also outputs location, size, length, perimeter and intensity of segmented objects which can be used in conjunction with colocalization analysis or to classify objects. The software is evaluated on the colocalization of five RAB GTPases with subcellular markers for early endosomes, late endosomes, recycling endosomes and the endoplasmic reticulum. The developed software is available as an ImageJ plugin and runs on desktop computers, where it can parallelize computation on available processors, or on computer clusters. |
Rizk A*, Sbalzarini I, Berger P
*Paul Scherrer Institute Switzerland |
L - Technology and Software |
|
L 34
L34 |
We designed a completely new algorithm with the aim to dramatically speed up the NGS reads alignment process. First testing on human data and paired-end reads of 100 bp revealed that the alignment speed is indeed ultra-fast; more than 600 Giga bases per day per CPU. This is at least two orders of magnitude faster than the currently most frequently used tools. At the time of meeting we will have a complete set of test results, which we can share with the audience. |
Lunenberg J*, Tolhuis B, Karten H
*Genalice BV Netherlands |
L - Technology and Software |
|
L 35
L35 |
The handling of a single Hiseq2000 multiplexed run including the meta data of QC and sample processing steps up to the generated sequencing data files (eg FASTQ files) can become a very tedious task without any suitable software. We offer a software called openBIS which allows out of the box data management for the standard output of Illumina HiSeq2000 runs. The possibility to track all meta data accumulating during the processing in the lab and being browsable by all lab members makes the whole library generation transparent and easy to track. Also external people can follow the progress. For full traceability we offer a property history which shows who changed what and at which time. The integration into the Illumina pipeline is supported via an API which allows the automatic generation of the SampleSheet.csv, based on the entries made into openBIS. Especially dual-indexed runs are easily demultiplexed. The generated FASTQ files are, based on their names, attached to the meta data. The flexible (Jython-based) setup and easy-to-use drop boxes for getting data into openBIS gives full control which data is put into openBIS and in which structure. Creating another GUI, beside the default one, is simply done by using the JSON RPC and allows individualization of data and meta data representation. |
Kohler M*, Beisel C, Nissen I, Barillari C, Elmer F, Glyzewski P, Kupczyk P, Luomi A, Ramakrishnan C, Straszewski J, Rinn B
*ETH Zürich Switzerland |
L - Technology and Software |
|
L 36
L36 |
The value of intuitive, highly visual interfaces to functional genomic data is substantial, particularly to users without computational and/or modern biology training (laboratory biologists, high school students, etc). The wide-spread “genome browser”-like displays acknowledge this somewhat but there is much room for improvement. MaGnET (www.malariagenomeexplorer.org) and the new BaGET (www.oralgenomeexplorer.org, from September 2012) are prototype tools developed in our research group to facilitate what we call “exploration-style” analysis of many different types of data pertaining to the genes of their target organisms: Plasmodium falciparum (a human malarial parasite) and 13 bacterial species relevant to oral disease.
Exploration-style analysis can be viewed as taking “browsing” to a next level , through: • highly visual, interactive interfaces (viewers), and direct access (one-click) from each viewer to the others • integrated displays visualising several data types simultaneously (e.g. interaction and gene expression) • work with two modifiable selections (each of a user-defined set of genes) during exploration (i.e. genes can be added or removed, without going back to a search page) Both tools are Java programs that can communicate with MySQL databases housed at UCSC. They are publicly and freely available for use via the Internet. Particularly attractive to users with confidential data is that we offer standalone versions allowing users to work with their data locally. Tutorials and application examples at the respective WWW-sites help users get started. We find that our tools are valuable to biologists when forming, and pursuing hypotheses that make non-obvious functional connections between several gene products. |
Sharman JL, Orton RJ, Xie G, Gerloff DL*
*University of California, Santa Cruz United States of America |
L - Technology and Software |
|
L 37
L37 |
The OmniLog® Phenotype Microarray (PM) is able to record simultaneously the respiration kinetics of bacteria upon up to 2,000 environmental challenges spotted in sets of 96-well microtiter plates. Beyond these standard settings, the user may mimic further ecological challenges by modifying, e.g., ion type and concentrations in the 96-well microtiter plates. Furthermore, incubation temperature and gaseous phase can be set according to the needs of the user, allowing finally an enormous number of ecological challenges to be tested. Upon each challenge, valuable biological information is coded in the typical curve parameters such as lag phase, steepness of slope, maximum value and area under the respiration curve. Unfortunately, the proprietary OmniLog software does not enable one to appropriately analyse the wealth of data.
The opm package for the free statistical software environment R offers tools for storing the curve kinetics, aggregating the curve parameters, integrating associated metadata of organisms and experimental settings, as well as methods for analysing these highly complex data sets graphically and statistically. The package also includes 95% confidence plots and enhanced heatmap graphics for comparing the estimated curve parameters. Querying and subsetting functions allow any combinations of 96-well plates, individual wells, time-points and metadata to be selected and analysed. It is also possible to discretize these parameters and to export them for reports in taxonomic journals or for inferring phylogenies with external programs. Export and import in the YAML format facilitates the data exchange between labs. The presentation will demonstrate the functionality of the R package by custom user-designed examples of ecological characterization of bacteria. It is expected that the in-depth exploration of PM data via the opm package will become a valuable tool in systems biology. |
Vaas LAI*, Sikorski J, Buddruhs N, Fiebig A, Klenk H, Göker M
*Centraalbureau voor Schimmelcultures Netherlands |
L - Technology and Software |