Home / research areas / CBB

Computational Biology and Bioinformatics

Recent discoveries in biology have attracted the attention of several researchers from other fields such as mathematics, physics, statistics and computer science. The interaction between these researchers resulted in a multidisciplinary field which deals with various problems, from the generation of tools and databases for processing biological data through development of computational techniques for simulation of biological systems, besides the contributions of computation theory to the various aspects of genetics and molecular biology.

The great diversity of problems inspired in biology leads to confusion among researchers regarding the manner of classifying their own work and what to expect from other researchers with projects that have aspects in biology. The terms “bioinformatics” and “computational biology” were created to bring some order to this context. Although there is not an universally accepted definition for both terms (and they often are mistakenly used interchangeably), the following definitions are used by a good part of the community:

  • Bioinformatics: the development or application of computational tools to extend the use of biological data. This area encompasses the processes to acquire, store, organize, analyze and visualize biological data.

  • Computational Biology: the development of theoretical approaches of computer science, algorithmic techniques to solve difficult problems based on biology and mathematical modeling to study biological systems.

Generally speaking, Computational Biology is commonly related to any application of computation in the biological area while Bioinformatics is frequently associated to sequence analysis of genomes, transcriptomes and proteomics. One goal of these areas is to deal with the huge amount of data that is being obtained in the past years. DNA sequencing is the process of determining the order of nucleotides within a DNA molecule. The rapid speed of sequencing attained with modern DNA sequencing technology has been fundamental in the sequencing of genomes, which generates a huge amount of data. In turn, all this data have made the manual analysis of DNA sequences infeasible. Therefore, the complete sequencing of genomes would not be possible without informatics, and molecular biology as it is nowadays would not be so advanced if were not for the computational resources we have. One may see that the need to process all these information has created new problems that are interdisciplinary in nature. In what follows, we present some of these problems.

Sequence Alignment

A sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. If there are more than two sequences involved, we call it Multiple Sequence Alignment. Alignments are typically represented as rows within a matrix. Gaps are inserted within the sequences so that identical or similar characters are aligned in successive columns:


Dynamic programming is commonly used to produce alignments, but other methods exist.

Sequence Assembly

Sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence. This is needed because DNA sequencing technology cannot read whole genomes: depending on the technology, it is possible to read something between 20 and 1000 bases. Some difficulties existing for this problem are sequencing errors, unknown size of the original sequence, repeated regions in the original sequence, lack of coverage in the original sequence or unknown orientation of the fragments. This make the problem very difficult to solve, but there are some well-known tools for this problem, such as Phrap and CAP3.

Phylogenetic trees

Phylogenetic trees represent a common approach to infer evolutionary relationships among various biological species or other entities. Each leaf is a biological object and internal nodes are hypothetical ancestors. The distance between two objects may serve as a measure of their relationship.


The genome sequencing of isolated organisms has its limits. First of all, one need the ability to make a culture and single clones of the organism. However, only certain part of the microbes can be cultivated. Second of all, it is rare to find microbes living in isolated communities, since species interact with other species and with their own habitat. New technologies of sequencing made possible to obtain genomic information directly from microbial communities. This data are called Metagenomes and Metagenomics is the study of them. In other words, Metagenomics it is the sequencing of an environment rather than DNA extracted from a single organism. Metagenomics is a recent field with many computational challenges, such as the lack of tools to deal with metagenomic data. The fragments provenient from the sequencing have to be assembled in a different manner, since there could be innumerable organisms in the sample. Another problem is that there can be no information whatsoever about some of them and generally, it leads to discovering new organisms.

Genome Rearrangements

In the literature on genome rearrangement that focuses on mathematical models of genomes, chromosomes are usually represented as sequences of segments called syntenic blocks, which we assume to be shared by the genomes being compared. Let n be the total number of shared segments, we assign a unique number in the set {1, …, n} to each segment such that chromosomes can be regarded as permutations. We sort a given permutation by applying successive operations that transform it into another permutation where all elements are in ascending order. The main goal of sorting problems is to find the minimum number of such operations. Reversals and transpositions are two of the most common operations that affect real genomes. A reversal occur when a block of elements in the permutation is reversed while a transposition occur when two adjacent blocks of elements exchange position. They lead to the challenging NP-hard problems in genome rearrangement field called Sorting by Reversals Problem and Sorting by Transpositions Problem.

Scaffold Construction

With the decreasing costs of DNA sequencing it is now very common for prokaryotic genomes to be sequenced at “draft” status only. This means that the generated sequence will be a set of contigs (a contig is a substring of the string over the DNA alphabet that represents the genome sequence). The number of contigs depends on the sequencing fold coverage and DNA sequencing technology, and typically varies between half a dozen to a few hundred. One technique that can be used to improve the sequencing results is to generate a contig scaffold from the contig set. A scaffold is an ordered set of contigs, with the desired order being the correct genome order. There are various techniques that can help to create a scaffold such as paired-end reads and physical or optical maps. Another technique, which became popular in the last few years, is to use a reference genome. In this case we assume the query genome has a close phylogenetic relative that has been fully sequenced, and this genome can be used to guide the assembly of the query genome and to generate a scaffold as well. When using a reference genome B to create a scaffold for a query genome A, one possible problem is the existence of rearrangements in A with respect to B.


  • J.C. Setubal and J. Meidanis. Introduction to Computational Molecular Biology. PWS, 1997.
  • N.C. Jones and P.A. Pevzner. An Introduction to Bioinformatics Algorithms. The MIT Press, 2004.
  • D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, 1997.
  • P.A. Pevzner. Computational Molecular Biology: An Algorithmic Approach. The MIT Press, 2000.
  • G. Fertin, A. Labarre, I. Rusu, E. Tannier and S. Vialette. Combinatorics of Genome Rearrangements, The MIT Press, 2009.

Professors (2):

  • Guilherme Pimentel Telles
  • Zanoni Dias