Directed Dot Plots

Directed dot plots are a visual enhancement of traditional dot plots used to compare two sequences. The enhancements are created by plotting small glyphs rather than dots at each location in the plot. The glyphs, in turn, consist of small diagonal lines directed at plus or minus 45 degrees. The lengths of the diagonals indicate the matching score of two words. A 45 degree direction indicates that a word from the sequence on the horizontal axis was compared to a word from the sequence on the vertical axis. A minus 45 degree direction indicates that the horizontal word was compared to the inverted word from the vertical axis. Crossing diagonals form X's and white X's indicate the presence of palindromes.

Symbolic Scatter Plots


Click here for a web version of symbolic scatter plots. The software is preliminary with more features to be added in time. For questions or comments, contact david AT

This image is a symbolic scatter plot.  It graphically represents a small portion of the human Y-chromosome.

Traditionally, analysis of DNA, RNA, and protein sequences has relied almost exclusively on statistical algorithms. Regions of DNA found to be statistically similar across several sequences (often from multiple species) are viewed as biologically significant.  However, it is still somewhat of an art to know when sequences are statistically similar.  Assumptions are made about the number and types of mismatches that can be tolerated. Indeed, two DNA sequences that "match" can contain very different sets of nucleotides because the assumptions may allow a high number of mismatches.  Knowing when these mismatches can be tolerated is key to isolating regions of DNA that are biologically similar and, thus, important.

My interest is in analyzing biological sequences such as DNA using visualization algorithms.  My goal is to be able to look at a DNA sequence to see what it can tell us without relying on comparisons with other DNA sequences.  What intrinsic information does DNA contain that can help us to identify biologically important regions?  Can we visualize this information?  If so, is there any correlation between functionality and their graphical representation?  For example, what do promoters look like?  What do exons look like?  What do introns look like?  What do CpG islands look like?  What do tandem repeats and approximate tandem repeats look like?  How do cancer genes compare to each other visually?  Can visualizations help to partition DNA sequences into biologically meaningful units?

Visualization is not new.  Statisticians have used it for years to discover relationships in data.  Perhaps the most widely used visualization is the scatter plot which is routinely used to discover linear and non-linear relationships in data as well as to discover when data is highly correlated and when it is not.  My Ph.D. research focuses on applying scatter plots to symbolic sequences such as DNA, RNA, and protein.  I refer to these as symbolic scatter plots to distinguish them from those used to analyze purely numeric data.

The above symbolic scatter plot is one example illustrating a very distinct pattern in the sequence.  Within the human genome there are thousands of other such patterns.  These patterns have sharp boundaries and stand out begging one to discover what they represent.  Are they sites where proteins bind to the DNA to regulate transcription and/or translation?  Are they sites that control the folding of DNA?  Perhaps they are involved in the replication of DNA as a cell divides?  A part of my Ph.D. research is to attempt to provide some preliminary answers to these questions.

To learn more about how symbolic scatter plots are created, click here.  Included will be links to software that you can download to create your own plots.