The scatterplot is one of the oldest and most widely used methods for visualizing data. It is a graphical representation of typically two variables of a dataset. Each individual record in the dataset is drawn as a point in a Cartesian coordinate system, which is spanned by the respective data dimensions used as spatial reference. This usually results in a plot of scattered points. An example of a scatterplot can be seen in Figure 2.1. Two variables of a dataset,
and
, have been chosen for a 2D spatial reference. One can see the individual points, read their approximate values from the axes, and look for clusters, correlations and outliers.
|
|
A single scatterplot usually has only two axes and can therefore only encode two variables in space. To display more than two variables at a time, several scatterplots can be combined to form a scatterplot matrix. It consists of a number of scatterplots arranged in rows and columns. The scatterplot matrix in Figure 2.2 displays four variables pairwise against each other. The matrix has four rows and four columns, each row and each column is assigned one of the variables.
In general, a scatterplot matrix has
times
individual scatterplots, where
is the number of variables in the dataset. That is more than the number of unique combinations of variables, which is
. If
is four, the number of unique combinations is six. Only six plots of the 16 plots in Figure 2.2 have unique pairs of variables. They are either the six plots in the lower left part or the six plots in the upper right part. Those two groups of plots have the same variables but horizontal and vertical axes are swapped, so that the plots are mirrored on the diagonal inside the plots. The plots along the diagonal from upper left to bottom right in the matrix have the same variable on both axes. They all contain a more or less irregularly dotted diagonal line. This provides distribution information about the variables, but it can be hard to read if there are many points close together.
If those partly redundant plots are not needed, it is possible to omit them. In this case the visualization is called triangular matrix, as opposed to the full matrix described above. One can either choose the upper or lower triangular matrix. Figure 2.3 shows an example of a lower triangular matrix.
The problem with scatterplot matrices is that the overall relationship of the entire dataset is not easily visible [3], and it is difficult to see patterns that are present only when three or more dimensions are taken into account [33]. A possible solution to this is using not only space but also color to encode more than two data dimensions in one plot. One can, for instance, use the red, green, and blue components of color to include three additional dimensions in a scatterplot [33]. Figure 2.4 shows how this looks. On the left side is a scatterplot matrix without color. It is hard to see anything interesting, also because many points overlap. In the center image there is a single scatterplot with colors, and on the right side the same scatterplot matrix as on the left, but this time with colors. It is now much easier to identify clusters (using color and space to distinguish them), but it is still difficult to interpret the plots. The color also makes partly overlapping points easier to tell apart.
When using color in a scatterplot, one should make sure to use the space coordinates for the two most important variables in the dataset, and the colors for the less important dimensions, because space is the most valuable and most meaningful way of encoding data [3].
A variety of interaction techniques can be applied to scatterplots and scatterplot matrices to support exploration and analysis of datasets. Among them are selecting and changing the variables for the axes of plots, and windowing, which means enlarging a part of a plot to fill all of the available area. Another commonly used interaction technique is linking and brushing, which is discussed in Section 2.3.
An example for the application of scatterplots that also uses color is the starfield display [1]. It uses the two axes of a scatterplot for instance for the release year and for the length of movies. The color of the points is used for other properties.
There is a similar technique like the scatterplot matrix called the hyperslice [31]. It consists of a matrix of two-dimensional plots of two variables each. But unlike the scatterplot matrix it is not used for unstructured data but for the visualization of scalar functions of many variables. When the number of variables is higher than three, visualization techniques like simple 2D graphs, mountain plots and volume rendering cannot display all variables in one visualization while treating them identically. The hyperslice solves this problem by slicing the data and plotting only two variables in a sub-graph. Each plot shows only a part of the function (a window), and users can change this window with direct manipulation. All sub-plots are linked, so that when a window in a plot is changed, the windows of the other plots with the same variables are changed accordingly.