Tutorial

Tutorial

Gadfly is an implementation of a "grammar of graphics" style statistical graphics system for Julia. This tutorial will outline general usage patterns and will give you a feel for the overall system.

To begin, we need some data. Gadfly works best when the data is supplied in a DataFrame. In this tutorial, we'll pick and choose some examples from the RDatasets package.

Let us use Fisher's iris dataset as a starting point.

using Gadfly
using RDatasets

iris = dataset("datasets", "iris")

The plot function in Gadfly is of the form:

plot(data::DataFrame, mapping::Dict, elements::Element...)

The first argument is the data to be plotted, the second is a dictionary mapping "aesthetics" to columns in the data frame, and this is followed by some number of elements, which are the nouns and verbs, so to speak, that form the grammar.

Let's get to it.

p = plot(iris, x=:SepalLength, y=:SepalWidth, Geom.point);

This produces a Plot object. It can be saved to a file by drawing to one or more backends using draw.

img = SVG("iris_plot.svg", 6inch, 4inch)
draw(img, p)

Now we have the following charming little SVG image.

SepalLength 4 5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5 SepalWidth

If you are working at the REPL, a quicker way to see the image is to omit the semi-colon trailing plot. This automatically renders the image to your default multimedia display, typically an internet browser. No need to capture the output argument in this case.

plot(iris, x=:SepalLength, y=:SepalWidth, Geom.point)

Alternatively one can manually call display on a Plot object. This workflow is necessary when display would not otherwise be called automatically.

function get_to_it(d)
  ppoint = plot(d, x=:SepalLength, y=:SepalWidth, Geom.point)
  pline = plot(d, x=:SepalLength, y=:SepalWidth, Geom.line)
  ppoint, pline
end
ps = get_to_it(iris)
map(display, ps)

For the rest of the demonstrations, we'll simply omit the trailing semi-colon for brevity.

In this plot we've mapped the x aesthetic to the SepalLength column and the y aesthetic to the SepalWidth. The last argument, Geom.point, is a geometry element which takes bound aesthetics and render delightful figures. Adding other geometries produces layers, which may or may not result in a coherent plot.

plot(iris, x=:SepalLength, y=:SepalWidth,
         Geom.point, Geom.line)
SepalLength 4 5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5 SepalWidth

This is the grammar of graphics equivalent of "colorless green ideas sleep furiously". It is valid grammar, but not particularly meaningful.

Color

Let's do add something meaningful by mapping the color aesthetic.

plot(iris, x=:SepalLength, y=:SepalWidth, color=:Species,
         Geom.point)
SepalLength 4 5 6 7 8 setosa versicolor virginica Species 2.0 2.5 3.0 3.5 4.0 4.5 SepalWidth

Ah, a scientific discovery: Setosa has short but wide sepals!

Color scales in Gadfly by default are produced from perceptually uniform colorspaces (LUV/LCHuv or LAB/LCHab), though it supports RGB, HSV, HLS, XYZ, and converts arbitrarily between these. Of course, CSS/X11 named colors work too: "old lace", anyone?

Scale transforms

Scale transforms also work as expected. Let's look at some data where this is useful.

mammals = dataset("MASS", "mammals")
plot(mammals, x=:Body, y=:Brain, label=:Mammal, Geom.point, Geom.label)
Body 0 2.0×10³ 4.0×10³ 6.0×10³ 8.0×10³ Arctic fox Owl monkey Mountain beaver Cow Grey wolf Goat Roe deer Guinea pig Verbet Chinchilla Ground squirrel Arctic ground squirrel African giant pouched rat Lesser short-tailed shrew Star-nosed mole Nine-banded armadillo Tree hyrax N.A. opossum Asian elephant Big brown bat Donkey Horse European hedgehog Patas monkey Cat Galago Genet Giraffe Gorilla Grey seal Rock hyrax-a Human African elephant Water opossum Rhesus monkey Kangaroo Yellow-bellied marmot Golden hamster Mouse Little brown bat Slow loris Okapi Rabbit Sheep Jaguar Chimpanzee Baboon Desert hedgehog Giant armadillo Rock hyrax-b Raccoon Rat E. American mole Mole rat Musk shrew Pig Echidna Brazilian tapir Tenrec Phalanger Tree shrew Red fox 0 1.0×10³ 2.0×10³ 3.0×10³ 4.0×10³ 5.0×10³ 6.0×10³ Brain

This is no good, the large animals are ruining things for us. Putting both axis on a log-scale clears things up.

plot(mammals, x=:Body, y=:Brain, label=:Mammal,
         Geom.point, Geom.label, Scale.x_log10, Scale.y_log10)
Body 10-4 10-2 100 102 104 Arctic fox Owl monkey Mountain beaver Cow Grey wolf Goat Roe deer Guinea pig Verbet Chinchilla Ground squirrel Arctic ground squirrel African giant pouched rat Lesser short-tailed shrew Star-nosed mole Nine-banded armadillo Tree hyrax N.A. opossum Asian elephant Big brown bat Donkey Horse European hedgehog Patas monkey Cat Galago Genet Giraffe Gorilla Grey seal Rock hyrax-a Human African elephant Water opossum Rhesus monkey Kangaroo Yellow-bellied marmot Golden hamster Mouse Little brown bat Slow loris Okapi Rabbit Sheep Jaguar Chimpanzee Baboon Desert hedgehog Giant armadillo Rock hyrax-b Raccoon Rat E. American mole Mole rat Musk shrew Pig Echidna Brazilian tapir Tenrec Phalanger Tree shrew Red fox 10-1 100 101 102 103 104 Brain

Discrete scales

Since all continuous analysis is just degenerate discrete analysis, let's take a crack at the latter using some fuel efficiency data.

gasoline = dataset("Ecdat", "Gasoline")

plot(gasoline, x=:Year, y=:LGasPCar, color=:Country,
         Geom.point, Geom.line)
Year 1960 1965 1970 1975 1980 JAPAN NETHERLA NORWAY SPAIN SWEDEN SWITZERL TURKEY U.K. U.S.A. AUSTRIA BELGIUM CANADA DENMARK FRANCE GERMANY GREECE IRELAND ITALY Country 3 4 5 6 7 LGasPCar

We could have added Scale.x_discrete explicitly, but this is detected and the right default is chosen. This is the case with most of elements in the grammar: we've omitted Scale.x_continuous and Scale.y_continuous in the previous plots, as well as Coord.cartesian, and guide elements such as Guide.xticks, Guide.xlabel, and so on. As much as possible the system tries to fill in the gaps with reasonable defaults.

Rendering

Gadfly uses a custom graphics library called Compose, which is an attempt at a more elegant, purely functional take on the R grid package. It allows mixing of absolute and relative units and complex coordinate transforms. The primary backend is a native SVG generator (almost native: it uses pango to precompute text extents), though there is also a Cairo backend. See Backends for more details.

Building graphics declaratively let's you do some fun things. Like stick two plots together:

fig1a = plot(iris, x="SepalLength", y="SepalWidth", Geom.point)
fig1b = plot(iris, x="SepalWidth", Geom.bar)
fig1 = hstack(fig1a, fig1b)
SepalWidth 1 2 3 4 5 0 50 100 150 SepalLength 4 5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5 SepalWidth

Ultimately this will make more complex visualizations easier to build. For example, facets, plots within plots, and so on. See Layers and Stacks for more details.

Interactivity

One advantage of generating our own SVG is that the files are much more compact than those produced by Cairo, by virtue of having a higher level API. Another advantage is that we can annotate our SVG output and embed Javascript code to provide some level of dynamism.

Though not a replacement for full-fledged custom interactive visualizations of the sort produced by d3, this sort of mild interactivity can improve a lot of standard plots. The fuel efficiency plot is made more clear by toggling off some of the countries, for example. To do so, simply click or shift-click in the colored squares in the table of keys to the right.