Recipe 4: geom_cat_lm(), geom_cat_fitted(), and geom_cat_residuals()

Example recipe #4: geom_cat_lm()

In this next recipe, we use panel-wise computation again to visualize a linear model that is estimated using both a continuous and a categorical variable, i.e. lm(y ~ x + cat). This may feel a bit like geom_smooth(method="lm") + aes(group="cat"). However, since geom_smooth does group-wise computation, the data is broken up before model estimation when a discrete variable is mapped like aes(color="sex") – meaning a model is estimated for each category. Let’s see how we might visualize a single model that includes a categorical variable.

Our first goal is to be able to specify a plot with newly created geom_cat_lm() (and we’ll look at defining geom_cat_fitted() and geom_cat_residuals())

(
    ggplot(penguins_clean)
    + aes(x="bill_depth_mm", y="bill_length_mm", cat="species")
    + geom_point()
    + geom_cat_lm()
)

Let’s get started!

Step 0: use base plotnine to get the job done

It’s a good idea to look at how you’d get things done without Stat extension first, just using ‘base’ plotnine. The computational moves you make here can serve as a reference for building our extension function.

Use ggplot.layer_data() to inspect the render-ready data internal in the plot. Your stat will help prep data to look something like this.

x y group PANEL alpha size linetype color
136 15.5 34.833358 1 1 1 0.5 solid maroon
118 15.9 35.393983 1 1 1 0.5 solid maroon
96 16.0 35.534140 1 1 1 0.5 solid maroon
72 16.1 35.674296 1 1 1 0.5 solid maroon
92 16.1 35.674296 1 1 1 0.5 solid maroon

Step 1: Define compute. Test.

Now you are ready to begin building your extension function. The first step is to define the compute that should be done under-the-hood when your function is used. We’ll define this in a function called compute_panel_cat_lm(). The data input will look similar to the plot data. You will also need to include a scales argument, which plotnine uses internally.

NoteYou may have noticed …
  1. … the scales argument in the compute definition, which is used internally in plotnine. While it won’t be used in your test (up next), you do need it so that the computation will work in the plotnine setting.

  2. … that the compute function can only be used with data with variables x, y, and cat. These aesthetic variable names, relevant for building the plot, are generally not found in the raw data inputs for the plot.

  3. … that we use C(cat) in the formula. This tells statsmodels to treat cat as a categorical variable in the model, equivalent to R’s automatic factor handling in lm().

Test compute.

x y cat
0 18.7 39.318360 Adelie
1 17.4 37.496328 Adelie
2 18.0 38.337265 Adelie
4 19.3 40.159297 Adelie
5 20.6 41.981329 Adelie
NoteYou may have noticed …

… that we prepare the data to have columns with names x, y, and cat before testing compute_panel_cat_lm. Computation will fail if these names are not present given our function definition. Internally in a plot, columns are named based on aesthetic mapping, e.g. aes(x="bill_depth_mm", y="bill_length_mm", cat="species").

Step 2: Define new Stat. Test.

Next, we define a new stat class — which will let us do computation under the hood while building our plot.

Define Stat.

NoteYou may have noticed …
  1. … that the naming convention for the stat class uses snake_case. e.g. stat_cat_lm.

  2. … that we inherit from the stat class. In fact, your class is a subclass — you are inheriting class properties from plotnine’s stat.

  3. … that the compute_panel_cat_lm function is called in the compute_panel method. This means that data will be transformed panel-wise (by facet), not group-wise.

  4. … that setting REQUIRED_AES to {"x", "y", "cat"} is consistent with compute requirements. The compute assumes data to be a DataFrame with columns x, y, and cat. Specifying REQUIRED_AES in your stat can improve your user interface because standard plotnine error messages will issue when required aesthetics are not specified, e.g. ‘stat_cat_lm() requires the following missing aesthetics: x.’

Test Stat.

You can test out your stat by using it in plotnine geom_*() functions.

NoteYou may have noticed …

… that we pass the class stat_cat_lm to the stat argument. You could also write geom_point(stat="cat_lm") which will direct to your new stat_cat_lm under the hood.

Test panel-wise behavior

You might be thinking, what we’ve done would already be pretty useful to me. Can I just use my stat as-is within geom_*() functions?

The short answer is ‘yes’! If you just want to use the stat yourself locally in a script, there might not be much reason to go on to Step 3, user-facing functions. But if you have a wider audience in mind, i.e. internal to organization or open sourcing in a package, probably a more succinct expression of what functionality you deliver will be useful - i.e. write the user-facing functions.

Instead of using a geom_*() function, you might prefer to use the layer() function in your testing step. Occasionally, you must go this route; for example, geom_vline() contains no stat argument, but you can use geom_vline in layer(). If you are teaching this content, using layer() may help you better connect this step with the next, defining the user-facing functions.

A test of stat_cat_lm using this method follows. You can see it is a little more verbose, as there is no default for the position argument, and setting the size must be handled with a little more care.

Step 3: Define user-facing functions. Test.

In this next section, we define user-facing geom_* classes. In plotnine, creating a new geom is as simple as subclassing an existing geom and setting DEFAULT_PARAMS with your new stat.

Define geom_*() class

‘Most plotnine users are accustomed to adding geoms, not stats, when building up a plot.’ ggplot2: Elegant Graphics for Data Analysis.

Because plotnine users may be more accustomed to using layers that have the geom_ prefix, you might define a geom_ class with the same properties as the stat_. Here we subclass geom_line and set the default stat to "cat_lm".

NoteYou may have noticed…
  1. … that geom_cat_lm inherits from geom_line. This means the layer will render lines by default.

  2. … that DEFAULT_PARAMS includes all of geom_line’s parameters plus "stat": "cat_lm". In plotnine, DEFAULT_PARAMS is fully replaced (not merged) when subclassing, so every parent parameter must be redeclared.

Test/Enjoy functions

Below we use the new function geom_cat_lm(), contrasting it to geom_smooth(), which have parallel and not parallel slopes respectively.

And check out conditionality

Note that because panel-wise (facet-wise) computation is specified, there are in fact, two separately estimated models for female and male. If the model is to be computed across all of the data, it’s worth considering layer-wise computation, i.e. specifying the compute_layer slot (not yet covered in these tutorials).

Done! Time for a review.

Here is a quick review of the classes we’ve covered, dropping tests and discussion.

NoteReview

Your Turn: Write geom_cat_fitted() and geom_cat_residuals()

Using the geom_cat_lm Recipe #4 as a reference, try to create a geom_cat_fitted() and geom_cat_residuals() that draws fitted values and segments between observed and fitted values for a linear model with a categorical variable.

Hint: consider what aesthetics are required for segments. We’ll give you Step 0 this time…

Step 0: use base plotnine to get the job done

Step 1: Write compute. Test.

Step 2: Write Stat.

Step 3: Write user-facing geom classes.

Congratulations!

If you’ve finished all four recipes, you should have a good feel for writing stats, and stat_*() and geom_*() classes.