Functional Programming: filter

Mon, Sep 14, 2020 3-minute read

Another famous higher-order function is filter. Although it is rather handy for some programming languages, it might be less useful in the R world. I will talk about it for the sake of completeness (I wrote two posts about higher-order functions: one about map and another about reduce). Then I will explain why we don’t need it in R

What is filter?

Like the other higher-order functions we talked about, filter takes a function and a collection. The result is a subset of the original collection. Elements are kept or discarded, not surprisingly, according to the input function:

  • The input function takes an element of the collection and produces a boolean/logical, which is called a predicate.
  • For each element of the collection, the predicate is called. If the predicate returns TRUE the element is kept, and if it returns FALSE (you guessed it) the element is discarded.

How to implement it in R?

If you already know map and reduce, filter should be very easy. It is available in base R with a capital F: Filter().

Let’s say you have a vector c(1,2,3,4,5) and you want to filter only even numbers. The only trick you need to do is build a predicate, which we’ll call is_even().

is_even <- function(x) {
    x %% 2 == 0
}

All what’s left is calling Reduce() with the predicate and the vector.

Filter(is_even, 1:5)
# [1] 2 4

Side note: negating a predicate

Although you could easily negate a logical vector with the exclamation/bang sign, that doesn’t make sense for a predicate: you can’t negate a function. But R has a built-in function called Negate() that takes a predicate and return another predicate with the logic flipped. (Notice that this is a higher-order function that is different from the ones we discussed because it returns a function, the new predicate.)

Why is it not very useful?

One word: vectorization.

In case you don’t know what that means, it is the fact that when you do an arithmetic or logical operation on a vector. R automatically does it element-wise. This is not the default behavior of most programming languages in which it doesn’t make sense to add a number to a collection. But R is a language “made by statisticians for statisticians”, and making vectorization a default behavior is (I think) a smart move for a language meant to be used by non-developers.

So, when I ask if the remainder of dividing “the vector” by two is zero, I get back a vector of TRUEs and FALSEs.

1:5 %% 2 == 0
# [1] FALSE  TRUE FALSE  TRUE FALSE

Which brings us to the second smart move: any vector can be indexed with logical vectors, so that values matching the TRUEs are kept and those matching the FALSEs are discarded.

Here is another example in which I will write both methods without any intermediate definitions, and it’s up to you to decide which is better to your taste.

Filter(function(x) x %% 1 == 0, iris$Sepal.Length)
#  [1] 5 5 5 5 5 5 5 5 7 5 6 6 6 6 5 6 6

iris$Sepal.Length[iris$Sepal.Length %% 1 == 0]
#  [1] 5 5 5 5 5 5 5 5 7 5 6 6 6 6 5 6 6

Personally, I feel slightly annoyed by the need to repeat the vector’s name to use the indexing-based method. But I also know that these higher-order functions are kind of obscure in the R world.

Let’s also not forget that R is meant for data analysis, and Filter() doesn’t work with data frames unless you convert them to a list. But why do that when you have indexing-based filtering? or dplyr?


That ends our glance over the higher-order filter and, more importantly, why R programmers don’t need it.