Thursday, February 12, 2015

Parsing Months in R

As part of a recent analytics project, I needed to convert strings containing (English) names of months to the corresponding cardinal values (1 for January, ..., 12 for December). The strings came from a CSV file, and were translated by R to a factor when the file was read. The factor had more than 12 levels: to the literal-minded (which includes R), "August" and "August " (the latter with a trailing space) are different months.

So I wanted a solution that was moderately robust with respect to extra spaces, capitalization, and abbreviation. A Google search turned up several solutions involving string manipulation, none of which entirely appealed to me. So I rolled my own, which I'm posting here. As usual, the code is licensed under a Creative Commons license (see the right-hand margin for details).

A few notes about the code:
  • I used the lubridate package to provide a function (month()) for extracting the month index from a date object. I know that some people dislike loading packages they don't absolutely need (memory consumption, name space clashes, ...). I find the lubridate::month() function pleasantly robust, but if you want to avoid loading lubridate, I suggest you try one of the other methods posted on the Web.
  • My code loads the magrittr package so that I can "pipeline" commands. If you load a package (such as dplyr) that in turn loads magrittr, you're covered. If you prefer the pipeR package, a minimal amount of tweaking should produce a version that works with pipeR. If you just want to avoid loading anything, the same logic will work; you just need to change the piping into nested function calls.
  • I make no claim that this is the most efficient, most robust or most elegant solution. It just seems to work for me.
The code includes a small example of its use.

#
# Load libraries.
#
library(lubridate)
library(magrittr)
#
# Function monthIndex converts English-language string
# representations of a month name to the equivalent
# cardinal value (1 for January, ..., 12 for December).
#
# Argument:
#   x  a character vector, or object that can be
#      coerced to a character vector
#
# Value:
#   a numeric vector of the same length as x,
#   containing the ordinals of the months named
#   in x (NA if the entry in x cannot be deciphered)
monthIndex <- 
  function(x) {
    x                        %>%
      # strip any periods
      gsub("\\.", "", .)     %>%
      # turn it into a full date string
      paste0(" 1, 2001")     %>%
      # turn the full string into a date
      as.Date("%t%B %d, %Y") %>%
      # extract the month as an integer
      month
  }
#
# Unit test.
#
x <- c("Sep", "May", " July ", "huh?",
       "august", "dec ", "Oct. ")
monthIndex(x) # 9 5 7 NA 8 12 10
Created by Pretty R at inside-R.org

No comments:

Post a Comment

Due to intermittent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Operations Research Stack Exchange.