Introduction

The New York City Council is the lawmaking body of the City of New York. The Council has 35 committees with oversight of various functions of the city government. Each council member sits on at least three standing, select or subcommittees. The standing committees must meet at least once a month unless the Charter mandates otherwise.

This brief write-up seeks to explore the following questions:

With this information, we can hold our government bodies accountable and determine the best (and worse) times of the year for legislative action.

The Data

We scraped our data from legistar.council.nyc.gov, which uses an .aspx framework to house the minutes and legislative documents related to the New York City Council. We could export the document to an Excel document, but unfortunately, this feature is broken. As such, I have included the HTML export of the site here for convenience.

Set-Up

Libraries

First, let’s load the relevant libraries into the R environment and set the current working directory. Use the development version of ggvis to ensure that the interactive plot is embedded in the knitted html document.

# install.packages("devtools")
# devtools::install_github('rstudio/ggvis', build_vignettes = FALSE)
library(dplyr)
library(stringr)
library(XML)
library(ggplot2)
library(ggthemes)
# change to your working directory
setwd("/Users/Quan/mailman/qmssviz/hw1/_posts/")

Functions

Next, let’s define some functions to help us get and process the data more easily. HTMLtoDF parses the HTML tree using the XML package, retreives values enclosed in the </td> tag, and plops them into a data.frame. RemoveSpaces uses a regular expression to remove leading and lagging spaces.

HTMLtoDF <- function(file){
    doc <- htmlParse(file)
    tables <- readHTMLTable(doc, 
                            stringsAsFactors = FALSE,
                            na.strings = "")
    df <- tables[[1]]
    return(df)
}

RemoveSpaces <- function(df) {
    gsub("^\\s+|\\s+$", "", df)
}

Loading the Data

df <- HTMLtoDF("nycc-meetings.html")

Data Processing

We clean the data to create more consistency in variable names and to ensure that variables are the correct data type for our analysis.

# remove leading and lagging spaces
df <- data.frame(sapply(df, RemoveSpaces))
# convert blank strings to NA
df[df == ""] <- NA
# change column names
names(df) <- c("Name", "Date", "Time", "Location", "Topic")
df$Date <- as.POSIXct(df$Date, format = "%m/%d/%Y")

We use the dplyr library to filter out observations that will not help us answer our overarching questions. As such, we remove non-standing committees and ignore meetings that were deferred. We also create new variables to help us cut the data on month and year with greater ease.

# Remove defunct committees & Committee on Finance (outliers)
dfPlot <- df %>%
  # filter out inactive and small committees
  filter(!str_detect(Name, ignore.case("Inactive")),
         !str_detect(Name, ignore.case("Subcommittee")),
         !str_detect(Name, ignore.case("Task")),
         !str_detect(Name, ignore.case("Town"))) %>%
  # create new variables
  mutate(DateTime = paste(Date,Time),
         # clean "name" variable
         Name = str_replace(Name, "Committee on ", ""),
         Name = str_replace(Name, ",.*$", ""),
         Name = str_replace(Name, "and Solid Waste Management", ""),
         Name = str_replace(Name, "Justice Services", "Justice"),
         # create new columns for month and year with proper format
         Month = factor(format(Date, format = "%b"), levels = month.abb),
         Year = factor(format(Date, format = "%Y")),
         # create new column for Status
         Status = ifelse(Time == "Deferred", "Deferred", "Calendared")) %>%
  # filter for meetings that were actually held
  filter(Year %in% c(2000:2013),
         Status == "Calendared")

Visualizations

Let’s make some plots.

Committee Meetings Held by Month

# Total Committee Meetings Held by Month
dfPlot %>%
  group_by(Month, Year) %>%
  summarise(Count = n()) %>%
  group_by(Month) %>%
  summarise(Count = sum(Count/length(Year))) %>%
  
  # bar plot
  ggplot(aes(x = Month, y = Count, group = 1)) +
  geom_line() +
  ylab("Mean Number of Meetings Held, 2000-2013") +
  # create more breaks
  ggtitle("City Council Meetings Held by Month") +
  theme_bw()

plot of chunk unnamed-chunk-6

Here we look at the proportion of City Council meetings held each month from 2000 to 2013. From this graph, we can see that council members take their summers seriously. Good luck trying to schedule a hearing during July or August. Also, you can see an upward trend starting from the beginning of the year until the month of June, the end of the fiscal cycle. Perhaps council members are spurred by deadlines just as much as the rest of us.

Distribution of the Number of Meetings Held By Year

# generate plot
dfPlot %>%
  # aggregate by year, get counts
  group_by(Name, Year) %>%
  summarise(Count = n()) %>%
  
  # boxplot
  ggplot(aes(x = Year, y = Count)) +
  stat_boxplot(geom = "errorbar") + # add error bars
  geom_boxplot() +
  
  # line graph through the mean of distribution
  stat_summary(aes(group = 1), fun.y = mean, geom = "line", size = 2) +
  
  # adjusting axes and titles
  scale_x_discrete(breaks = c(2000:2013)) + 
  ylab("Number of Meetings Held") +
  ggtitle("New York City Council Meetings, 2000-2013") +
  theme_economist() # use the economist theme

plot of chunk unnamed-chunk-7

Here the solid black line goes through the mean of the distribution for each year. We can approximate that the total number of meetings held by the New York City Council has remained fairly consistent over the last decade. However, it is important to note that some committees were well below their quota of one meeting per month. By subsetting for the committees that held less than 12 meetings in one year, we can see who the most frequent offenders were:

dfPlot %>%
  # aggregate by year and
  # get meeting counts per committee
  group_by(Year, Name) %>%
  summarise(Count = n()) %>%
  filter(Count < 12) %>% # subset those falling behind quota
  # get top 5 most frequent offenders
  group_by(Name) %>%
  summarise(Freq = n()) %>%
  arrange(desc(Freq)) %>%
  head(5)
## Source: local data frame [5 x 2]
## 
##                            Name Freq
## 1                Youth Services   13
## 2  Oversight and Investigations   12
## 3          Standards and Ethics   12
## 4              Higher Education   11
## 5 State and Federal Legislation   11

So internal and youth-oriented committees have consistently held less than 12 meetings per year. We should use the Charter to verify if these committees were explicitly allowed to do so.

The boxplot also shows several outliers. Since I am at lost at how to label the outliers using ggplot2, let’s take a look at the committees that had the most meetings held by year:

dfPlot %>%
  group_by(Year, Name) %>%
  summarise(Count = n()) %>%
  filter(Count == max(Count)) %>%
  arrange(desc(Count))
## Source: local data frame [14 x 3]
## 
##    Year     Name Count
## 1  2000  Finance    35
## 2  2001  Finance    39
## 3  2002  Finance    32
## 4  2003  Finance    34
## 5  2004  Finance    35
## 6  2005  Finance    41
## 7  2006  Finance    35
## 8  2007  Finance    36
## 9  2008  Finance    47
## 10 2009 Land Use    42
## 11 2010  Finance    36
## 12 2011  Finance    46
## 13 2012  Finance    39
## 14 2013  Finance    42

The Committee on Finance consistently holds the most meetings per year. This makes sense since all committees must go through the finance committee to approve any changes to the budget. Money is power.

Wish List

Session Info

## R version 3.1.1 (2014-07-10)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggthemes_1.7.0   ggplot2_1.0.0    XML_3.98-1.1     stringr_0.6.2   
## [5] dplyr_0.2.0.9000
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1   colorspace_1.2-4 DBI_0.3.0        digest_0.6.4    
##  [5] evaluate_0.5.5   formatR_1.0      grid_3.1.1       gtable_0.1.2    
##  [9] htmltools_0.2.6  knitr_1.6        labeling_0.3     magrittr_1.0.1  
## [13] MASS_7.3-34      munsell_0.4.2    parallel_3.1.1   plyr_1.8.1      
## [17] proto_0.3-10     Rcpp_0.11.2      reshape2_1.4     rmarkdown_0.3.3 
## [21] scales_0.2.4     tools_3.1.1      yaml_2.1.13