The New York City Council is the lawmaking body of the City of New York. The Council has 35 committees with oversight of various functions of the city government. Each council member sits on at least three standing, select or subcommittees. The standing committees must meet at least once a month unless the Charter mandates otherwise.
This brief write-up seeks to explore the following questions:
With this information, we can hold our government bodies accountable and determine the best (and worse) times of the year for legislative action.
We scraped our data from legistar.council.nyc.gov, which uses an .aspx framework to house the minutes and legislative documents related to the New York City Council. We could export the document to an Excel document, but unfortunately, this feature is broken. As such, I have included the HTML export of the site here for convenience.
First, let’s load the relevant libraries into the R environment and set the current working directory. Use the development version of ggvis
to ensure that the interactive plot is embedded in the knitted html document.
# install.packages("devtools")
# devtools::install_github('rstudio/ggvis', build_vignettes = FALSE)
library(dplyr)
library(stringr)
library(XML)
library(ggplot2)
library(ggthemes)
# change to your working directory
setwd("/Users/Quan/mailman/qmssviz/hw1/_posts/")
Next, let’s define some functions to help us get and process the data more easily. HTMLtoDF
parses the HTML tree using the XML
package, retreives values enclosed in the </td>
tag, and plops them into a data.frame
. RemoveSpaces
uses a regular expression to remove leading and lagging spaces.
HTMLtoDF <- function(file){
doc <- htmlParse(file)
tables <- readHTMLTable(doc,
stringsAsFactors = FALSE,
na.strings = "")
df <- tables[[1]]
return(df)
}
RemoveSpaces <- function(df) {
gsub("^\\s+|\\s+$", "", df)
}
df <- HTMLtoDF("nycc-meetings.html")
We clean the data to create more consistency in variable names and to ensure that variables are the correct data type for our analysis.
# remove leading and lagging spaces
df <- data.frame(sapply(df, RemoveSpaces))
# convert blank strings to NA
df[df == ""] <- NA
# change column names
names(df) <- c("Name", "Date", "Time", "Location", "Topic")
df$Date <- as.POSIXct(df$Date, format = "%m/%d/%Y")
We use the dplyr
library to filter out observations that will not help us answer our overarching questions. As such, we remove non-standing committees and ignore meetings that were deferred. We also create new variables to help us cut the data on month and year with greater ease.
# Remove defunct committees & Committee on Finance (outliers)
dfPlot <- df %>%
# filter out inactive and small committees
filter(!str_detect(Name, ignore.case("Inactive")),
!str_detect(Name, ignore.case("Subcommittee")),
!str_detect(Name, ignore.case("Task")),
!str_detect(Name, ignore.case("Town"))) %>%
# create new variables
mutate(DateTime = paste(Date,Time),
# clean "name" variable
Name = str_replace(Name, "Committee on ", ""),
Name = str_replace(Name, ",.*$", ""),
Name = str_replace(Name, "and Solid Waste Management", ""),
Name = str_replace(Name, "Justice Services", "Justice"),
# create new columns for month and year with proper format
Month = factor(format(Date, format = "%b"), levels = month.abb),
Year = factor(format(Date, format = "%Y")),
# create new column for Status
Status = ifelse(Time == "Deferred", "Deferred", "Calendared")) %>%
# filter for meetings that were actually held
filter(Year %in% c(2000:2013),
Status == "Calendared")
Let’s make some plots.
# Total Committee Meetings Held by Month
dfPlot %>%
group_by(Month, Year) %>%
summarise(Count = n()) %>%
group_by(Month) %>%
summarise(Count = sum(Count/length(Year))) %>%
# bar plot
ggplot(aes(x = Month, y = Count, group = 1)) +
geom_line() +
ylab("Mean Number of Meetings Held, 2000-2013") +
# create more breaks
ggtitle("City Council Meetings Held by Month") +
theme_bw()
Here we look at the proportion of City Council meetings held each month from 2000 to 2013. From this graph, we can see that council members take their summers seriously. Good luck trying to schedule a hearing during July or August. Also, you can see an upward trend starting from the beginning of the year until the month of June, the end of the fiscal cycle. Perhaps council members are spurred by deadlines just as much as the rest of us.
# generate plot
dfPlot %>%
# aggregate by year, get counts
group_by(Name, Year) %>%
summarise(Count = n()) %>%
# boxplot
ggplot(aes(x = Year, y = Count)) +
stat_boxplot(geom = "errorbar") + # add error bars
geom_boxplot() +
# line graph through the mean of distribution
stat_summary(aes(group = 1), fun.y = mean, geom = "line", size = 2) +
# adjusting axes and titles
scale_x_discrete(breaks = c(2000:2013)) +
ylab("Number of Meetings Held") +
ggtitle("New York City Council Meetings, 2000-2013") +
theme_economist() # use the economist theme
Here the solid black line goes through the mean of the distribution for each year. We can approximate that the total number of meetings held by the New York City Council has remained fairly consistent over the last decade. However, it is important to note that some committees were well below their quota of one meeting per month. By subsetting for the committees that held less than 12 meetings in one year, we can see who the most frequent offenders were:
dfPlot %>%
# aggregate by year and
# get meeting counts per committee
group_by(Year, Name) %>%
summarise(Count = n()) %>%
filter(Count < 12) %>% # subset those falling behind quota
# get top 5 most frequent offenders
group_by(Name) %>%
summarise(Freq = n()) %>%
arrange(desc(Freq)) %>%
head(5)
## Source: local data frame [5 x 2]
##
## Name Freq
## 1 Youth Services 13
## 2 Oversight and Investigations 12
## 3 Standards and Ethics 12
## 4 Higher Education 11
## 5 State and Federal Legislation 11
So internal and youth-oriented committees have consistently held less than 12 meetings per year. We should use the Charter to verify if these committees were explicitly allowed to do so.
The boxplot also shows several outliers. Since I am at lost at how to label the outliers using ggplot2
, let’s take a look at the committees that had the most meetings held by year:
dfPlot %>%
group_by(Year, Name) %>%
summarise(Count = n()) %>%
filter(Count == max(Count)) %>%
arrange(desc(Count))
## Source: local data frame [14 x 3]
##
## Year Name Count
## 1 2000 Finance 35
## 2 2001 Finance 39
## 3 2002 Finance 32
## 4 2003 Finance 34
## 5 2004 Finance 35
## 6 2005 Finance 41
## 7 2006 Finance 35
## 8 2007 Finance 36
## 9 2008 Finance 47
## 10 2009 Land Use 42
## 11 2010 Finance 36
## 12 2011 Finance 46
## 13 2012 Finance 39
## 14 2013 Finance 42
The Committee on Finance
consistently holds the most meetings per year. This makes sense since all committees must go through the finance committee to approve any changes to the budget. Money is power.
ggvis
stacked bar plot## R version 3.1.1 (2014-07-10)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggthemes_1.7.0 ggplot2_1.0.0 XML_3.98-1.1 stringr_0.6.2
## [5] dplyr_0.2.0.9000
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 colorspace_1.2-4 DBI_0.3.0 digest_0.6.4
## [5] evaluate_0.5.5 formatR_1.0 grid_3.1.1 gtable_0.1.2
## [9] htmltools_0.2.6 knitr_1.6 labeling_0.3 magrittr_1.0.1
## [13] MASS_7.3-34 munsell_0.4.2 parallel_3.1.1 plyr_1.8.1
## [17] proto_0.3-10 Rcpp_0.11.2 reshape2_1.4 rmarkdown_0.3.3
## [21] scales_0.2.4 tools_3.1.1 yaml_2.1.13