Valence Analytics: visualization

Showing posts with label visualization. Show all posts

Saturday, August 9, 2014

Visualizing CDC's Morbidity and Mortality Weekly Report (MMWR) on Infrequently Reported Diseases

Hello Readers,

Here we will download, organize, and visualize disease data the Morbidity and Mortality Weekly Report (MMWR) published by the Centers for Disease Control and Prevention (CDC). Recent news from Texas and other states include a cyclosporiasis outbreak originating from alleged imported Mexican produce. In June of 2013, there were 631 people sickened by a cyclosporiasis outbreak, so this year's outbreak is not new.

Cyclospora cayetanensis Parasite

This post acknowledges and thanks Aaron Kite-Powell of the Armed Forces Health Surveillance Center with collaboration in writing the R code. Check them out if you're interested in disease surveillance!

Notifiable Diseases and Mortality Tables

Looking at the CDC link above for notifiable diseases in Table I, we see over 55 diseases- some with no reported cases for the week (anthrax) and some with more than a few (measles). The diseases are deemed infrequent from having less than 1000 cases reported in the past year from the National Notifiable Diseases Surveillance System (NNDSS). Clicking the export data button brings us to the data.cdc.gov page where we can see the cases reported for each week of this year. We will download this data in a CSV file and plot the cases by disease in R. Selecting the Export button, we have several choices of download formats, and we can manually download the CSV file, or we can copy the link and use R to retrieve the CSV file. The R code will demonstrate both methods later.

Figure 1. Infrequently Reported Diseases Data

Accessing the Data in R

Now that we have the CSV file downloaded, and the link address copied, we can attempt both methods to load the data into R.

Manual CSV Download
First, we load all the libraries we will use ("plyr", "ggplot2", "RCurl"), and load the file location. Then using "read.csv()" we read in the CSV file, making sure that the strings are not factors with "stringsAsFactors=F". Note that we have 20 columns and 1650 rows of disease data in various weeks.

CSV Code:

> ### CDC MMWR Scraping/Formatting
> ### https://data.cdc.gov/NNDSS/NNDSS-Table-I-infrequently-reported-notifiable-dis/wcwi-x3uk
> setwd("~/YOURDIRECTORY")
> # load libraries
> library(plyr)
> library(ggplot2)
> library(RCurl)
> 
> # CSV read-in method
> file <- "NNDSS_-_Table_I._infrequently_reported_notifiable_diseases.csv"
> nndss <- read.csv(file, stringsAsFactors=F)
> dim(nndss) # 1650 rows by 20 columns
[1] 1650   20
> names(nndss) # column names
 [1] "Disease"                                         
 [2] "MMWR.year"                                       
 [3] "MMWR.week"                                       
 [4] "Current.week"                                    
 [5] "Current.week..flag"                              
 [6] "Cum.2014"                                        
 [7] "Cum.2014..flag"                                  
 [8] "X5.year.weekly.averageâ.."                       
 [9] "X5.year.weekly.averageâ....flag"                 
[10] "Total.cases.reported..2013"                      
[11] "Total.cases.reported..2013..flag"                
[12] "Total.cases.reported.2012"                       
[13] "Total.cases.reported.2012..flag"                 
[14] "Total.cases.reported.2011"                       
[15] "Total.cases.reported.2011..flag"                 
[16] "Total.cases.reported.2010"                       
[17] "Total.cases.reported.2010..flag"                 
[18] "Total.cases.reported.2009"                       
[19] "Total.cases.reported.2009..flag"                 
[20] "States.reporting.cases.during.current.week..No.."
>

R URL Download
Next we will use R and "download.file()" to access our target CSV file. After specifying the file name ("reportedDiseases.csv") and the method, ("method="curl""), we can read in the CSV with "read.csv()". Looking at the dimensions, we see 20 columns, but 1705 rows. This is because the URL refers to the most recent data from the CDC, whereas the CSV file we downloaded manually came before another week of data was added. Therefore, using the data retrieved from the URL method is more optimal since it has more recent data (1650 old rows + 55 new week= 1705). So the current week is 1705/55 = week 31.

URL Method Code:

> # URL Method
> url <- "https://data.cdc.gov/api/views/wcwi-x3uk/rows.csv?accessType=DOWNLOAD"
> download.file(url, destfile="reportedDiseases.csv", method="curl")
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  125k    0  125k    0     0  74997      0 --:--:--  0:00:01 --:--:-- 75658
> nndss.1 <- read.csv("reportedDiseases.csv", strip.white=T, stringsAsFactors=F)
> dim(nndss.1)
[1] 1705   20
> names(nndss.1)
 [1] "Disease"                                         
 [2] "MMWR.year"                                       
 [3] "MMWR.week"                                       
 [4] "Current.week"                                    
 [5] "Current.week..flag"                              
 [6] "Cum.2014"                                        
 [7] "Cum.2014..flag"                                  
 [8] "X5.year.weekly.averageâ.."                       
 [9] "X5.year.weekly.averageâ....flag"                 
[10] "Total.cases.reported..2013"                      
[11] "Total.cases.reported..2013..flag"                
[12] "Total.cases.reported.2012"                       
[13] "Total.cases.reported.2012..flag"                 
[14] "Total.cases.reported.2011"                       
[15] "Total.cases.reported.2011..flag"                 
[16] "Total.cases.reported.2010"                       
[17] "Total.cases.reported.2010..flag"                 
[18] "Total.cases.reported.2009"                       
[19] "Total.cases.reported.2009..flag"                 
[20] "States.reporting.cases.during.current.week..No.."
> nndss.1$Disease <- factor(nndss.1$Disease)
>

Also remember to transform the "$Disease" column into a factor, because we have 55 diseases, not 1705.

Infrequently Reported Diseases

Now that we have our data, the data are all separated and organized by "$MMWR.week", but we want each diseased grouped together, progressing by each week so we can evaluate the cases. This is where we use "ddply()" to wrestle our data and transform it to our needs. First specify the data set, "nndss.1", then the variables by which we will organize ".(Disease, MMWR.week)", and our value that we pull is the "Current.week" counts. So for week 1, it will pull the week 1 cases, and push it to the output, and so on for each week. Again, we transform "$MMWR.week" into a factor type variable, since it takes on discrete values from 1 to 31.

Data Re-Arrangement Code:

> # combine measure by disease type
> # for each week
> d <- ddply(nndss.1, .(Disease, MMWR.week), summarize,
+            count = Current.week)
> d$MMWR.week <- factor(d$MMWR.week)
>
> summary(d)
                                                               Disease    
 Anthrax                                                           :  31  
 Arboviral diseases, California serogroup virus diseaseÂ§,Â¶       :  31  
 Arboviral diseases, Eastern equine encephalitis virus diseaseÂ§,Â¶:  31  
 Arboviral diseases, Powassan virus diseaseÂ§,Â¶                   :  31  
 Arboviral diseases, St. Louis encephalitis virus diseaseÂ§,Â¶     :  31  
 Arboviral diseases, Western equine encephalitis virus diseaseÂ§,Â¶:  31  
 (Other)                                                           :1519  
   MMWR.week        count       
 1      :  55   Min.   : 1.000  
 2      :  55   1st Qu.: 1.000  
 3      :  55   Median : 2.000  
 4      :  55   Mean   : 3.787  
 5      :  55   3rd Qu.: 4.000  
 6      :  55   Max.   :87.000  
 (Other):1375   NA's   :1255   
> 
> # rename levels ####
> levels(d$Disease)[3] <- "Arbo,EEE"
> levels(d$Disease)[2] <- "Arbo,CA serogroup"
> levels(d$Disease)[4] <- "Arbo,Powassan"
> levels(d$Disease)[5] <- "Arbo,St Louis"
> levels(d$Disease)[6] <- "Arbo,WEE"
> levels(d$Disease)[9] <- "Botulism other"
> levels(d$Disease)[14] <- "Cyclosporiasis"
> levels(d$Disease)[16] <- "H flu <5 non-b"
> levels(d$Disease)[17] <- "H flu <5 b"
> levels(d$Disease)[18] <- "H flu <5 unknown"
> levels(d$Disease)[19] <- "Hansen disease"
> levels(d$Disease)[20] <- "Hantavirus PS"
> levels(d$Disease)[21] <- "HUS,postdiarrheal"
> levels(d$Disease)[22] <- "HBV,perinatal"
> levels(d$Disease)[23] <- "Influenza ped mort"
> levels(d$Disease)[26] <- "Measles"
> levels(d$Disease)[27] <- "Mening a,c,y,w-135"
> levels(d$Disease)[28] <- "Mening other"
> levels(d$Disease)[29] <- "Mening serogroup b"
> levels(d$Disease)[30] <- "Mening unknown"
> levels(d$Disease)[31] <- "Novel influenza A"
> levels(d$Disease)[33] <- "Polio nonparalytic"
> levels(d$Disease)[35] <- "Psittacosis"
> levels(d$Disease)[38] <- "Q fever, total"
> levels(d$Disease)[40] <- "Rubella"
> levels(d$Disease)[42] <- "SARS-CoV"
> levels(d$Disease)[43] <- "Smallpox"
> levels(d$Disease)[44] <- "Strep toxic shock synd"
> levels(d$Disease)[45] <- "Syphilis congenital <1yr"
> levels(d$Disease)[47] <- "Toxic shock synd staph"
> levels(d$Disease)[51] <- "Vanco Interm Staph A"
> levels(d$Disease)[52] <- "Vanco Resist Staph A"
> levels(d$Disease)[53] <- "Vibrio non-cholera"
> levels(d$Disease)[54] <- "Viral hemorr fever"
> # levels finish
>

Afterwards, we see the names of the diseases, and realize that they are a mess, format-wise. So we rename each level in the "$Disease" factor variable as they need renaming. Thanks Aaron for making the disease names more readable!

Plotting Disease Cases

The next step in R creates the visualizations necessary to understand and see the temporal aspect of the reported disease cases and possible outbreaks. Utilizing the layers in "ggplot()", we place the "MMWR.week" on the x-axis and the "count" on the y-axis. Each plot will be a histogram through "geom_histogram()", and we plot all of the 55 diseases in the same plot with "facet_wrap()". 55 plots is divided perfectly into 11 rows of 5 columns. The "theme()" designates no grid lines, and "scale_x_discrete()" provides the break points and labels on the week x-axis.

Plot Code:

> # plotting diseases
> # use d dataset
> plot <- ggplot(d, aes(x=MMWR.week, y=count)) +
+   geom_histogram(stat="identity") +
+   ggtitle("MMWR Infrequently Reported Diseases") +
+   facet_wrap(~Disease, nrow=11, ncol=5, scales="free") +
+   theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank()) +
+   scale_x_discrete(breaks=c("1", "10", "20", "30"), labels=c("1","10","20","30"))
> 
> # write to file
> png(file="MMWR Infreq Diseases.png", bg="white",
+ width=850, height=850)
> # plot plots in png file
> plot
> # turns off png device
> dev.off()
RStudioGD 
        2 
>

With "png()" we can write the plot directly to a .png file, while specifying the dimensions. Call the plot name "plot" while the png device is active and it will write with that device to the designated destination. "dev.off()" will turn the device off, returning it to the original RStudio plot device. Other formats can be used, such as "jpg()" or "bmp()". "png()" was chosen because it preserves the image quality.

Figure 2. Plot of Infrequently Reported Diseases

We can see the cyclosporiasis plot, in the third row, second from the right. Note the steep right in cases in recent weeks- owed to outbreak in Texas and other states mentioned at the beginning of the post.

Visualizing the Cyclosporiasis Outbreak

By isolating the cyclosporiasis plot, we gain better perspective in observing the outbreak in reported cases. Those infected begin to develop symptoms of diarrhea, fever, and nausea, from two days to two weeks after infection, lasting from a few weeks to a two months. The vector is through contaminated feces, and thought to be present on some imported produce.

Figure 3. Cyclosporiasis Reported Cases

As we can see, the reported cases rise dramatically in the month of July, and hit a high of 58 cases in early August. As of the NNDSS report, there have been 221 cases of cyclosporiasis, however, not all of them are due to this specific outbreak- but the majority of them are.
Plotting Cyclosporiasis Code:

> levels(d$Disease)
 [1] "Anthrax"                  "Arbo,CA serogroup"        "Arbo,EEE"                
 [4] "Arbo,Powassan"            "Arbo,St Louis"            "Arbo,WEE"                
 [7] "Botulism, foodborne"      "Botulism, infant"         "Botulism other"          
[10] "Botulism, total"          "Brucellosis"              "Chancroid"               
[13] "Cholera"                  "Cyclosporiasis"           "Diphtheria"              
[16] "H flu <5 non-b"           "H flu <5 b"               "H flu <5 unknown"        
[19] "Hansen disease"           "Hantavirus PS"            "HUS,postdiarrheal"       
[22] "HBV,perinatal"            "Influenza ped mort"       "Leptospirosis"           
[25] "Listeriosis"              "Measles"                  "Mening a,c,y,w-135"      
[28] "Mening other"             "Mening serogroup b"       "Mening unknown"          
[31] "Novel influenza A"        "Plague"                   "Polio nonparalytic"      
[34] "Poliomyelitis, paralytic" "Psittacosis"              "Q fever, acute"          
[37] "Q fever, chronic"         "Q fever, total"           "Rabies, human"           
[40] "Rubella"                  "Rubellaâ€ â€ â€ "         "SARS-CoV"                
[43] "Smallpox"                 "Strep toxic shock synd"   "Syphilis congenital <1yr"
[46] "Tetanus"                  "Toxic shock synd staph"   "Trichinellosis"          
[49] "Tularemia"                "Typhoid fever"            "Vanco Interm Staph A"    
[52] "Vanco Resist Staph A"     "Vibrio non-cholera"       "Viral hemorr fever"      
[55] "Yellow fever"            
> 
> # plot cyclosporiasis
> cyclo <- ggplot(d[d$Disease==levels(d$Disease)[14],],
+                 aes(x=MMWR.week, y=count)) +
+   geom_histogram(stat="identity") +
+   ggtitle("Cyclosporiasis Reported Cases in 2014") +
+   scale_x_discrete(breaks=c(1,6,10,14,19,23,27,31), # first weeks of each month
+                    labels=month.name[1:8]) # month names
> 
> # write to file
> png(filename="cyclosporiasis_plot.png", bg="white",
+     width=550, height=550)
> cyclo
> dev.off()
RStudioGD 
        2 
>

And there we have it folks! We gathered reported disease data from the CDC's MMWR and loaded into R. Then we cleaned, and transformed the data, and visualized the cases report by disease. Additionally, we focused on cyclosporiasis, and we were able to see the rise in reported cases coinciding with the imported produce outbreak in Texas and other states.

By looking at the health data, we can track near real-time and even predict future cases. Refer to the MMWR if you would like weekly reports on the latest outbreaks!

Thanks for reading,

Wayne
@beyondvalence
LinkedIn

Thursday, June 26, 2014

Why LeBron James Should Leave Miami: A Look At Win Shares

Hey Basketball Fans,

The NBA 2014 postseason is finally over, and the San Antonio Spurs have emerged victorious in the Finals over the Miami Heat in five games with stunning decisiveness. This sums to five Larry O'Brien Championship Trophies for the Spurs if you were counting. Let Tim Duncan help you out:

1, 2, 3, 4, and 5. Maybe a 6th? Maybe next year.

San Antonio never forgot Game 6 last year, and played with a vengeance in this year's Finals, taking down the defending champions, the Heat. Last year's Game 6? Consider that demon banished. The difference between last year's Spurs and this year's Spurs are glaring. To counter the Heat's athleticism, the Spurs relied on consecutive passing plays to create mismatches or find the open man for the easy bucket, awing many analysts. The Spurs set two defining Finals records in this series: accurate shooting in the first half of Game 3 at 75.8% from the field to win 111-92, and a 70 point differential between the teams in 5 games.

While the Spurs celebrated in San Antonio, the Heat retreated back to Miami, regrouping and focusing on the 2014-2015 season. Now in the off season, the Heat roster is bound to change, especially with major player contracts ending. What I am talking about is LeBron's Decision 2.0, since he opted out of the final year of his contract with the Heat to explore free agency. Would King James stay in Miami, or would he choose to 'bring his talent' to another team?

Going into a potential season ending Game 5, LeBron made several remarks on his performance and the team's performance in the first four games in this pregame interview. He says "its team basketball", and maybe increasing his shot attempts and shooting percentage might help the Heat win. However, looking at the win shares, through the season, LeBron has been shouldering the majority of the responsibility to win. The Heat's supporting cast was not doing too much supporting, even during the Finals- namely the other 2 of the Big Three. Outside the Big Three- somebody find Mario Chalmers!

(Note: Win shares, as per basketball reference, estimate the number of wins contributed by a player.)

Which is why I think LeBron should leave Miami to play with a stronger, younger supporting cast. Being one of oldest teams in the NBA, the Heat struggled to stop the Spurs machine, exposing the Heat's reliance on LeBron, Wade, and Bosh. The Spurs's team basketball unmasked the weak of impact of Wade and Bosh on the court, despite Wade resting during the regular season. But this was not a sudden phenomenon in the playoffs; the win shares in the regular season show the unbalanced contributions from the Heat's Big Three.

The HEAT Big Three: James, Wade, Bosh

Heat Big Three

Taking the win share data from basketball-reference, we can create a win share visual for the Miami Heat Big Three, starting from the 2010-2011 season to the season that just ended, 2013-2014. For all the R code in gathering and plotting the data, see the Appendix at the end of the post.

For each of the four seasons the Big Three have been together, we plot their individual win shares for the regular season, along with their average win shares with the heavy red dotted line.

Fig. 1

LeBron clearly takes the cake, with win shares around 15 for each season, peaking at 19.3 in 2012-2013. The King certainly adds many wins to the Heat team on paper. While Dwyane and Chris start above double digits in their first year together, they eventually regressed through the seasons down to 5.5 and 8.0 in 2013-14, respectively. The gap in win shares in the Big Three in '13-'14 illustrates the Heat's reliance on LeBron, and the regression of Dwyane and Chris. The star power fades without LeBron, and as we saw in the Finals, he can only contribute so much. With two of the Big Three in the decline, LeBron must look to another team with scoring and point guard options, unless the Heat are able to infuse their roster with those role players.

The difference has been Dwyane Wade's performance. Even though he has been resting during the regular season for the playoffs, he had a lackluster stat line in the Finals: 15.2 PPG , 3.8 RPG, 2.6, APG, while playing 34.5 minutes per game. Dwyane's offensive rating (points per 100 possessions) numbered to 89, compared to LeBron's 120 and Chris Bosh's 119. His regular season win shares have declined from 12.8 in '10-'11 to 7.7, 9.6, and 5.5 in subsequent years. His fall in win shares are attributable to him sitting out regular games to maintain healthy knees. Dwyane only played in 54 games in the regular season compared to LeBron's 77.

Though Dwyane Wade has had an excellent career, his win share contribution in the regular season, and performance in the Finals have fallen dramatically- and if LeBron chooses to stay, I would call them the Big Two, not the Big Three. Sorry Dwyane, but if LeBron signs with another team, I would not blame him.

The SPURS Big Three: Duncan, Ginobili, Parker

Spurs Big Three

Let's take a look at the NBA Champions in the West. A winning consistency has been a hallmark of the Spurs for 17 seasons, a year after Gregg Popovich joined as head coach in 1996-97. From the '97-'98 season onwards, the Spurs have never dropped below .500 in their regular season record, in fact their lowest W/L was .610. Check out the Spurs' gear throughout those successful years (I spy 5 trophies):

Nice Set of Bling, San Antonio

So how consistent were the Spurs trio of Duncan, Parker, and Ginobili after Miami formed their Big Three? Pretty consistent and similar:

Fig. 2

Looking at the Heat in Fig. 1 and the Spurs in Fig. 2, we see a few differences:

1. Not one of the Spurs trio had a win share above 10,
2. the three were close to the mean (each other),
3. and no large gap in win shares like LeBron and Co. (point 2 rephrased).

Why do we see much similar numbers for the Spurs? Well, we can attribute them to their mantra of team basketball and passing. On any given night, the Spurs might have 4 or 5 players in double figures in points, highlighting their balanced scoring options. For the Finals, the Spurs averaged 5 players in double figures: Tony Parker 18.0, Kawhi Leonard 17.8, Tim Duncan 15.4, Manu Ginobili 14.4, and Patty Mills 10.2. Compare that to the Heat, who had 3 players to do the same.

Also, the Spurs managed their players' minutes played down to a science: not a single player averaged more than 30 minutes per game. Good thing Tim Duncan opted in the last year on his contract, he can make another run for his 6th trophy in 2015 alongside rising Spurs star, Kawhi Leonard, the Finals MVP.

If LeBron James wants a fine set of NBA Finals bling like Tim Duncan, he should decide to sign with another team. King James won 2 rings in 4 Finals trips with Miami, but all good things come to an end.

Thanks for Reading,

Wayne

@beyondvalence
LinkedIn

If you were curious to examine the win shares of the stars on both the Heat and Spurs, here you go:

Fig. 3

R Code Appendix

Creating data.frame Code:

> # data ####
> names <- c("Player", "Team", 
>            "2010-2011",
>            "2011-2012",
>            "2012-2013",
>            "2013-2014")
> # LeBron James
> lebron <- c("LeBron James", "MIA", 15.6, 14.5, 19.3, 15.9)
> # Dwyane Wade
> dwade <- c("Dwyane Wade", "MIA", 12.8, 7.7, 9.6, 5.5)
> # Chris Bosh
> cbosh <- c("Chris Bosh", "MIA", 10.3, 6.9, 9.0, 8.0)
> 
> # Tim Duncan
> tduncan <- c("Tim Duncan", "SAS", 7.7, 5.9, 8.3, 7.4)
> # Tony Parker
> tparker <- c("Tony Parker", "SAS", 8.2, 7.1, 9.3, 5.9)
> # Manu Ginobili
> ginobili <- c("Manu Ginobili", "SAS", 9.9, 4.2, 4.5, 5.7)
> 
> # cbind
> ws <- rbind(lebron, dwade, cbosh, tduncan, tparker, ginobili)
> # create data.frame
> ws <- as.data.frame(ws, stringsAsFactors=FALSE)
> names(ws) <- names
> # format columns
> ws[,2] <- as.factor(ws[,2])
> ws[,3] <- as.numeric(ws[,3])
> ws[,4] <- as.numeric(ws[,4])
> ws[,5] <- as.numeric(ws[,5])
> ws[,6] <- as.numeric(ws[,6])
> ws
                Player Team 2010-2011 2011-2012 2012-2013 2013-2014
lebron    LeBron James  MIA      15.6      14.5      19.3      15.9
dwade      Dwyane Wade  MIA      12.8       7.7       9.6       5.5
cbosh       Chris Bosh  MIA      10.3       6.9       9.0       8.0
tduncan     Tim Duncan  SAS       7.7       5.9       8.3       7.4
tparker    Tony Parker  SAS       8.2       7.1       9.3       5.9
ginobili Manu Ginobili  SAS       9.9       4.2       4.5       5.7

Plotting Win Shares Together Code:

> # win share means
> heat.mean <- apply(ws[ws$Team=="MIA",3:6], 2, mean)
> spurs.mean <- apply(ws[ws$Team=="SAS",3:6], 2, mean)
> 
> # plotting win shares both teams ####
> plot(1:4, ws[1,3:6], type="l", 
+      main="Win Shares in Big Three Era",
+      xlab="Season", xaxt="n",
+      ylab="Win Shares", ylim=c(0,20),
+      col="red", lwd=2)
> axis(1, at=1:4, labels=names(ws)[3:6])
> for (i in 2:nrow(ws)) {
+   
+   if (i < 4) {
+     lty=i
+   } else lty=i-3
+     if (ws[i,2]=="MIA") {
+       col="red"
+     } else col="grey"
+     lines(1:4, ws[i,3:6], type="l",
+           col=col, lwd=2, lty=lty)
+ }
> 
> # add big 3 averages
> lines(1:4, heat.mean, lwd=6, lty=3, col="red")
> lines(1:4, spurs.mean, lwd=6, lty=3, col="grey")
> 
> # add legend
> col <- c()
> for (i in 1:nrow(ws)) {
+   if (ws[i,2]=="MIA") {
+     col <- c(col, "red")
+   } else col <- c(col, "grey")
+ }
> legend("bottomleft", col=col, lwd=rep(2), legend=ws[,1],
+        lty=c(1:3), cex=0.85)
> legend("bottomright", col=c("white", "red", "grey"), 
+        lty=3, lwd=6,
+        legend=c("Average Big of 3", "Miami Heat", "San Antonio Spurs"))

Plotting Win Shares by Team Code:

> # plotting win shares for heat ####
> plot(1:4, ws[1,3:6], type="l", 
+      main="Win Shares in Big Three Heat Era",
+      xlab="Season", xaxt="n",
+      ylab="Win Shares", ylim=c(0,20),
+      col="red", lwd=2)
> axis(1, at=1:4, labels=names(ws)[3:6])
> 
> # add lines
> lines(1:4, ws[2,3:6], lwd=2, lty=2, col="red")
> lines(1:4, ws[3,3:6], lwd=2, lty=3, col="red")
> lines(1:4, heat.mean, lwd=6, lty=3, col="red")
> 
> # add legend
> legend("bottomleft", col="red", lwd=c(2,2,2,6), 
+        legend=c(ws[1:3,1], "Big 3 Average"),
+        lty=c(1:3,3), cex=0.85)
> 
> # plotting win shares for spurs ####
> plot(1:4, ws[4,3:6], type="l", 
+      main="Win Shares in Big Three Heat Era- Spurs",
+      xlab="Season", xaxt="n",
+      ylab="Win Shares", ylim=c(0,20),
+      col="grey", lwd=2)
> axis(1, at=1:4, labels=names(ws)[3:6])
> 
> # add lines
> lines(1:4, ws[5,3:6], lwd=2, lty=2, col="grey")
> lines(1:4, ws[6,3:6], lwd=2, lty=3, col="grey")
> lines(1:4, spurs.mean, lwd=6, lty=3, col="grey")
> 
> # add legend
> legend("bottomleft", col="grey", lwd=c(2,2,2,6), 
+        legend=c(ws[4:6,1], "Big 3 Average"),
+        lty=c(1:3,3), cex=0.85)

Disclosure: I support the Spurs.

Monday, June 23, 2014

Predicting Capital Bikeshare Demand in R: Part 1. Data Exploration

Hello Readers,

In order to promote alternative public transportation, many major cities in the U.S. have established bike sharing programs. These systems use a network of kiosks for users to rent and return bikes on an as-need basis. Users can rent a bike at one kiosk and return it to another kiosk across town. The automated kiosks gather all sorts of bike usage data, including duration of rent, departure and arrival locations. These data points act as proxy measures for analysts to estimate city mobility. (Check out the YouTube video in the middle of the post.)

Capital Bikeshare Station

This "Bike Sharing" R series involves the prediction of bike rentals over 2011 and 2012 for the Capital Bikeshare program in Washington D.C. The CSV data for forecasting can be obtained from the Kaggle Knowledge Competition.

Capital Bikeshare Data

The training data are the first 19 days of each month from January 2011 to December 2012, and the test data from which we aim to predict the bike rental numbers, are the remaining days in each month. The variables include the "datetime", seasonal data, temperature, humidity, and wind speed measures. Because Kaggle gave us this information along with the time stamps, we will have to evaluate whether a model with the weather data, or a time series model without the weather data can better predict the bike rental counts.

Before we get ahead of ourselves and start modeling, we need to understand the data first. Remember to point your working directory in R to the proper location. Load the training data with "read.csv", and get a glimpse of the data with "head" and "summary":

First Look Code:

> # set working directory
> setwd("~/Documents/Kaggle/BikeSharing")
> 
> # load libraries ####
> library(xts)
> library(gbm)
>
> # load train csv
> train <- read.csv("train.csv", stringsAsFactors=FALSE)
> head(train)
             datetime season holiday workingday weather temp  atemp
1 2011-01-01 00:00:00      1       0          0       1 9.84 14.395
2 2011-01-01 01:00:00      1       0          0       1 9.02 13.635
3 2011-01-01 02:00:00      1       0          0       1 9.02 13.635
4 2011-01-01 03:00:00      1       0          0       1 9.84 14.395
5 2011-01-01 04:00:00      1       0          0       1 9.84 14.395
6 2011-01-01 05:00:00      1       0          0       2 9.84 12.880
  humidity windspeed casual registered count
1       81    0.0000      3         13    16
2       80    0.0000      8         32    40
3       80    0.0000      5         27    32
4       75    0.0000      3         10    13
5       75    0.0000      0          1     1
6       75    6.0032      0          1     1
>
> summary(train)
   datetime             season         holiday          workingday    
 Length:10886       Min.   :1.000   Min.   :0.00000   Min.   :0.0000  
 Class :character   1st Qu.:2.000   1st Qu.:0.00000   1st Qu.:0.0000  
 Mode  :character   Median :3.000   Median :0.00000   Median :1.0000  
                    Mean   :2.507   Mean   :0.02857   Mean   :0.6809  
                    3rd Qu.:4.000   3rd Qu.:0.00000   3rd Qu.:1.0000  
                    Max.   :4.000   Max.   :1.00000   Max.   :1.0000  
    weather           temp           atemp          humidity     
 Min.   :1.000   Min.   : 0.82   Min.   : 0.76   Min.   :  0.00  
 1st Qu.:1.000   1st Qu.:13.94   1st Qu.:16.66   1st Qu.: 47.00  
 Median :1.000   Median :20.50   Median :24.24   Median : 62.00  
 Mean   :1.418   Mean   :20.23   Mean   :23.66   Mean   : 61.89  
 3rd Qu.:2.000   3rd Qu.:26.24   3rd Qu.:31.06   3rd Qu.: 77.00  
 Max.   :4.000   Max.   :41.00   Max.   :45.45   Max.   :100.00  
   windspeed          casual         registered        count      
 Min.   : 0.000   Min.   :  0.00   Min.   :  0.0   Min.   :  1.0  
 1st Qu.: 7.002   1st Qu.:  4.00   1st Qu.: 36.0   1st Qu.: 42.0  
 Median :12.998   Median : 17.00   Median :118.0   Median :145.0  
 Mean   :12.799   Mean   : 36.02   Mean   :155.6   Mean   :191.6  
 3rd Qu.:16.998   3rd Qu.: 49.00   3rd Qu.:222.0   3rd Qu.:284.0  
 Max.   :56.997   Max.   :367.00   Max.   :886.0   Max.   :977.0

With our initial look at the data, we can make a few observations:

1. The "datetime" variable is formatted "year-month-day_hour:minute:second",
2. "season", "holiday", "workingday", and "weather" are categorical variables,
3. our target variable, "count", is composed of "causal" and "registered" users,
4. and each row entry increments by the hour for 10,886 observations.

Now we need to change "datetime" into a date type, and "season", "holiday", "workingday", and "weather" into factor variables. Keep in mind for our prediction model, we need not include all the variables.

Variable Reformatting Code:

> # set categorical variables ####
> # season, holiday, workingday, weather
> train$season <- factor(train$season, c(1,2,3,4), ordered=FALSE)
> train$holiday <- factor(train$holiday, c(0,1), ordered=FALSE)
> train$workingday <- factor(train$workingday, c(0,1), ordered=FALSE)
> train$weather <- factor(train$weather, c(4,3,2,1), ordered=TRUE)
> # set datetime ####
> train$datetime <- as.POSIXct(train$datetime, format="%Y-%m-%d %H:%M:%S")
> str(train)
'data.frame': 10886 obs. of  12 variables:
 $ datetime  : POSIXct, format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
 $ season    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ workingday: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ weather   : Ord.factor w/ 4 levels "4"<"3"<"2"<"1": 1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
 $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
 $ humidity  : int  81 80 80 75 75 75 80 86 75 76 ...
 $ windspeed : num  0 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ count     : int  16 40 32 13 1 1 2 3 8 14 ...

In our training data.frame, we have 12 reformatted variables, and from the "str" function, we see the changes reflected in the variable types. I set the "weather" variable as a ordinal factor, where order matters, instead of a regular factor variable. Looking at the data dictionary, you can see the categorical "weather" variable describing the severity of the weather conditions, with 1 being clear or partly cloudy, and 4 indicating thunderstorm, heavy rain, sleet, etc. So 1 is the best weather condition, with 4 being the worst.

Now let us examine the distribution of our bike sharing data.

Exploring Count Data

Since we aim to predict the bike share demand, the obvious place to begin is with our target variable, "count". We can stratify the "count" distribution as boxplots for the categorical variables, and draw the "count" and numeric variables in another plot. We group the two sets of plots together, and designate their plotting order with "layout".

Count Distribution Code:

> # count is our target variable
> # plot count distribution
> # by categorical var
> layout(matrix(c(1,1,2,3,4,5),2,3,byrow=FALSE))
> boxplot(train$count, main="count")
> boxplot(train$count ~ train$weather, main="weather")
> boxplot(train$count ~ train$season, main="season")
> boxplot(train$count ~ train$holiday, main="holiday")
> boxplot(train$count ~ train$workingday, main="workingday")
> 
> # by numeric var
> layout(matrix(c(1,2,3,4),2,2,byrow=TRUE))
> plot(train$temp, train$count, main="temp")
> plot(train$atemp, train$count, main="feelslike temp")
> plot(train$windspeed, train$count, main="windspeed")
> plot(train$humidity, train$count, main="humidity")

The R code above will yield the two "count" distribution graphics below:

Count Distribution by Categorical Variables

Observe the five "count" boxplots above, with the larger plot being the overall "count" distribution. We see the median count hover around 150 units, and we see many outlier counts above 600. The range of counts is from 0 to under 1000 units. When stratified by "weather", besides extreme weather (==4), the median count increases, with higher usage count the better the weather. There is not much difference other than the outliers for non-"holiday" days, and also for days which are designated a "workingday". We see increases in median counts for "season" 2 and 3, summer and fall, respectively.

Count Distribution by Numeric Variables

Now we move to the numeric variables. Looking at the distributions, we see a general trend of higher "count" values for temperatures from 25 to low 30's (in Celsius). Not surprisingly, there were more "count" values for lower "windspeed" values. There was not much difference for the "humidity" variable, as "humidity" values from 30 to 70 had similar "count" values.

Take a Break- How the Capital Bikeshare Program Works:

Examining All Variables

Since we looked at the relationship between the "count" variable and the other covariates, what about the relationship between the covariates? An obvious guess would support correlation between temperature, season, and between weather and wind speed. To visualize their relationships, we create a pairwise plot with the size of the correlation font relative to their correlation strength in the upper panel of the plot. A best fit regression line is also added to give an idea of trend.

Pairwise Plot Code:

> # pairwise plot
> ## put (absolute) correlations on the upper panels,
> ## with size proportional to the correlations.
> panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
> {
>   usr <- par("usr"); on.exit(par(usr))
>   par(usr = c(0, 1, 0, 1))
>   r <- abs(cor(x, y))
>   txt <- format(c(r, 0.123456789), digits = digits)[1]
>   txt <- paste0(prefix, txt)
>   if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
>   text(0.5, 0.5, txt, cex = cex.cor * r)
> }
> pairs(~count+season+holiday+workingday+weather+temp+humidity+windspeed,
>       data=train, 
>       main="Bike Sharing",
>       lower.panel=panel.smooth, 
>       upper.panel=panel.cor)
>

Below we have the resulting pairwise plot with correlations:

Pairwise Plot

I suggest enlarging the plot to view the graphic in all its complexity.

Yes, we do see the seasonal temperature variation, however we do not see too much variation in wind speed other than with humidity. We so see higher humidity with increased weather severity, which makes sense since it requires precipitation to rain/fog/sleet/snow. And yes, if it is a holiday, then it probably is not a working day (not all non-working days are holidays though).

Good. Our pairwise plot revealed no unexpected surprises.

A Closer Look at Count

Let's go back and analyze "count" again, but this time with the "datetime" variable to incorporate a temporal aspect. Recall that we reformatted the variable so now it is recognized as a date type (POSIXct), by year-month-day hour:minute:second format.

So let us visualize the counts over the time of our data. Additionally, since we have the breakdown of casual and registered users, we can examine the percentage of registered users using the Capital Bikeshare each hour over the two years.

Count and Percentage Plot Code:

> # plot
> # count of bike rentals by datetime
> plot(train$datetime, train$count, type="l", lwd=0.7,
>      main="Count of Bike Rentals")
> 
> # percentage of registered users
> percentage <- train$registered/train$count*100
> plot(train$datetime, percentage, type="l", 
>      main="Percentage of Registered Users")
> 
> summary(percentage)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   74.66   85.53   82.90   93.83  100.00

Immediately we see differences within and between year 2011 and 2012. Observe the seasonal spike in the summer and fall in each year, and also the general elevated counts in 2012 compared to 2011. It appears that the Capital Bikeshare program became more popular in 2012 relative to 2011! We see our maximum count is located in 2012.

Note the gaps in between the months- they are the days after the 19th of each month. They are the entries we aim to predict with our model from the first 19 days.

Next we have the percentage of registered users from the count. We see drops in the summer and fall, which could be attributed to tourists who are only visiting Washington D.C. to see the sights and have no need to register with Capital Bikeshare.

From our summary of "percentage" in the code, we notice that our median percentage of registered users hovers around 85.53%.

So while there are casual users of Capital Bikeshare, the majority of users are registered. Also, there was an decrease in number of occasions where the majority of users were casual users from 2011 to 2012.

OK folks, in this R post we have explored the Capital Bikeshare data from Kaggle, while to prepare to predict the bike share demand with various weather and type of day variables. So stay tuned for Part 2, where we start using regression to examine how well each covariate predicts the bike count.

Thanks for reading,

Wayne

@beyondvalence
LinkedIn

Capital Bikeshare Series:
1. Predicting Capital Bikeshare Demand in R: Part 1. Data Exploration
2. Predicting Capital Bikeshare Demand in R: Part 2. Regression
3. Predicting Capital Bikeshare Demand in R: Part 3. Generalized Boosted Model

Pages