Pages

Saturday, August 9, 2014

Visualizing CDC's Morbidity and Mortality Weekly Report (MMWR) on Infrequently Reported Diseases


Hello Readers,

Here we will download, organize, and visualize disease data the Morbidity and Mortality Weekly Report (MMWR) published by the Centers for Disease Control and Prevention (CDC). Recent news from Texas and other states include a cyclosporiasis outbreak originating from alleged imported Mexican produce. In June of 2013, there were 631 people sickened by a cyclosporiasis outbreak, so this year's outbreak is not new.


Cyclospora cayetanensis Parasite
This post acknowledges and thanks Aaron Kite-Powell of the Armed Forces Health Surveillance Center with collaboration in writing the R code. Check them out if you're interested in disease surveillance!


Notifiable Diseases and Mortality Tables


Looking at the CDC link above for notifiable diseases in Table I, we see over 55 diseases- some with no reported cases for the week (anthrax) and some with more than a few (measles). The diseases are deemed infrequent from having less than 1000 cases reported in the past year from the National Notifiable Diseases Surveillance System (NNDSS). Clicking the export data button brings us to the data.cdc.gov page where we can see the cases reported for each week of this year. We will download this data in a CSV file and plot the cases by disease in R. Selecting the Export button, we have several choices of download formats, and we can manually download the CSV file, or we can copy the link and use R to retrieve the CSV file. The R code will demonstrate both methods later.


Figure 1. Infrequently Reported Diseases Data


Accessing the Data in R


Now that we have the CSV file downloaded, and the link address copied, we can attempt both methods to load the data into R.

Manual CSV Download
First, we load all the libraries we will use ("plyr", "ggplot2", "RCurl"), and load the file location. Then using "read.csv()" we read in the CSV file, making sure that the strings are not factors with "stringsAsFactors=F". Note that we have 20 columns and 1650 rows of disease data in various weeks.

CSV Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
> ### CDC MMWR Scraping/Formatting
> ### https://data.cdc.gov/NNDSS/NNDSS-Table-I-infrequently-reported-notifiable-dis/wcwi-x3uk
> setwd("~/YOURDIRECTORY")
> # load libraries
> library(plyr)
> library(ggplot2)
> library(RCurl)
> 
> # CSV read-in method
> file <- "NNDSS_-_Table_I._infrequently_reported_notifiable_diseases.csv"
> nndss <- read.csv(file, stringsAsFactors=F)
> dim(nndss) # 1650 rows by 20 columns
[1] 1650   20
> names(nndss) # column names
 [1] "Disease"                                         
 [2] "MMWR.year"                                       
 [3] "MMWR.week"                                       
 [4] "Current.week"                                    
 [5] "Current.week..flag"                              
 [6] "Cum.2014"                                        
 [7] "Cum.2014..flag"                                  
 [8] "X5.year.weekly.averageâ.."                       
 [9] "X5.year.weekly.averageâ....flag"                 
[10] "Total.cases.reported..2013"                      
[11] "Total.cases.reported..2013..flag"                
[12] "Total.cases.reported.2012"                       
[13] "Total.cases.reported.2012..flag"                 
[14] "Total.cases.reported.2011"                       
[15] "Total.cases.reported.2011..flag"                 
[16] "Total.cases.reported.2010"                       
[17] "Total.cases.reported.2010..flag"                 
[18] "Total.cases.reported.2009"                       
[19] "Total.cases.reported.2009..flag"                 
[20] "States.reporting.cases.during.current.week..No.."
> 


R URL Download
Next we will use R and "download.file()" to access our target CSV file. After specifying the file name ("reportedDiseases.csv") and the method, ("method="curl""), we can read in the CSV with "read.csv()". Looking at the dimensions, we see 20 columns, but 1705 rows. This is because the URL refers to the most recent data from the CDC, whereas the CSV file we downloaded manually came before another week of data was added. Therefore, using the data retrieved from the URL method is more optimal since it has more recent data (1650 old rows + 55 new week= 1705). So the current week is 1705/55 = week 31.

URL Method Code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
> # URL Method
> url <- "https://data.cdc.gov/api/views/wcwi-x3uk/rows.csv?accessType=DOWNLOAD"
> download.file(url, destfile="reportedDiseases.csv", method="curl")
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  125k    0  125k    0     0  74997      0 --:--:--  0:00:01 --:--:-- 75658
> nndss.1 <- read.csv("reportedDiseases.csv", strip.white=T, stringsAsFactors=F)
> dim(nndss.1)
[1] 1705   20
> names(nndss.1)
 [1] "Disease"                                         
 [2] "MMWR.year"                                       
 [3] "MMWR.week"                                       
 [4] "Current.week"                                    
 [5] "Current.week..flag"                              
 [6] "Cum.2014"                                        
 [7] "Cum.2014..flag"                                  
 [8] "X5.year.weekly.averageâ.."                       
 [9] "X5.year.weekly.averageâ....flag"                 
[10] "Total.cases.reported..2013"                      
[11] "Total.cases.reported..2013..flag"                
[12] "Total.cases.reported.2012"                       
[13] "Total.cases.reported.2012..flag"                 
[14] "Total.cases.reported.2011"                       
[15] "Total.cases.reported.2011..flag"                 
[16] "Total.cases.reported.2010"                       
[17] "Total.cases.reported.2010..flag"                 
[18] "Total.cases.reported.2009"                       
[19] "Total.cases.reported.2009..flag"                 
[20] "States.reporting.cases.during.current.week..No.."
> nndss.1$Disease <- factor(nndss.1$Disease)
> 

Also remember to transform the "$Disease" column into a factor, because we have 55 diseases, not 1705.


Infrequently Reported Diseases


Now that we have our data, the data are all separated and organized by "$MMWR.week", but we want each diseased grouped together, progressing by each week so we can evaluate the cases. This is where we use "ddply()" to wrestle our data and transform it to our needs. First specify the data set, "nndss.1", then the variables by which we will organize ".(Disease, MMWR.week)", and our value that we pull is the "Current.week" counts. So for week 1, it will pull the week 1 cases, and push it to the output, and so on for each week. Again, we transform "$MMWR.week" into a factor type variable, since it takes on discrete values from 1 to 31.

Data Re-Arrangement Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
> # combine measure by disease type
> # for each week
> d <- ddply(nndss.1, .(Disease, MMWR.week), summarize,
+            count = Current.week)
> d$MMWR.week <- factor(d$MMWR.week)
>
> summary(d)
                                                               Disease    
 Anthrax                                                           :  31  
 Arboviral diseases, California serogroup virus disease§,¶       :  31  
 Arboviral diseases, Eastern equine encephalitis virus disease§,¶:  31  
 Arboviral diseases, Powassan virus disease§,¶                   :  31  
 Arboviral diseases, St. Louis encephalitis virus disease§,¶     :  31  
 Arboviral diseases, Western equine encephalitis virus disease§,¶:  31  
 (Other)                                                           :1519  
   MMWR.week        count       
 1      :  55   Min.   : 1.000  
 2      :  55   1st Qu.: 1.000  
 3      :  55   Median : 2.000  
 4      :  55   Mean   : 3.787  
 5      :  55   3rd Qu.: 4.000  
 6      :  55   Max.   :87.000  
 (Other):1375   NA's   :1255   
> 
> # rename levels ####
> levels(d$Disease)[3] <- "Arbo,EEE"
> levels(d$Disease)[2] <- "Arbo,CA serogroup"
> levels(d$Disease)[4] <- "Arbo,Powassan"
> levels(d$Disease)[5] <- "Arbo,St Louis"
> levels(d$Disease)[6] <- "Arbo,WEE"
> levels(d$Disease)[9] <- "Botulism other"
> levels(d$Disease)[14] <- "Cyclosporiasis"
> levels(d$Disease)[16] <- "H flu <5 non-b"
> levels(d$Disease)[17] <- "H flu <5 b"
> levels(d$Disease)[18] <- "H flu <5 unknown"
> levels(d$Disease)[19] <- "Hansen disease"
> levels(d$Disease)[20] <- "Hantavirus PS"
> levels(d$Disease)[21] <- "HUS,postdiarrheal"
> levels(d$Disease)[22] <- "HBV,perinatal"
> levels(d$Disease)[23] <- "Influenza ped mort"
> levels(d$Disease)[26] <- "Measles"
> levels(d$Disease)[27] <- "Mening a,c,y,w-135"
> levels(d$Disease)[28] <- "Mening other"
> levels(d$Disease)[29] <- "Mening serogroup b"
> levels(d$Disease)[30] <- "Mening unknown"
> levels(d$Disease)[31] <- "Novel influenza A"
> levels(d$Disease)[33] <- "Polio nonparalytic"
> levels(d$Disease)[35] <- "Psittacosis"
> levels(d$Disease)[38] <- "Q fever, total"
> levels(d$Disease)[40] <- "Rubella"
> levels(d$Disease)[42] <- "SARS-CoV"
> levels(d$Disease)[43] <- "Smallpox"
> levels(d$Disease)[44] <- "Strep toxic shock synd"
> levels(d$Disease)[45] <- "Syphilis congenital <1yr"
> levels(d$Disease)[47] <- "Toxic shock synd staph"
> levels(d$Disease)[51] <- "Vanco Interm Staph A"
> levels(d$Disease)[52] <- "Vanco Resist Staph A"
> levels(d$Disease)[53] <- "Vibrio non-cholera"
> levels(d$Disease)[54] <- "Viral hemorr fever"
> # levels finish
> 

Afterwards, we see the names of the diseases, and realize that they are a mess, format-wise. So we rename each level in the "$Disease" factor variable as they need renaming. Thanks Aaron for making the disease names more readable!



Plotting Disease Cases


The next step in R creates the visualizations necessary to understand and see the temporal aspect of the reported disease cases and possible outbreaks. Utilizing the layers in "ggplot()", we place the "MMWR.week" on the x-axis and the "count" on the y-axis. Each plot will be a histogram through "geom_histogram()", and we plot all of the 55 diseases in the same plot with "facet_wrap()". 55 plots is divided perfectly into 11 rows of 5 columns. The "theme()" designates no grid lines, and "scale_x_discrete()" provides the break points and labels on the week x-axis.

Plot Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
> # plotting diseases
> # use d dataset
> plot <- ggplot(d, aes(x=MMWR.week, y=count)) +
+   geom_histogram(stat="identity") +
+   ggtitle("MMWR Infrequently Reported Diseases") +
+   facet_wrap(~Disease, nrow=11, ncol=5, scales="free") +
+   theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank()) +
+   scale_x_discrete(breaks=c("1", "10", "20", "30"), labels=c("1","10","20","30"))
> 
> # write to file
> png(file="MMWR Infreq Diseases.png", bg="white",
+ width=850, height=850)
> # plot plots in png file
> plot
> # turns off png device
> dev.off()
RStudioGD 
        2 
> 

With "png()" we can write the plot directly to a .png file, while specifying the dimensions. Call the plot name "plot" while the png device is active and it will write with that device to the designated destination. "dev.off()" will turn the device off, returning it to the original RStudio plot device. Other formats can be used, such as "jpg()" or "bmp()". "png()" was chosen because it preserves the image quality.



Figure 2. Plot of Infrequently Reported Diseases

We can see the cyclosporiasis plot, in the third row, second from the right. Note the steep right in cases in recent weeks- owed to outbreak in Texas and other states mentioned at the beginning of the post. 


Visualizing the Cyclosporiasis Outbreak

By isolating the cyclosporiasis plot, we gain better perspective in observing the outbreak in reported cases. Those infected begin to develop symptoms of diarrhea, fever, and nausea, from two days to two weeks after infection, lasting from a few weeks to a two months. The vector is through contaminated feces, and thought to be present on some imported produce.

Figure 3. Cyclosporiasis Reported Cases
As we can see, the reported cases rise dramatically in the month of July, and hit a high of 58 cases in early August. As of the NNDSS report, there have been 221 cases of cyclosporiasis, however, not all of them are due to this specific outbreak- but the majority of them are.
Plotting Cyclosporiasis Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
> levels(d$Disease)
 [1] "Anthrax"                  "Arbo,CA serogroup"        "Arbo,EEE"                
 [4] "Arbo,Powassan"            "Arbo,St Louis"            "Arbo,WEE"                
 [7] "Botulism, foodborne"      "Botulism, infant"         "Botulism other"          
[10] "Botulism, total"          "Brucellosis"              "Chancroid"               
[13] "Cholera"                  "Cyclosporiasis"           "Diphtheria"              
[16] "H flu <5 non-b"           "H flu <5 b"               "H flu <5 unknown"        
[19] "Hansen disease"           "Hantavirus PS"            "HUS,postdiarrheal"       
[22] "HBV,perinatal"            "Influenza ped mort"       "Leptospirosis"           
[25] "Listeriosis"              "Measles"                  "Mening a,c,y,w-135"      
[28] "Mening other"             "Mening serogroup b"       "Mening unknown"          
[31] "Novel influenza A"        "Plague"                   "Polio nonparalytic"      
[34] "Poliomyelitis, paralytic" "Psittacosis"              "Q fever, acute"          
[37] "Q fever, chronic"         "Q fever, total"           "Rabies, human"           
[40] "Rubella"                  "Rubella†††"         "SARS-CoV"                
[43] "Smallpox"                 "Strep toxic shock synd"   "Syphilis congenital <1yr"
[46] "Tetanus"                  "Toxic shock synd staph"   "Trichinellosis"          
[49] "Tularemia"                "Typhoid fever"            "Vanco Interm Staph A"    
[52] "Vanco Resist Staph A"     "Vibrio non-cholera"       "Viral hemorr fever"      
[55] "Yellow fever"            
> 
> # plot cyclosporiasis
> cyclo <- ggplot(d[d$Disease==levels(d$Disease)[14],],
+                 aes(x=MMWR.week, y=count)) +
+   geom_histogram(stat="identity") +
+   ggtitle("Cyclosporiasis Reported Cases in 2014") +
+   scale_x_discrete(breaks=c(1,6,10,14,19,23,27,31), # first weeks of each month
+                    labels=month.name[1:8]) # month names
> 
> # write to file
> png(filename="cyclosporiasis_plot.png", bg="white",
+     width=550, height=550)
> cyclo
> dev.off()
RStudioGD 
        2 
> 


And there we have it folks! We gathered reported disease data from the CDC's MMWR and loaded into R. Then we cleaned, and transformed the data, and visualized the cases report by disease. Additionally, we focused on cyclosporiasis, and we were able to see the rise in reported cases coinciding with the imported produce outbreak in Texas and other states.


By looking at the health data, we can track near real-time and even predict future cases. Refer to the MMWR if you would like weekly reports on the latest outbreaks!


Thanks for reading,

Wayne
@beyondvalence
LinkedIn

2 comments:


  1. www.bolavita.fun situs Judi Online Deposit via Go Pay !

    Terbukti aman, dan sudah terpercaya, Minimal Deposit 50ribu ...

    Tersedia Pasaran Lengkap seperti SBOBET - MAXBET - CBET

    Informasi selengkapnya hubungi :
    WA : +62812-2222-995
    BBM : BOLAVITA

    Keluaran Togel Hari Ini terbaru 2019

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete