Wednesday, April 25, 2012

Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part II


In my last blog, Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps, I analyzed historical airlines performance data set using R and SAP HANA and put the aggregated analysis on Google Maps.  Undoubtedly, Map is a pretty exciting canvas to view and analyze big data sets. One could draw shapes (circles, polygons) on the map under a marker pin, providing pin-point information and display aggregated information in the info-window when a marker is clicked.  So I enjoyed doing all of that, but I was craving for some old fashion bubble charts and other types of charts to provide comparative information on big data sets.  Ultimately, all big data sets get aggregated into smaller analytical sets for viewing, sharing and reporting.  An old fashioned chart is the best way to tell a visual story!

On bubble charts, one could display four dimensional data for comparative analysis. In this blog analysis, I used the same data-set which had 200M data points and went deeper looking at finer slices of information.  I leveraged D3, R and SAP HANA for this blog post.  Here I am publishing some of this work:  

In this first graphics, the performance of top airlines is compared for 2008.  As expected, Southwest, the largest airlines (when using total number of flights as a proxy), performed well for its size (1.2M flights, 64 destinations but average delay was ~10 mins.)  Some of the other airlines like American and Continental were the worst performers along with Skywest.  Note, I didn't remove outliers from this analysis.  Click here to interact with this example (view source to get D3 code).


In the second analysis, I replaced airlines dimension with airports dimension but kept all the other dimensions the same.  To my disbelief, Newark airport is the worst performing airport when it comes to departure delays.  Chicago O'Hare, SFO and JFK follow.  Atlanta airport is the largest airport but it has the best performance. What are they doing differently at ATL?  Click here to interact with this example (view source to get D3 code).


It was hell of a fun playing with D3, R and HANA, good intellectual stimulation if nothing else!  Happy Analyzing and remember possibilities are endless!

As always, my R modules are fairly simple and straightforward:
###########################################################################################  
#ETL - Read the AIRPORT Information, get major aiport informatoin extracted and upload this 
#transfromed dataset into HANA
###########################################################################################
major.airports <- data.table(read.csv("MajorAirports.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE))
setkey(major.airports, iata)

all.airports <- data.table(read.csv("AllAirports.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE)) 
setkey(all.airports, iata)

airports.2008.hp <- data.table(read.csv("2008.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE)) 
setkey(airports.2008.hp, Origin, UniqueCarrier)

#Merge two datasets
airports.2008.hp <- major.airports[airports.2008.hp,]


###########################################################################################  
# Get airport statisitics for all airports
###########################################################################################
airports.2008.hp.summary <- airports.2008.hp[major.airports,     
    list(AvgDepDelay=round(mean(DepDelay, na.rm=TRUE), digits=2),
    TotalMiles=prettyNum(sum(Distance, na.rm=TRUE), big.mark=","),
    TotalFlights=length(Month),
    TotalDestinations=length(unique(Dest)),
    URL=paste("http://www.fly", Origin, ".com",sep="")), 
                    by=list(Origin)][order(-TotalFlights)]
setkey(airports.2008.hp.summary, Origin)
#merge two data tables
airports.2008.hp.summary <- major.airports[airports.2008.hp.summary, 
                                                     list(Airport=airport, 
                                                          AvgDepDelay, TotalMiles, TotalFlights, TotalDestinations, 
                                                          Address=paste(airport, city, state, sep=", "), 
                                                          Lat=lat, Lng=long, URL)][order(-TotalFlights)]

airports.2008.hp.summary.json <- getRowWiseJson(airports.2008.hp.summary)
writeLines(airports.2008.hp.summary.json, "airports.2008.hp.summary.json")                 
write.csv(airports.2008.hp.summary, "airports.2008.hp.summary.csv", row.names=FALSE)

9 comments:

  1. What other R packages you use aside from Rjson?

    I think Rjson package is the secret weapon to R and D3 for visualization.

    http://cran.r-project.org/web/packages/rjson/index.html

    Thanks.

    Noli

    ReplyDelete
  2. In one of your blog, you are using D3 with Sencha Touch for iPad for visualization of SAP Hana. Is SAP Hana live i.e. on-line in this case when you do the querying?

    However, you have 2 D3 examples using .csv and .json files from SAP Hana data mining results using R. I guess they are also hosted in AWS as well probably outside the SAP Hana server, right?

    They reason, I am asking these questions are, I am interested in Hybrid apps using SAP Hana (e.g. results from data mining and or live queries from SAP Hana). I have created a Hybrid app for iOS using Cartodb (Cartodb.com) i.e. GIS PostgreSQL database / PostGIS in the cloud.

    Here are the screenshots of the cartodb

    https://picasaweb.google.com/116847891529748214201/LeafletCartoDBProtectedPlanetIPhone

    Here are some of screenshots for D3 and g.Raphael which I intended for SAP Hana mobile hybrid apps using PhoneGap.

    https://picasaweb.google.com/116847891529748214201
    /D3AndGRaphaelForSAPHanaHybridMobile

    I am interested to see some example with D3, SAP Hana and PhoneGap. I think you can easily package your example (e.g. Sencha Touch with D3 json data). If only I can get the exact URL of D3.json data, I think can compile your example in iOS phonegap.

    Hope to see some replies.

    Thanks.

    Noli

    ReplyDelete
  3. Have you seen the relatively new fread function in the data.table package? It is easier to use than read.csv and loaded the 2008.csv data set about 6 times faster.

    > system.time(airports.2008.hp <- fread("./2008.csv"))
    user system elapsed
    17.44 0.29 17.81
    > system.time(airports.2008.hp <- data.table(read.csv("2008.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)))
    user system elapsed
    99.72 3.29 103.87

    ReplyDelete
  4. thanks for sharing important information on SAP HANA

    ReplyDelete
  5. The information which you have provided is very good and easily understood.
    It is very useful who is looking for sap hana Online Training.

    ReplyDelete
    Replies
    1. Thank you Sudheer, I am glad this blog helped...

      Delete
  6. The Information you provided is very much useful for SAP HANA Learners. This Information was very Intersting, We also provide SAP HANA Online training in India.

    ReplyDelete
  7. Appreciation for nice Update, I found something new and folks can get useful information about sap hana Online Training

    ReplyDelete
  8. Thank you provide valuable informations and iam seacrching same informations,and saved my time SAS Online Training

    ReplyDelete