Wednesday, May 23, 2012

If You are a R Developer, Then You Must Try SAP HANA for Free.



This is a guest blog from 
Alvaro Tejada Galindo, my colleague and fellow R and SAP HANA enthusiast.  I am thankful to Alvaro for coming and posting on "AllThingsR".

Are you an R developers? Have ever heard of SAP HANA? Would you like to test SAP HANA for free?

SAP HANA is an In-Memory Database Technology allowing developers to analyze big data in real-time.

Processes that took hours now take seconds due to SAP HANA's power to keep everything on RAM memory.

As announced in SAP Sapphire Now event in Orlando, Florida, SAP HANA is free for developers. You just need to download and install both the SAP HANA Client and the SAP HANA Studio, and create an SAP HANA Server on the Amazon Web Services as described in the following document:
Get your own SAP HANA DB server on Amazon Web Services - http://scn.sap.com/docs/DOC-28294

Why should this interest you? Easy...SAP HANA is an agent of change bringing speed to its limits and it can also be integrated with R as described in the following blog:

Want to know more about SAP HANA? Read everything you need here: http://developers.sap.com

You're convinced but don't want to pay for the Amazon Web Services? No problem. Just leave a comment including your name, company and email. We will reach you and send you an Amazon Gift Card so you can get started. Of course, your feedback would be greatly appreciated. Of course, we only a limited set of gift cards, so be quick or be out.

Author Alvaro Tejada Galindo, mostly known as "Blag" is a Development Expert working for the Technology Innovation and Developer Experience team in SAP Labs.  He can be contacted at a.tejada.galindo@sap.com.

Alvaro's background in his own words: I used to be an ABAP Consultant for 11 years. I worked in implementations on Peru and Canada. I’m also a die hard developer using R, Python, Ruby, PHP, Flex and many more languages. Now, I work for SAP Labs and my main roles are evangelize SAP technologies by writing blogs, articles, helping people on the forums, attending SAP events, besides many other “Developer engagement” activities.
I maintain a blog called “Blag’s bag of rants” at blagrants.blogspot.com

Wednesday, May 2, 2012

Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part III


Mash-up Airlines Performance Data with Historical Weather Data to Pinpoint Weather Related Delays

For this exercise, I combined following four separate blogs that I did on BigData, R and SAP HANA.  Historical airlines and weather data were used for the underlying analysis. The aggregated output of this analysis was outputted in JSON which was visualized in HTML5, D3 and Google Maps.  The previous blogs on this series are:
  1. Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part II
  2. Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps
  3. Getting Historical Weather Data in R and SAP HANA 
  4. Tracking SFO Airport's Performance Using R, HANA and D3
In this blog, I wanted to mash-up disparate data sources in R and HANA by combining airlines data with weather data to understand the reasons behind the airport/airlines delay.  Why weather - because weather is one of the commonly cited reasons in the airlines industry for flight delays.  Fortunately, the airlines data breaks up the delay by weather, security, late aircraft etc., so weather related delays can be isolated and then the actual weather data can be mashed-up to validate the airlines' claims.  However, I will not be doing this here, I will just be displaying the mashed-up data.

I have intentionally focused on the three bay-area airports and have used last 4 years of historical data to visualize the airport's performance using a HTML5 calendar built from scratch using D3.js.  One can use all 20 years of data and for all the airports to extend this example.  I had downloaded historical weather data for the same 2005-2008 period for SFO and SJC airports as shown in my previous blog (For some strange reasons, there is no weather data for OAK, huh?).  Here is how the final result will look like in HTML5:



Click here to interact with the live example.  Hover over any cell in the live example and a tool tip with comprehensive analytics will show the break down of the performance delay for the selected cell including weather data and correct icons* - result of a mash-up.  Choose a different airport from the drop-down to change the performance calendar. 
* Weather icons are properties of Weather Underground.

As anticipated, SFO airport had more red on the calendar than SJC and OAK.  SJC definitely is the best performing airport in the bay-area.  Contrary to my expectation, weather didn't cause as much havoc on SFO as one would expect, strange?

Creating a mash-up in R for these two data-sets was super easy and a CSV output was produced to work with HTML5/D3.  Here is the R code and if it not clear from all my previous blogs: I just love data.table package.


###########################################################################################  

# Percent delayed flights from three bay area airports, a break up of the flights delay by various reasons, mash-up with weather data

###########################################################################################  

baa.hp.daily.flights <- baa.hp[,list( TotalFlights=length(DepDelay), CancelledFlights=sum(Cancelled, na.rm=TRUE)), 

                             by=list(Year, Month, DayofMonth, Origin)]
setkey(baa.hp.daily.flights,Year, Month, DayofMonth, Origin)

baa.hp.daily.flights.delayed <- baa.hp[DepDelay>15,
                                     list(DelayedFlights=length(DepDelay), 
                                      WeatherDelayed=length(WeatherDelay[WeatherDelay>0]),
                                      AvgDelayMins=round(sum(DepDelay, na.rm=TRUE)/length(DepDelay), digits=2),
                                      CarrierCaused=round(sum(CarrierDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),
                                      WeatherCaused=round(sum(WeatherDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),
                                      NASCaused=round(sum(NASDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),
                                      SecurityCaused=round(sum(SecurityDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),
                                      LateAircraftCaused=round(sum(LateAircraftDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2)), by=list(Year, Month, DayofMonth, Origin)]
setkey(baa.hp.daily.flights.delayed, Year, Month, DayofMonth, Origin)

# Merge two data-tables
baa.hp.daily.flights.summary <- baa.hp.daily.flights.delayed[baa.hp.daily.flights,list(Airport=Origin,
                           TotalFlights, CancelledFlights, DelayedFlights, WeatherDelayed, 
                           PercentDelayedFlights=round(DelayedFlights/(TotalFlights-CancelledFlights), digits=2),
                           AvgDelayMins, CarrierCaused, WeatherCaused, NASCaused, SecurityCaused, LateAircraftCaused)]
setkey(baa.hp.daily.flights.summary, Year, Month, DayofMonth, Airport)

# Merge with weather data
baa.hp.daily.flights.summary.weather <-baa.weather[baa.hp.daily.flights.summary]
baa.hp.daily.flights.summary.weather$Date <- as.Date(paste(baa.hp.daily.flights.summary.weather$Year, 
                                                           baa.hp.daily.flights.summary.weather$Month, 
                                                           baa.hp.daily.flights.summary.weather$DayofMonth, 
                                                           sep="-"),"%Y-%m-%d")
# remove few columns
baa.hp.daily.flights.summary.weather <- baa.hp.daily.flights.summary.weather[, 
            which(!(colnames(baa.hp.daily.flights.summary.weather) %in% c("Year", "Month", "DayofMonth", "Origin"))), with=FALSE]

#Write the output in both JSON and CSV file formats
objs <- baa.hp.daily.flights.summary.weather[, getRowWiseJson(.SD), by=list(Airport)]
# You have now (Airportcode, JSONString), Once again, you need to attach them together.
row.json <- apply(objs, 1, function(x) paste('{\"AirportCode\":"', x[1], '","Data\":', x[2], '}', sep=""))
json.st <- paste('[', paste(row.json, collapse=', '), ']')
writeLines(json.st, "baa-2005-2008.summary.json")                 
write.csv(baa.hp.daily.flights.summary.weather, "baa-2005-2008.summary.csv", row.names=FALSE)


Happy Coding!