All Things R: Simulating Map-Reduce in R for Big Data Analysis Using Flights Data

Friday, June 14, 2013

Simulating Map-Reduce in R for Big Data Analysis Using Flights Data

We are constantly crunching through large amounts of data and designing unique and innovative ways to process large datasets on a single node and use distributed computing only when single node computing becomes time consuming and less efficient.

We are happy to share with the R community one such unique map-reduce like approach we designed in R for a single node to process flights data (available here) which has ~122 million records and occupies 12GB of space when uncompressed. We used Mathew Dowle's data.table package heavily to load and analyze large datasets.

It took us few days to stabilize and optimize this approach and we are very proud to share this approach and source code with you. The full source code can be found and downloaded from datadolph.in's git repository.

Here is how we approached this problem: First, before loading the datasets in R, we compressed each of the 22 CSV files using gunzip for faster reading in R. The method read.csv can read gzip files faster than it can read uncompressed files:

# load list of all files
flights.files <- list.files(path=flights.folder.path, pattern="*.csv.gz")

# read files in data.table
flights <- data.table(read.csv(flights.files[i], stringsAsFactors=F))

Next, we mapped the analysis we wanted to run to extract insights from each of the datasets. This approach included extracting flight level, airlines level and airport level aggregated analysis and generating intermediate results. Here is example code to get stats for each airline by year:

getFlightsStatusByAirlines <- function(flights, yr){

# by Year

if(verbose) cat("Getting stats for airlines:", '\n')

airlines.stats <- flights[, list(

dep_airports=length(unique(origin)),

flights=length(origin),

flights_cancelled=sum(cancelled, na.rm=T),

flights_diverted=sum(diverted, na.rm=T),

flights_departed_late=length(which(depdelay > 0)),

flights_arrived_late=length(which(arrdelay > 0)),

total_dep_delay_in_mins=sum(depdelay[which(depdelay > 0)]),

avg_dep_delay_in_mins=round(mean(depdelay[which(depdelay > 0)])),

median_dep_delay_in_mins=round(median(depdelay[which(depdelay > 0)])),

miles_traveled=sum(distance, na.rm=T)

), by=uniquecarrier][, year:=yr]

#change col order

setcolorder(airlines.stats, c("year", colnames(airlines.stats)[-ncol(airlines.stats)]))

#save this data

saveData(airlines.stats, paste(flights.folder.path, "stats/5/airlines_stats_", yr, ".csv", sep=""))

#clear up space

rm(airlines.stats)

# continue.. see git full code

}

Here is a copy of the map function:

#map all calculations
mapFlightStats <- function(){
for(j in 1:period) {
yr <- as.integer(gsub("[^0-9]", "", gsub("(.*)(\\.csv)", "\\1", flights.files[j])))
flights.data.file <- paste(flights.folder.path, flights.files[j], sep="")
if(verbose) cat(yr, ": Reading : ", flights.data.file, "\n")
flights <- data.table(read.csv(flights.data.file, stringsAsFactors=F))
setkeyv(flights, c("year", "uniquecarrier", "dest", "origin", "month"))
# call functions
getFlightStatsForYear(flights, yr)
getFlightsStatusByAirlines(flights, yr)
getFlightsStatsByAirport(flights, yr)
}
}

As one can see, we are generating intermediate results by airlines (and by airports / flights) for each year and storing it on the disk. The map function takes less than 2 hours to run on a MacBook Pro which had 2.3 GHZ dual core processor and 8 GB of memory and generated 132 intermediate datasets containing aggregated analysis.

And finally, we call the reduce function to aggregate intermediate datasets into final output (for flights, airlines and airports):

#reduce all results
reduceFlightStats <- function(){
n <- 1:6
folder.path <- paste("./raw-data/flights/stats/", n, "/", sep="")
print(folder.path)
for(i in n){
filenames <- paste(folder.path[i], list.files(path=folder.path[i], pattern="*.csv"), sep="")
dt <- do.call("rbind", lapply(filenames, read.csv, stringsAsFactors=F))
print(nrow(dt))
saveData(dt, paste("./raw-data/flights/stats/", i, ".csv", sep=""))
}
}

10 comments:

Daewoo Choi (최대우)July 20, 2013 at 4:04 PM
Is saveData your function?
ReplyDelete
Replies
TejutejuJuly 20, 2018 at 5:02 AM
Big data in hadoop is the interseting topic and to get some important information.Big data hadoop online Training Bangalore
ReplyDelete
Replies
AnggastikaJune 11, 2019 at 7:17 AM
Dapatkan Prediksi Bola Terlengkap Disini Pasar Taruhan
Gabung Dengan Partner Bola Terbaik Agen Sbobet

Hubungi kami Sekarang Juga :
WA: +6287785425244
LINE: WINNING303
ReplyDelete
Replies
Plumbing & HVAC Services San DiegoApril 26, 2021 at 11:09 AM
Plumbing & HVAC Services San Diego
Air Star Heating guarantees reliability and quality for all equipment and services
Air Star Heating is specializing in providing top-quality heating, ventilating, air conditioning, and plumbing services to our customers and clients.
Our company is leading the market right now. By using our seamless and huge array of services. Our customers can now have the privilege of taking benefit from our services very easily and swiftly. To cope up with the desires and needs of our clients we have built an excellent reputation. We are already having a huge list of satisfied customers that seem to be very pleased with our services.

Plumbing & HVAC Services in San Diego. Call now (858) 900-9977 ✓Licensed & Insured ✓Certified Experts ✓Same Day Appointment ✓Original Parts Only ✓Warranty On Every Job.
Visit:- https://airstarheating.com
ReplyDelete
Replies
baldassarefadnessMarch 3, 2022 at 10:16 PM
Casino Hotel Tunica - JamBase! Sportsbook & Bar
A new 나주 출장마사지 casino 경상북도 출장샵 hotel, casino and sportsbook is 의왕 출장마사지 on the 의왕 출장샵 way! Check out our Las Vegas casino app review to see the 충청남도 출장샵 new COVID-19 situation and the
ReplyDelete
Replies
Mahil mithuAugust 29, 2022 at 4:23 AM
Very interesting post and thanks for your knowledgeable sharing with us. Keep doing well!
Virginia Online Divorce
Divorce Lawyer Consultation Cost
Divorce in Virginia with Child
ReplyDelete
Replies
shazamNovember 23, 2022 at 2:49 AM
Thanks for sharing this useful articles
virginia military divorce
Abogado Disputas Contratos Comerciales
ReplyDelete
Replies
AnonymousJune 15, 2025 at 11:38 PM
Israel’s military said fighter jets have struck command centers belonging to the 애인대행 Quds Force, a clandestine wing of Iran’s Revolutionary Guard. Earlier on Sunday, Israeli strikes targeted성주출장샵 missile sites and military targets in Iran, and 전주애인대행 killed the Revoultionary Guard’s intelligence chief.
ReplyDelete
Replies
AnonymousJune 20, 2025 at 7:12 PM
According to the Personal Consumption Expenditures price index most closely followed by the Federal Reserve, core inflation — which strips out volatile items like food and gas prices — fell to 2.5% in April. That was the lowest reading since March 2021.
태백애인대행
애인대행
무안출장샵
ReplyDelete
Replies

Add comment

All Things R

Friday, June 14, 2013

Simulating Map-Reduce in R for Big Data Analysis Using Flights Data

10 comments:

Blog Archive

Followers