We are constantly crunching through large amounts of data and designing unique and innovative ways to process large datasets on a single node and use distributed computing only when single node computing becomes time consuming and less efficient.
We are happy to share with the R community one such unique map-reduce like approach we designed in R for a single node to process flights data (available here) which has ~122 million records and occupies 12GB of space when uncompressed. We used Mathew Dowle's data.table package heavily to load and analyze large datasets.
It took us few days to stabilize and optimize this approach and we are very proud to share this approach and source code with you. The full source code can be found and downloaded from datadolph.in's git repository.
Here is how we approached this problem: First, before loading the datasets in R, we compressed each of the 22 CSV files using gunzip for faster reading in R. The method read.csv can read gzip files faster than it can read uncompressed files:
# load list of all files
flights.files <- list.files(path=flights.folder.path, pattern="*.csv.gz")
# read files in data.table
flights <- data.table(read.csv(flights.files[i], stringsAsFactors=F))
Next, we mapped the analysis we wanted to run to extract insights from each of the datasets. This approach included extracting flight level, airlines level and airport level aggregated analysis and generating intermediate results. Here is example code to get stats for each airline by year:
Here is a copy of the map function:
#map all calculations
mapFlightStats <- function(){
for(j in 1:period) {
yr <- as.integer(gsub("[^0-9]", "", gsub("(.*)(\\.csv)", "\\1", flights.files[j])))
flights.data.file <- paste(flights.folder.path, flights.files[j], sep="")
if(verbose) cat(yr, ": Reading : ", flights.data.file, "\n")
flights <- data.table(read.csv(flights.data.file, stringsAsFactors=F))
setkeyv(flights, c("year", "uniquecarrier", "dest", "origin", "month"))
# call functions
getFlightStatsForYear(flights, yr)
getFlightsStatusByAirlines(flights, yr)
getFlightsStatsByAirport(flights, yr)
}
}
As one can see, we are generating intermediate results by airlines (and by airports / flights) for each year and storing it on the disk. The map function takes less than 2 hours to run on a MacBook Pro which had 2.3 GHZ dual core processor and 8 GB of memory and generated 132 intermediate datasets containing aggregated analysis.
And finally, we call the reduce function to aggregate intermediate datasets into final output (for flights, airlines and airports):
#reduce all results
reduceFlightStats <- function(){
n <- 1:6
folder.path <- paste("./raw-data/flights/stats/", n, "/", sep="")
print(folder.path)
for(i in n){
filenames <- paste(folder.path[i], list.files(path=folder.path[i], pattern="*.csv"), sep="")
dt <- do.call("rbind", lapply(filenames, read.csv, stringsAsFactors=F))
print(nrow(dt))
saveData(dt, paste("./raw-data/flights/stats/", i, ".csv", sep=""))
}
}
We are happy to share with the R community one such unique map-reduce like approach we designed in R for a single node to process flights data (available here) which has ~122 million records and occupies 12GB of space when uncompressed. We used Mathew Dowle's data.table package heavily to load and analyze large datasets.
It took us few days to stabilize and optimize this approach and we are very proud to share this approach and source code with you. The full source code can be found and downloaded from datadolph.in's git repository.
Here is how we approached this problem: First, before loading the datasets in R, we compressed each of the 22 CSV files using gunzip for faster reading in R. The method read.csv can read gzip files faster than it can read uncompressed files:
# load list of all files
flights.files <- list.files(path=flights.folder.path, pattern="*.csv.gz")
# read files in data.table
flights <- data.table(read.csv(flights.files[i], stringsAsFactors=F))
getFlightsStatusByAirlines <- function(flights, yr){
# by Year
if(verbose) cat("Getting stats for airlines:", '\n')
airlines.stats <- flights[, list(
dep_airports=length(unique(origin)),
flights=length(origin),
flights_cancelled=sum(cancelled, na.rm=T),
flights_diverted=sum(diverted, na.rm=T),
flights_departed_late=length(which(depdelay > 0)),
flights_arrived_late=length(which(arrdelay > 0)),
total_dep_delay_in_mins=sum(depdelay[which(depdelay > 0)]),
avg_dep_delay_in_mins=round(mean(depdelay[which(depdelay > 0)])),
median_dep_delay_in_mins=round(median(depdelay[which(depdelay > 0)])),
miles_traveled=sum(distance, na.rm=T)
), by=uniquecarrier][, year:=yr]
#change col order
setcolorder(airlines.stats, c("year", colnames(airlines.stats)[-ncol(airlines.stats)]))
#save this data
saveData(airlines.stats, paste(flights.folder.path, "stats/5/airlines_stats_", yr, ".csv", sep=""))
#clear up space
rm(airlines.stats)
# continue.. see git full code
}
Here is a copy of the map function:
#map all calculations
mapFlightStats <- function(){
for(j in 1:period) {
yr <- as.integer(gsub("[^0-9]", "", gsub("(.*)(\\.csv)", "\\1", flights.files[j])))
flights.data.file <- paste(flights.folder.path, flights.files[j], sep="")
if(verbose) cat(yr, ": Reading : ", flights.data.file, "\n")
flights <- data.table(read.csv(flights.data.file, stringsAsFactors=F))
setkeyv(flights, c("year", "uniquecarrier", "dest", "origin", "month"))
# call functions
getFlightStatsForYear(flights, yr)
getFlightsStatusByAirlines(flights, yr)
getFlightsStatsByAirport(flights, yr)
}
}
As one can see, we are generating intermediate results by airlines (and by airports / flights) for each year and storing it on the disk. The map function takes less than 2 hours to run on a MacBook Pro which had 2.3 GHZ dual core processor and 8 GB of memory and generated 132 intermediate datasets containing aggregated analysis.
And finally, we call the reduce function to aggregate intermediate datasets into final output (for flights, airlines and airports):
#reduce all results
reduceFlightStats <- function(){
n <- 1:6
folder.path <- paste("./raw-data/flights/stats/", n, "/", sep="")
print(folder.path)
for(i in n){
filenames <- paste(folder.path[i], list.files(path=folder.path[i], pattern="*.csv"), sep="")
dt <- do.call("rbind", lapply(filenames, read.csv, stringsAsFactors=F))
print(nrow(dt))
saveData(dt, paste("./raw-data/flights/stats/", i, ".csv", sep=""))
}
}
You could look into using data.table::fread instead of read.csv. Discussion here http://lists.r-forge.r-project.org/pipermail/datatable-help/2013-May/001677.html
ReplyDeleteAlso in the aggregation function, I believe [, year:=yr] should be replaced with setcolnames(airlines.stats, yr, year)
ReplyDeleteSee data.table documentation for other speed-optimized methods.
Is saveData your function?
ReplyDeletesaveData is a simple CSV writer function which takes data.table as one of the arguments...
DeleteBig data in hadoop is the interseting topic and to get some important information.Big data hadoop online Training Bangalore
ReplyDeleteThanks for your post. This is excellent information. The list of your blogs is very helpful for those who want to learn, It is amazing!!! You have been helping many application.
ReplyDeletebest selenium training in chennai | best selenium training institute in chennai selenium training in chennai | best selenium training in chennai | selenium training in Velachery | selenium training in chennai omr | quora selenium training in chennai | selenium testing course fees | java and selenium training in chennai | best selenium training institute in chennai | best selenium training center in chennai
Begini Cara Menghasilkan Uang dan Mengelola Uang Yang Baik Trik Menjadi Kaya
ReplyDeleteDapatkan Prediksi Bola Terlengkap Disini Pasar Taruhan
ReplyDeleteGabung Dengan Partner Bola Terbaik Agen Sbobet
Hubungi kami Sekarang Juga :
WA: +6287785425244
LINE: WINNING303
Compre documentos en línea, documentos originales y registrados.
ReplyDeleteAcerca de Permisodeespana, algunos dicen que somos los solucionadores de problemas, mientras que otros se refieren a nosotros como vendedores de soluciones. Contamos con cientos de clientes satisfechos a nivel mundial. Hacemos documentos falsos autorizados y aprobados como Permiso de Residencia Español, DNI, Pasaporte Español y Licencia de Conducir Española. Somos los fabricantes y proveedores de primer nivel de estos documentos, reconocidos a nivel mundial.
Comprar permiso de residencia,
permiso de residenciareal y falso en línea,
Compre licencia de conducir en línea,
Compre una licencia de conducir española falsa en línea,
Comprar tarjeta de identificación,
Licencia de conducir real y falsa,
Compre pasaporte real en línea,
Visit Here fpr more information. :- https://permisodeespana.com/licencia-de-conducir-espanola/
Address: 56 Guild Street, London, EC4A 3WU (UK)
Email: contact@permisodeespana.com
WhatsApp: +443455280186
Plumbing & HVAC Services San Diego
ReplyDeleteAir Star Heating guarantees reliability and quality for all equipment and services
Air Star Heating is specializing in providing top-quality heating, ventilating, air conditioning, and plumbing services to our customers and clients.
Our company is leading the market right now. By using our seamless and huge array of services. Our customers can now have the privilege of taking benefit from our services very easily and swiftly. To cope up with the desires and needs of our clients we have built an excellent reputation. We are already having a huge list of satisfied customers that seem to be very pleased with our services.
Plumbing & HVAC Services in San Diego. Call now (858) 900-9977 ✓Licensed & Insured ✓Certified Experts ✓Same Day Appointment ✓Original Parts Only ✓Warranty On Every Job.
Visit:- https://airstarheating.com
Thank you for sharing
ReplyDeleteInternship providing companies in chennai | Where to do internship | internship opportunities in Chennai | internship offer letter | What internship should i do | How internship works | how many internships should i do ? | internship and inplant training difference | internship guidelines for students | why internship is necessary
Casino Hotel Tunica - JamBase! Sportsbook & Bar
ReplyDeleteA new 나주 출장마사지 casino 경상북도 출장샵 hotel, casino and sportsbook is 의왕 출장마사지 on the 의왕 출장샵 way! Check out our Las Vegas casino app review to see the 충청남도 출장샵 new COVID-19 situation and the
I am looking for and I love to post a comment that "The content of your post is awesome" Great work!
ReplyDeleteibm full form in india |
ssb ka full form |
what is the full form of dp |
full form of brics |
gnm nursing full form |
full form of bce |
full form of php |
bhim full form |
nota full form in india |
apec full form