Monday, October 29, 2012

Pull Yahoo Finance Key-Statistics Instantaneously Using XML and XPath in R


This two-part blog post I published a day ago required key-stats from Yahoo Finance for all the companies in the control group I created for my research.  I wanted all the key-stats pulled, arranged in a data-frame and then present them side-by-side to form my opinions.

Quantmod package has "getQuote" method which should return the desired metrics.  The number and names of the metrics can be controlled via "yahooQF" (see the script below.)  Unfortunately, this method seems to be broken as the resulting data-frame had large amount of null values for some metrics.  Here is the script nonetheless if one wishes to experiment:

#######################################################################
# Script to download key metrics for a set of stock tickers using the quantmod package
#######################################################################
require(quantmod)
require("plyr")
what_metrics <- yahooQF(c("Price/Sales", 
                          "P/E Ratio",
                          "Price/EPS Estimate Next Year",
                          "PEG Ratio",
                          "Dividend Yield", 
                          "Market Capitalization"))

tickers <- c("AAPL", "FB", "GOOG", "HPQ", "IBM", "MSFT", "ORCL", "SAP")
# Not all the metrics are returned by Yahoo.
metrics <- getQuote(paste(tickers, sep="", collapse=";"), what=what_metrics)

#Add tickers as the first column and remove the first column which had date stamps
metrics <- data.frame(Symbol=tickers, metrics[,2:length(metrics)]) 

#Change colnames
colnames(metrics) <- c("Symbol", "Revenue Multiple", "Earnings Multiple", 
                       "Earnings Multiple (Forward)", "Price-to-Earnings-Growth", "Div Yield", "Market Cap")

#Persist this to the csv file
write.csv(metrics, "FinancialMetrics.csv", row.names=FALSE)

#######################################################################

After some digging around and staring at the raw HTML for Yahoo's KeyStats page for sometime, I decided to use the XML package and the XPath operators to get all the nodes which host key stats (name and values).  This turned out to be lot simpler.  Let's walk through this using three easy steps:

1) The CSS class name for the HTML nodes which host the name of the metric such as Market Cap or Enterprise value is "yfnc_tablehead1".  This made it quite easy to grab all the elements from the HTML tree with this class name:
 nodes <- getNodeSet(html_text, "/*//td[@class='yfnc_tablehead1']")

2) Now all I needed to do was get the value of this node using xmlValue function to get the name of the metric (Enterprise Value as an example):
 measures <- sapply(nodes, xmlValue)

3) Next, to get the value of any metric, I used the getSibling function to get the adjacent node (i.e. sibling) and used xmlValue function to get the value.  Here is how it was done:
values <- sapply(nodes, function(x)  xmlValue(getSibling(x)))

This is it, I then used some other common functions to clean up column names and constructed a data-frame to arrange the key-stats in a columnar fashion.  Here is the final script and the result is shown in the graphics below.  Please feel free to use this and share it with other R enthusiasts:

#######################################################################
##Alternate method to download all key stats using XML and x_path - PREFERRED WAY
#######################################################################

setwd("C:/Users/i827456/Pictures/Blog/Oct-25")
require(XML)
require(plyr)
getKeyStats_xpath <- function(symbol) {
  yahoo.URL <- "http://finance.yahoo.com/q/ks?s="
  html_text <- htmlParse(paste(yahoo.URL, symbol, sep = ""), encoding="UTF-8")

  #search for <td> nodes anywhere that have class 'yfnc_tablehead1'
  nodes <- getNodeSet(html_text, "/*//td[@class='yfnc_tablehead1']")
  
  if(length(nodes) > 0 ) {
   measures <- sapply(nodes, xmlValue)
   
   #Clean up the column name
   measures <- gsub(" *[0-9]*:", "", gsub(" \\(.*?\\)[0-9]*:","", measures))   
   
   #Remove dups
   dups <- which(duplicated(measures))
   #print(dups) 
   for(i in 1:length(dups)) 
     measures[dups[i]] = paste(measures[dups[i]], i, sep=" ")
   
   #use siblings function to get value
   values <- sapply(nodes, function(x)  xmlValue(getSibling(x)))
   
   df <- data.frame(t(values))
   colnames(df) <- measures
   return(df)
  } else {
    break
  }
}

tickers <- c("AAPL")
stats <- ldply(tickers, getKeyStats_xpath)
rownames(stats) <- tickers
write.csv(t(stats), "FinancialStats_updated.csv",row.names=TRUE)  

#######################################################################




Happy Analyzing!
All Things R &
All Things Analytics (http://goo.gl/CBYQI)


5 comments:

  1. have you looked at yql? no xpath involves ;)

    http://developer.yahoo.com/yql/console/?q=show%20tables&env=store://datatables.org/alltableswithkeys#h=SELECT%20*%20FROM%20yahoo.finance.keystats%20WHERE%20symbol%20%3D%20%20%27fb%27

    ReplyDelete
  2. BigProfitbuzz is an Indian stock market advisory firm. BigProfitbuzz proven month after month that trading and investing in stock market can be profitable whether market is bull or bear. As suggested last week our all client made a very good profit in the sell position of NIFTY. Now for the coming BSE STOCK TIPS week we suggest all the traders to make a buy position in NIFTY around 5850-5950 with stoploss 5700 for the target of 6200-6300.This is due to that NIFTY is has completed is consolidation phase. You can also can make buy position in NIFTY stocks also. But keep the strict stoploss of 5700 for the NIFTY & NIFTY stocks. For further update visit our website.
    Regards
    BIGPROFITBUZZ TEAM

    ReplyDelete
  3. I was very pleased to find this site.I wanted to thank you for this great read!! I definitely enjoying every little bit of it and I have you bookmarked to check out new stuff you post.
    obtain cash in Canada

    ReplyDelete
  4. wonderful information, I had come to know about your blog from my friend nandu , hyderabad,i have read atleast 7 posts of yours by now, and let me tell you, your website gives the best and the most interesting information. This is just the kind of information that i had been looking for, i'm already your rss reader now and i would regularly watch out for the new posts, once again hats off to you! Thanks a ton once again, Regards, HFM Online Training

    ReplyDelete
  5. G'day
    Enjoyed reading your code and comments.
    Tried to replicate your solution on the SMH website - eg: http://www.smh.com.au/business/markets/quotes/summary/BHP
    But ran into some trouble separating the table elements using:
    (nodes <- getNodeSet(html_text, "/*//table[@class='business-table']"))
    Any advice at all?

    ReplyDelete