Data

How to Scrape Data for Over 1,900 ETFs

At the investment management firm I worked at, we had teams of people that devoted lots of time just trying to accurately calculate the total return of assets. Why? Having accurate return data is a prerequisite to creating a trading system or model. You can train your model over various market regimes and environments which increases the likelihood that you are actually modeling something real.

Since I’m just a poor retail investor that can’t afford an actual data feed, I decided to scrape data for over 1,900 ETFs as a proxy for some high-quality return data across multiple asset classes. I thought this was an easy way to get return data for all the major asset class and country combinations in one go. The downside is that most ETFs are pretty new, so there’s some pretty limited backhistory. Most ETFs are only a few years old.

1-008-number-of-etfs

How I Got a List of ETFs With Metadata

I took a look at what existing free data sets are out there, and I think ETF.com is the best resource. I took a look at their website and figured out that they have an unpublished API, so I scraped that. In R, this can be accomplished in a few lines of code. I also cleaned some of the fields using some regular expressions.

# 1. www.etf.com API is in JSON format. The 0 refers to begin at the 0th element. 
# The 2000 represents number of elements to retrieve. Not sure what the 1 means. 
url <- GET("http://www.etf.com/etf-finder-funds-api//-aum/0/2000/1")
etf <- fromJSON(content(url, as = "text"), flatten = TRUE) %>% 
 mutate(fundBasics.issuer = str_replace_all(fundBasics.issuer, "(\\<.*\\>)(.*)(\\<.*\\>)", "\\2"), 
 fundBasics.segment = str_replace_all(fundBasics.segment, "\\s+", " "), 
 fundBasics.expenseRatio.value = fundBasics.expenseRatio.value / 10000)

Here are the fields that they offer:

 Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1928 obs. of 59 variables:
 $ productId : int 521 523 ...
 $ fund : chr "SPDR S&P 500 ETF Trust" ...
 $ ticker : chr "SPY" ...
 $ inceptionDate : Date, format: "1993-01-22" ...
 $ launchDate : Date, format: "1993-01-22" ...
 $ hasSegmentReport : chr "true" ...
 $ genericReport : chr "false" ...
 $ hasReport : chr "true" ...
 $ fundsInSegment : int 105 105 ...
 $ economicDevelopment : chr "Developed Markets" ...
 $ totalRows : int 1928 1928 ...
 $ fundBasics.issuer : chr "State Street Global Advisors" ...
 $ fundBasics.segment : chr "Equity: U.S. - Large Cap" ...
 $ fundBasics.expenseRatio.value : num 0.000945 0.0007 ...
 $ fundBasics.aum.value : num 1.94e+11 ...
 $ fundBasics.spreadPct.value : num 4.59e-05 ...
 $ performance.priceTrAsOf : Date, format: "2016-09-15" ...
 $ performance.priceTr1Mo.value : num -0.0174 ...
 $ performance.priceTr3Mo.value : num 0.0417 ...
 $ performance.priceTr1Yr.value : num 0.108 ...
 $ performance.priceTr3YrAnnualized.value : num 0.105 ...
 $ performance.priceTr5YrAnnualized.value : num 0.145 ...
 $ performance.priceTr10YrAnnualized.value: num 0.0718 ...
 $ analysis.analystPick : chr NA ...
 $ analysis.opportunitiesList : chr NA ...
 $ analysis.letterGrade : chr "A" ...
 $ analysis.efficiencyScore : num 98.9 ...
 $ analysis.tradabilityScore : num 99.3 ...
 $ analysis.fitScore : num 97.2 ...
 $ analysis.leveragedFactor : chr NA ...
 $ analysis.exposureReset : chr NA ...
 $ analysis.avgDailyDollarVolume : num 1.6e+10 ...
 $ analysis.avgDailyShareVolume : num 7.4e+07 ...
 $ analysis.fundClosureRisk : chr "Low" ...
 $ analysis.spread.value : num 0.00996 ...
 $ fundamentals.dividendYield.value : num 0.0203 ...
 $ fundamentals.equity.pe : num 25.4 ...
 $ fundamentals.equity.pb : num 2.86 ...
 $ fundamentals.fixedIncome.duration : num NA NA ...
 $ fundamentals.fixedIncome.creditQuality : chr NA ...
 $ fundamentals.fixedIncome.ytm.value : num NA NA ...
 $ classification.assetClass : chr "Equity" ...
 $ classification.region : chr "North America" ...
 $ classification.geography : chr "U.S." ...
 $ classification.category : chr "Size and Style" ...
 $ classification.focus : chr "Large Cap" ...
 $ classification.niche : chr "Broad-based" ...
 $ classification.inverse : chr "false" ...
 $ classification.leveraged : chr "false" ...
 $ classification.etn : chr "false" ...
 $ classification.selectionCriteria : chr "Committee" ...
 $ classification.weightingScheme : chr "Market Cap" ...
 $ classification.activePerSec : chr "false" ...
 $ classification.underlyingIndex : chr "S&P 500" ...
 $ classification.brand : chr "SPDR" ...
 $ tax.legalStructure : chr "Unit Investment Trust" ...
 $ tax.maxLtCapitalGainsRate : num 20 20 ...
 $ tax.maxStCapitalGainsRate : num 39.6 39.6 ...
 $ tax.taxReporting : chr "1099" ...

How I Got ETF Price Data

I used Yahoo! Finance’s API. In my experience, the Yahoo! Finance data is of pretty high quality for a free data source. For these ETFs, I did some basic checks to see how complete the data was and tests for outliers. The data set is pretty clean. Overall, the code takes around 40 minutes to scrape everything because I added some waiting time in between each request to prevent getting rate limited:

#2. Download data. Some ETFs with limited historical data produce an error. 
ticker_list <- etf$ticker
df <- data.frame()
n <- 1
for (ticker in ticker_list) {
 print(paste0("ETF number ", n, ": ", ticker))
 df_temp <- tryCatch(getSymbolsYahoo(ticker), 
 error = function(e) {
 print(paste0("Could not download data for ", ticker))
 return(data.frame())
 })
 df <- bind_rows(df, df_temp)
 n <- n + 1
 Sys.sleep(runif(1, 1, 2))
}

Here are the fields that they offer:

Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2837633 obs. of 8 variables:
 $ open : num 44 ...
 $ high : num 44 ...
 $ low : num 43.8 ...
 $ close : num 43.9 ...
 $ volume : int 1003200 480500 ...
 $ adjusted_close: num 28.2 ...
 $ date : Date, format: "1993-01-29" ...
 $ ticker : chr "SPY" ...

I hope this is helpful for some people. You can also take a look at the code I used to scrape this data at my Github repository. The scraping can be done in 30 lines of code.

I have an email list where I occasionally send updates to readers on the trading systems that I’m developing. If you are interested, please enter your email below.

Get updates from Signal Plot in your inbox.

About

I am interested in building investing systems, and this blog contains my research and analysis on this topic. I previously worked as an analyst at Bridgewater Associates, a hedge fund that utilizes a systematic, global macro investing style.

12 comments on “How to Scrape Data for Over 1,900 ETFs

  1. Harry Espino

    Yahoo finance API is not available anymore. I have moved to MarketXLS after this change, much more reliable data.

    • The Yahoo Finance API was broken for a while, but by now all the packages that interface with it have fixed it and it’s working again.

  2. Alex Antal

    Kevin, very useful article, thanks a lot.

    Any idea where to get historical NAVs for ETFs?
    e.g. for CEFS, NAV symbol on yahoo finance is ^CEFS-IV:
    But in historical prices page they only show previous day data:
    https://finance.yahoo.com/quote/%5ECEFS-IV/history?p=%5ECEFS-IV

    Thanks!

    • I’ve been asked this question before, and I think it’s tough unless you’re subscribed to some institutional level data provider. But ETF.com does provide historical weekly fund flows, so you could download the flows and use the current NAV to back out the historical NAV.

  3. Hi there,

    very nice post! I have two questions:
    1) Did you find out any way of scraping the price series directly from etf.com?
    2) Did you use the raw yahoo timeseries, or did you apply some cleaning? If yes, which?

    Best,
    Florian

    • Thanks. I didn’t scrape any of the price series from ETF.com. I used Yahoo! Finance to scrape the price series. I just downloaded the data as is and didn’t apply any cleaning. I’ve used this data for subsequent analyses and have found the data to be of high quality with few errors.

  4. Great info

    Thanks

  5. Very interesting indeed.
    I tried to have a look at the code too but it looks like the function getSymbolsYahoo is not defined…maybe is defined within the source(“./Posts/1.001 Initial Functions and Libraries.R”)?

    • Thanks for reading. Yes, the getSymbolsYahoo function is just a wrapper function I wrote for quantmod’s getSymbols function to make it a bit cleaner for my purposes. I hope this is helpful.

      getSymbolsYahoo < - function(ticker) { 
        df <- getSymbols(ticker, src = "yahoo", auto.assign = FALSE, from = "1900-01-01")
        df <- as.data.frame(df) %>% 
          mutate(date = index((df)), 
                 ticker = ticker)
        colnames(df) < - c("open", "high", "low", "close", "volume", "adjusted_close", "date", "ticker")
        return(df)
      }
      
  6. Pingback: Quantocracy's Daily Wrap for 09/27/2016 | Quantocracy

Leave a Reply