Data Quality – Portfolio and Risk Data: Why It’s So Hard To Do It Right
When evaluating risk and portfolio analytics platforms, a critical part of the story is often overlooked, even ignored: and that is, Data and Data Quality. Robust, reliable, complete datasets are essential to any system’s calculation engines – it is impossible to produce usable results without them – anyone who has worked with data will be familiar with the term “garbage in-garbage out”. Even the most advanced risk system will fail to provide accurate analytics without the high-quality data sets needed to power it.
The data component of risk analytics may be overlooked because data seems easier to understand than the quantitative techniques underlying the model. People may take it for granted that the data will be there, and assume that all risk platforms are basically the same when it comes to the breadth and depth of data they supply. Or, firms may assume that it would be straightforward to collect and store their own data, to give them control over it. To show the reality that belies those notions, we offer this look behind the scenes at what is involved in building and maintaining the comprehensive datasets that power Imagine’s multi-asset class, multi-currency risk and portfolio management system.
We break this undertaking into three categories:
- Underlying data
- Derived data and risk factors
- Quality assurance
Collecting the Data
We begin with collecting the underlying data, which includes prices and terms and conditions for stocks and bonds, as well as market indexes, swap rates, options, futures, FX rates (spot and forward), and so on.
Stocks and Bonds
There are 50-60 major stock exchanges in the world, covering an estimated 630,000 securities. Equity data includes not just prices and dividends, but also the number of shares issued, corporate actions, including ex-dividend dates, stock splits and repurchases, mergers and so on, which are needed to compute returns and market capitalizations. On some of the smaller exchanges, many issues do not trade every day, so a pricing feed may have gaps, or may repeat stale prices, which causes problems with constructing time series (discussed below).
The bond market is an entirely different beast. In the U.S. alone, there are well over 1 million fixed income securities, including Treasuries (nominal and inflation-protected), agencies, corporates, mortgage-backed and asset-backed securities and taxable and tax-exempt municipals. The terms and conditions needed to compute even basic measures for bonds make collecting equity data look easy – for starters, we need coupon rate and type (fixed, floating, step-up), maturity date, interest payment frequency, day-count, call schedule, and currency, and possibly more.
There is no official repository for this information, so it must be scraped from official documents and organized into a usable format, or obtained from a vendor who handles that. Since bonds are not exchange-traded there are no “official” prices, and vendors use different types of prices for a given bond. Many are derived from “similar” issues, as weeks can go by without having a traded price reported for a bond. You need to understand the vendor’s price types and make choices about which ones to use and what to discard.
Indexes and Derivatives
We expect this data collection effort already looks daunting, and we are just getting started. A platform needs market index data, which can be obtained from index purveyors – for a non-trivial fee per index. Swap rates for major currencies are also available from various vendors, for a fee. Note that as the swap market transitions away from LIBOR to secured overnight funding rates, one must be sure to manage that data transition carefully. Obtaining futures prices from the exchanges is not difficult, but constructing a time series from those prices requires logic to smooth out the discontinuities that occur when the nearest contract expires and the price “jumps” to the next contract. Option data is also available from exchanges but it can be so voluminous, with all of the possible combinations of underlyings, strikes and expiries, one must decide what to store data (perhaps using a volume threshold).
Imagine collect hundreds of thousands of individual data items every day. Since no single vendor offers the breadth of coverage across instruments, asset classes and global regions that our clients need, we source data across many vendors whose file formats must be reconciled and harmonized. Those who may have been thinking of doing this in-house have probably changed their minds by now, and we have not yet discussed calculating and storing derived measures from the data.
Sensitivities, Time Series and Risk Measures
The raw data described above are the inputs to the risk measures and portfolio analytics that are the end goal of Imagine’s data management activities. To reach that goal, we must construct time series from which we can derive volatilities (except for options, where we typically use implied vols; therefore, we must compute them) and correlations. Note that the length and frequencies of time series must be identical to compute correlations.
For equity markets (individual stocks and market indices), we compute volatilities from returns. For bond markets, we must construct yield curves and build time series from the rates that comprise those curves – Treasury/sovereign curves, inflation curves and swap curves. We also compute the sensitivities (durations and spreads) of individual instruments to changes in those curves. Note that there is no single “correct” way to build a yield curve; that is why different sources produce different curves for the same market on the same day.
Imagine extracts par bond rates by tenor for each curve to create time series of “constant maturity” par bond rates. We use 12 tenors ranging from 1 month to 50 years, where available, to provide sufficient detail, and cover sovereign, IBOR-style, and OIS curves, where possible, for every currency we handle. Rates for yield curves are extracted twice daily; a preliminary snapshot is generated in the afternoon and final rates are published later that evening, based on official closing prices.
For credit markets, we derive spread volatilities from time series of single name credit default swaps, where available; for issuers without single name CDS, we use CDS indices based on industry and credit rating. We also construct time series to compute volatilities for FX rates and all listed commodities. In all, we maintain well over 250,000 separate time series.
To compute risk, we must identify a portfolio’s exposures to different assets classes and link them to the appropriate volatilities. To do this, we must decompose each security into its relevant risk factors – for example, a convertible bond has exposure to interest rates, credit spreads, equity returns, equity volatility, and potentially FX.
Last, but definitely not least, all of this data must be reviewed. Gaps, input errors and missing files are inevitable, and feeds must be ingested at the right time every day, recognizing the 24-hour business day that is the reality of global markets.
Imagine’s Data Quality team checks every time series for every risk factor, in what is close to real time. Of course, this must be automated, as it would be an impossible task to do manually. We use a highly scalable “elastic engine” that is uploaded into an elastic cloud (an approach frequently used by high tech companies) to permit ultra-fast queries, since this volume of data would choke a typical system.
We use moving averages to detect irregularities; observations that are a certain number of standard deviations away from the moving average are either fixed automatically, by receiving updated values from vendors, or corrected manually. We also check certain elements of vendors data against our own databases. We use dashboards to monitor quality, including the number of outliers and missing values in a time series. For implied vols, we make sure the number of strikes and expiries is fairly constant.
Due to the nature of financial market data, there are always gaps in time series. For example, not all equities have 12 years of history, so running VaR analyses with data that goes back as far as 2008 is not as straightforward as it may sound. Imagine has “fall-back” models that are used to address any data gaps for a given factor. This requires some thought – for example, companies that went public in the latter half of 2020, thereby missing the market plunge in March due to Covid-19 lockdowns, would seem to have unusually low volatility; our model corrects for that (we are refining these models, making them more granular and transparent, and therefore easier to override if needed).
Value added from high quality data and data management
Most risk systems do not supply this volume of data, or pay as much attention to validating the data inputs and outputs every day. Indeed, many systems require clients to supply their own data and/or prices, which entails negotiating contracts with various data vendors, paying ongoing fees per data feed, and addressing numerous data integration challenges. Firms that consider building and maintaining their own in-house system face immense data costs as well as ongoing maintenance and quality assurance efforts. With Imagine, all of the data described above is natively included in the platform. We believe this is an important consideration in evaluating a risk system.
Clients trust Imagine’s ability to handle this significant data effort for them and appreciate that it is all part of the service we offer.
This summary is based on research conducted between February and March 2021 with 20 banks looking at current trends and priorities for Prime Brokers in the Margin, Collateral and Data areas.
TS Imagine and IHS Markit have announced their partnership whereby the OEMS TradeSmart platform, in combination with IHS Markit’s live evaluated bond pricing data will provide real time insights into the global fixed income bond market.
Offering valuable insight into the priorities and challenges facing prime brokers, broker dealers and clearers, this report is a summary of the discussion during the Margin, Collateral and Data Round Table, sponsored by Imagine Software on Thursday 25th March 2021.