Roland Jutasi
September 12, 2022
Filling in the Gaps - Going from NaN to Wham
#data transparency#data quality#data pipeline

In the world of data, asset managers are pampered.

Every financial metric is available in an instant, updated real time, and fully standardized - nowhere is there a debate on the definition of Gross Profit Margin, or discussing whether Return on Assets covers all assets or only some 70% of them that has been measured.

These are issues however that plague ESG metrics.



Data Availability & Reliability Challenges


First there is the lack of available data.

Up until recently, reporting was effectively up to the good will of companies and any pressures that shareholder activism could muster. Mandatory disclosures are now beginning to catch up, but it is a long road to match the level of compliance as financial disclosures.


But even for the fortunate cases of reporting, we have a reliability issue.

Different companies have different competing methodologies to report the same datapoint and interpret the definition of certain metrics differently. They also may only have the capacity to report for a certain segment of their company or operations. Even with these in place though, without the proper auditing or transparency, it can be incredibly difficult to understand where these values come from. This can lead to apparently absurd situations where different ESG data providers will disagree on a company’s reported impact metrics, even outside of their own estimations to fill the gaps.



From Uncleaned Raw Data to Complete Dataset


Data providers, however, often have the unenviable job of selling a complete dataset - something that is usable to client investors in an actionable way - and this means having to impute and fill the gaps. This goes in direct counter to the needs of other investors, who need to know how reliable the datapoint they have is to make their investment decision - be it for risk management or to drive impact.


So how do we tackle these challenges at RGS, how do we go from uncleaned raw data to a complete and reliable dataset?


Meet the RGS Data Pipeline


The methodology with which we estimate datapoints is a living one, and one that is continuously being improved for reliability and specificity – in this article we are giving a sneak-peek into our data pipeline and how we address quality control processes within.


Data Collection


In our data pipeline the first step is Data Collection - the sourcing for the raw data from which all else is derived.


There are 3 types of data sources we collect from, and every data source is given significant time with human analysts to ensure that the data is consistent, and the source reliable:

●      Government Databases - for some KPIs, the data is required by law to be listed by a government agency. These databases are publicly accessible for all and are often comprehensive and complete. A good example of such a database is the FDA Medical Device Recalls database.
For these databases there is an analyst-in-the-loop for every company-year-datapoint value, and every collected datapoint has a citation associated. The collected data is converted into standardized units, ensuring the reported datapoint is comparably universal for all companies.

●      Direct Company Reporting - data that is provided by the company itself. This is usually available in separate Sustainability or Corporate Responsibility Reports, or otherwise integrated within other data sources such as SEC filings. These reports are often accompanied by an Index, explaining where each datapoint is discussed with respect to GRI or SASB or other reporting guideline’s definitions.

●      Trusted External Data Providers - In some cases, we use data from 3rd-party providers. These providers have access to information that is not universally available, due to their ability to leverage privileged relationships with companies.

Here we still (where possible) rely solely on the data that is raw from a company and impute ourselves for the estimated datapoints that enables us to track and maintain a solid domain knowledge of each datapoint we utilize.

Also, a long time is spent examining and comparing the data to other sources where possible and engaging with the data provider to ensure that both the methodology and any discrepancies across data sources are understood.



The different data sources mean that whilst we have significant control of many of the datapoints we collect, we must undergo a process to join and merge files. We must also engage in the work to hunt out the outliers in our data sets. This is our Pre-processing stage. We join the various datasets needed for a given impact or framework pipeline and standardize namings for columns and companies. From this structured form begins our Quality Control Process.


Data Type Validation


Within Pre-processing, our first task is Data Type Validation - this ensures that the data type is correct for the column, so numbers are represented as numbers, and not a string, and namings of companies and ISIN’s are consistent throughout their use.


Data QC

With the data types verified, we then move to our Input Data QC, which utilizes 3 layers of “defense” against incorrect inputs:

  1. Clipping - this is the simple removal of unphysical numbers, or impossible values (such as percentages above 100%) from our dataset and estimating instead for the specific company-year. One may initially hope that such mistakes are not common, but when seeing reported % of women in a company as 115%, one realizes the sheer necessity of these clipping steps.
  1. Automated Data Descriptive Statistics & Quality Sheets - The automated creation of a series of dataframes for both manual review and as an input into our automated outlier detection.
    These sheets calculate and display information about the raw datapoint with respect to the time series it lies within, and information on where the datapoint sits within the company, and the various sector/industry categorizations. These sheets also display a range of descriptive statistics per categorization - statistics such as the mean, median, standard deviation, minimum and maximum value. Manual review across these sheets hunts both for outliers and provides insights into reported data for different industry categorizations and any systemic behaviors in company reporting.

  2. Automated Outlier Detection - Utilizing methods specifically designed to catch outliers in datasets to flag both spikes in reported impact metrics (indicative of a possible one-off reporting error) or spikes in the Year-on-Year changes in the impact metric (indicative of a change in calculation methodology by the company or reporting scope or financial behaviors like mergers). Via the resulting data descriptions, we create a cost function to automatically hunt for outliers relative to other datapoints within the company, sub-industry, Year-on-Year Growth rate, and more. A suspicion list is produced, and manually investigated to confirm if these are genuinely problematic datapoints or there is a legitimate reason for this. This is used to train a Machine Learning model to predict outliers better, leading to a self-improving outlier detection algorithm.

Using the suspicion list we develop from methods 2 and 3, we then engage in manual examination for each of these company-year-datapoints to verify the reported value and investigate the outlier behavior. From these, those that are unexplained and unexcused are marked as true outliers. This is a separate input that we then feed into the pre-processing to give us a manual list of datapoints that should be dropped and estimated instead, or otherwise corrected in the raw data where reasonable to do so.




With our now cleaned data, we move onto the task of Imputations. Depending on the datapoint, we may have reporting coverage as high as 85%, or as low as 1.8%, and the task of estimating the rest must come with its challenges. We classify imputations into 2 types, depending on the statistical strength of the assumptions we can enforce with our estimations.


1.     High-Quality Imputations are those where we estimate while imposing strict requirements such as minimum numbers of datapoints to base any estimations on, or extrapolation limits to prevent extending too far on trends without sufficient supporting evidence.


2.     Other Imputations are reserved as the last line for estimation. where the fundamental methods of estimation remain the same here as well.


Information about which imputations we consider to be made with high confidence and those where we consider there to be more uncertainty is preserved throughout this process, for both internal tracking during quality control, and so we can effectively quantify our confidence in our impact calculations, and the uncertainty associated with each value we estimate.


We are in the process of integrating the same automated data statistics sheets and outlier detection model as in the pre-processing stage to catch potential outliers - and mark these suspicious points for our investigation. Not only does this act as another filter for catching potentially problematic reported datapoints, but it also enables us to manage the quality of our imputations and examine the merit of different estimation methods per datapoint.


Per datapoint, we calculate the percentage of datapoints that are reported, imputed with statistical rigor, and imputed under relaxed conditions, and the relative effectiveness of each imputation method. This is used to then adjust our imputation logic.



Calculations and Results


Now with a completed dataset of input KPIs, the next stage along the data pipeline is our impact calculations. With each calculation we engage in significant academic research, saving both the bibliography, derivations, and logic justifications for any and all calculations we utilize.

This then undergoes an internal peer review of the research undergone, a viva-like defense for the logic behind the calculation, and blind rederivation of any involved mathematical formulae to prevent arithmetic errors. This research goes towards both the impact measurement and monetization calculations themselves, and to any and all social costs or other global assumptions utilized in the calculation.


Once this peer review has been completed and passed, we move onto our Results QC. We begin with an examination of our automatically created Data Statistics Sheets, identical in form to those discussed before, but applied onto the monetized impacts, as well as the intermediate drivers and input KPIs we use. This is done both for the individual monetized impacts, and their aggregations (such as Total RIFT™).

Manual examination not only acts to augment the calculation review described above, by contextualizing the calculation logic and enabling cross-impact comparisons, it also is the input towards our outlier detection - both manual and automated. The outlier detection behaves similar to those methods described above.


Across all the parts of the data pipeline, we have collected and investigated a list of suspicious company-year-datapoints. Those that do not have a valid explanation for their unexpected behavior, or that otherwise cannot be verified, are marked into the input that goes into the preprocessing, thus leading to our self-improvement model that ensures data quality remains as high as possible for our clients.


Transparency Scoring


Beyond our transparency towards our process, we are also developing methods to directly communicate and quantify the transparency (or lack thereof) of the company under assessment.

These Data Transparency & Reliability Scores are important augmenters to investment decisions. Understanding and quantifying impact uncertainty is fundamental for developing true impact-risk management techniques.