Applying corrections on SDC New Issues database

Copyright Thomson Reuters
Reading Time: 3 minutes

Thomson Reuters SDC Platinum Database does have some errors. Usually, the IPO literature adapts corrections supplied by Prof. Jay Ritter from the University of Florida. He provides his corrections in PDF format in a text-like style, which is hardly parseable in an automated manner.

In this blog post, we provide a short Python script which applies the corrections (downloadable here) to any dataset obtained from SDC Platinum.

We manually convert the PDF into a CSV format, which is publicly downloadable here. It possesses five columns cik, col, val, info, name, and issuedate. The Central Index Key (CIK) is the first column. col corresponds to the column name of the SDC dataset. In our studies, we convert the raw SDC column names into a more convenient format by deleting special characters and white spaces. For those interested, here is the code snippet we use:

nice_string = ['\n',' ','(',')','&','.', '/','\'] 
for i in nice_string: 
    sdc.columns = sdc.columns.str.replace(i, '')

However, if you would rather use the raw names, just change the entries of col by your SDC column names. The entries of val correspond to the corrected values, name to the issuer names and issuedate to the issue dates.

To match rows from the corrections CSV to the SDC data, we use the CIK as the main identifier. However, if the CIK is not provided, we proceed by matching the issuer name. If there are multiple matches, we also compare the issue dates, if available. In the case of there being multiple matches on both issuer name and issue date, we apply the correction to all of these. If the date cannot be matched, we allow for a two-day discrepancy between the dates, which the user of the script can also set to zero (if, for example, only perfect matches should be allowed). Non-matchable entries are captured in a list, which the function returns together with the corrected SDC data frame.

The following code should do it. All remaining errors are our own.

import pandas as pd
import os
import datetime as dt

def apply_corrections(sdc, corrections, ddelta=2):
    We apply the corrections to the SDC data suggested by Prof. Jay R. Ritter which are available
    on his website as a pdf file. We transform them into a csv file and alter SDC data via a script.

    sdc : pandas dataframe
    corrections : path to corrections csv as string
    ddelta : allowed discrepancy between issue dates

    sdc : corrected sdc dataframe
    fail_list : list of companies that could not be matched

    corrections = pd.read_csv(corrections, parse_dates = True)
    for i, row in corrections.iterrows():
        # skip if column to be corrected not in sdc file
        if row["col"] not in sdc.columns:
        # if cik available, search by cik
        if not pd.isnull( row['cik'] ):
            sdc.loc[ (sdc["CIK"] == row["cik"]) & (sdc["IPO"] == "Yes"), row["col"] ] = row["val"]
        # if cik not available, search by issuer name
            idx = (sdc["Issuer"] == row["name"]) & (sdc["IPO"] == "Yes")
            # if match is unambiguous
            if sum(idx) == 1:
                sdc.loc[idx, row['col']] = row['val']
            # in case of multiple matches:
            # if issue date is the same for all rows, apply changes on all rows
            elif len( set(sdc.loc[idx, "IssueDate"]) ) == 1:
                n = len( sdc.loc[idx, row['col']] )
                sdc.loc[idx, row['col']] = [ row['val'] for j in range(n) ]
            # compare issue date in corrections with issue date in sdc
                # add margin to issue date, issue date not the same
                date = [dt.datetime.strptime(j, "%Y-%m-%d") for j in sdc.loc[idx, "IssueDate"]]
                margin = dt.timedelta(days = ddelta)
                date_range = [j - margin <= dt.datetime.strptime(row["issuedate"], "%Y-%m-%d") <= j + margin for j in date]
                if sum(date_range) == 1:
                    # correct only the row with matching issue date
                    idx = [y for x,y in zip(date_range,sdc.loc[idx, "IssueDate"].index) if x]
                    sdc.loc[idx, row['col']] = row['val']
                    # not unambiguous
    return sdc, fail_list
Print Friendly, PDF & Email
+ posts
+ posts