Thomson Reuters SDC Platinum Database does have some errors. Usually, the IPO literature adapts corrections supplied by Prof. Jay Ritter from the University of Florida. He provides his corrections in PDF format in a text-like style, which is hardly parseable in an automated manner.
In this blog post, we provide a short Python script which applies the corrections (downloadable here) to any dataset obtained from SDC Platinum.
We manually convert the PDF into a CSV format, which is publicly downloadable here. It possesses five columns cik, col, val, info, name, and issuedate. The Central Index Key (CIK) is the first column. col corresponds to the column name of the SDC dataset. In our studies, we convert the raw SDC column names into a more convenient format by deleting special characters and white spaces. For those interested, here is the code snippet we use:
nice_string = ['\n',' ','(',')','&','.', '/','\'] for i in nice_string: sdc.columns = sdc.columns.str.replace(i, '')
However, if you would rather use the raw names, just change the entries of col by your SDC column names. The entries of val correspond to the corrected values, name to the issuer names and issuedate to the issue dates.
To match rows from the corrections CSV to the SDC data, we use the CIK as the main identifier. However, if the CIK is not provided, we proceed by matching the issuer name. If there are multiple matches, we also compare the issue dates, if available. In the case of there being multiple matches on both issuer name and issue date, we apply the correction to all of these. If the date cannot be matched, we allow for a two-day discrepancy between the dates, which the user of the script can also set to zero (if, for example, only perfect matches should be allowed). Non-matchable entries are captured in a list, which the function returns together with the corrected SDC data frame.
The following code should do it. All remaining errors are our own.
import pandas as pd import os import datetime as dt def apply_corrections(sdc, corrections, ddelta=2): """ We apply the corrections to the SDC data suggested by Prof. Jay R. Ritter which are available on his website as a pdf file. We transform them into a csv file and alter SDC data via a script. Parameters ---------- sdc : pandas dataframe corrections : path to corrections csv as string ddelta : allowed discrepancy between issue dates Returns ------- sdc : corrected sdc dataframe fail_list : list of companies that could not be matched """ corrections = pd.read_csv(corrections, parse_dates = True) fail_list=[] for i, row in corrections.iterrows(): # skip if column to be corrected not in sdc file if row["col"] not in sdc.columns: continue # if cik available, search by cik if not pd.isnull( row['cik'] ): sdc.loc[ (sdc["CIK"] == row["cik"]) & (sdc["IPO"] == "Yes"), row["col"] ] = row["val"] # if cik not available, search by issuer name else: idx = (sdc["Issuer"] == row["name"]) & (sdc["IPO"] == "Yes") # if match is unambiguous if sum(idx) == 1: sdc.loc[idx, row['col']] = row['val'] # in case of multiple matches: # if issue date is the same for all rows, apply changes on all rows elif len( set(sdc.loc[idx, "IssueDate"]) ) == 1: n = len( sdc.loc[idx, row['col']] ) sdc.loc[idx, row['col']] = [ row['val'] for j in range(n) ] # compare issue date in corrections with issue date in sdc else: # add margin to issue date, issue date not the same date = [dt.datetime.strptime(j, "%Y-%m-%d") for j in sdc.loc[idx, "IssueDate"]] margin = dt.timedelta(days = ddelta) date_range = [j - margin <= dt.datetime.strptime(row["issuedate"], "%Y-%m-%d") <= j + margin for j in date] if sum(date_range) == 1: # correct only the row with matching issue date idx = [y for x,y in zip(date_range,sdc.loc[idx, "IssueDate"].index) if x] sdc.loc[idx, row['col']] = row['val'] else: # not unambiguous fail_list.append(row["name"]) return sdc, fail_list