Frequent pattern mining in EDGAR log files

Reading Time: 3 minutes

Frequent pattern mining – or association rule learning – is a concept from machine learning used to discover strong rules in large databases. Popular applications are for example supermarket transactions: {tooth brush} → {tooth paste}. People who buy tooth brushes are likely to buy toothpaste as well. Another example could be {butter, bread, milk} → {cheese}.

After identifying these rules, super market managers place these products closely to each other to ease the way for customers to buy these products together. One can also observe this approach in online shops such as Amazon: “Customers who bought this item also bought …”. Other applications include web usage mining, intrusion detection, continuous production, and bioinformatics.

Although the idea sounds pretty simple the implementation isn’t trivial. The probably best known algorithm is the apriori-algorithm proposed by Agrawal and Srikant in 1994. Another popular algorithm is the FP-growth tree proposed by Han, Pei and Yin in 2000. Albeit the implementation of these techniques is quite impressive, I will skip the technical details in this blog post.

Let us now take a look at the Electronic Data Gathering, Analysis and Retrieval system (EDGAR) from the U.S. Securities and Exchange Commission (SEC) where U.S. companies file important documents such as annual or quarterly reports. Nils already introduced EDGAR in a previous blogpost. In 2003, EDGAR started logging all search traffic on its system and makes it available for download in an anonymized form of the following:

ipdatezonecikaccession...
171.66.213.bge

2005-03-1900:00:00
600860000060086-97-000009
...
203.167.124.gjj2005-03-1900:00:00
752737
0000935069-05-000317
...
64.62.140.djf2005-03-1900:00:00
752737
0000732717-04-000671
...

The dots indicate columns I dimissed for space reasons. Visit the Edgar website for a complete list of variables. Obviously, the IP adresses are anonymized by replacing the last three digits with characters. However, the same user will get the same characters enabling us to match EDGAR users by their IP adresses. The whole database comprises terabytes of information making it hard for an average user to deal with it on a ordinary computer.

The structure of the log-file seems to be a promising application of frequent pattern mining. Which company filings do investors study together? What do EDGAR users read after reviewing the latest annual report of Google, Facebook, Amazon and Microsoft? Just to name a few questions to answer and potential conclusions to be drawn.

Before identifying rules, the log files need to be cleaned from robot and crawler searches. I follow the procedure of Ryan (2017). After that one has to decide how to specify the “customers” and the “items”. As customers I use the IP adresses, as the items I use the central index key (CIK), a company identifier given by the SEC. Another possible item choice could be the accessed filing (CIK and accession number merged).

Although there is quite a wide range of interesting questions to answer with this procedure, I limit the analysis to some descriptives. Let us take a look at the 15th of each month of year 2016. I identify company sets of size three frequently studied together for each day. I report only those sets with the highest support and confidence, e.g. highest probability of being watched together on the specific day, which is reported in the fourth and fifth column.

Date{#, #}→ {#} SupportConfidenceLift
15.01.2016Microsoft Corp, Google IncApple Inc0.0016 0.8031 55.72
15.02.2016HSBC USA Inc, Sterling Bancshares IncJPMorgan Chase0.00171.0000494.17
15.03.2016Cel Sci Corp, PharmaCyte Biotech IncAgrium Inc0.00070.923068.58
15.04.2016World Fuel Services Corp, U.S. Auto Parts Network IncNCI Building Systems Inc0.0017 1.0000550.22
15.05.2016General Electric Co, Altaba IncUnitedhealth Group Inc0.0006 0.9394748.27
15.06.2016Turbodyne Technologies Inc, UBS Series FundsMicro Imaging Technology Inc0.00161.0000550.44
15.07.2016Barclays Bank PLC, Morgan StanleyJPMorgan Chase0.00030.589736.48
15.08.2016Icahn Carl C, Pershing Square Cap. Man.Greenlight Capital Inc0.00040.6667214.62
15.09.2016HSBC USA INC, UBS AGJPMorgan Chase0.00050.676949.71
15.10.2016Allied Motion Technologies Inc, Calix IncPuma Biotechnology Inc0.00331.0000244.77
15.11.2016Frost Phillip MD et al, Mips Technologies IncNetflix Inc0.00020.8235439.06
15.12.2016Barclays Bank PLC, Morgan StanleyJPMorgan Chase0.00170.889857.98

Interestingly, the patterns seem indeed to be non-random highlighting the potential of the collective wisdom of EDGAR users. Most firms are indeed industry peer firms. However, there are also rules linking firms from different industries revealing interesting insights about networks other than those as classified by standard industrial codes such as GICS or NAICS. Lee, Ma, and Wang (2015) show in a similar approach that peer firms based on these rules dominate GICS6 industry peers in explaining cross-sectional variations in base firms׳ out-of-sample.

References:

  • Agrawal, R. and Srikant, R., 1994, September. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB (Vol. 1215, pp. 487-499).
  • Han, J., Pei, J. and Yin, Y., 2000, May. Mining frequent patterns without candidate generation. In ACM sigmod record (Vol. 29, No. 2, pp. 1-12). ACM.
  • Ryans, James, Using the EDGAR Log File Data Set (February 8, 2017). Available at SSRN: https://ssrn.com/abstract=2913612 or http://dx.doi.org/10.2139/ssrn.2913612
  • Lee, C.M., Ma, P. and Wang, C.C., 2015. Search-based peer firms: Aggregating investor perceptions through internet co-searches. Journal of Financial Economics, 116(2), pp.410-431.
Print Friendly, PDF & Email