In this blog post, we are going to forecast the Bitcoin price based on text data from Twitter and Reddit. Given that the observed Bitcoin price is formed by some supply and demand function, modeling the demand side, while assuming that the supply side behaves somehow stable, we may end up with some outstanding forecasting results. Social media data has been massively used in the financial industry and requires algorithms that can scale. However, social media data is unstructured and noisy. Supervised learning techniques are strongly domain dependent and need a massive amount of labeled data to be trained on to generalize well. We are going to tackle this problem by mapping the vectorized text data and sentiment directly to future price movements of Bitcoin. The economic theory claims that the price of an asset is a composition of its utility and speculation value. In 2017, we observed a crypto-currency market that went skyrocket – in the absence of a blockchain killer application so far; it is safe to assume that the reason behind this was driven by at least 90% of speculation and 10% by the utility. This assumption highly encourages our project.
In the first part, we collect text data from Twitter (#bitcoin) and the Bitcoin sub-Reddit. The price data for Bitcoin is downloaded via the coindesk.com API. The dictionary of words that we obtain from Twitter and Reddit is preprocessed using a prebuild dictionary with the top 10k English words. Each word that is not included in the dictionary will be dropped. In the final step, we vectorize our text data. Each word is translated into a vector so that that words that share ordinary contexts in the corpus are located in close proximity to one another in space. The vectorization allows us to obtain a machine-readable representation of words which we can correlate with the price data. Besides, we label the price data from coindesk.com as 1 each time we observe a positive price change and as 0 for each downward movement.
In the second part, we build our model using a Recurrent Neural Network (RNN). The RNN model of choice consists of cells that can process sequential patterns using Long-Short-Term-Memory cells. The LSTM cells have some memory which allows to write, read and forget information.
The memory state is protected by three gates: the input gate, the output gate and the forget gate. These gates are composed of a single neural net layer and a point-wise multiplication operation. Depending on the output of the sigmoid layer, which acts as a gatekeeper, we can determine how much information we would like to let through. If you want to master LSTMs entirely I suggest building a RNN-LSTM network entirely from scratch. No worries, we will cover this in a future post.
Given the vectorized text data we start to feed it in batches into our model. We are going to build our model in keras which allows us to create complex models on the Tensorflow backend easily. For each iteration, the network predicts a future price movement and compares it with the ground truth value. The deviation between both values is our error value which will be used within the gradient descent to update the network’s weights. The gradient tells us how much we should update the weight values to improve the forecasting performance of our neural network. To avoid over-fitting, which is very likely due to the high model complexity, one should use standard regularization techniques such as dropout and L2.
The preliminary results show a forecasting accuracy higher than 50% which implies that, in theory, we might be able to profitable trade signals derived from our model. The code can be found in our git repository – feel free play around with the model architecture and hyper-parameters.