Hierarchical Multi-Scale Gaussian Transformer for Stock Movement Prediction

Source: Notion | Last edited: 2023-10-31 | ID: c9e00da3-840...

@May I have come across a 2020 paper on Transformer models that addresses the same problems we have been working on:

Tackled Problems

Multi-timeframe (e.g. 15M + 30M + 60M OHLCV) training
Noncontinuous time series due to trading gap (e.g. weekends)

Overview

Transformer employs a multi-head self-attention mechanism to learn the relationship among different positions globally, thereby the capacity of learning long-term dependencies is enhanced. Nevertheless, canonical Transformer is designed for natural language tasks, and therefore it has a number of limitations in tackling finance prediction: (1) Locality imprecation: the global self-attention mechanism in canonical Transformer is insensitive to local context, whose dependencies are much important in financial time series. (2) Hierarchy poverty: the point-wise dot-product self-attention mechanism lacks the capability of utilizing hierarchical structure of financial time series (e.g. learning intraday, intra-week and intra-month features in financial time series independently). Intuitively, addressing those drawbacks will improve the robustness of the model and lead to better performance in the task of financial time series prediction.

In this paper, we propose a new Transformer-based method for stock movement prediction. The primary highlight of the proposed model is the capability of capturing long-term, short-term as well as hierarchical dependencies of financial time series. For these aims, we propose several enhancements for the Transformer-based model:

(1) *Multi-Scale Gaussian Prior *enhances the locality of Transformer.

(2) Orthogonal Regularization avoids learning redundant heads in the multi-head self-attention mechanism.

(3) Trading Gap Splitter enables Transformer to learn intra-day features and intra-week features independently. Numerical results comparing with other competitive methods for time series show the advantages of the proposed method.

In summary, the main contributions of our paper include:

We propose a Transformer-based method for stock movement prediction. To the best of our knowledge, this is the first work using Transformer model to tackle financial time series forecasting problems.
We propose several enhancements for Transformer model include Multi-Scale Gaussian Prior, Orthogonal Regularization, and Trading Gap Splitter.
In experiments, the proposed Transformer-based method significantly outperform several state-of-the-art baselines, such as CNN, LSTM and ALSTM, on two real-world exchange market.

OCR Version of Hierarchical Multi-Scale Gaussian Transformer for Stock Movement Prediction

https://snip.mathpix.com/amonic/pdfs/ba294ffc-39fc-456e-ab59-d3500a639bb0/view

1Password link valid 2023-06-22 → 2023-07-21: https://share.1password.com/s#rul1d8vJg9IB6J1L38SspTXCcgFOIDuX3j7456brKNg

Originally published:

https://www.ijcai.org/proceedings/2020/0640.pdf

PDF

\begin{abstract}
Predicting the price movement of finance securities like stocks is an important but challenging task, due to the uncertainty of financial markets. In this paper, we propose a novel approach based on the Transformer to tackle the stock movement prediction task. Furthermore, we present several enhancements for the proposed basic Transformer. Firstly, we propose a Multi-Scale Gaussian Prior to enhance the locality of Transformer. Secondly, we develop an Orthogonal Regularization to avoid learning redundant heads in the multi-head self-attention mechanism. Thirdly, we design a Trading Gap Splitter for Transformer to learn hierarchical features of high-frequency finance data. Compared with other popular recurrent neural networks such as LSTM, the proposed method has the advantage to mine extremely long-term dependencies from financial time series. Experimental results show our proposed models outperform several competitive methods in stock price prediction tasks for the NASDAQ exchange market and the China A-shares market.
\end{abstract}

\section{Introduction}

With the development of stock markets all around the world, the overall capitalization of stock markets worldwide has exceed 68 trillion U.S. dollar by $2018^{1}$. Recent years, more and more quantitative researchers get involved in predicting the future trends of stocks, and they help investors make profitable decisions using state-of-the-art trading strategies. However, the uncertainty of stock prices make it an extremely challenging problem in the field of data science.

Prediction of stock price movement belongs to the area of time series analysis which models rich contextual dependencies using statics or machine learning methods. Traditional approaches for stock price prediction are mainly based on fundamental factors technical indices or statistical time series models, which captures explicit or implicit patterns from historical financial data. However, the performance of those methods are limited by two aspects. Firstly, they usually require expertise in finance. Secondly, these methods only capture simple patterns and simple dependence structures of financial time series. With the rise of artificial intelligence technology, more and more researchers attempt to solve this problem using machine learning algorithms, such as SVM [Cortes and Vapnik, 1995], Nearest Neighbors [Altman, 1992], Random Forest [Breiman, 2001]. Recently, since deep neural networks empirically exhibited its powerful capabilities in solving highly uncertain and nonlinear problems, the stock prediction research based on deep learning technique has become more and more popular in recent years and show significant advantages over traditional approaches.

The stock prediction research based on deep learning technique can roughly be grouped to two categories: (1) Fundamental analysis, and (2) Technical analysis. Fundamental analysis constructs prediction signals using fundamental information such as news text, finance report and analyst report. For example, [Schumaker and Chen, 2009; $\mathrm{Xu}$ and Cohen, 2018; Chen et al., 2019] use natural language processing approaches to predict stock price movement by extracting latent features from market-related texts information,such as news, reports, and even rumors. On the other hand, technical analysis predicts finance market using historical data of stocks. One natural choice is the RNN family, such as RNN [Rumelhart et al., 1986], LSTM [Hochreiter and Schmidhuber, 1997], Conv-LSTM [Xingjian et al., 2015], and ALSTM [Qin et al., 2017]. However, the primary drawback of these methods is that RNN family struggles to capture extremely long-term dependencies [Li et al., 2019], such as the dependencies across several months on financial time series.

Recently, a well-known sequence-to-sequence model called Transformer [Vaswani et al., 2017] has achieved great success on natural machine translation tasks. Distinct from RNN-based models, Transformer employs a multi-head selfattention mechanism to learn the relationship among different positions globally, thereby the capacity of learning long-term dependencies is enhanced. Nevertheless, canonical Transformer is designed for natural language tasks, and therefore it has a number of limitations in tackling finance prediction: (1) Locality imperception: the global self-attention mechanism in canonical Transformer is insensitive to local context, whose dependencies are much important in financial time series. (2) Hierarchy poverty: the point-wise dot-product selfattention mechanism lacks the capability of utilizing hierarchical structure of financial time series (e.g. learning intraday, intra-week and intra-month features in financial time series independently). Intuitively, addressing those drawbacks will improve the robustness of the model and lead to better performance in the task of financial time series prediction.

In this paper, we propose a new Transformer-based method for stock movement prediction. The primary highlight of the proposed model is the capability of capturing long-term, short-term as well as hierarchical dependencies of financial time series. For these aims, we propose several enhancements for the Transformer-based model: (1) Multi-Scale Gaussian Prior enhances the locality of Transformer. (2) Orthogonal Regularization avoids learning redundant heads in the multimead self-attention mechanism. (3) Trading Gap Splitter enables Transformer to learn intra-day features and intra-week features independently. Numerical results comparing with other competitive methods for time series show the advantages of the proposed method.

In summary, the main contributions of our paper include:

- We propose a Transformer-based method for stock movement prediction. To the best of our knowledge, this is the first work using Transformer model to tackle financial time series forecasting problems.

- We propose several enhancements for Transformer model include Multi-Scale Gaussian Prior, Orthogonal Regularization, and Trading Gap Splitter.

- In experiments, the proposed Transformer-based method significantly outperform several state-of-the-art baselines, such as CNN, LSTM and ALSTM, on two realworld exchange market.

\section{Related Work}

Fundamental Analysis Machine learning for fundamental analysis developed with the explosion of finance alternative data, such as news, location, e-commerce. [Schumaker and Chen, 2009] proposes a predictive machine learning approach for financial news articles analysis using several different textual representations. [Weng et al., 2017] outlines a novel methodology to predict the future movements in the value of securities after tapping data from disparate sources. [Xu and Cohen, 2018] uses a stochastic recurrent model (SRM) with an extra discriminator and an attention mechanism to address the adaptability of stock markets. [Chen et al., 2019] proposes to learn event extraction and stock prediction jointly.

Technical Analysis On the other hand, technical analysis methods extract price-volume information from historical trading data and use machine learning algorithms for prediction. For instances, [Lin et al., 2013] proposes an SVMbased approach for stock market trend prediction. Meanwhile, LSTM neural network [Hochreiter and Schmidhuber, 1997] has been employed to model stock price movement. [Nelson et al., 2017] proposes an LSTM model to predict stock movement based on the technical analysis indicators. [Zhang et al., 2017] proposes an LSTM model on historical data to discover multi-frequency trading. [Wang et al., 2019] proposes a ConvLSTM-based Seq2Seq framework for stock movement prediction. [Qin et al., 2017] proposes an Attentive-LSTM model with an attention mechanism to predict stock price movement and [Feng et al., 2019] further introduces an data augmentation approach with the idea of adversarial training. However, [Li et al., 2019] points out that LSTM can only distinguish 50 positions nearby with an effective context size of about 200. That means that LSTM-based models suffer from the difficulty in capturing extremely longterm dependencies in time series. To tackle this issue, we propose a Transformer-based method to better mine the intrinsic long-term and complex structures in financial time series.

\section{Problem Formulation}

Since the exact price of a stock is extremely hard to be predicted accurately, we follow the setup of [Walczak, 2001] and predict the stock price movement instead. Usually the stock movement prediction is treated as a binary classification problem - e.g., discretizing the stock movement into two classes (Rise or Fall). Formally, given the stock features $\mathbf{X}=\left[\mathbf{x}_{T-\Delta t+1}, \mathbf{x}_{T-\Delta t+2}, \ldots, \mathbf{x}_{T}\right] \in \mathbb{R}^{\Delta t \times F}$ (also represented as $\mathbf{X}=\left[\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{N}\right] \in \mathbb{R}^{N \times F}$ in the rest of the paper for simplicity) in the latest $\Delta t$ time-steps, the prediction model $f_{\theta}(\mathbf{X})$ with parameters $\theta$ can output the predicted movement label $y=\mathbb{I}\left(p_{T+k}>p_{T}\right)$, where $T$ denotes the target trading time, $F$ denotes the dimension of stock features and $p_{t}$ denotes the close price at time-step $t$. Briefly, the proposed model utilizes the historical data of a stock $s$ in the lag $[T-\Delta t+1, T]$ (where $\Delta t$ is a fixed lag size) to predict the movement class $y$ ( 0 for Fall, 1 for Rise) of the future $k$ time-steps.

Proposed Method

\section{Proposed Method}

In this section, we first describe the basic Transformer model we designed. Then we introduce the proposed enhancements of Transformer for financial time series.

\subsection{Basic Transformer for Stock Movement Prediction}

In our work, we instantiate $f_{\theta}(\cdot)$ with Transformer-based model. To adapt the stock movement prediction task which takes time series as inputs, we design a variant of Transformer with encoder-only structure which consists of $\mathbf{L}$ blocks of multi-head self-attention layers and position-wise feed forward layers (see Figure 1). Given the input time series $\mathbf{X}=\left[\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{N}\right] \in \mathbb{R}^{N \times F}$, we first add the position encoding and adopt an linear layer with $t a n h$ activation function as follows:

$$
\overline{\mathbf{X}}=\sigma_{t a n h} \mathbf{W}_{h}^{(I)}[\text { PositionEncoding }(\mathbf{X})] .
$$

Then multi-head self-attention layers take $\overline{\mathbf{X}}$ as input, and are computed by

$$
\mathbf{Q}_{h}=\mathbf{W}_{h}^{(Q)} \overline{\mathbf{X}}, \quad \mathbf{K}_{h}=\mathbf{W}_{h}^{(K)} \overline{\mathbf{X}}, \quad \mathbf{V}_{h}=\mathbf{W}_{h}^{(V)} \overline{\mathbf{X}}
$$

where $h=1, \ldots, H$ and $\mathbf{W}_{h}^{(Q)}, \mathbf{W}_{h}^{(K)}$ and $\mathbf{W}_{h}^{(V)}$ are learnable weight matrices for Query, Key and Value, respectively (refers to [Vaswani et al., 2017] for more details). Then the attention score matrix $\mathbf{a}_{h} \in \mathbb{R}^{N \times N}$ of the $h^{t h}$ head is computed by

$$
\mathbf{a}_{h}=\operatorname{softmax}\left(\frac{\mathbf{Q}_{h} \mathbf{K}_{h}^{T}}{\sqrt{d_{k}}} \cdot \mathbf{M}\right)
$$

where $\mathbf{M}$ is a position-wise mask matrix to filter out temporal attention, so as to avoid future information leakage. Afterwards, the output of the $h^{\text {th }}$ head is a weighted sum defined as follows:

$$
\left[\mathbf{O}_{h}\right]_{i}=\sum_{j=1}^{N}\left(\mathbf{a}_{h}\right)_{i, j} \cdot\left[\mathbf{V}_{h}\right]_{j} .
$$

The final output of multi-head attention layers is the concatenation of all heads by $\mathbf{O}=\left[\mathbf{O}_{1}, \mathbf{O}_{2}, \ldots, \mathbf{O}_{H}\right]$. Afterward, the position-wise feed forward layer takes $\mathbf{O}$ as input and transforms it to $\mathbf{Z}$ by two fully-connected layers and a ReLU activation layer. Upon the output $\mathbf{z}_{i}$ of the last selfattention layer, a temporal attention layer [Qin et al., 2017] is deployed to aggregate the latent features from each position as $\mathbf{m}=\sum_{i=1}^{N} \alpha_{i} \mathbf{z}_{i}$. Then the scalar prediction score $\hat{y}$ is computed by a fully-connected layer and a sigmoid transformation:

$$
\hat{y}=\operatorname{sigmoid}\left(\mathbf{W}_{f c} \mathbf{m}\right)
$$

Our ultimate goal is to maximize the log-likelihood between $\hat{y}$ and $y$ via the following loss function:

$$
\mathcal{L}_{C E}=(1-y) \log (1-\hat{y})+y \log (\hat{y})
$$

\subsection{Enhancing Locality with Multi-Scale Gaussian Prior}

Recently, Transformer exhibits its powerful capability of extracting global patterns in natural language processing fields. However, the self-attention mechanism in Transformer considers the global dependencies with very weak position information. Note that, the position information serves as the temporal variant patterns in time series, which is much important. To address it, we incorporate Multi-Scale Gaussian Prior into the canonical multi-head self-attention mechanism with the intuition that the relevance of data in two positions is directly proportional to the temporal distance between them.

To pay more attention to the closer time-steps, we add biases of Gaussian prior to the attention score matrices based on the assumption that such scores would obey Gaussian distributions. Note that this operation is equivalent to multiplying the original attention distribution with a Gaussian distribution mask (See [Guo et al., 2019] for the proof). In details, we transform Eq. 3 to the following form by adding Gaussian biases:

$$
\mathbf{a}_{h}=\operatorname{softmax}\left[\left(\frac{\mathbf{Q}_{h} \mathbf{K}_{h}^{T}}{\sqrt{d_{k}}}+\mathbf{B}_{h}^{(G)}\right) \cdot \mathbf{M}\right]
$$

where $\mathbf{B}_{h}^{(G)} \in \mathbb{R}^{N \times N}$ is a matrix computed by

$$
\left[\mathbf{B}_{h}^{(G)}\right]_{i, j}= \begin{cases}\exp \left(-\frac{(j-i)^{2}}{2 \sigma_{h}^{2}}\right) & j \leq i ; \\ 0 & j>i .\end{cases}
$$

Note that we allow $\sigma_{h}$ in $\mathbf{B}_{h}^{(G)}$ are different for different heads in the multi-head self-attention layer.

Besides, we also give an empirical approximation for $\sigma_{h}$. Suppose we want to pay more attention to the $D_{h}$ closest time-steps, the variance can be empirically set as $\sigma_{h}=D_{h}$. By this way, we allow different $D_{h}$ in different attention heads in order to provide Multi-Scale Gaussian Prior.

In finance, the temporal features from last 5-day, 10-day, 20-day or 40-day are usually considered in trading strategies. That means, for a 4-head self-attention layer, we can empirically assign the window-size set $\mathbf{D}=\{5,10,20,40\}$ to $\sigma_{h}$ with $h=1, \ldots, 4$, respectively as is shown in Figure 2 . In conclusion, the proposed Multi-Scale Gaussian Prior enables Transformer to learn multi-scale localities from financial time series.

\subsection{Orthogonal Regularization for Multi-Head Self-Attention Mechanism}

With the proposed Multi-Scale Gaussian Prior, we let different heads learn different temporal patterns in the multi-head attention layer. However, some previous research [Tao et al., 2018][Li et al., 2018][Lee et al., 2019] claims that canonical multi-head self-attention mechanism tend to learn redundant heads. To enhance the diversity between each head, we induce an orthogonal regularization with regard to the weight tensor $\mathbf{W}_{h}^{(V)}$ in Eq.2. Specifically, we first calculate the tensor $\mathbf{W}^{(V)}=\left[\mathbf{W}_{1}^{(V)}, \mathbf{W}_{2}^{(V)}, \ldots, \mathbf{W}_{H}^{(V)}\right]$ by concatenating $\mathbf{W}_{h}^{(V)}$ of all heads. Note that the size of $\mathbf{W}^{(V)}$ is $H \times F \times d_{v}$ where $d_{v}$ denotes the last dimension of $\mathbf{V}_{h}$. Then we flatten the tensor $\mathbf{W}^{(V)}$ to a matrix $\mathbf{A}$ with size of $H \times\left(F * d_{v}\right)$, and further normalize it as $\widetilde{\mathbf{A}}=\mathbf{A} /\|\mathbf{A}\|_{2}$. Finally, the penalty loss is computed by

$$
\mathcal{L}_{p}=\left\|\widetilde{\mathbf{A}} \widetilde{\mathbf{A}}^{T}-I\right\|_{F}
$$

where $\|\circ\|_{F}$ denotes the Frobenius norm of a matrix and $I$ stands for an identity matrix. We briefly add the penalty loss to the original loss with a trade-off hyper-parameter $\gamma$ as follows:

$$
\mathcal{L}=\mathcal{L}_{C E}+\gamma \mathcal{L}_{p}
$$

For simplicity in expression, here we omit the number of multi-head self-attention layers in the model. In our experiment, we sum up the penalty losses from each multi-head self-attention layer as the final penalty loss as follows:

$$
\mathcal{L}_{p}=\mathcal{L}_{p}^{(1)}+\mathcal{L}_{p}^{(2)}+\ldots+\mathcal{L}_{p}^{(L)}
$$

\subsection{Trading Gap Splitter}

As is known that the input of the model is a continuous time series. However, due to the trading gaps, the input time series is essentially NOT continuous. Takes the 15 -minute data from NASDAQ stock market ${ }^{2}$ as an example, of which one trading day contains 2615 -minute time-steps and one trading week contains 5 trading days. This means there are inter-day and inter-week trading gaps. However, when the basic Transformer model is applied to this data, the self-attention layer inside treats all time-steps equally and omit the implicit interday and inter-week trading gaps. To solve this problem, we design a new hierarchical self-attention mechanism for the Transformer model to learn the hierarchical features of stock data (see Figure 3 (a)).

Takes a 3-block Transformer model as an example, we aim to learn the hierarchical features of stock data by the order "intra-day $\rightarrow$ intra-week $\rightarrow$ global". In order to do so, we set two extra position-wise masks to the first and second selfattention blocks respectively in order to limit the attention scopes. Formally, we modify Eq.7 to the following form:

$$
\mathbf{a}_{h}=\operatorname{softmax}\left[\left(\frac{\mathbf{Q}_{h} \mathbf{K}_{h}^{T}}{\sqrt{d_{k}}}+\mathbf{B}_{h}^{(G)}\right) \cdot \mathbf{M}^{(H)} \cdot \mathbf{M}\right]
$$

where $\mathbf{M}^{(H)}$ is an $N \times N$ matrix filled with -inf whose diagonal is composed of continuous sub-matrices filled with 0 . The $\mathbf{M}^{(H)}$ for the first and second self-attention blocks are shown in Figure 3 (c). Specifically, the size of sub-matrices in $\mathbf{M}^{(H)}$ for the first block is $26 \times 26$ since one trading day contains 26 15-minute time-steps, and the size of sub-matrices for the second block changes to $130 \times 130(26 * 5)$ since one trading week contains 5 trading days. By this way, the first and second self-attention blocks will learn the intra-day and intra-week features of stock data, respectively. Moreover, for the last self-attention block, we keep the original attention mechanism without $\mathbf{M}^{(H)}$ to learn global features of stock data. As a result, the Transformer model with the proposed hierarchical attention mechanism avoids suffering from the trading gaps. Note that, all attention heads in the same multihead self-attention layer share the same $\mathbf{M}^{(H)}$.

Let’s formalize the concepts, framework, and techniques mentioned in the provided text using logical and mathematical expressions.

Sequence-to-Sequence Model (Transformer): The Transformer model is a sequence-to-sequence model that employs multi-head self-attention mechanisms to capture global relationships among different positions in a sequence. It is represented by the function $f_{\theta}(\cdot)$ , which takes input $\mathbf{X}$ (a sequence) and produces an output $\hat{y}$ (predicted movement label). The model parameters are denoted by $\theta$ .
Local Context Sensitivity: The canonical Transformer model is insensitive to local context, which is crucial for analyzing financial time series. To address this limitation, enhancements are proposed.
Hierarchy Utilization: The point-wise dot-product self-attention mechanism in the Transformer model lacks the ability to utilize hierarchical structures present in financial time series. The proposed enhancements aim to capture intraday, intra-week, and intra-month features independently.
Multi-Scale Gaussian Prior: The Multi-Scale Gaussian Prior is introduced to enhance the locality of the Transformer model. It incorporates biases of Gaussian prior into the attention scores to emphasize the relevance of closer time steps. The attention score matrix $\mathbf{a}_{h}$ is computed by adding Gaussian biases to the original attention distribution.
Orthogonal Regularization: Orthogonal Regularization is employed to increase diversity between different heads in the multi-head self-attention mechanism. It involves inducing an orthogonal regularization penalty $\mathcal{L}*{p}$ on the weight tensor $\mathbf{W}*{h}^{(V)}$ to avoid learning redundant heads.
Trading Gap Splitter: To handle the trading gaps in financial time series, a hierarchical self-attention mechanism is introduced. The model is designed to learn intra-day, intra-week, and global features by employing position-wise masks in different self-attention blocks.
Stock Movement Prediction: The goal of the proposed method is to predict the movement (rise or fall) of stock prices. The prediction is treated as a binary classification problem, where the model utilizes historical data of a stock to predict the movement label of future time steps.
Model Loss and Evaluation: The model’s performance is evaluated using a loss function $\mathcal{L}*{CE}$ , which computes the log-likelihood between the predicted movement $\hat{y}$ and the actual movement $y$ . The final loss function $\mathcal{L}$ incorporates an orthogonal regularization penalty term $\mathcal{L}*{p}$ with a trade-off hyper-parameter $\gamma$ . The proposed method is compared with state-of-the-art baselines, such as CNN, LSTM, and ALSTM, using numerical results.

These formalizations provide a structured representation of the concepts and techniques mentioned in the text, allowing for a clearer understanding of the proposed approach.

In the given context, “local context” refers to the information or dependencies within a specific region or vicinity of a sequence. In the case of financial time series, it refers to the importance of considering the relationships and dependencies among neighboring time steps or positions in the sequence.

In the context of the Transformer model, which is primarily designed for natural language tasks, the global self-attention mechanism used in the canonical Transformer is insensitive to local context. It focuses more on capturing long-range dependencies across the entire sequence. However, in financial time series, the relationships and dependencies among nearby time steps are often crucial for accurate prediction.

The mentioned limitation of the canonical Transformer in perceiving local context implies that the model may not effectively capture the short-term dependencies and patterns that are significant in financial time series analysis. Therefore, enhancing the model to consider local context becomes important in the domain of financial prediction to improve the robustness and performance of the model.