Stock Price Prediction with Market Style Clustering

Clustering methods to detect and trade market regimes and incorporating news volume and sentiment into trading strategies.

Hi! Here's Iván with this week's exciting newsletter, brimming with insights and discoveries on building robust risk models and trading strategies using Machine Learning. This edition focuses on using clustering methods to detect and trade market regimes, as well as incorporating news volume and sentiment into our trading strategies.

  • 🕹️ 2 Academic Articles: Dive into groundbreaking research featuring actionable ideas that are reshaping our understanding of how to apply ML/DL in creating successful investment and trading strategies.

  • 💊 Video Insight : A user-friendly guide on how the DBSCAN clustering method works.

  • ✔️ Market Insight: Best market reflection shared on my LinkedIn/Twitter during the last week: the financial market is always first!

  • 🥐 Quick Learning Methodologies: A brief summary of the clustering methods that you must include in your quantitative arsenal.

Academic Insights

“Stock Price Prediction with Market Style Clustering”

In my experience with various investment firms, one of the most useful techniques is incorporating unsupervised learning (clustering) into market-timing time series trading strategies, typically before the prediction phase.

This paper illustrates how to effectively combine price inputs and sentiment data with clustering prior to forecasting.

Let me try to sum up the paper main ideas 👇

💡 Input Features:

---> First (Price Data): The researchers used past prices and technical indicators derived from those prices as the initial set of inputs (“Prices”).

---> Second (Sentiment Data): By analyzing news, they obtained a second set of inputs, deriving sentiment from news articles (“News”). They utilized two standard dictionaries for sentiment features: Loughran and McDonald, and AffectiveSpace, transforming news articles into sentiment embeddings.

💡 Model

--->First (Clustering): The authors combined the “Prices” and “News” inputs. Using both, they clustered each observation of “n” past prices and corresponding news into market styles (clusters) using a standard hierarchical clustering method, along with silhouette scores to choose the cluster hyperparameters (see image below).

--->Second (Trainning): They assumed that the market style on day t + 1 would be the same as on day t. The past "m" samples sharing the same market style as day t were used to train the models. This approach led to the training of a distinct model for each market style. In their empirical results, they chose m=250 (one trading year), meaning they selected one year of data with the same market style to train a stock prediction model.

--->Third (Dependent Variable): With “Prices&News” as inputs, they targeted a binary classification problem {0,1}, labeling a class positive when the trend is increasing in the following "w" time-steps and negative (“0”) when decreasing. They used a Kernel-based SVM for this prediction task, which involved predicting the next trading day’s return (w = 1).

💡 Results

In this study, results are compared against other published benchmark models, with the authors demonstrating "better performance".

However, under my view, the main takeaway is how they combine price and sentiment data into clusters to train different “sub-models.”

“Portfolio Selection using News Volume and Sentiment” 

The paper presents useful hints on how to include both news volume and sentiment in trading strategies by leveraging unsupervised machine learning.

I'd say it is especially relevant for long-short strategies, and, of course, to include both measures (volume and sentiment) as additional features for predictive models alongside your already working inputs.

Let me summarize the paper in a nutshell: 👇

🔔 Data

-->The study uses data from 29 DJIA companies (all 30 DJIA companies except Apple, as it was not part of the index during the entire study period).

-->They collected news data (including volume and sentiment) directly from the RavenPack News Analytics database.

🔔 Main Idea

-->The main idea is to assume that the return in week 't' depends on information from volume and sentiment in week 't-1'.

-->To do this, they observe the number of news articles and sentiment data for each stock in the test dataset, and then search for 'k' weeks in the training dataset with a similar number of news articles and sentiment using the k-means algorithm.

-->They calculate the average return for the following week in each of the 'k' weeks with similar news and sentiment in the training set. Finally, they use this average return as the forecasted return for the next week in the test data.

-->Lastly, they assign weights based on the expected returns for the 29 stocks in the portfolio and rebalance it on a weekly basis.

🔔 Results

They use different out-of-sample periods. For instance, in the image, the out-of-sample kNN portfolio has grown to 108.8% of its initial value, representing a cumulative profit of 8.8%. In contrast, the DJIA index has only gained 1.8% in cumulative profit, making the kNN strategy approximately five times more profitable.

AI-Essentials: Step-by-Step Tutorial

🚀 A user-friendly guide on how the DBSCAN method works. Sometimes, video explanations are very helpful for visualizing these clustering algorithms.

The post: Market Ideas

Welcome to the 2024 election cycle...

Will we ride the same cycle?. 👇

Quick: Learning

“Clustering methods for Quant Trading Strategies” 

It is a fact that clustering is a must in many trading strategies.

To name a few, it is useful for 👇

--> Market Timing Time Series Strategies: Here, the idea is to cluster time series states and then use ML to predict the position of the time series in the next period (acting as a kind of denoising method).

--> Statistical Arbitrage: The better the cluster, the more effective the long/short strategy.

--> Portfolio Optimization: It's well-known how hierarchical clustering is applied to improve portfolio performance out-of-sample, for instance, with Hierarchical Risk Parity (HRP).

For these reasons, let me summarize in a nutshell the main methods you should include in your quant arsenal:

🏷 K-Means Clustering: This method groups data into a specified number (K) of distinct clusters. It's fast and easy to understand, best for large datasets but requires knowing the number of clusters beforehand and may struggle with clusters of different sizes.

🏷 Hierarchical Clustering: This technique doesn't need a predetermined number of clusters. It gradually forms clusters, either by combining smaller ones (agglomerative) or splitting larger ones (divisive), and represents them in a tree-like diagram. While it offers a clear view of data structure, it's less suitable for very large datasets.

🏷 Gaussian Mixture Models (GMM): GMMs treat data as if it comes from multiple Gaussian distributions, allowing for overlapping clusters and varied cluster shapes. They're flexible but may become too complex for data with many features.

🏷 Density-Based Clustering (DBSCAN): DBSCAN identifies clusters based on the density of data points, effectively finding clusters of any shape and distinguishing outliers. It automatically determines the number of clusters but requires careful setting of density parameters.

🏷 Spectral Clustering: This method uses relationships between data points (similarity matrices) for clustering. It's good for complex, irregular clusters but can be demanding computationally, especially for very large or unevenly distributed datasets.

If you're enjoying our newsletter and want to support us, please recommend it to anyone you know who's interested in AI and Finance. Your referrals are the biggest compliment and help us grow! 🌟🤖💼