ReJOOSp: Reinforcement learning for join order optimization in SPARQL
The selection of an optimal join order significantly impacts the performance of SPARQL queries by influencing the number of intermediate results generated. However, determining the best join order is a complex NP-hard problem due to the exponential growth of possible combinations with increasing input elements. To tackle this challenge, Machine Learning employs stochastic methods, delivering promising results efficiently, even with challenging problems. In our experiments, the learned execution plans exhibit exceptional performance in approximately 80% of cases when applied to synthetic datasets, particularly when the number of joins during learning aligns with the number during evaluation. However, on real-world datasets, only 50% of the generated execution plans demonstrate the same level of effectiveness. Other join order optimizers fail to achieve superior results more frequently, even with extended computational time.
Query generation
In our approach to train and evaluate machine learning models, we focus on using SPARQL queries that precisely contain a specific number of join operations. The rationale behind this is to simplify result interpretation, as queries comprising solely join operators are more straightforward to analyze. Real queries, on the other hand, pose challenges for machine learning purposes due to their limited diversity. They are often too similar, with only a few constants being changed, or too dissimilar, containing various SPARQL elements in addition to joins. Obtaining a sufficient number of real queries for training is also difficult. Additionally, while some queries may have many joins, applications that employ such extensive joins often lack the required query variations, resulting in an inadequate supply of suitable queries for training purposes.

Reinforcement Learning Framework
We have chosen to utilize the Stable Baselines 3 framework for our project due to its extensive open-source community and active contributors, surpassing other frameworks in this aspect. Particularly, a community addon has been instrumental in extending the built-in Proximal Policy Optimization (PPO) model, enabling us to apply essential masks to the available actions. The ability to use masks is crucial for the model to generate optimal join trees effectively.
To maintain efficiency during the training phase and ensure practicality in real-world applications, we opt for the default Multi-Layer Perceptron (MLP) policy with 2 layers and 64 nodes. This configuration is consistent with other join order optimizing articles, allowing us to focus on shorter training periods. Utilizing deeper layers would demand more learning steps, as the model needs to comprehend the utilization of these additional hidden layers. Shorter training phases are essential as they mitigate the impact of dataset changes, ensuring that the newly trained model remains effective despite fluctuations in the data.
Reward calculation
The process of calculating rewards for the join trees involves reflecting the quality of each tree concerning previously executed join trees. For joins with up to 5 inputs, it is relatively straightforward to traverse all possible join trees and maintain statistics on their performance. However, when dealing with larger joins, collecting statistics on all possible trees becomes impractical. To address this, we initially use the default optimizer to initialize statistics for each SPARQL query, which include the query and the total number of intermediate results but not specific join orders.
During training, these statistics are continually updated as the model generates new join plans and extends the statistics with the number of intermediate results produced by the current join plan. These statistics are stored in a separate SQL DBMS. However, after the training phase, these statistics are removed to free up memory, as they become outdated and unnecessary for future iterations.
The reward calculation function is designed to handle potentially significant variations between different join trees. A logarithmic reward function is employed, along with a minimum, to reward good join trees and penalize those with either very high intermediate result counts or instances where the DBMS aborted the benchmark due to extended execution times. This estimation, while not capturing the real best and worst cases, is sufficient for effectively training the model. The goal is to provide adequate rewards that encourage the model to learn optimal join strategies in a practical and efficient manner.

Results
After approximately 1 million training steps, the learned model demonstrates exceptional performance, generating highly effective join trees in about 80% of cases when using synthetic data. In contrast, no traditional optimizer achieves good join orders for more than 50% of the tested queries when utilizing real data. The machine learning approach exhibits the advantage of computing these join trees in linear time, providing significant efficiency gains over traditional join tree optimizers.
In contrast to most models in the literature that directly learn run times, our learning model focuses on predicting the number of intermediate results. This approach is advantageous as the number of intermediate results is closely linked to the execution time, resulting in highly efficient execution of the generated query plans. Furthermore, the join trees produced by our model are exceptionally memory-friendly, further contributing to their practicality and effectiveness in real-world scenarios.