Abstract

The PLSpredict algorithm has been developed by Shmueli et al. (2016). The method uses training and holdout samples to generate and evaluate predictions from PLS path model estimations.

Description

The research by Shmueli et al. (2016) proposes a set of procedures for prediction with PLS path models and the evaluation of their predictive performance. These procedures are combined in the PLSpredict package https://github.com/ISS-Analytics/pls-predict for the statistical software R. They allow generating different out-of-sample and in-sample predictions (e.g., case-wise and average predictions), which facilitate the evaluation of the predictive performance when analyzing new data (that was not used to estimate the PLS path model). The analysis serves as a diagnostic for possible overfitting of the PLS path model to the training data.

Based on the procedures suggested by Shmueli et al. (2016), the current PLSpredict algorithm implementation in the SmartPLS software allows researchers to obtain k-fold cross-validated prediction errors and prediction error summaries statistics such as the root mean square error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE) to assess the predictive performance of their PLS path model for the manifest variables (MV or indicators) and the latent variables (LV or constructs). Note that all three criteria are available for the MV results, while it is only possible to compute the RMSE and MAE for the LV results. These criteria allow to compare the predictive performance of alternative PLS path models.

Sharma et al.’s (2019) Monte Carlo simulation shows that the RMSE and mean absolute deviation MAE are particularly suitable when the aim is to select the best predictive model among a set of competing models. Researchers need to compare RMSE and MAD values for alternative model set-ups and select the model, which minimizes RMSE and MAD values in the latent variable scores.

In addition, to assess the results of a specific PLS path model, its predictive performance can be compared against two naïve benchmarks (Shmueli et al., 2019):

(1) The Q² value in PLSpredict compares the prediction errors of the PLS path model against simple mean predictions. For this purpose, it uses the mean value of the training sample to predict the outcomes of the holdout sample. The Q² value results interpretation is similar to the assessment of Q² values obtained by the blindfolding procedure in PLS-SEM. If the Q² value is positive, the prediction error of the PLS-SEM results is smaller than the prediction error of simply using the mean values. In that case, the PLS-SEM models offers better predictive performance.

(2) The linear regression model (LM) offers prediction errors and summary statistics that ignore the specified PLS path model. Instead, the LM approach regresses all exogenous indicator variables on each endogenous indicator variable to generate predictions. Thereby, a comparison with the PLS-SEM results offers information whether using a theoretically established path model improves (or at least does not worsen) the predictive performance of the available indicator data. In comparison with the LM outcomes, the PLS-SEM results should have a lower prediction error (e.g., in terms of RMSE or MAE) than the LM. Note that the LM prediction error is only available for the manifest variables and not the latent variables.

The out-of-sample predictions used in PLSpredict assist researchers in evaluating the predictive capabilities of their model. Therefore, PLSpredict should be included in the evaluation of PLS-SEM results (Hair et al., 2019, 2022).

Additional procedures and extensions are under development and may become part of future SmartPLS versions. The most recent extension is the cross-validated predictive ability test (CVPAT), which can be used to test the predictive ability of the model (Liengaard et al., 2021; Sharma et al., 2023). CVPAT results are available in the PLSpredict results report in SmartPLS.

PLSpredict Settings in SmartPLS

Number of Folds

Default: 10

In k-fold cross-validation the algorithm splits the full dataset into k equally sized subsets of data. The algorithm then predicts each fold (hold-out sample) with the remaining k-1 subsets, which, in combination, become the training sample. For example, when k equals 10 (i.e., 10-folds), a dataset of 200 observations will be split into 10 subsets with 20 observations per subset. The algorithm then predicts ten times each fold with the nine remaining subsets.

Number of Repetitions

Default: 10

The number of repetitions indicates how often PLSpredict algorithm runs the k-fold cross validation on random splits of the full dataset into k folds.

Traditionally, cross-validation only uses one random split into k-folds. However, a single random split can make the predictions strongly dependent on this random assignment of data (observations) into the k-folds. Due to the random partition of data, executions of the algorithm at different points of time may vary in their predictive performance results (e.g., RMSE, MAPE, etc.).

Repeating the k-fold cross-validation with different random data partitions and computing the average across the repetitions ensures a more stable estimate of the predictive performance of the PLS path model.

References

Please always cite the use of SmartPLS!

Ringle, Christian M., Wende, Sven, & Becker, Jan-Michael. (2024). SmartPLS 4. Bönningstedt: SmartPLS. Retrieved from https://www.smartpls.com