Benchmarks#

We benchmark GPflux’ Deep GP on several UCI datasets. The code to run the experiments can be found in benchmarking/main.py. The results are stored in benchmarking/runs/*.json. In this script we aggregate and plot the outcomes.

We report the mean and std. dev. of the MSE and Negative Log Predictive Density (NLPD) measured by running the experiment on 5 different splits. We use 90% of the data for training and the remaining 10% for testing. The output is normalised to have zero mean and unit variance.

[4]:
table
[4]:
split mse nlpd
count mean std mean std
dataset model
Concrete dgp-1 5 0.103785 0.014586 0.526873 0.231547
dgp-2 5 0.093612 0.003917 0.388471 0.200387
dgp-3 5 0.103213 0.019258 0.624335 0.409077
Energy dgp-1 5 0.003866 0.001660 -0.991852 0.065885
dgp-2 5 0.004071 0.001542 -1.089672 0.039099
dgp-3 5 0.004063 0.001521 -1.091651 0.039407
Kin8mn dgp-1 5 0.098581 0.006733 0.263775 0.019575
dgp-2 5 0.061714 0.002321 0.040491 0.026879
dgp-3 5 0.064156 0.002981 0.144311 0.045383
Power dgp-1 5 0.056407 0.004272 -0.009102 0.045228
dgp-2 5 0.044380 0.006752 -0.129386 0.078303
dgp-3 5 0.042464 0.005769 -0.113741 0.040804
Yacht dgp-1 5 0.005899 0.005309 -0.908563 0.095456
dgp-2 5 0.002389 0.002963 -1.084093 0.071270
dgp-3 5 0.002420 0.002879 -1.085658 0.069810