Benchmarks#

We benchmark GPflux’ Deep GP on several UCI datasets. The code to run the experiments can be found in benchmarking/main.py. The results are stored in benchmarking/runs/*.json. In this script we aggregate and plot the outcomes.

We report the mean and std. dev. of the MSE and Negative Log Predictive Density (NLPD) measured by running the experiment on 5 different splits. We use 90% of the data for training and the remaining 10% for testing. The output is normalised to have zero mean and unit variance.

[4]:

table

[4]:

		split	mse		nlpd
		count	mean	std	mean	std
dataset	model
Concrete	dgp-1	5	0.103785	0.014586	0.526873	0.231547
	dgp-2	5	0.093612	0.003917	0.388471	0.200387
	dgp-3	5	0.103213	0.019258	0.624335	0.409077
Energy	dgp-1	5	0.003866	0.001660	-0.991852	0.065885
	dgp-2	5	0.004071	0.001542	-1.089672	0.039099
	dgp-3	5	0.004063	0.001521	-1.091651	0.039407
Kin8mn	dgp-1	5	0.098581	0.006733	0.263775	0.019575
	dgp-2	5	0.061714	0.002321	0.040491	0.026879
	dgp-3	5	0.064156	0.002981	0.144311	0.045383
Power	dgp-1	5	0.056407	0.004272	-0.009102	0.045228
	dgp-2	5	0.044380	0.006752	-0.129386	0.078303
	dgp-3	5	0.042464	0.005769	-0.113741	0.040804
Yacht	dgp-1	5	0.005899	0.005309	-0.908563	0.095456
	dgp-2	5	0.002389	0.002963	-1.084093	0.071270
	dgp-3	5	0.002420	0.002879	-1.085658	0.069810