GPTIPS 2 - Tutorial 1 - "A Hard Problem for Standard GP".
Some of the functions used in the tutorial may result in errors if run in versions of MATLAB 2018a and up. This will be fixed soon in the release of GPTIPS 2.01
This tutorial shows how GPTIPS may be used to solve a 'difficult' symbolic regression problem using the MGGP model discovery engine.
It also shows how the Symbolic Math Toolbox may be used to visualise the problem and the GPTIPS solutions.
The problem ('called the Spatial Evolution' problem) has been used as a GP benchmark.
The function to be approximated is:
For this benchmark:
The input range is –5 <= x1 <= 5 and -5 < = x2 <= 5 "with test data 0.4 apart, which gives a total of 676 data points."
"The shape of the graph between these points is quite smooth. This is a hard problem for standard GP -- none of the papers below have been able to get GP to solve this (using a Koza style 676 hits predicate)."
According to Harper (2009, page 175)
A hit means that "the error for a given data point is less than 0.01 and 676 hits are required for a complete solution" (see http://www.cs.bham.ac.uk/~wbl/biblio/gp-html/Harper_thesis.html ).
In this tutorial the 676 evenly spaced points will be used to generate a training data set as in Harper (2009), a further 2601 evenly spaced points 0.2 units apart are generated in the range -5 to 5 to create a test data set (not used in learning the regression models).
The 676 hits (absolute error < 0.01) predicate on the training data will not be used in the fitness function (for simplicity, the default RMS error on the training data will be used) but the solutions will be examined to see if they meet the criterion for a 'complete solution' as defined above.
Before using GPTIPS to model this problem, it is useful to use the Symbolic Math Toolbox to visualise it using the ezsurf function. We can do this because this is a known synthetic problem - for real world applications involving data from systems where the 'true' structure is unknown this would not be possible.
First, we can create a symbolic math object of the function using the sym command.
Next, generate a 3-D surface plot of the function using the ezsurf command plus some additional commands to improve the visualisation. The additional arguments to ezsurf generate a 150 x 150 grid in the -5 to 5 range for x1 and x2.
Hence, we can see that the function is smooth and significantly non-linear with a local minimum at x1 = 0, x2 = 0.
Next we can create a GPTIPS configuration file from scratch to model data from this function.
Create an m file function (spatial_evo_config.m) that accepts and returns the argument gp. All GPTIPS config files must adhere to this format.
Next, add some basic run control parameters. Here, 2 runs will be performed, each with a population of 250. At the end the results of the 2 runs are merged to form a final population of 500 models.
Each run is forced to terminate after 60 seconds, thus dedicating a total of 2 minutes of computational effort to the problem.
Note that, because we have not specified otherwise, the default fitness function performing multigene regression (regressmulti_fitfun.m) will automatically be used.
Next, set the selection tournament size at 20 (as a first guess I usually set it just under 10 % of the population size).
Encourage less complex models by setting 30 % of tournaments to be Pareto tournaments.
Also set the max number of genes per model to 10 (this maybe a bit high but in terms of raw function approximation performance will probably give better results on the training data than fewer genes - there is a greater risk of overfitting however).
Next, we generate the 676 training data points as well as the additional data points for the test data set. We populate the gp.userdata fields with the generated data points. This is where GPTIPS expects to find data.
Then we add some function nodes to build the trees forming the models. This is pretty much the standard palette of functions to try if you don't know which are best for a certain problem.
Next save the file and go to the command line to run GPTIPS on the problem.
Analysis of results
Because GPTIPS uses a non-deterministic algorithm - results vary from run to run. Here the results of an entirely typical run I performed are shown.
First - use the popbrowser function to examine the overall predictive performance and complexity of models in the final merged population on the training data set. In general, models with low complexity and high R2 (predictive performance) are preferred.
The popbrowser window shows that many of the models have an R2 close to 1, indicating a good fit on the training data. The 'best' model on the training data (in terms of predictive performance) is indicated with a red circle. Clicking on any circle in the popbrowser also shows the equation representing the model.
To evaluate the 'best' model (again, in terms of predictive performance) on the training and test data sets, the runtree function can be used.
This generates a number of graphs, including trend and scatter plots of the model predictions against the actual data for the training and test sets.
From these graphs it can be seen that the model predictions on the training data are highly accurate with an R2 of nearly 1 (the reported value of 1 on the graphs is actually 0.999997 but this is only displayed to a lower precision).
Similarly, on the test data the R2 is very high at 0.99908 indicating a good generalisation ability by the model.
To see the overall simplified model equation the gppretty command can be used.
This does not bear much apparent relation to the function that generated the data (there's no reason it should) but it does accurately approximate the function in the region of interest.
Next, the number of 'hits' on the training data (i.e. the number of training data points for which the absolute prediction error is < 0.01) can be computed by use of the gpmodel2struct function. This creates a data structure that contains - amongst other things - the predictions of the model on the supplied data sets.
The .train field of the variable modelStruct returned by gpmodel2struct contains details of model performance on the training data as well as a vector of prediction errors (similarly the .test field provides prediction metrics on the test data set).
From this we can see that the maximum absolute error (modelStruct.train.maxe) is 0.00119687. Alternatively, we can also find the number of prediction errors on the training data >= 0.01 as follows
Hence - there are no errors >= 0.01 and so using 2 mins of computational effort (on a 2014 vintage dual core laptop) the model has provided a 100% accurate solution (676 hits) in terms of the original problem formulation.
The model solution can be visualised by extracting it as a symbolic math object using the gpmodel2sym function as shown below. In symbolic math objects real numeric values (e.g. the tree/gene weights) are represented as the ratio of large integers.
This can be a little much to visually digest so it's often a good idea to use the vpa (variable precision arithmetic) command to see a lower precision version of the model as also shown below.
Next we can generate a 3-D surface plot of the model's symbolic math object using the ezsurf function and some additional commands to improve the visualisation.
It can be seen that the model provides a very close approximation to the problem function over the region of interest.
We can also add the test data to the plot to show how well the model fits it.
Finally, because this is a synthetic example and we know the function that generated the model, we can create an explicit symbolic model of the difference (error) between the evolved model and the true function. We can also plot this model error surface as follows.
The model error surface is interesting because it clearly shows model errors that are greater than 0.01, even though errors of this magnitude did not appear when predicting the training data.
To see why this is, the model prediction training set errors can be superimposed on the model error surface plot as follows.
It can be seen the model error surface is small (< 0.01) where the training data points are located but in some regions grows larger in the 'gaps' between the training instances. So, the take home message here is that the algorithm only learns from the data you actually feed to to it and it does not know or care what happens in between those data points. This is pretty much true of any data driven modelling method so be careful in constructing your data sets and always examine your assumptions.
GPTIPS was used to successfully solve a difficult 2-D synthetic symbolic regression problem. It was demonstrated how to examine and visualise model performance using a variety of GPTIPS and Symbolic Math Toolbox functions. Only 1 model was examined in this case (for the sake of brevity) however in practice the population would be examined (e.g. using popbrowser and paretoreport) to find less complex models offering similar high predictive accuracy.
Harper, R. Enhancing Grammatical Evolution, PhD thesis, School of Computer Science and Engineering, The University of New South Wales, 2009.
Text of config file.
(c) Dominic Searson 2015 - 2021
You are welcome to use this tutorial as a basis for teaching either in part or in its entirety if you kindly provide the following citation:
Searson D.P. (2015) GPTIPS 2: An Open-Source Software Platform for Symbolic Data Mining. In: Gandomi A., Alavi A., Ryan C. (eds) Handbook of Genetic Programming Applications. Springer, Cham. https://doi.org/10.1007/978-3-319-20883-1_22