Saturday, October 24, 2009

Surrogate Models for Classification

From our FAQ:

Question: Does the SUMO Toolbox support classification problems?
Short Answer: Yes, now it does
Long Answer: see below

At the SUMO Lab we spend most of our time on the problem of generating an accurate surrogate model (metamodel) for a given data set or simulation code with a minimum number of data points (= adaptive sampling, sequential design, active learning). The goal is to make this this process as efficient and pain-free as possible.

To aid this work we developed the Matlab SUMO Toolbox which implements a number of frameworks and abstractions to facilitate:
  • model selection
  • model complexity selection (= hyperparameter optimization)
  • adaptive sampling (= active learning)
  • Design of Experiments (DoE)
  • data visualization
  • data interfacing
  • distributed execution of simulations
The work always revolved around regression/function approximation type problems. However, many of the algorithms and sub problems we encountered are equally applicable to classification. So wouldn't it be possible to leverage the SUMO framework for classification problems as well? Encouraged by some comments of one of the toolbox users I looked into this.

It turned out that with only 30mins work I had a first demo ready. Since some of the model types inside SUMO already support classification internally (e.g., the SVM models) I just needed to add some extra options and tweak the model plotting code somewhat.
The result is that now you can use the SUMO plugins for hyperparameter optimization, model selection, adaptive sampling, etc. and apply them to classification problems. The code will become available in version 6.3. If you want to play around with it earlier, just let me know.

As a proof of principle example I took the classical two spiral problem and configured SUMO to use SVM models (parameters optimized with DIRECT) and the density based sample selection algorithm. The resulting movie generated by SUMO is given below:



Remark that while the basic support for classification is there, our focus remains on the classic surrogate modeling problem (regression). So dont expect major developments in this area anytime soon. Rather, the purpose of this was just to show that it can be done quite easily. The basic support is there and now its up to an interested somebody to pick it up and improve or extend it as needed :)

Remark also that exactly the same could be done for Time Series prediction. If I turn out to have a spare hour here or there I might do a similar post on that as well :)

--Dirk

0 comments: