{
  "nbformat": 4,
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "language": "python",
      "display_name": "Python 3"
    },
    "language_info": {
      "file_extension": ".py",
      "name": "python",
      "mimetype": "text/x-python",
      "nbconvert_exporter": "python",
      "codemirror_mode": {
        "version": 3,
        "name": "ipython"
      },
      "pygments_lexer": "ipython3",
      "version": "3.5.2"
    }
  },
  "cells": [
    {
      "cell_type": "code",
      "source": [
        "%matplotlib inline"
      ],
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "collapsed": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n# How to fit a basic model\n\n\nThese examples show how to fit a model using MonoBoost. There are two model\n types: `MonoBoost`, and `MonoBoostEnsemble`. `MonoBoost` sequentially fits\n `num_estimators` partially monotone cone rules to the dataset using gradient\n boosting. `MonoBoostEnsemble` fits a sequence of  `MonoBoost` classifiers each\n of size `learner_num_estimators` (up to a total of `num_estimators`) using\n gradient boosting. The advantage of `MonoBoostEnsemble` is to allow the added\n feature of stochastic subsampling of fraction `sample_fract` after every\n `learner_num_estimators` cones.\n\n"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\nimport monoboost as mb\nfrom sklearn.datasets import load_boston"
      ],
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "collapsed": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Load the data\n----------------\n\nFirst we load the standard data source on \n`Boston Housing \n<https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html>`_, and \nconvert the output from real valued (regression) to binary classification \nwith roughly 50-50 class distribution:\n\n\n"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "data = load_boston()\ny = data['target']\nX = data['data']\nfeatures = data['feature_names']"
      ],
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "collapsed": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Specify the monotone features\n-------------------------\nThere are 13 predictors for house price in the Boston dataset:\n\n"
      ],
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "#. CRIM - per capita crime rate by town\n#. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.\n#. INDUS - proportion of non-retail business acres per town.\n#. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)\n#. NOX - nitric oxides concentration (parts per 10 million)\n#. RM - average number of rooms per dwelling\n#. AGE - proportion of owner-occupied units built prior to 1940\n#. DIS - weighted distances to five Boston employment centres\n#. RAD - index of accessibility to radial highways\n#. TAX - full-value property-tax rate per $10,000\n#. PTRATIO - pupil-teacher ratio by town\n#. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n#. LSTAT - % lower status of the population\n\n The output is MEDV - Median value of owner-occupied homes in $1000's, but we \n convert it to a binary y in +/-1 indicating whether MEDV is less than \n $21(,000):\n\n"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "y[y< 21]=-1 # convert real output to 50-50 binary classification\ny[y>=21]=+1"
      ],
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "collapsed": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "We suspect that the number of rooms (6. RM) and the highway \naccessibility (9. RAD) would, if anything, increase the price of a house\n(all other things being equal). Likewise we suspect that crime rate (1.\nCRIM), distance from employment (8. DIS) and percentage of lower status\nresidents (13. LSTAT) would be likely to, if anything, decrease house prices.\nSo we have:\n\n"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "incr_feats=[6,9]\ndecr_feats=[1,8,13]"
      ],
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "collapsed": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Fit a MonoBoost model\n-------------------------\nWe now fit our classifier. To understand the hyperparameters, please\nrefer to the original paper available \n`here <http://staffhome.ecm.uwa.edu.au/~19514733/>`_:\n\n"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# Specify hyperparams for model solution\nvs = [0.01, 0.1, 0.2, 0.5, 1]\neta = 0.25\nlearner_type = 'two-sided'\nnum_estimators = 10\n# Solve model\nmb_clf = mb.MonoBoost(n_feats=X.shape[1], incr_feats=incr_feats,\n                          decr_feats=decr_feats, num_estimators=num_estimators,\n                          fit_algo='L2-one-class', eta=eta, vs=vs,\n                          verbose=False, learner_type=learner_type)\nmb_clf.fit(X, y)\n# Assess the model\ny_pred = mb_clf.predict(X)\nacc = np.sum(y == y_pred) / len(y)"
      ],
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "collapsed": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Fit a MonoBoostEnsemble model\n-------------------------\nWe now fit our classifier. To understand the hyperparameters, please\nrefer to the original paper available \n`here <http://staffhome.ecm.uwa.edu.au/~19514733/>`_:\n\n"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# Specify hyperparams for model solution\nvs = [0.01, 0.1, 0.2, 0.5, 1]\neta = 0.25\nlearner_type = 'one-sided'\nnum_estimators = 10\nlearner_num_estimators = 2\nlearner_eta = 0.25\nlearner_v_mode = 'random'\nsample_fract = 0.5\nrandom_state = 1\nstandardise = True\n# Solve model\nmb_clf = mb.MonoBoostEnsemble(\n    n_feats=X.shape[1],\n    incr_feats=incr_feats,\n    decr_feats=decr_feats,\n    num_estimators=num_estimators,\n    fit_algo='L2-one-class',\n    eta=eta,\n    vs=vs,\n    verbose=False,\n    learner_type=learner_type,\n    learner_num_estimators=learner_num_estimators,\n    learner_eta=learner_eta,\n    learner_v_mode=learner_v_mode,\n    sample_fract=sample_fract,\n    random_state=random_state,\n    standardise=standardise)\nmb_clf.fit(X, y)\n# Assess the model (MonoBoostEnsemble)\ny_pred = mb_clf.predict(X)\nacc = np.sum(y == y_pred) / len(y)"
      ],
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "collapsed": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Final notes\n-----------------------\nIn a real scenario we would use a hold out technique such as cross-validation \nto tune the hyperparameters `v`, `eta` and `num_estimators` but this is \nstandard practice and not covered in these basic examples. Note that for \ntuning `num_estimators` we can use `predict(X,cum=True)` because as standard \nfor boosting, the stagewise predictions are stored. Enjoy!\n\n"
      ],
      "metadata": {}
    }
  ],
  "nbformat_minor": 0
}