Prediction, Classification

Estimated reading time: 15 minutes

Welcome! This section highlights important business machine learning models. Many of these models are not code-complete and simply provide excerpted pseudo-like code. The code on this website is in the Python programming language.

This six-part documentation identifies:

State of the art classification models (this page).
Continuous value prediction problems
The use of Natural Language Processing
Important time series solutions
The core principles of recommender systems
And experimental image and voice technologies

This section is more easily explorable with some knowledge of Python; especially of its data science components. If you need any help with the models feel free to get in touch for a consultation.

Binary Classification

A lot of classification problems are binary in nature such as predicting whether the stock price will go up or down in the future, predicting gender and predicting wether a prospective client will buy your product.

Binary Business Prediction:

Future direction of commodity, stocks and bonds prices.
Predicting a customer demographic.
Predict wheteher customers will respond to direct mail.
Predict the pobabilitiy of damage in a home inspection
Predict the likelihood that a grant application will succeed.
Predict job success using a 10 part questionaire.
Predict those most likley to donate to a cause.

Data Types	Description	Description
Categorical	Data that can be discretely classified.	Country, Exchange, Currency, Dummy Variable, State, Industry.
Continuous	Data that incrementally changes in values	Past Asset Price, Interest Rate, Competitors Price.
Stepped	Similar to continuos but changes infrequently	P/E, Quarterly Revenue,
Transformed Category	A different datatype converted to categorical.	Traded inside standard deviation - yes, no. P/E above 10 - yes, no.
Models	The prediction of additional models	ARIMA, AR, MA.

GBM

CNN

MLP

premodel

#Load Data:
import pandas as pd
train = pd.read_csv("../input/train_1.csv")

#Explore For Insights:
import matplotlib.pyplot as plt
plt.plot(mean_group)
plt.show()

#Split Data in Three Sets:
from sklearn.model_selection import train_test_split

X_holdout = X.iloc[:int(len(X),:]
X_rest = X[X[~X_holdout]]

y_holdout = y.iloc[:int(len(y),:]
y_rest = y[y[~y_holdout]]

X_train, X_test, y_train, y_test = train_test_split(X_rest, y, test_size = 0.3, random_state = 0)

#Add Additional Features:
mean = X_train[col].mean()

model

LGBM

XGBoost

import lightgbm as lgbm

learning_rate = 0.8
num_leaves =128
min_data_in_leaf = 1000
feature_fraction = 0.5
bagging_freq=1000
num_boost_round = 1000
params = {"objective": "binary",
          "boosting_type": "gbdt",
          "learning_rate": learning_rate,
          "num_leaves": num_leaves,
          "feature_fraction": feature_fraction, 
          "bagging_freq": bagging_freq,
          "verbosity": 0,
          "metric": "binary_logloss",
          "nthread": 4,
          "subsample": 0.9
          }

    dtrain = lgbm.Dataset(X_train, y_train)
    dvalid = lgbm.Dataset(X_validate, y_test, reference=dtrain)
    bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, verbose_eval=100,early_stopping_rounds=100)
    bst.predict(X_test, num_iteration=bst.best_iteration)

import xgboost as XGB
model = xgb.XGBClassifier(objective='binary:logistic',
                     learning_rate=0.037, max_depth=5, 
                     min_child_weight=20, n_estimators=180,
                     reg_lambda=0.8,booster = 'gbtree',
                     subsample=0.9, silent=1,
                     nthread = -1)

model.fit(train[feature_names], target)

pred = model.predict(test[feature_names])

postmodel

#Predict:

y_pred = regressor.predict(X_test)
y_pred = sc.inverse_transform(y_pred)

#Assess Success of Prediction:

ROC AUC
TP/TN
F1
Confusion Matrix

#Tweak Parameters to Optimise Metrics:

#Select A new Model

#Repeat the process. 

#Final Showdown

Measure the performance of all models against the holdout set.
And pick the final model. 

premodel

#Load Data:
import pandas as pd
train = pd.read_csv("../input/train_1.csv")

#Explore For Insights:
import matplotlib.pyplot as plt
plt.plot(mean_group)
plt.show()

#Split Data in Three Sets:
from sklearn.model_selection import train_test_split

X_holdout = X.iloc[:int(len(X),:]
X_rest = X[X[~X_holdout]]

y_holdout = y.iloc[:int(len(y),:]
y_rest = y[y[~y_holdout]]

X_train, X_test, y_train, y_test = train_test_split(X_rest, y, test_size = 0.3, random_state = 0)

#Add Additional Features:
mean = X_train[col].mean()

model

1D CNN

from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.optimizers import SGD
from keras.models import Sequential
from keras.layers import Dense, Flatten

def create_model():
    conv = Sequential()
    conv.add(Conv1D(20, 4, input_shape = PRED.shape[1:3], activation = 'relu'))
    conv.add(MaxPooling1D(2))
    conv.add(Dense(50, activation='relu'))
    conv.add(Flatten())
    conv.add(Dense(1, activation = 'sigmoid'))
    sgd = SGD(lr = 0.1, momentum = 0.9, decay = 0, nesterov = False)
    conv.compile(loss = 'binary_crossentropy', optimizer = sgd, metrics = ['accuracy'])
    return conv
            
model = KerasClassifier(build_fn=create_model,  batch_size = 500, epochs = 20, verbose = 1,class_weight=class_weight)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

postmodel

#Predict:

y_pred = regressor.predict(X_test)
y_pred = sc.inverse_transform(y_pred)

#Assess Success of Prediction:

ROC AUC
TP/TN
F1
Confusion Matrix

#Tweak Parameters to Optimise Metrics:

#Select A new Model

#Repeat the process. 

#Final Showdown

Measure the performance of all models against the holdout set.
And pick the final model. 

premodel

#Load Data:
import pandas as pd
train = pd.read_csv("../input/train_1.csv")

#Explore For Insights:
import matplotlib.pyplot as plt
plt.plot(mean_group)
plt.show()

#Split Data in Three Sets:
from sklearn.model_selection import train_test_split

X_holdout = X.iloc[:int(len(X),:]
X_rest = X[X[~X_holdout]]

y_holdout = y.iloc[:int(len(y),:]
y_rest = y[y[~y_holdout]]

X_train, X_test, y_train, y_test = train_test_split(X_rest, y, test_size = 0.3, random_state = 0)

#Add Additional Features:
mean = X_train[col].mean()

model

Default

def create_baseline():
    # create model
    model = Sequential()
    model.add(Dense(10, input_dim=30, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    # Compile model. We use the the logarithmic loss function, and the Adam gradient optimizer.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

model = KerasClassifier(build_fn=create_baseline, epochs=100, batch_size=5, verbose=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

postmodel

#Predict:

y_pred = regressor.predict(X_test)
y_pred = sc.inverse_transform(y_pred)

#Assess Success of Prediction:

ROC AUC
TP/TN
F1
Confusion Matrix

#Tweak Parameters to Optimise Metrics:

#Select A new Model

#Repeat the process. 

#Final Showdown

Measure the performance of all models against the holdout set.
And pick the final model. 

RNN Stock

Multi-class Classification

This section relates to predictions for multiple classes. Machine learning has significantly enahnced the quality and accuracy of multiple class/lable predictions.

Multi-class Prediction

Item specific sales prediction i.e. unit of sales.
Predicting store sales.
Predict the unit of sales from multiple items.
Predicting the likelihood of certain crimes occuring at different points geographically and at different times.
What when, where and at what severity will the flu strike.
New empoyees predict the level of access and what access they require.
Predict the most pressing community issue.
What customers wil purchase what policy.
Predict which shoppers are most likely to repeat purchase.
Predict which blog post from a selection would be most popular
Predict destination of taxi with initial partial trajectories.

Data Types	Description	Description
Categorical	Data that can be discretely classified.	Country, Exchange, Currency, Dummy Variable, State, Industry.
Continuous	Data that incrementally changes in values	Past Asset Price, Interest Rate, Competitors Price.
Stepped	Similar to continuos but changes infrequently	P/E, Quarterly Revenue,
Transformed Category	A different datatype converted to categorical.	Traded inside standard deviation - yes, no. P/E above 10 - yes, no.
Models	The prediction of additional models	ARIMA, AR, MA.

GBM

MLP

premodel

#Load Data:
import pandas as pd
train = pd.read_csv("../input/train_1.csv")

#Explore For Insights:
import matplotlib.pyplot as plt
plt.plot(mean_group)
plt.show()

#Split Data in Three Sets:
from sklearn.model_selection import train_test_split

X_holdout = X.iloc[:int(len(X),:]
X_rest = X[X[~X_holdout]]

y_holdout = y.iloc[:int(len(y),:]
y_rest = y[y[~y_holdout]]

X_train, X_test, y_train, y_test = train_test_split(X_rest, y, test_size = 0.3, random_state = 0)

#Add Additional Features:
mean = X_train[col].mean()

model

LGBM

XGBoost

import lightgbm as lgbm

learning_rate = 0.8
num_leaves =128
min_data_in_leaf = 1000
feature_fraction = 0.5
bagging_freq=1000
num_boost_round = 1000
params = {"objective": "multiclass",
          "boosting_type": "gbdt",
          "learning_rate": learning_rate,
          "num_leaves": num_leaves,
          "feature_fraction": feature_fraction, 
          "bagging_freq": bagging_freq,
          "verbosity": 0,
          "metric": "multi_logloss",
          "nthread": 4,
          "subsample": 0.9
          }

    dtrain = lgbm.Dataset(X_train, y_train)
    dvalid = lgbm.Dataset(X_validate, y_test, reference=dtrain)
    bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, verbose_eval=100,early_stopping_rounds=100)
    bst.predict(X_test, num_iteration=bst.best_iteration)

import xgboost as XGB
model = xgb.XGBClassifier(objective='multi:softmax',
                     learning_rate=0.037, max_depth=5, 
                     min_child_weight=20, n_estimators=180,
                     reg_lambda=0.8,booster = 'gbtree',
                     subsample=0.9, silent=1,
                     nthread = -1)

model.fit(train[feature_names], target)

pred = model.predict(test[feature_names])

postmodel

#Predict:

y_pred = regressor.predict(X_test)
y_pred = sc.inverse_transform(y_pred)

#Assess Success of Prediction:

ROC AUC
TP/TN
F1
Confusion Matrix

#Tweak Parameters to Optimise Metrics:

#Select A new Model

#Repeat the process. 

#Final Showdown

Measure the performance of all models against the holdout set.
And pick the final model. 

premodel

#Load Data:
import pandas as pd
train = pd.read_csv("../input/train_1.csv")

#Explore For Insights:
import matplotlib.pyplot as plt
plt.plot(mean_group)
plt.show()

#Split Data in Three Sets:
from sklearn.model_selection import train_test_split

X_holdout = X.iloc[:int(len(X),:]
X_rest = X[X[~X_holdout]]

y_holdout = y.iloc[:int(len(y),:]
y_rest = y[y[~y_holdout]]

X_train, X_test, y_train, y_test = train_test_split(X_rest, y, test_size = 0.3, random_state = 0)

#Add Additional Features:
mean = X_train[col].mean()

model

Default

def create_baseline():
    # create model
    model = Sequential()
    model.add(Dense(10, input_dim=30, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    # Compile model. We use the the logarithmic loss function, and the Adam gradient optimizer.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

model = KerasClassifier(build_fn=create_baseline, epochs=100, batch_size=5, verbose=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

postmodel

#Predict:

y_pred = regressor.predict(X_test)
y_pred = sc.inverse_transform(y_pred)

#Assess Success of Prediction:

ROC AUC
TP/TN
F1
Confusion Matrix

#Tweak Parameters to Optimise Metrics:

#Select A new Model

#Repeat the process. 

#Final Showdown

Measure the performance of all models against the holdout set.
And pick the final model. 

RNN Stock

On to Part 2 »

get started, prediction, classification, model, keras, concepts, supervised, learning