Predicting Resource Access Permission for Amazon Employees¶
In this article, we consider the Amazon Employee Access Challenge hosted by Kaggle. In this challenge, we are asked to predict the resource access permissions required for a new employee given historical employee access data. This problem is motivated by the difficulties faced by new employees as they enter the company. In particular, new employees require access to resources such as web portals, printers, etc, and traditionally these access permissions are granted manually by the employee's manager as they encounter road blocks. This process wastes a considerable amount of time for both the new employees and their managers.
Our goal here is to automate this process, by building a model based on previously available employee access data. We aim to build a model that learns common access patterns required by employees based on employee attributes such as employee's role, employee's manager role, etc. Such a model then can be used to predict the resource accesses required by a new employee given their attributes.
Exploratory Data Analysis¶
To gain a better understanding of the data, we first conduct a exploratory data analysis. We start by loading the data into a pandas dataframe:
import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
Here is the first five rows of train data:
train.head()
As we can see, train data provides us with attributes such as the resource in question, information about the role of the employee, manager id, etc, as well as the action that was granted for this employee. Here is a detailed description of the data attributes:
| Column Name | Description |
|---|---|
| ACTION | ACTION is 1 if the resource was approved, 0 if the resource was not |
| RESOURCE | An ID for each resource |
| MGR_ID | The EMPLOYEE ID of the manager of the current EMPLOYEE ID record; an employee may have only one manager at a time |
| ROLE_ROLLUP_1 | Company role grouping category id 1 (e.g. US Engineering) |
| ROLE_ROLLUP_2 | Company role grouping category id 2 (e.g. US Retail) |
| ROLE_DEPTNAME | Company role department description (e.g. Retail) |
| ROLE_TITLE | Company role business title description (e.g. Senior Engineering Retail Manager) |
| ROLE_FAMILY_DESC | Company role family extended description (e.g. Retail Manager, Software Engineering) |
| ROLE_FAMILY | Company role family description (e.g. Retail Manager) |
| ROLE_CODE | Company role code; this code is unique to each role (e.g. Manager) |
Test data has a similar structure, except that as expected ACTION field is not present:
test.head()
First let's see if train or test data have any missing values:
train.isnull().sum(axis=0)
test.isnull().sum(axis=0)
train and test data are already cleaned and there are no missing values.
Now let's get some statistics about the data:
train.describe()
ACTION field can only be 0 or 1 with about 94% of values being 1. But other fields take on a wider range of values. For example, here are 10 random samples from the RESOURCE field.
train.RESOURCE.sample(10)
It seems that this field consists of categorical values which are anonymized using a label encoding. Actually the same seems to be true about all other fields. To confirm this suspicion, let's find out the number of unique values for each field.
import numpy as np
def f(x):
return len(np.unique(x))
df = pd.concat([train.agg([min, max]).T, train.agg(f).to_frame()], axis=1).T
df.index = ['min', 'max', 'n_unique']
df
The features are anonymized, but we still can figure out their relative importance. This help us with getting a better intuition about the data. To do this we first train a Random-Forest classifier on the train data.
from sklearn.ensemble import RandomForestClassifier
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]
clf_rf = RandomForestClassifier()
clf_rf.fit(X_train, y_train)
feature_importances = pd.DataFrame(clf_rf.feature_importances_.reshape(1, -1),
index=['feature_importance'],
columns=train.columns[1:])
feature_importances = feature_importances.T.sort_values(by=['feature_importance'],
ascending=False).T
feature_importances
import seaborn as sns
from matplotlib import pyplot as plt
g = sns.barplot(data=feature_importances,
color="b")
g.set_xticklabels(feature_importances.columns, rotation=90)
plt.show()
As we can see, the resource field has the most information, followed by manager id and department fields.
Training Models and Cross-Validation¶
In the next step, we train several classifiers on the data and measure their performance using cross-validation. The goal here is to get an idea about the type of performance we can get with different classifiers. So, we will use default classifier parameters and leave performance tuning for later.
But before we can train classifiers, we need to do some preparation on the data. Firstly, as we observed before the data is categorical. A standard practice when dealing with categorical data is One Hot Encoding:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
def one_hot_endcode(X_train, X_test):
ohe = OneHotEncoder()
X_total = np.vstack([X_train, X_test])
X_total_ohe = ohe.fit_transform(X_total)
n_train = X_train.shape[0]
X_train_ohe = X_total_ohe[:n_train,]
X_test_ohe = X_total_ohe[n_train:,]
return X_train_ohe, X_test_ohe
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]
X_test = test.iloc[:, 1:]
X_train_ohe, X_test_ohe = one_hot_endcode(X_train, X_test)
Secondly, we note that the performance evaluation metric for the competition is area under the ROC curve:
from sklearn import model_selection
from sklearn.metrics import roc_auc_score
def scoring(estimator, X, y):
y_pred = estimator.predict_proba(X)
return roc_auc_score(y, y_pred[:, 1])
Given these information, we can train different classifiers and evaluate their performance. Here, we choose learning curves as performance measure for the classifiers, since they provide more intuition about the model and data. sklearn provides great tools for obtaining the learning curves:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator, title, ax, X, y, ylim=None, cv=None, scoring=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5), fontsize=20):
plt.title(title, fontsize=19)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples", fontsize=16)
plt.xticks(fontsize=16)
plt.ylabel("Score", fontsize=16)
plt.yticks(fontsize=16)
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, scoring=scoring, cv=cv, n_jobs=n_jobs,
train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="lower right", fontsize=16)
plt.grid(True)
return plt
We now obtain the learning curves for a number of well-known classifiers, namely, Nearest Neighbors, Linear SVM, Decision Tree, Random Forest, Neural Net, AdaBoost, and Logistic Regression. Since learning curves calculations are computationally expensive, we use decomposition to reduce the data dimensionality and speed up the computation.
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.decomposition import TruncatedSVD
names = ["Nearest Neighbors", "Linear SVM", "Decision Tree", "Random Forest",
"Neural Net", "AdaBoost", "Logistic Regression"]
classifiers = [
KNeighborsClassifier(),
SVC(probability=True),
DecisionTreeClassifier(),
RandomForestClassifier(),
MLPClassifier(),
AdaBoostClassifier(),
LogisticRegression()]
pca = TruncatedSVD(n_components=10)
X_train_ohe_pca = pca.fit_transform(X_train_ohe)
plt.figure(figsize=(20, 30))
index = 1
for name, clf in zip(names, classifiers):
ax = plt.subplot(4, 2, index)
index += 1
plot_learning_curve(clf, name, ax, X_train_ohe_pca, y_train,
scoring=scoring, n_jobs=3, cv=5)
plt.subplots_adjust(hspace=0.33, wspace=0.15)
plt.show()
This experiment reveals that some estimators like SVM have poor performance for this data and should be avoided. Also, estimators like Random Forest and Decision Tree tend to heavily overfit. We also learn that although Logistic Regression classifier does not have the best performance, it is the fastest among the seven classifiers we tried.
We can also calculate the cross-validation score for all these classifiers. This is computationally much less intensive than obtaining the learning curves, so we don't have to use decomposition to reduce the dimensionality.
scores = {}
for name, clf in zip(names, classifiers):
score = cross_val_score(clf, X_train_ohe, y_train, n_jobs=4, scoring=scoring)
scores[name] = score.mean()
pd.Series(scores).sort_values(ascending=False)
As we see, here we get a different performance picture from the learning curves and Logistic Regression provides the best performance. Let's find out if we can improve the performance of Logistic Regression classifier by optimizing the regularization parameter:
from sklearn.model_selection import GridSearchCV
clf = LogisticRegression()
gs = GridSearchCV(clf, param_grid={'C': np.logspace(-1, 1, 10)},
scoring=scoring, n_jobs=4)
gs.fit(X_train_ohe, y_train)
gs.best_params_
clf = gs.best_estimator_
score = cross_val_score(clf, X_train_ohe, y_train, n_jobs=4, scoring=scoring)
score.mean()
We get some marginal improvement over the previous result of 0.856455 which was obtained using $C = 1$. Next, we examine feature engineer to see if higher performance improvements are possible.
Feature Engineering¶
So far we have used each input feature individually in isolation from other features. We can use feature engineering to generate new features that encode relationship among different features. For example, a particular resource may appear frequently in combination with certain manager id's or role departments.
To extract these types of relationships, we can generate new features that are a combination of all possible subsets of length k of the input features. This idea is similar in essence to generating polynomial features for numerical data. However, since here we are dealing with categorical data, we have to generate feature combinations instead. The following code is a generator that produces all k-combinations of the input array s:
from itertools import chain
def generate_combs(s, k):
if k == 0 or k > len(s):
yield []
return
for i in range(len(s)-k+1):
for c in generate_combs(s[i+1:], k-1):
yield [s[i]] + c
list(generate_combs(range(5), 3))
We can use the above generator to compute the desired feature interactions of a given degree:
import numpy as np
def generate_high_order_features(data, degree=2):
"""
Generate new features by concatinating existing features
from data. This is similiar to polinomial feature generation
but can be used with categorical data as well.
Parameters:
-----------
data: numpy array, contains data with features to be concatinated
degree: int, number of features to concatinate to generate each new
feature. For example, if there are three features 'a', 'b', and 'c'
and degree=2, generated features are 'ab', 'ac', 'bc'.
"""
m, n = data.shape
feature_combs = generate_combs(range(n), degree)
return np.vstack(
[np.array([hash(tuple(row))
for row in data[:, comb].tolist()])]
for comb in feature_combs
).T
It is much nicer to format our feature generator as a sklearn transformer. This way, we can use it in data pipelines should we need to. We call this transformer HighOrderFeatures since it produces high order interactions of the existing features.
from scipy import sparse
from sklearn.base import BaseEstimator, TransformerMixin
class HighOrderFeatures(BaseEstimator, TransformerMixin):
def __init__(self, degree=2):
self.degree = degree
def fit(self, X, y=None):
return self
def transform(self, X):
if isinstance(X, pd.DataFrame):
X = X.as_matrix()
elif isinstance(X, list):
X = np.array(X)
X_poli_feature = [X]
for d in range(2, self.degree+1):
X_poli_feature.append(generate_high_order_features(X, degree=d))
return np.hstack(X_poli_feature)
We instantiate a HighOrderFeature transformer with degree = 2 which means it generates features of order one (regular features) as well as order two (combinations of every two features). We then apply this transformer to train and test data:
hof = HighOrderFeatures(degree=2)
X_train_hof = hof.fit_transform(X_train)
X_test_hof = hof.fit_transform(X_test)
Unfortunately, the One-Hot-Encoder provided by sklearn has a problem with processing sparse matrices, which are produced by the feature generator. So we have to write our own!
from scipy import sparse
from sklearn.base import BaseEstimator, TransformerMixin
class OneHotEncoder(BaseEstimator, TransformerMixin):
def __init__(self):
self.fitted = False
def fit(self, X, y=None):
col_uniques = [np.unique(c) for c in X.T]
self.map = [dict(zip(c, range(len(c)))) for c in col_uniques]
self.fitted = True
return self
def transform(self, X):
if not self.fitted:
raise Exception("Transformer not fitted.")
n_samples, n_features = X.shape
if len(self.map) != n_features:
raise Exception("X must have the same number of columns \
as the matrix used to fit the transformer.")
res = []
for i, col in enumerate(X.T):
col_indices = np.vectorize(self.map[i].get)(col)
row_indices = np.arange(n_samples)
ones = np.ones(n_samples)
col_encoded = sparse.coo_matrix((ones, (row_indices, col_indices)))
res.append(col_encoded)
return sparse.hstack(res).tocsr()
We then apply One-Hot-Encoding to high order features generated in the previous step:
def one_hot_endcode(X_train, X_test):
ohe = OneHotEncoder()
X_total = np.vstack([X_train, X_test])
X_total_ohe = ohe.fit_transform(X_total)
n_train = X_train.shape[0]
X_train_ohe = X_total_ohe[:n_train,]
X_test_ohe = X_total_ohe[n_train:,]
return X_train_ohe, X_test_ohe
X_train_ohe, X_test_ohe = one_hot_endcode(X_train_hof, X_test_hof)
Finally, we are ready to train a model on our data. We use a Logistic-Regression classifier similar to the previous section, to see if we have gained any performance improvement using high order features.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
clf = LogisticRegression()
gs = GridSearchCV(clf, param_grid={'C': np.logspace(-1, 1, 10)}, scoring=scoring)
gs.fit(X_train_ohe, y_train)
gs.best_params_
clf = gs.best_estimator_
cross_val_score(clf, X_train_ohe, y_train, n_jobs=4, scoring=scoring).mean()
The cross-validation score has improved from 0.8585 to 0.8698 which is a considerable improvement. Next, we create a submission file using the obtain classifier and submit it to Kaggle for evaluation:
def make_submission_file(y_test_pred, filename=None):
submit_data = pd.DataFrame(columns=['ID', 'ACTION'])
submit_data['ID'] = xrange(1, len(y_test_pred)+1)
submit_data['ACTION'] = y_test_pred
if not filename:
filename = 'submit.csv'
submit_data.to_csv(filename, index=False)
y_test_pred = clf.predict_log_proba(X_test_ohe)[:, 1]
make_submission_file(y_test_pred, 'submit/submit.csv')
We receive scores of 0.89919 and 0.89322 on public and private leaderboards, respectively. Interestingly, the leaderboard scores are higher than our own local cross-validation score which is encouraging.
The private leaderboard score is used to determine the winning solution. The winning solution in this competition has the score of 0.92360. Comparing our score to this score, our score is around 97% of the winning solution which is promising. Achieving the other 3% would require employing more complex models and more extensive hyper-parameter tuning. We present some ideas how to do this in the next section.
End Notes¶
In this article, we considered the problem of predicting access requirements for new Amazon employees. We first performed an exploratory data analysis which helped us become more familiar with the data. We also learned that all data features are categorical, which motivated the use of one-hot-encoding later when building our model. We compared the performance of several classification algorithms. We then picked Logistic-Regression due to its good performance and simplicity. We managed to achieve a great score using feature engineering together with cross-validation and hyper-parameter tunning.
To further improve our model, we can try other classification algorithms which showed promising results in our performance comparison. In particular, as a next step we can try Random Forest and AdaBoost classifiers which offer better performance at the expense of higher complexity.