Predicting Resource Access Permission for Amazon Employees¶

In this article, we consider the Amazon Employee Access Challenge hosted by Kaggle. In this challenge, we are asked to predict the resource access permissions required for a new employee given historical employee access data. This problem is motivated by the difficulties faced by new employees as they enter the company. In particular, new employees require access to resources such as web portals, printers, etc, and traditionally these access permissions are granted manually by the employee's manager as they encounter road blocks. This process wastes a considerable amount of time for both the new employees and their managers.

Our goal here is to automate this process, by building a model based on previously available employee access data. We aim to build a model that learns common access patterns required by employees based on employee attributes such as employee's role, employee's manager role, etc. Such a model then can be used to predict the resource accesses required by a new employee given their attributes.

Exploratory Data Analysis¶

To gain a better understanding of the data, we first conduct a exploratory data analysis. We start by loading the data into a pandas dataframe:

import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

Here is the first five rows of train data:

train.head()

	ACTION	RESOURCE	MGR_ID	ROLE_ROLLUP_1	ROLE_ROLLUP_2	ROLE_DEPTNAME	ROLE_TITLE	ROLE_FAMILY_DESC	ROLE_FAMILY	ROLE_CODE
0	1	39353	85475	117961	118300	123472	117905	117906	290919	117908
1	1	17183	1540	117961	118343	123125	118536	118536	308574	118539
2	1	36724	14457	118219	118220	117884	117879	267952	19721	117880
3	1	36135	5396	117961	118343	119993	118321	240983	290919	118322
4	1	42680	5905	117929	117930	119569	119323	123932	19793	119325

As we can see, train data provides us with attributes such as the resource in question, information about the role of the employee, manager id, etc, as well as the action that was granted for this employee. Here is a detailed description of the data attributes:

Column Name	Description
ACTION	ACTION is 1 if the resource was approved, 0 if the resource was not
RESOURCE	An ID for each resource
MGR_ID	The EMPLOYEE ID of the manager of the current EMPLOYEE ID record; an employee may have only one manager at a time
ROLE_ROLLUP_1	Company role grouping category id 1 (e.g. US Engineering)
ROLE_ROLLUP_2	Company role grouping category id 2 (e.g. US Retail)
ROLE_DEPTNAME	Company role department description (e.g. Retail)
ROLE_TITLE	Company role business title description (e.g. Senior Engineering Retail Manager)
ROLE_FAMILY_DESC	Company role family extended description (e.g. Retail Manager, Software Engineering)
ROLE_FAMILY	Company role family description (e.g. Retail Manager)
ROLE_CODE	Company role code; this code is unique to each role (e.g. Manager)

Test data has a similar structure, except that as expected ACTION field is not present:

test.head()

	id	RESOURCE	MGR_ID	ROLE_ROLLUP_1	ROLE_ROLLUP_2	ROLE_DEPTNAME	ROLE_TITLE	ROLE_FAMILY_DESC	ROLE_FAMILY	ROLE_CODE
0	1	78766	72734	118079	118080	117878	117879	118177	19721	117880
1	2	40644	4378	117961	118327	118507	118863	122008	118398	118865
2	3	75443	2395	117961	118300	119488	118172	301534	249618	118175
3	4	43219	19986	117961	118225	118403	120773	136187	118960	120774
4	5	42093	50015	117961	118343	119598	118422	300136	118424	118425

First let's see if train or test data have any missing values:

train.isnull().sum(axis=0)

ACTION              0
RESOURCE            0
MGR_ID              0
ROLE_ROLLUP_1       0
ROLE_ROLLUP_2       0
ROLE_DEPTNAME       0
ROLE_TITLE          0
ROLE_FAMILY_DESC    0
ROLE_FAMILY         0
ROLE_CODE           0
dtype: int64

test.isnull().sum(axis=0)

id                  0
RESOURCE            0
MGR_ID              0
ROLE_ROLLUP_1       0
ROLE_ROLLUP_2       0
ROLE_DEPTNAME       0
ROLE_TITLE          0
ROLE_FAMILY_DESC    0
ROLE_FAMILY         0
ROLE_CODE           0
dtype: int64

train and test data are already cleaned and there are no missing values.

Now let's get some statistics about the data:

train.describe()

	ACTION	RESOURCE	MGR_ID	ROLE_ROLLUP_1	ROLE_ROLLUP_2	ROLE_DEPTNAME	ROLE_TITLE	ROLE_FAMILY_DESC	ROLE_FAMILY	ROLE_CODE
count	32769.000000	32769.000000	32769.000000	32769.000000	32769.000000	32769.000000	32769.000000	32769.000000	32769.000000	32769.000000
mean	0.942110	42923.916171	25988.957979	116952.627788	118301.823156	118912.779914	125916.152644	170178.369648	183703.408893	119789.430132
std	0.233539	34173.892702	35928.031650	10875.563591	4551.588572	18961.322917	31036.465825	69509.462130	100488.407413	5784.275516
min	0.000000	0.000000	25.000000	4292.000000	23779.000000	4674.000000	117879.000000	4673.000000	3130.000000	117880.000000
25%	1.000000	20299.000000	4566.000000	117961.000000	118102.000000	118395.000000	118274.000000	117906.000000	118363.000000	118232.000000
50%	1.000000	35376.000000	13545.000000	117961.000000	118300.000000	118921.000000	118568.000000	128696.000000	119006.000000	118570.000000
75%	1.000000	74189.000000	42034.000000	117961.000000	118386.000000	120535.000000	120006.000000	235280.000000	290919.000000	119348.000000
max	1.000000	312153.000000	311696.000000	311178.000000	286791.000000	286792.000000	311867.000000	311867.000000	308574.000000	270691.000000

ACTION field can only be 0 or 1 with about 94% of values being 1. But other fields take on a wider range of values. For example, here are 10 random samples from the RESOURCE field.

train.RESOURCE.sample(10)

8450     75047
25700    35376
3541     37672
13164    23921
22847     1581
16571    20281
17258    23096
9272     15818
23962    79092
28790     7543
Name: RESOURCE, dtype: int64

It seems that this field consists of categorical values which are anonymized using a label encoding. Actually the same seems to be true about all other fields. To confirm this suspicion, let's find out the number of unique values for each field.

import numpy as np

def f(x):
    return len(np.unique(x))
df = pd.concat([train.agg([min, max]).T, train.agg(f).to_frame()], axis=1).T
df.index = ['min', 'max', 'n_unique']
df

	ACTION	RESOURCE	MGR_ID	ROLE_ROLLUP_1	ROLE_ROLLUP_2	ROLE_DEPTNAME	ROLE_TITLE	ROLE_FAMILY_DESC	ROLE_FAMILY	ROLE_CODE
min	0	0	25	4292	23779	4674	117879	4673	3130	117880
max	1	312153	311696	311178	286791	286792	311867	311867	308574	270691
n_unique	2	7518	4243	128	177	449	343	2358	67	343

The features are anonymized, but we still can figure out their relative importance. This help us with getting a better intuition about the data. To do this we first train a Random-Forest classifier on the train data.

from sklearn.ensemble import RandomForestClassifier

X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]

clf_rf = RandomForestClassifier()
clf_rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

feature_importances = pd.DataFrame(clf_rf.feature_importances_.reshape(1, -1), 
                                   index=['feature_importance'], 
                                   columns=train.columns[1:])

feature_importances = feature_importances.T.sort_values(by=['feature_importance'], 
                                                            ascending=False).T
feature_importances

	RESOURCE	MGR_ID	ROLE_DEPTNAME	ROLE_FAMILY_DESC	ROLE_ROLLUP_2	ROLE_CODE	ROLE_TITLE	ROLE_ROLLUP_1	ROLE_FAMILY
feature_importance	0.379566	0.18012	0.114294	0.102391	0.073208	0.045134	0.040788	0.034449	0.03005

import seaborn as sns
from matplotlib import pyplot as plt

g = sns.barplot(data=feature_importances,
            color="b")
g.set_xticklabels(feature_importances.columns, rotation=90)

[Text(0,0,u'RESOURCE'),
 Text(0,0,u'MGR_ID'),
 Text(0,0,u'ROLE_DEPTNAME'),
 Text(0,0,u'ROLE_FAMILY_DESC'),
 Text(0,0,u'ROLE_ROLLUP_2'),
 Text(0,0,u'ROLE_CODE'),
 Text(0,0,u'ROLE_TITLE'),
 Text(0,0,u'ROLE_ROLLUP_1'),
 Text(0,0,u'ROLE_FAMILY')]

plt.show()

As we can see, the resource field has the most information, followed by manager id and department fields.

Training Models and Cross-Validation¶

In the next step, we train several classifiers on the data and measure their performance using cross-validation. The goal here is to get an idea about the type of performance we can get with different classifiers. So, we will use default classifier parameters and leave performance tuning for later.

But before we can train classifiers, we need to do some preparation on the data. Firstly, as we observed before the data is categorical. A standard practice when dealing with categorical data is One Hot Encoding:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

def one_hot_endcode(X_train, X_test):
    ohe = OneHotEncoder()
    X_total = np.vstack([X_train, X_test])
    X_total_ohe = ohe.fit_transform(X_total)
    n_train = X_train.shape[0]
    X_train_ohe = X_total_ohe[:n_train,]
    X_test_ohe = X_total_ohe[n_train:,]
    return X_train_ohe, X_test_ohe

X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]

X_test = test.iloc[:, 1:]
X_train_ohe, X_test_ohe = one_hot_endcode(X_train, X_test)

Secondly, we note that the performance evaluation metric for the competition is area under the ROC curve:

from sklearn import model_selection
from sklearn.metrics import roc_auc_score

def scoring(estimator, X, y):
    y_pred = estimator.predict_proba(X)
    return roc_auc_score(y, y_pred[:, 1])

Given these information, we can train different classifiers and evaluate their performance. Here, we choose learning curves as performance measure for the classifiers, since they provide more intuition about the model and data. sklearn provides great tools for obtaining the learning curves:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit


def plot_learning_curve(estimator, title, ax, X, y, ylim=None, cv=None, scoring=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5), fontsize=20):

    plt.title(title, fontsize=19)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples", fontsize=16)
    plt.xticks(fontsize=16)
    plt.ylabel("Score", fontsize=16)
    plt.yticks(fontsize=16)
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, scoring=scoring, cv=cv, n_jobs=n_jobs, 
        train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="lower right", fontsize=16)
    plt.grid(True)
    return plt

We now obtain the learning curves for a number of well-known classifiers, namely, Nearest Neighbors, Linear SVM, Decision Tree, Random Forest, Neural Net, AdaBoost, and Logistic Regression. Since learning curves calculations are computationally expensive, we use decomposition to reduce the data dimensionality and speed up the computation.

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.decomposition import TruncatedSVD

names = ["Nearest Neighbors", "Linear SVM", "Decision Tree", "Random Forest", 
         "Neural Net", "AdaBoost", "Logistic Regression"]

classifiers = [
    KNeighborsClassifier(),
    SVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    MLPClassifier(),
    AdaBoostClassifier(),
    LogisticRegression()]

pca = TruncatedSVD(n_components=10)
X_train_ohe_pca = pca.fit_transform(X_train_ohe)

plt.figure(figsize=(20, 30))
index = 1
for name, clf in zip(names, classifiers):
    ax = plt.subplot(4, 2, index)
    index += 1
    plot_learning_curve(clf, name, ax, X_train_ohe_pca, y_train, 
                    scoring=scoring, n_jobs=3, cv=5)

plt.subplots_adjust(hspace=0.33, wspace=0.15)

plt.show()

This experiment reveals that some estimators like SVM have poor performance for this data and should be avoided. Also, estimators like Random Forest and Decision Tree tend to heavily overfit. We also learn that although Logistic Regression classifier does not have the best performance, it is the fastest among the seven classifiers we tried.

We can also calculate the cross-validation score for all these classifiers. This is computationally much less intensive than obtaining the learning curves, so we don't have to use decomposition to reduce the dimensionality.

scores = {}
for name, clf in zip(names, classifiers):
        score = cross_val_score(clf, X_train_ohe, y_train, n_jobs=4, scoring=scoring)
        scores[name] = score.mean()

pd.Series(scores).sort_values(ascending=False)

Logistic Regression    0.856455
Neural Net             0.848133
Random Forest          0.823005
Nearest Neighbors      0.812712
Linear SVM             0.747521
AdaBoost               0.725165
Decision Tree          0.699022
dtype: float64

As we see, here we get a different performance picture from the learning curves and Logistic Regression provides the best performance. Let's find out if we can improve the performance of Logistic Regression classifier by optimizing the regularization parameter:

from sklearn.model_selection import GridSearchCV
clf = LogisticRegression()
gs = GridSearchCV(clf, param_grid={'C': np.logspace(-1, 1, 10)}, 
                  scoring=scoring, n_jobs=4)
gs.fit(X_train_ohe, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=4,
       param_grid={'C': array([ 0.1    ,  0.16681,  0.27826,  0.46416,  0.77426,  1.29155,
        2.15443,  3.59381,  5.99484, 10.     ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=<function scoring at 0x7f21786eaaa0>, verbose=0)

gs.best_params_

{'C': 2.1544346900318834}

clf = gs.best_estimator_
score = cross_val_score(clf, X_train_ohe, y_train, n_jobs=4, scoring=scoring)
score.mean()

0.8584785888043406

We get some marginal improvement over the previous result of 0.856455 which was obtained using $C = 1$. Next, we examine feature engineer to see if higher performance improvements are possible.

Feature Engineering¶

So far we have used each input feature individually in isolation from other features. We can use feature engineering to generate new features that encode relationship among different features. For example, a particular resource may appear frequently in combination with certain manager id's or role departments.

To extract these types of relationships, we can generate new features that are a combination of all possible subsets of length k of the input features. This idea is similar in essence to generating polynomial features for numerical data. However, since here we are dealing with categorical data, we have to generate feature combinations instead. The following code is a generator that produces all k-combinations of the input array s:

from itertools import chain
def generate_combs(s, k):
    if k == 0 or k > len(s):
        yield []
        return 
    
    for i in range(len(s)-k+1):
        for c in generate_combs(s[i+1:], k-1):
            yield [s[i]] + c

list(generate_combs(range(5), 3))

[[0, 1, 2],
 [0, 1, 3],
 [0, 1, 4],
 [0, 2, 3],
 [0, 2, 4],
 [0, 3, 4],
 [1, 2, 3],
 [1, 2, 4],
 [1, 3, 4],
 [2, 3, 4]]

We can use the above generator to compute the desired feature interactions of a given degree:

import numpy as np

def generate_high_order_features(data, degree=2):
    """
    Generate new features by concatinating existing features
    from data. This is similiar to polinomial feature generation
    but can be used with categorical data as well.
    
    Parameters:
    -----------
    data: numpy array, contains data with features to be concatinated
    
    degree: int, number of features to concatinate to generate each new
    feature. For example, if there are three features 'a', 'b', and 'c'
    and degree=2, generated features are 'ab', 'ac', 'bc'.
    
    """
    m, n = data.shape
    
    feature_combs = generate_combs(range(n), degree)
    
    return np.vstack(
                    [np.array([hash(tuple(row)) 
                               for row in data[:, comb].tolist()])]
                                   for comb in feature_combs
    ).T

It is much nicer to format our feature generator as a sklearn transformer. This way, we can use it in data pipelines should we need to. We call this transformer HighOrderFeatures since it produces high order interactions of the existing features.

from scipy import sparse
from sklearn.base import BaseEstimator, TransformerMixin

class HighOrderFeatures(BaseEstimator, TransformerMixin):
    def __init__(self, degree=2):
        self.degree = degree
    
    def fit(self, X, y=None):
        return self
        
    def transform(self, X):
        if isinstance(X, pd.DataFrame):
            X = X.as_matrix()
        elif isinstance(X, list):
            X = np.array(X)
            
        X_poli_feature = [X]
        for d in range(2, self.degree+1):
            X_poli_feature.append(generate_high_order_features(X, degree=d))

        return np.hstack(X_poli_feature)

We instantiate a HighOrderFeature transformer with degree = 2 which means it generates features of order one (regular features) as well as order two (combinations of every two features). We then apply this transformer to train and test data:

hof = HighOrderFeatures(degree=2)
X_train_hof = hof.fit_transform(X_train)
X_test_hof = hof.fit_transform(X_test)

Unfortunately, the One-Hot-Encoder provided by sklearn has a problem with processing sparse matrices, which are produced by the feature generator. So we have to write our own!

from scipy import sparse
from sklearn.base import BaseEstimator, TransformerMixin

class OneHotEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.fitted = False
    
    def fit(self, X, y=None):
        col_uniques = [np.unique(c) for c in X.T]
        self.map = [dict(zip(c, range(len(c)))) for c in col_uniques]
        self.fitted = True
        return self
        
    def transform(self, X):
        if not self.fitted:
            raise Exception("Transformer not fitted.")
        
        n_samples, n_features = X.shape
        if len(self.map) != n_features:
            raise Exception("X must have the same number of columns \
                                as the matrix used to fit the transformer.")
        
        res = []
        for i, col in enumerate(X.T):
            col_indices = np.vectorize(self.map[i].get)(col)
            row_indices = np.arange(n_samples)
            ones = np.ones(n_samples)
            col_encoded = sparse.coo_matrix((ones, (row_indices, col_indices)))
            res.append(col_encoded)
            
        return sparse.hstack(res).tocsr()

We then apply One-Hot-Encoding to high order features generated in the previous step:

def one_hot_endcode(X_train, X_test):
    ohe = OneHotEncoder()
    X_total = np.vstack([X_train, X_test])
    X_total_ohe = ohe.fit_transform(X_total)
    n_train = X_train.shape[0]
    X_train_ohe = X_total_ohe[:n_train,]
    X_test_ohe = X_total_ohe[n_train:,]
    return X_train_ohe, X_test_ohe

X_train_ohe, X_test_ohe = one_hot_endcode(X_train_hof, X_test_hof)

Finally, we are ready to train a model on our data. We use a Logistic-Regression classifier similar to the previous section, to see if we have gained any performance improvement using high order features.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score

clf = LogisticRegression()
gs = GridSearchCV(clf, param_grid={'C': np.logspace(-1, 1, 10)}, scoring=scoring)
gs.fit(X_train_ohe, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': array([ 0.1    ,  0.16681,  0.27826,  0.46416,  0.77426,  1.29155,
        2.15443,  3.59381,  5.99484, 10.     ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=<function scoring at 0x7fc53752f848>, verbose=0)

gs.best_params_

{'C': 0.2782559402207124}

clf = gs.best_estimator_
cross_val_score(clf, X_train_ohe, y_train, n_jobs=4, scoring=scoring).mean()

0.8698137523547452

The cross-validation score has improved from 0.8585 to 0.8698 which is a considerable improvement. Next, we create a submission file using the obtain classifier and submit it to Kaggle for evaluation:

def make_submission_file(y_test_pred, filename=None):
    submit_data = pd.DataFrame(columns=['ID', 'ACTION'])
    submit_data['ID'] = xrange(1, len(y_test_pred)+1)
    submit_data['ACTION'] = y_test_pred
    if not filename:
        filename = 'submit.csv'
    submit_data.to_csv(filename, index=False)
    
y_test_pred = clf.predict_log_proba(X_test_ohe)[:, 1]
make_submission_file(y_test_pred, 'submit/submit.csv')

We receive scores of 0.89919 and 0.89322 on public and private leaderboards, respectively. Interestingly, the leaderboard scores are higher than our own local cross-validation score which is encouraging.

The private leaderboard score is used to determine the winning solution. The winning solution in this competition has the score of 0.92360. Comparing our score to this score, our score is around 97% of the winning solution which is promising. Achieving the other 3% would require employing more complex models and more extensive hyper-parameter tuning. We present some ideas how to do this in the next section.

End Notes¶

In this article, we considered the problem of predicting access requirements for new Amazon employees. We first performed an exploratory data analysis which helped us become more familiar with the data. We also learned that all data features are categorical, which motivated the use of one-hot-encoding later when building our model. We compared the performance of several classification algorithms. We then picked Logistic-Regression due to its good performance and simplicity. We managed to achieve a great score using feature engineering together with cross-validation and hyper-parameter tunning.

To further improve our model, we can try other classification algorithms which showed promising results in our performance comparison. In particular, as a next step we can try Random Forest and AdaBoost classifiers which offer better performance at the expense of higher complexity.