Assignment Task
Problem 1 – Reading the dataset
Q1. Read the first 10,000 rows from the credit card dataset provided in the assignment_data folder
Name your DataFrame df
Rename the column ‘PAY_0’ to ‘PAY_1’ and the column ‘default payment next month’ to ‘payment_default’
Delete ID column
Q2. List which features are numeric, ordinal, and nominal variables, and how many features of each kind there are in the dataset. To answer this question
Find the definitions of the variables provided elsewhere in the course material (hint: make sure you do weekly tutorials)
Find the definitions of numeric, ordinal and nominal variables
Carefully consider the values of the data itself as well as the output of df.info().
Q3. Missing Values.
Print out the number of missing values for each variable in the dataset and comment on your findings.
Problem 2. Cleaning data and dealing with categorical features
Q1.
Use an appropriate pandas function to impute missing values using one of the following two strategies: mean and mode.
– Take into consideration the type of each variable and the best practices we discussed in class/lecture notes
Explain what data imputation is, how you have done it here, and what decisions you had to make.
Q2.
Print value_counts() of the ‘SEX’ column and add a dummy variable named ‘SEX_FEMALE’ to df using get_dummies()
Carefully explain what the values of the new variable ‘SEX_FEMALE’ mean
Make sure the variable ‘SEX’ is deleted from df
Q3. Print value_counts() of the ‘MARRIAGE’ column and carefully comment on what you notice in relation to the definition of this variable.
Q4.
Apply get_dummies() to ‘MARRIAGE’ feature and add dummy variables ‘MARRIAGE_MARRIED’, ‘MARRIAGE_SINGLE’, ‘MARRIAGE_OTHER’ to df.
Carefully consider how to allocate all the values of ‘MARRIAGE’ across these 3 newly created features
Explain what decisions you had to make
Make sure that ‘MARRIAGE’ is deleted from df
Q5. In the column ‘EDUCATION’, convert the values {0, 5, 6} to the value 4.
Problem 3 Preparing X and y arrays
Q1.
Create a numpy array y from the first 8,000 observations of ‘payment_default’ column from df
Create a numpy array X from the first 8,000 observations of all the remaining variables in df
Q2.
Use an appropriate sklearn library we used in class to create y_train, y_test, X_train and X_test by splitting the data into 75% train and 25% test datasets
– Set random_state to 4 and stratify the subsamples so that train and test datasets have roughly equal proportions of the target’s class labels
Standardise the data to mean zero and variance one using an approapriate sklearn library
Problem 4. Support Vector Classifier and Accuracies
Q1.
Train a Support Vector Classifier on the standardised data
– Use rbf kernel and set random_state to 3 (don’t change any other parameters)
Compute and print training and test dataset accuracies
Q2.
Extract 2 linear principal components from the standardised features using an appropriate sklearn library
Train a Support Vector Classifier on the 2 principal components computed above
– Use rbf kernel and set random_state to 3 (don’t change any other parameters)
Compute and print training and test dataset accuracies
Q3.
Comment on the suitability of the two classifiers to predict credit card defaults by commenting on (and comparing) the computed accuracies from the last two questions.
Make comparisons both within and across the two questions