Category: R studio

R studio

Capstone Project guidelines released
Project summaries released    29th April 2020
1st May 2020
Selection of projects by the learner    3rd May 2020
Projects Data Release date   
4th May 2020

Project Notes I submission deadline    17th May 2020
First milestone session    23th/24th May 2020
Project Notes II submission deadline    1st June 2020
Second milestone session    6th/7th June 2020
Project Notes III submission deadline    22th June 2020
Third milestone session    27th/28th June 2020

Final Report Submission
Final Presentation Submission
    6th July 2020
9th July 2020
Final Capstone Presentation     11th July 2020

capstone project

Capstone Project guidelines released
Project summaries released    29th April 2020
1st May 2020
Selection of projects by the learner    3rd May 2020
Projects Data Release date   
4th May 2020

Project Notes I submission deadline    17th May 2020
First milestone session    23th/24th May 2020
Project Notes II submission deadline    1st June 2020
Second milestone session    6th/7th June 2020
Project Notes III submission deadline    22th June 2020
Third milestone session    27th/28th June 2020

Final Report Submission
Final Presentation Submission
    6th July 2020
9th July 2020
Final Capstone Presentation     11th July 2020

Applying classification algorithms to a data set derived from text data, twitter

I need the R code, interpretation of results and variables and reasoning for choosing certain parameters and such for Naive Bayes classifier, J48, logistic regression and SVM. I’m having issues with some functions, it would help me greatly to dedicate time on more important areas of my paper.
Data set: first 24 features are predictors, last one is the target variable. 0 are true news, 1 are fake news.
If it has any importance, I would prefer Caret package.

Hypothesis Testing

#One sided confidence intervals

#For p
library(regclass)
data(“CUSTREACQUIRE”)
summary(CUSTREACQUIRE)

summary(CUSTREACQUIRE$Reacquire)
mean( CUSTREACQUIRE$Reacquire == “Yes” )

#Old reacquire policy got 60% of churned customers. New one is cheaper, and may not be as effective
#Ho: p=0.6 vs. HA: p < 0.6
binom.test(295,500,alternative = “less”, p=0.6)
# 95 percent confidence interval:
# 0.0000000 0.6267122
#60% is still a plausible value for p, retain Ho

#Is the average lifetimevalue 2 larger than lifetime value 1?
#Ho: mu2 = mu1 vs. HA: mu2 > mu1
#Ho: mu2 – mu1 = 0 vs HA: mu2 – mu1 > 0
SUB <- subset(CUSTREACQUIRE,Reacquire==”Yes”)
summary(SUB)
t.test(SUB$Lifetime2,SUB$Lifetime1,paired=TRUE,alternative=”greater”)
# 95 percent confidence interval:
# 110.8438 Inf

#Median Age < 53?
median(CUSTREACQUIRE$Age)
# < alternative wants (-Inf, quantile(,.95) )
# > alternative wants ( quantile(,.05), Inf )
#Ho: median = 53
#HA: median < 53

boot.medians <- c()
for (i in 1:4999) {
boot.sample <- sample( CUSTREACQUIRE$Age, replace=TRUE )
boot.medians[i] <- median(boot.sample)
}
hist(boot.medians)

c(-Inf, quantile(boot.medians,.95) )

#Clustering

set.seed(471); DATA <- data.frame(x=runif(100,30,40),y=runif(100,70,80))
set.seed(472);DATA[1:7,] <- data.frame(x=runif(7,31.1,32.9),y=runif(7,71.1,72.9))
set.seed(473);DATA[21:28,] <- data.frame(x=runif(8,35.1,36.8),y=runif(8,76.0,77.8))
plot(DATA,pch=20,cex=2)

#Is there clustering? Or is there not?

#Step 1
#Ho: distribution of points IS random (=)
#HA: distribution of points IS NOT random (not =)

#Step 2
#Test statistic: the average distance to each point’s 5th nearest neighbor
DISTANCE <- as.matrix( dist( DATA) )
#Get the distance to the 5th nearest neighbor for each point
apply(DISTANCE,2,function(x) sort( x )[6] )
#Get average 5th nearest neighbor distance
mean( apply(DISTANCE,2,function(x) sort( x )[6] ) )

#What is the distribution “under the null” of the test statistic
#i.e. what is the distribution of the average 5th nearest neighbor distance when points
#are randomly placed on this square?
#Simulate this with a Monte Carlo simulation
null.stat <- c()
for ( i in 1:500 ) {
#Place points in the square at random
DATA <- data.frame(x=runif(100,30,40),y=runif(100,70,80))
null.stat[i] <- mean( apply(as.matrix( dist( DATA) ) ,2,function(x) sort( x )[6] ) )
}
hist(null.stat)

#Step 3: get pvalue

mean( null.stat <= 1.29358 )
#p-value is 0.106

#Interpretation of this p-value
#The probability that we would measure an average 5th nearest neighbor distance of 1.29358
#or something smaller (i.e. more evidence from the alternative) when the distribution of points
#is purely random is 10.6%.

#There’s a 10.6% chance of observing our data or data that is even more evidence for the
#alternative (i.e. that there’s clustering) when the points are actually being placed
#at random (i.e. the null is true).

#if p-value < 5% reject Ho, if 5% or bigger, retain Ho
#WE RETAIN HO!

R studio

Description
Thera Bank – Loan Purchase Modeling
This case is about a bank (Thera Bank) which has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget. The department wants to build a model that will help them identify the potential customers who have a higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign. The dataset has data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
Link to the case file:
Thera Bank_Personal_Loan_Modelling-dataset-1.xlsx
You are brought in as a consultant and your job is to build the best model which can classify the right customers who have a higher probability of purchasing the loan. You are expected to do the following:
    EDA of the data available. Showcase the results using appropriate graphs – (10 Marks)
    Apply appropriate clustering on the data and interpret the output(Thera Bank wants to understand what kind of customers exist in their database and hence we need to do customer segmentation) – (10 Marks)
    Build appropriate models on both the test and train data (CART & Random Forest). Interpret all the model outputs and do the necessary modifications wherever eligible (such as pruning) – (20 Marks)
    Check the performance of all the models that you have built (test and train). Use all the model performance measures you have learned so far. Share your remarks on which model performs the best. – (20 Marks)
Hint : split <- sample.split(Thera_Bank$Personal Loan, SplitRatio = 0.7)
#we are splitting the data such that we have 70% of the data is Train Data and 30% of the data is my Test Data

train<- subset(Thera_Bank, split == TRUE)
test<- subset( Thera_Bank, split == FALSE)

Please note the following:
    Please note the following:
1.    There are two parts to the submission:
1.    The output/report in any file format – the key part of the output is the set of observations and insights from the exploration and analysis
2.    Commented R code in .R or .Rmd
2.    Please dont share your R code and/or outputs only, we expect some verbiage/story too – a meaningful output that you can share in a business environment
3.    Any assignment found copied/ plagiarized with other groups will not be graded and awarded zero marks
4.    Please ensure timely submission as post-deadline assignment will not be accepted
Thanks

R studio

Description
Thera Bank – Loan Purchase Modeling
This case is about a bank (Thera Bank) which has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget. The department wants to build a model that will help them identify the potential customers who have a higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign. The dataset has data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
Link to the case file:
Thera Bank_Personal_Loan_Modelling-dataset-1.xlsx
You are brought in as a consultant and your job is to build the best model which can classify the right customers who have a higher probability of purchasing the loan. You are expected to do the following:
    EDA of the data available. Showcase the results using appropriate graphs – (10 Marks)
    Apply appropriate clustering on the data and interpret the output(Thera Bank wants to understand what kind of customers exist in their database and hence we need to do customer segmentation) – (10 Marks)
    Build appropriate models on both the test and train data (CART & Random Forest). Interpret all the model outputs and do the necessary modifications wherever eligible (such as pruning) – (20 Marks)
    Check the performance of all the models that you have built (test and train). Use all the model performance measures you have learned so far. Share your remarks on which model performs the best. – (20 Marks)
Hint : split <- sample.split(Thera_Bank$Personal Loan, SplitRatio = 0.7)
#we are splitting the data such that we have 70% of the data is Train Data and 30% of the data is my Test Data

train<- subset(Thera_Bank, split == TRUE)
test<- subset( Thera_Bank, split == FALSE)

Please note the following:
    Please note the following:
1.    There are two parts to the submission:
1.    The output/report in any file format – the key part of the output is the set of observations and insights from the exploration and analysis
2.    Commented R code in .R or .Rmd
2.    Please dont share your R code and/or outputs only, we expect some verbiage/story too – a meaningful output that you can share in a business environment
3.    Any assignment found copied/ plagiarized with other groups will not be graded and awarded zero marks
4.    Please ensure timely submission as post-deadline assignment will not be accepted
Thanks

Scoring guide (Rubric) – Project 4
Criteria    Points
1. EDA – Basic data summary, Univariate, Bivariate analysis, graphs    10
2.1 Apply Clustering algorithm < type, rationale>    5
2.2 Clustering Output interpretation < dendrogram, number of clusters, remarks to make it meaningful to understand>    5
3.1 Applying CART <plot the tree>    5
3.2 Interpret the CART model output <pruning, remarks on pruning, plot the pruned tree>    5
3.3 Applying Random Forests<plot the tree>    5
3.4 Interpret the RF model output <with remarks, making it meaningful for everybody>    5
4.1 Confusion matrix interpretation    5
4.2 Interpretation of other Model Performance Measures <KS, AUC, GINI>    10
4.3 Remarks on Model validation exercise <Which model performed the best>    5
Points    6