StatInference_project/final_draft.Rmd at master · srive326/StatInference_project · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
---
title: "Statistical Inference Project"
Name: Xiaotai Chai, Lilianne Raud, Stephanie Rivera
output: html_document
---
##Introduction

In this project, we're using sales data from Safeway Eastern's floral department. This data only shows the flowers/bouquets sold at Safeway Eastern supermarkets that were bought from Intergreen USA, a flower importing company which supplies different supermarkets across the US. There are a total of 9 distinct products sold and the sales occurred between April and October 2017.

##Literature Review/ Background and Methods
The Florist industry in the US retails cut flowers, floral arrangements and potted plants. Industry operators purchase these goods from domestic and international flower farms and wholesalers and sell them to the general public. This industry excludes operators that primarily function as electronic shopping websites and mail-order companies. Discounted prices for comparable goods online and in supermarkets have led consumers to buy fewer flowers from industry operators. So the supermarkets and retail florist industry is actually growing.


##Research Question(s)

We're interested in seeing the relationships between several predictor variables and sales to determine if there's a particular independent variable that drives or hurts flower sales.
One of these variables is the closing price of the S&P 500 for a given day. We chose this variable because the S&P 500 includes stocks from large companies that sell consumer staples like Johnson and Johnson, Heinz, Colgate, and Coca-Cola. Many of their products are sold in supermarkets, so our reasoning is that the S&P will reflect the movement in supermarkets. Similarly, since the flowers in the data are being sold in supermarkets this could be a good predictor for flower sales.
Oil price is another variable that we expect to have a negative relationship with flower sales.
Chocolate selling
Weather, average temperature, wind speed, precipitation


##Data Collection & Experimental Design

Data was provided by Safeway Eastern


##Data Analysis

```{r message=FALSE, warning=FALSE, paged.print=FALSE}
library('plyr')
library('bit64')
library('dplyr')
library('reshape2')
library('ggplot2')
library('data.table')
library('leaps')
library('minpack.lm')
library('xts')
library('lmtest')
library('onewaytests')
library('car')

project_dir <- getwd()
setwd(project_dir)
```

The code below imports the flower sales data and combines it with the S&P data.

```{r}
#Sidenote: I saved excel file as a txt before reading it as a table
flower_data <- read.table("flower_data.txt", header = TRUE)
flower_data_apr <- fread("april_june.csv", header = TRUE)


#line below ensures column is date and ordered
flower_data$Txn_Date <- as.Date(flower_data$Txn_Date, format = "%m/%d/%Y")
flower_data_apr$Day <- as.Date(flower_data_apr$Day, format = "%m/%d/%Y")


#Since there are multiple sales on a given day, I want to aggregate the sales based on the date
date_net <- aggregate(Net ~ Txn_Date, flower_data,sum)
colnames(flower_data_apr)[6] <- "Net"
date2_net <- aggregate(Net ~ Day, flower_data_apr,sum)

#renaming columns to make joining easier later
#colnames(date2_net)[2] <- "Net"
colnames(date2_net)[1] <- "Date"

#read in financial data
stock_info <- fread("stockdata.csv")

stock_info2 <- fread("april_june_finance.csv")
stock_info2$Date <- as.Date(stock_info2$Date, format = "%m/%d/%Y")
#change column name so i can do a join
colnames(date_net)[1] <- "Date"
combodata <- join(date_net,stock_info, type="right")
combodata2 <- join(date2_net, stock_info2, type = "right")
combodata2$Date <- as.character(combodata2$Date)
#combo3 <- merge(x = date2_net, y = stock_info2, by = "Date", all= TRUE)

#bind 2 data frames (data from april-june and sep-oct)
results <- rbind(combodata2, combodata)


```


Next, we looked at plots of flower sales against the S&P open, close, high, adjusted close, low, and volume individually to see which plots looked linear and could make it to the next stage of the data analysis process.

```{r}
par(mfrow = c(3,2))
plot(results$Open, results$Net)
plot(results$Close, results$Net)
plot(results$High, results$Net)
plot(results$`Adj Close`, results$Net)
plot(results$Low, results$Net)
plot(results$Volume, results$Net)
```

The relationship between net sales and adjusted close and net sales and open looks linear based on the scatter-plot. The next step is to run diagnostics to check if the assumptions of linearity hold.


```{r}
l.model_close <- lm(results$Net ~ results$`Adj Close`)
summary(l.model_close)
anova(l.model_close)
#plots
plot(l.model_close)
```

One of the assumptions of linearity is that the error terms are independent and not correlated. The plot of the residuals shows that there isn't a significant pattern between the residuals and fitted values but they aren't very spread and random either. There are also a few outliers.

The normal q-q plot is used to see if the error terms are normally distributed, if the points lie along the dotted line this is an indicator that the error terms are normally distributed. This q-q plot looks light tailed and there are outliers at the right end which are concerning although most points do lie on the dotted line between -1 and 1.

The scale location checks for homoscedasticity, the assumption of equal variance. In this case the points look spread out.

The residual vs leverage plot helps to find influential cases, values in the top right corner would be
an influential outlier, we don't seem to have any in this case.

Since the scatter plot of net sales and the S&P close seems to curve downward near the right of the graph, and because the q-q plot shows that the error terms at the right end are farther from the dotted line, it may be a good idea to perform a transformation on the predictor variable to improve the linear fit of the data.

Next we combine Net sales with some weather data.

```{r}
net_weather <- fread("Enhanced_weather_movingavg.csv")

par(mfrow = c(2,2))
plot(net_weather$Net,net_weather$avg_temp)
plot(net_weather$Net,net_weather$avg_pressure)
plot(net_weather$Net,net_weather$avg_wind)
plot(net_weather$Net,net_weather$sum_precip)

```

The plots do not seem to show any linear relationship, but we will look if combining some variables perhaps will give us some predictive relationships.


```{r}
attach(net_weather)

allpossreg <- regsubsets(Net ~ avg_temp + avg_pressure + avg_wind +
    sum_precip, nbest = 6, data = net_weather)

aprout <- summary(allpossreg)

with(aprout, round(cbind(which, rsq, adjr2, cp, bic), 3))

#I will be adding few more plots

```

It is clear from the results that the strongest association is between net sales and average wind, but the score is so low, that it can all be attributed to noise.

```{r}
attach(net_weather)

par(mfrow = c(1,2))
plot(net_weather$Net,net_weather$avg7_day)
plot(net_weather$Net,net_weather$avg30_day)

allpossreg1 <- regsubsets(Net ~ avg7_day + avg30_day
, nbest = 6, data = net_weather)

aprout1 <- summary(allpossreg1)

with(aprout1, round(cbind(which, rsq, adjr2, cp, bic), 3))

#I will be adding few more plots

```

we are looking for the potential relationship between oil price and flower sales. Since transportation cost would affect flower price, which will influnce flower sales, we expect there's a negative relationship between oil price and flower sales. We use the Crude oil price data, since crude oil is refined to produce gasoline.
```{r}
#read in oil price data
oil_info <- fread("oil_price.csv")
oil_info$Date <- as.Date(oil_info$Date, format = "%m/%d/%Y")
oil_info <- as.xts(oil_info)
d <- merge(oil_info, xts(,seq(start(oil_info),end(oil_info),"days")))
oil_info <- na.locf(d)
oil_info <- data.frame(Date = index(oil_info),coredata(oil_info))

#change column name so i can do a join
date3_net <- merge(date_net, date2_net, all = TRUE)
oil_combo <- join(date3_net,oil_info, type="right")
oil_combo$Date <- as.character(oil_combo$Date)

```

Let's look at the plot of flower sales against oil price.
```{r}
plot(oil_combo$`Price.Close`,oil_combo$Net)
```
There seems to be a negative relationship between flower sales against oil price if ignore the weeks when oil price is $46.22 (the week from 5/5 to 5/11). We can only get weekly oil price so we estimate oil prices are the same the whole week. That's why the price here seems like a categorical predictor. The week of 5/5 doesn't follow the estimation that lower oil price ha higher flower sales, which could be due to other factoers such as weather and season. Now let's see if there's a linear relationshp between oil price and flower sales.


```{r}
# linear model
fit_oil1 <- lm(Net ~ `Price.Close`, data = oil_combo)
summary(fit_oil1)
anova(fit_oil1)

#plots
plot(fit_oil1)
```
For the linear model, the price is significant. However, only 7.83% of the variance can be explained by the model, which is relatively low. we would like to find a model that fits better to the data. Since the outline of the data looks like it's a polynomial, let's try to fit the polynomial model.
```{r}
fit_oil2 <- lm(Net ~ I(`Price.Close`)+ I(`Price.Close`^2), data = oil_combo)
summary(fit_oil2)
anova(fit_oil2)

#plots
plot(fit_oil2)
```
It seems the polynomial model fits the data better than the linear model since the adjusted r-squared is higher than the linear model's. Also, all coefficients are siginificant. From the QQ plot, the residuals have a slight positively skewed distribution, however, the distribution is not strongly skewed so we don't need to make a transformation of the data. Compare the linear regression model and the polynomial regression model, we found that the polynomial regression model has a higher adjusted R-squared value and has a lower p-value.


We are also interensted in the the relationship between chocolate industry performance and flower sales since chocolate and flowers can be complementary goods. We expect a positive association between chocolate sales and flower sales. The Hershey Company is the largest chocolate manufacturer in North America and has 21.2% market share in the chocolate industry, so we are using Hershey's data. Becuase we are not able to get Hershey's daily sales data, we are comparing Hershey's stock price to the flower sales.
```{r}
#read in hershey data
hershey_info <- fread("hershey.csv")
hershey_info$Date <- as.Date(hershey_info$Date, format = "%m/%d/%Y")

#change column name so i can do a join
hershey_combo <- join(date3_net,hershey_info, type="left")
hershey_combo$Date <- as.character(hershey_combo$Date)
```

Let's look at the plot of flower sales against Hershey stock price.
```{r}
par(mfrow = c(3,2))
plot(hershey_combo$open, hershey_combo$Net)
plot(hershey_combo$close, hershey_combo$Net)
plot(hershey_combo$high, hershey_combo$Net)
plot(hershey_combo$low, hershey_combo$Net)
plot(hershey_combo$volume, hershey_combo$Net)
```

Let's build the linear model
```{r}
fit_hershey <- lm(hershey_combo$Net ~ hershey_combo$open)
summary(fit_hershey)
anova(fit_hershey)

#plots
plot(fit_hershey)
```
From both the plots and the model, there isn't a strong evidence showing that there's a linear relationship between Hershey's stock price and the flower sales. However, the plot looks there's a slight polynomial relationship between stock price and flower sales. So let's try polynomial model.

```{r}
fit_hershey2 <- lm(Net ~ I(open) + I(open^2), data =hershey_combo)
summary(fit_hershey2)
anova(fit_hershey2)

#plots
plot(fit_hershey2)
```
It seems the polynomial model fits much better than the linear model. all coefficients are significant and 17.97% of the variance can be explained by the model.


```{r}
all_data <- fread("all_data.csv", header = TRUE)
all_data$Date <- as.Date(all_data$Date, format = "%m/%d/%Y")

fullmodel <- lm(Net ~ Close +Date+ Open + avg7_day+ avg30_day+avg_temp + sum_precip + Oil + Hershey, data = all_data)
summary(fullmodel)

str(all_data)

anova(fullmodel)


```


```{r}
stepAIC(fullmodel, direction = "backward")
```

```{r}
newmodel <- lm(Net ~ Close + Oil, data = all_data)
summary(newmodel)
```

```{r}
plot(newmodel)
```

Now we divide the whole dataset into two parts - training_set and test_set. The training set is from 4/5 to 6/6 and the test set is from 9/20 to 10/3.
```{r}
training_set <- all_data[1:43,]

test_set <- all_data[44:53,c(2,6, 15)]


training_model <- lm(Net ~ Close + Oil, data = training_set)
summary(training_model)
anova(training_model)

predictions <- predict(training_model, test_set[,c(2,3)])

delta <- test_set[,1] - predictions

(msep <- sum((delta)^2)/length(test_set[,1]))

```
The training set doesn't predict the value for the test set well. This may be because of the seasonality of the flower industry.


```{r}


pairs(all_data[,c(6,15)])
#test for multicolinearity between predictors
vif(newmodel)
#test for normality
shapiro.test(residuals(newmodel))

#test for heteroskedacity
df <- data.frame(x = all_data$Oil, y = all_data$Net, Oil = ifelse(all_data$Oil <= median(all_data$Oil), "small", "large"))

bf.test(y ~ Oil, data = df)

df2 <- data.frame(x = all_data$Close, y = all_data$Net, Close = ifelse(all_data$Close <= median(all_data$Close), "small", "large"))

bf.test(y ~ Close, data = df2)

#check for correlated errors
plot(1:length(all_data$Date), residuals(newmodel))
durbinWatsonTest(newmodel)

```
There is moderate evidence of multicolinearity between the closing price of the S&P 500 and the price of crude oil.

Shapiro-Wilk test for normality rejects null hypothesis that errors are normally distributed.

This lead us to use the Brown-Forsythe test to test for heteroskedacity, which works well when the normality assumption is violated. The p-value for the price of crude oil was not statistically significant so we conclude that the errors have constant variance.
The p-value for the S&P Close was statistically significant so the variance of the errors are not equal in this case.

The time series plot of the error terms show a pattern and suggests they may be correlated. The Durbin-Watson Test supports this conclusion with a p-value of 0.004 we can reject the null hypothesis that the error terms are not correlated.


##Conclusions

Current data-set is sparse and some of the association that may exist, are not strongly apparent because of it. Currently we are in the process or trying to acquire more data. with couple of years worth of data, we could investigate if there are any associations with holidays or other seasonal trends. We would also be able to apply time series based methods, once we learn to apply them. After running informal and formal diagnostics on our linear model, we conclude that a linear model may not be the best for this data. Our tests showed evidence of multicollinearity in the predictor variables, a violation of the normality assumption and constant variance, and the errors are correlated. We'd like to revisit this data set in the future when we have more observations and have learned time-series modeling.


## Works Cited

Weather Data source https://www.wunderground.com/history/airport/KDCA

S&P 500 data source  https://finance.yahoo.com/quote/%5EGSPC/history?period1=1480482000&period2=1512018000&interval=1d&filter=history&frequency=1d

Oil price source http://www.fedprimerate.com/crude-oil-price-history.htm

Hershey stock price data http://www.nasdaq.com/symbol/hsy/historical

Florists in the US http://clients1.ibisworld.com.baypath.idm.oclc.org/reports/us/industry/default.aspx?entid=1096