diff --git a/New.qmd b/New.qmd new file mode 100644 index 000000000..6d735a15d --- /dev/null +++ b/New.qmd @@ -0,0 +1,26 @@ +--- +title: "About" +format: html +editor: visual +--- + +## Quarto + +Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see . + +## Running Code + +When you click the **Render** button a document will be generated that includes both content and the output of embedded code. You can embed code like this: + +```{r} +1 + 1 +``` + +You can add options to executable code like this + +```{r} +#| echo: false +2 * 2 +``` + +The `echo: false` option disables the printing of code (only output is displayed). diff --git a/STA6257_Project.Rproj b/STA6257_Project.Rproj index 8e3c2ebc9..827cca17d 100644 --- a/STA6257_Project.Rproj +++ b/STA6257_Project.Rproj @@ -11,3 +11,5 @@ Encoding: UTF-8 RnwWeave: Sweave LaTeX: pdfLaTeX + +BuildType: Website diff --git a/_testsite.yml b/_testsite.yml new file mode 100644 index 000000000..61f0ce54e --- /dev/null +++ b/_testsite.yml @@ -0,0 +1,7 @@ +website: + navbar: + background: primary + search: true + left: + - text: "Home" + file: index.qmd diff --git a/index.html b/index.html index 9e8c19192..6ec3586fd 100644 --- a/index.html +++ b/index.html @@ -2,14 +2,14 @@ - + - - + + -Sample Report - Data Science Capstone +Bayesian Linear Regression + - - - @@ -3156,7 +3025,7 @@
-

Sample Report - Data Science Capstone

+

Bayesian Linear Regression

@@ -3166,14 +3035,14 @@

Sample Report - Data Science Capstone

Author
-

Student name

+

Kayla Liana Mota

Published
-

January 9, 2024

+

April 15, 2024

@@ -3183,28 +3052,335 @@

Sample Report - Data Science Capstone

-
-

Introduction

-
-

What is “mehtod”?

-

This is an introduction to Kernel regression, which is a non-parametric estimator that estimates the conditional expectation of two variables which is random. The goal of a kernel regression is to discover the non-linear relationship between two random variables. To discover the non-linear relationship, kernel estimator or kernel smoothing is the main method to estimate the curve for non-parametric statistics. In kernel estimator, weight function is known as kernel function (Efromovich 2008). Cite this paper (Bro and Smilde 2014). The GEE (Wang 2014).

-

This is my work and I want to add more work…

+

Packages & Library

+
+
+Code +
# Install and load required packages
+#install.packages("readxl")  # Install if not already installed
+#install.packages("corrplot")  # Install if not already installed
+library(readxl)
+library(corrplot)
+
+# install.packages("rstanarm") 
+library(rstan)
+library(rstanarm)
+library(ggplot2)
+library(bayesplot)
+
+# this option uses multiple cores if they're available
+options(mc.cores = parallel::detectCores())
+
+
+

Data Input

+
+
+Code +
library(readxl)
+ledata <- read_excel("C:/Users/kayla/Downloads/Life-Expectancy-Data-Updated.xlsx", col_names = FALSE)
+names(ledata) <- c("Country",   "Region",   "Year", "Infant_deaths",    "Under_five_deaths",    "Adult_mortality",  "Alcohol_consumption",  "Hepatitis_B",  "Measles",  "BMI",  "Polio",    "Diphtheria",   "Incidents_HIV",    "GDP_per_capita",   "Population_mln",   "Thinness_ten_nineteen_years",  "Thinness_five_nine_years", "Schooling",    "Economy_status_Developed", "Economy_status_Developing", "Life_expectancy")
+
+revledata <- na.omit(ledata)
+
+var <- c("Infant_deaths",   "Under_five_deaths",    "Adult_mortality",  "Alcohol_consumption",  "Hepatitis_B",  "Measles",  "BMI",  "Polio",    "Diphtheria",   "Incidents_HIV",    "GDP_per_capita",   "Population_mln",   "Thinness_ten_nineteen_years",  "Thinness_five_nine_years", "Schooling",    "Economy_status_Developed", "Economy_status_Developing", "Life_expectancy")
+
+ndata1 <- revledata[, var]
+
+cdata1 <- na.omit(ndata1)
+print(cdata1)
+
+
+
# A tibble: 2,864 × 18
+   Infant_deaths Under_five_deaths Adult_mortality Alcohol_consumption
+           <dbl>             <dbl>           <dbl>               <dbl>
+ 1          11.1              13             106.                 1.32
+ 2           2.7               3.3            57.9               10.4 
+ 3          51.5              67.9           201.                 1.57
+ 4          32.8              40.5           222.                 5.68
+ 5           3.4               4.3            58.0                2.89
+ 6           9.8              11.2            95.2                4.19
+ 7           6.6               8.2           223                  8.06
+ 8           8.7              10.1           193.                12.2 
+ 9          22                26.1           130.                 0.52
+10          15.3              17.8           218.                 7.72
+# ℹ 2,854 more rows
+# ℹ 14 more variables: Hepatitis_B <dbl>, Measles <dbl>, BMI <dbl>,
+#   Polio <dbl>, Diphtheria <dbl>, Incidents_HIV <dbl>, GDP_per_capita <dbl>,
+#   Population_mln <dbl>, Thinness_ten_nineteen_years <dbl>,
+#   Thinness_five_nine_years <dbl>, Schooling <dbl>,
+#   Economy_status_Developed <dbl>, Economy_status_Developing <dbl>,
+#   Life_expectancy <dbl>
+
+
+Code +
summary(cdata1)
+
+
+
 Infant_deaths    Under_five_deaths Adult_mortality  Alcohol_consumption
+ Min.   :  1.80   Min.   :  2.300   Min.   : 49.38   Min.   : 0.000     
+ 1st Qu.:  8.10   1st Qu.:  9.675   1st Qu.:106.91   1st Qu.: 1.200     
+ Median : 19.60   Median : 23.100   Median :163.84   Median : 4.020     
+ Mean   : 30.36   Mean   : 42.938   Mean   :192.25   Mean   : 4.821     
+ 3rd Qu.: 47.35   3rd Qu.: 66.000   3rd Qu.:246.79   3rd Qu.: 7.777     
+ Max.   :138.10   Max.   :224.900   Max.   :719.36   Max.   :17.870     
+  Hepatitis_B       Measles           BMI            Polio        Diphtheria   
+ Min.   :12.00   Min.   :10.00   Min.   :19.80   Min.   : 8.0   Min.   :16.00  
+ 1st Qu.:78.00   1st Qu.:64.00   1st Qu.:23.20   1st Qu.:81.0   1st Qu.:81.00  
+ Median :89.00   Median :83.00   Median :25.50   Median :93.0   Median :93.00  
+ Mean   :84.29   Mean   :77.34   Mean   :25.03   Mean   :86.5   Mean   :86.27  
+ 3rd Qu.:96.00   3rd Qu.:93.00   3rd Qu.:26.40   3rd Qu.:97.0   3rd Qu.:97.00  
+ Max.   :99.00   Max.   :99.00   Max.   :32.10   Max.   :99.0   Max.   :99.00  
+ Incidents_HIV     GDP_per_capita   Population_mln    
+ Min.   : 0.0100   Min.   :   148   Min.   :   0.080  
+ 1st Qu.: 0.0800   1st Qu.:  1416   1st Qu.:   2.098  
+ Median : 0.1500   Median :  4217   Median :   7.850  
+ Mean   : 0.8943   Mean   : 11541   Mean   :  36.676  
+ 3rd Qu.: 0.4600   3rd Qu.: 12557   3rd Qu.:  23.688  
+ Max.   :21.6800   Max.   :112418   Max.   :1379.860  
+ Thinness_ten_nineteen_years Thinness_five_nine_years   Schooling     
+ Min.   : 0.100              Min.   : 0.1             Min.   : 1.100  
+ 1st Qu.: 1.600              1st Qu.: 1.6             1st Qu.: 5.100  
+ Median : 3.300              Median : 3.4             Median : 7.800  
+ Mean   : 4.866              Mean   : 4.9             Mean   : 7.632  
+ 3rd Qu.: 7.200              3rd Qu.: 7.3             3rd Qu.:10.300  
+ Max.   :27.700              Max.   :28.6             Max.   :14.100  
+ Economy_status_Developed Economy_status_Developing Life_expectancy
+ Min.   :0.0000           Min.   :0.0000            Min.   :39.40  
+ 1st Qu.:0.0000           1st Qu.:1.0000            1st Qu.:62.70  
+ Median :0.0000           Median :1.0000            Median :71.40  
+ Mean   :0.2067           Mean   :0.7933            Mean   :68.86  
+ 3rd Qu.:0.0000           3rd Qu.:1.0000            3rd Qu.:75.40  
+ Max.   :1.0000           Max.   :1.0000            Max.   :83.80  
+
+
+
+
+Code +
qplot(Life_expectancy, Schooling, data = cdata1)
+
+
+

+
+
+
+
+Code +
glm_post1 <- stan_glm(Life_expectancy~Schooling, data=cdata1, family=gaussian)
+
+stan_trace(glm_post1, pars=c("(Intercept)","Schooling","sigma"))
+
+
+

+
+
+Code +
summary(glm_post1)
+
+
+

+Model Info:
+ function:     stan_glm
+ family:       gaussian [identity]
+ formula:      Life_expectancy ~ Schooling
+ algorithm:    sampling
+ sample:       4000 (posterior sample size)
+ priors:       see help('prior_summary')
+ observations: 2864
+ predictors:   2
+
+Estimates:
+              mean   sd   10%   50%   90%
+(Intercept) 52.3    0.3 51.9  52.3  52.7 
+Schooling    2.2    0.0  2.1   2.2   2.2 
+sigma        6.4    0.1  6.3   6.4   6.5 
+
+Fit Diagnostics:
+           mean   sd   10%   50%   90%
+mean_PPD 68.9    0.2 68.6  68.9  69.1 
+
+The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
+
+MCMC diagnostics
+              mcse Rhat n_eff
+(Intercept)   0.0  1.0  4168 
+Schooling     0.0  1.0  4156 
+sigma         0.0  1.0  3554 
+mean_PPD      0.0  1.0  3810 
+log-posterior 0.0  1.0  1836 
+
+For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
+
+
+Code +
pp_check(glm_post1) #The dark blue line shows the observed data while the light blue lines are simulations from the posterior predictive distribution.
+
+
+

+
+
+
+
+Code +
# another way to look at posterior predictive checks
+ppc_intervals(
+  y = cdata1$Life_expectancy,
+  yrep = posterior_predict(glm_post1),
+  x = cdata1$Schooling)
+
+
+

+
+
+
+
+Code +
stan_hist(glm_post1, pars=c("Schooling"), bins=40)
+
+
+

+
+
+
+
+Code +
post_samps_Schooling <- as.data.frame(glm_post1, pars=c("Schooling"))[,"Schooling"]
+mn_Schooling <- mean(post_samps_Schooling) # posterior mean 
+ci_Schooling <- quantile(post_samps_Schooling, probs=c(0.05, 0.95)) # posterior 90% interval 
+
+print(mn_Schooling)
+
+
+
[1] 2.171649
+
+
+Code +
print(ci_Schooling)
+
+
+
      5%      95% 
+2.108951 2.233107 
+
+
+
+
+Code +
glm_fit <- glm(Life_expectancy~Schooling, data=cdata1, family=gaussian)
+summary(glm_fit)
+
+
+

+Call:
+glm(formula = Life_expectancy ~ Schooling, family = gaussian, 
+    data = cdata1)
+
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)    
+(Intercept) 52.27708    0.31190  167.61   <2e-16 ***
+Schooling    2.17227    0.03774   57.56   <2e-16 ***
+---
+Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
+(Dispersion parameter for gaussian family taken to be 41.0151)
+
+    Null deviance: 253277  on 2863  degrees of freedom
+Residual deviance: 117385  on 2862  degrees of freedom
+AIC: 18768
+
+Number of Fisher Scoring iterations: 2
+
+
+Code +
prior_summary(glm_post1)
+
+
+
Priors for model 'glm_post1' 
+------
+Intercept (after predictors centered)
+  Specified prior:
+    ~ normal(location = 69, scale = 2.5)
+  Adjusted prior:
+    ~ normal(location = 69, scale = 24)
+
+Coefficients
+  Specified prior:
+    ~ normal(location = 0, scale = 2.5)
+  Adjusted prior:
+    ~ normal(location = 0, scale = 7.4)
+
+Auxiliary (sigma)
+  Specified prior:
+    ~ exponential(rate = 1)
+  Adjusted prior:
+    ~ exponential(rate = 0.11)
+------
+See help('prior_summary.stanreg') for more details
+
+
+Code +
#It can also be helpful to juxtapose intervals from the prior distribution and the posterior distribution to see how the observed data has changed the parameter estimates.
+posterior_vs_prior(glm_post1, group_by_parameter = TRUE, pars=c("(Intercept)"))
+
+
+

+
+
+Code +
posterior_vs_prior(glm_post1, group_by_parameter = TRUE, pars=c("Schooling","sigma"))
+
+
+

+
+
+
+
+Code +
glm_post2 <- stan_glm(Life_expectancy~Schooling, data=cdata1, family=gaussian, prior=normal(2, 0.5, autoscale=FALSE))
+
+posterior_vs_prior(glm_post2, pars=c("Schooling"), group_by_parameter = TRUE)
+
+
+

+
+
+
+

Summary of Articles

+
+

An introduction to using Bayesian linear regression with clinical data

+

Within the realm of psychology, the tests utilized in research can be deceptive. This article explains how opting for the Bayesian method could yield more precise outcomes compared to frequentist methods. Throughout the article they are using the data from an electroencephalogram (EEG) and anxiety study to illustrate Bayesian models.

+

Specifically focusing on psychology, the article highlights the substantial reliance on p-values. This emphasis weakens the attention to precise predictions, failing to emphasize whether the results are true or valid. The Bayesian approach requires that researchers make predictions prior to data analysis. While concerns have been raised about potential subjectivity in these predictions, the article contends that research inherently involves subjective decisions throughout the process. Whether rooted in scientific research or not, these decisions are influenced by existing literature and chosen methods.

+

The Bayesian linear method involves multiple steps—essentially, predicting prior values, inputting collected data, and obtaining probability in the form of a distribution (posterior distribution). This distribution is typically different from commonly known probability distributions. This is where the Markov Chain Monte Carlo method becomes important, as it simulates random draws from the posterior distribution.

+

In assessing the adequacy of a model, Bayesian methods typically rely on two commonly used metrics: the widely applicable information criterion and the Leave-one-out cross-validation. When examining data, various assumptions, such as having a non-normal distribution, can be overlooked with the application of the Bayesian method. Despite its advantages, there are several challenges associated with using Bayesian methods, primarily its complexity and the need for a deeper understanding of the technology and methodologies involved.

+
+
+

Linear regression model using Bayesian approach for energy performance of residential building

+

The article discusses the two commonly used methods for estimating regression model parameters: the frequentist method (OLS or MLE) and the Bayesian approach. The Bayesian method is characterized by viewing parameters as random variables, introducing the concept of prior, likelihood, and posterior distributions.

+

The research uses an energy efficiency dataset to predict cooling equipment needs and utilizes linear regression modeling with Ordinary Least Square (OLS) and Bayesian approaches. Correlation tests reveal relationships between independent and dependent variables. The OLS model, while showing significance, fails typical assumptions, prompting consideration of the Bayesian approach. Bayesian modeling is conducted using Gibbs Sampler and Markov Chain Monte Carlo (MCMC) methods. The comparison of OLS and Bayesian models involves criteria such as RMSE, MAD, and MAPE. The study concludes that Bayesian regression is more suitable when standard assumptions are not met.

+
+
+

What to believe: Bayesian methods for data analysis

+

The article highlights the flexibility of Bayesian data analysis, allowing researchers to adapt models to various data types, such as dichotomous, metric, or those involving multiple treatment groups. Unlike traditional null-hypothesis significance testing (NHST), Bayesian analysis focuses on estimating parameters in a descriptive model without committing to specific cognitive mechanisms. Bayesian methods eliminate the need for p-values, offering richer information on parameters, including correlations and trade-offs. The article encourages a shift to Bayesian data analysis in cognitive science. The Bayesian model allows flexible customization to data types, estimating parameters like mean accuracy and certainty. Credible intervals and parameter correlations are inherent, and the discussion covers coherent power and replication probability computation in Bayesian analysis, providing a more realistic estimate compared to traditional NHST power analysis.

+
+
+

Bayesian Analysis Reporting Guidelines (BARG)

+

In Bayesian analysis, even when utilizing representative or informed priors, it is crucial to perform a sensitivity analysis to assess the impact of different prior specifications on the posterior results. This aims to ensure that the findings are not dependent on the choice of prior. If results remain consistent across various reasonable priors, it improves the robustness of the findings; however, if they are highly sensitive to the prior  the outcomes heavily rely on the assumed prior conditions.

+

The widely employed Markov Chain Monte Carlo (MCMC) computational method in Bayesian analysis involves generating samples from the posterior distribution of parameters. Confirming the convergence of MCMC chains is important for result reliability. Evidence of convergence, often indicated by statistics like the Potential Scale Reduction Factor (PSRF), provides confidence in the validity of the samples. High Effective Sample Size (ESS) signifies precise estimates, while low ESS may result in imprecise estimates. Reporting ESS for each parameter or derived value aids in assessing result reliability.

+

For decision-making based on continuous-parameter posterior distribution, defining and justifying the limits of the Region of Practical Equivalence (ROPE) is crucial. The ROPE determines the range of parameter values considered practically equivalent and guides decisions on the practical importance of parameter estimates. Providing transparency about the computational approach guarantees that users are informed about potential limitations.

+
+
+

The Application of Bayesian Analysis to Issues in Developmental Research

+

The application of Bayesian analysis to issues in developmental research The article explores the advantages of Bayesian analysis over traditional frequentist methods in statistical inference. It emphasizes Bayesian analysis as a comprehensive and unified framework, capable of handling uncertainties with accuracy and interpretability. Moreover, it highlights Bayesian analysis’s compatibility with hierarchical models and its ability to incorporate prior information effectively, contrasting it with the less optimal nature of frequentist approaches. The discussion extends to the computational ease afforded by Bayesian analysis, particularly with the advent of algorithms like MCMC, showcasing its versatility and applicability across various analytical contexts.

+

The article then discusses the practical application of hierarchical Bayesian models, particularly in a moral reasoning study with hierarchical data structures. Through the introduction of mixed models with random effects and the description of hierarchical Bayesian models, it illustrates their ability in accommodating heterogeneity and evaluating the linear relationships in developmental research. The text concludes with a nod to Lindley’s prediction, suggesting that Bayesian methods are poised to dominate the statistical world in the twenty-first century.

-

Methods

-

The common non-parametric regression model is \(Y_i = m(X_i) + \varepsilon_i\), where \(Y_i\) can be defined as the sum of the regression function value \(m(x)\) for \(X_i\). Here \(m(x)\) is unknown and \(\varepsilon_i\) some errors. With the help of this definition, we can create the estimation for local averaging i.e. \(m(x)\) can be estimated with the product of \(Y_i\) average and \(X_i\) is near to \(x\). In other words, this means that we are discovering the line through the data points with the help of surrounding data points. The estimation formula is printed below (R Core Team 2019):

-

\[ -M_n(x) = \sum_{i=1}^{n} W_n (X_i) Y_i \tag{1} -\] \(W_n(x)\) is the sum of weights that belongs to all real numbers. Weights are positive numbers and small if \(X_i\) is far from \(x\).

-

Another equation:

-

\[ -y_i = \beta_0 + \beta_1 X_1 +\varepsilon_i -\]

+

~ Bayesian Linear Regression ~

Analysis and Results

@@ -3214,19 +3390,19 @@

Data and Visualizat
Code -
# loading packages 
-library(tidyverse)
-library(knitr)
-library(ggthemes)
-library(ggrepel)
-library(dslabs)
+
# loading packages 
+library(tidyverse)
+library(knitr)
+library(ggthemes)
+library(ggrepel)
+library(dslabs)
Code -
# Load Data
-kable(head(murders))
+
# Load Data
+kable(head(murders))
@@ -3287,21 +3463,21 @@

Data and Visualizat
Code -
ggplot1 = murders %>% ggplot(mapping = aes(x=population/10^6, y=total)) 
-
-  ggplot1 + geom_point(aes(col=region), size = 4) +
-  geom_text_repel(aes(label=abb)) +
-  scale_x_log10() +
-  scale_y_log10() +
-  geom_smooth(formula = "y~x", method=lm,se = F)+
-  xlab("Populations in millions (log10 scale)") + 
-  ylab("Total number of murders (log10 scale)") +
-  ggtitle("US Gun Murders in 2010") +
-  scale_color_discrete(name = "Region")+
-      theme_bw()
+
ggplot1 = murders %>% ggplot(mapping = aes(x=population/10^6, y=total)) 
+
+  ggplot1 + geom_point(aes(col=region), size = 4) +
+  geom_text_repel(aes(label=abb)) +
+  scale_x_log10() +
+  scale_y_log10() +
+  geom_smooth(formula = "y~x", method=lm,se = F)+
+  xlab("Populations in millions (log10 scale)") + 
+  ylab("Total number of murders (log10 scale)") +
+  ggtitle("US Gun Murders in 2010") +
+  scale_color_discrete(name = "Region")+
+      theme_bw()
-

+

@@ -3312,25 +3488,11 @@

Statistical Modeling<

Conclusion

-
- - +
+

References

-

References

-
-Bro, Rasmus, and Age K Smilde. 2014. “Principal Component Analysis.” Analytical Methods 6 (9): 2812–31. -
-
-Efromovich, S. 2008. Nonparametric Curve Estimation: Methods, Theory, and Applications. Springer Series in Statistics. Springer New York. https://books.google.com/books?id=mdoLBwAAQBAJ. -
-
-R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org. -
-
-Wang, Ming. 2014. “Generalized Estimating Equations in Longitudinal Data Analysis: A Review and Recent Developments.” Advances in Statistics 2014. -
-
+