Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions BenBrandt_DATA400_Idea.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# DATA400 Mini-Project Idea 1
## Ben Brandt

For my mini-project, I plan on scraping housing data from Zillow from my zipcode and some neighboring zipcodes in order to predict the price of houses.

### Tractable Data
In terms of the kind of data I want, I want the typical information that someone looking for a home might consider when making a decision about buying a house. All of this information is available directly on the Zillow website. This information includes numerical values such as the number of bedrooms, number of bathrooms, square footage, and lot acreage. Additionally, there is a categorical variable present that I would like to include as well which is the kind of house (i.e townhouse, single family residence, etc.). This all together is used for as predictor variables for our response variable which would be the price of the house.

### Data Retrieval
The way I anticipate getting this data is by simply scraping it off the website myself. Using a list of zipcodes, I can navigate through all the available homes for each zipcode and use the XPath to scrape all the necessary data I might need. In practice, this would be very similar to the DATA200 GoodReads project.

### Specification of Model
I think the best model to use for this project is a Random Forest. This model is very versatile and is usable in most situations. I think in this case particularly it will better highlight the importance of certain variables over others. And since Random Forests deal with multicollinearity, there is less concern about variables being correlated with one another.

### Implications of Stakeholders
I think there are a lot of stakeholders at hand that can be affected by this project. The first and most obvious of which is people who are currently looking for a home in the Northern Virginia area who are trying to find an estimated price given their specifications. A second potential stakeholder would be the people selling their homes, they can use the model to see what their own home is worth and determine a price accordingly. A third stakeholder would be the realty companies selling these homes. They can use the model to better understand what types of homes are being sold and at what price points and can change their realty strategy accordingly.

### Ethical, Legal, and Societal Implications
There are definitely some legal and societal implications of doing this project both positive and negative. In terms of legality, Zillow likely is not a fan of people scraping their site and it might violate the terms of service. There is also the question of violating homeowners privacy but I think as long as addresses and photos are not included that should minimize any direct exposure to homeowners. Some negative societal impacts that could be at play is the potential that there might end up bringing reduced affordability in certain areas or bulk investing. However, there is also societal upsides that are also important to consider such as affordability transparency and helping buyers and renters make better decisions about the kind of home they buy.
16 changes: 16 additions & 0 deletions BenBrandt_DATA400_Idea.md.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
<!DOCTYPE html><html><head><meta charset="utf-8"><title>BenBrandt_DATA400_Idea.md</title><style></style></head><body id="preview">
<h1 class="code-line" data-line-start=0 data-line-end=1 ><a id="DATA400_MiniProject_Idea_1_0"></a>DATA400 Mini-Project Idea 1</h1>
<h2 class="code-line" data-line-start=1 data-line-end=2 ><a id="Ben_Brandt_1"></a>Ben Brandt</h2>
<p class="has-line-data" data-line-start="3" data-line-end="4">For my mini-project, I plan on scraping housing data from Zillow from my zipcode and some neighboring zipcodes in order to predict the price of houses.</p>
<h3 class="code-line" data-line-start=5 data-line-end=6 ><a id="Tractable_Data_5"></a>Tractable Data</h3>
<p class="has-line-data" data-line-start="6" data-line-end="7">In terms of the kind of data I want, I want the typical information that someone looking for a home might consider when making a decision about buying a house. All of this information is available directly on the Zillow website. This information includes numerical values such as the number of bedrooms, number of bathrooms, square footage, and lot acreage. Additionally, there is a categorical variable present that I would like to include as well which is the kind of house (i.e townhouse, single family residence, etc.). This all together is used for as predictor variables for our response variable which would be the price of the house.</p>
<h3 class="code-line" data-line-start=8 data-line-end=9 ><a id="Data_Retrieval_8"></a>Data Retrieval</h3>
<p class="has-line-data" data-line-start="9" data-line-end="10">The way I anticipate getting this data is by simply scraping it off the website myself. Using a list of zipcodes, I can navigate through all the available homes for each zipcode and use the XPath to scrape all the necessary data I might need. In practice, this would be very similar to the DATA200 GoodReads project.</p>
<h3 class="code-line" data-line-start=11 data-line-end=12 ><a id="Specification_of_Model_11"></a>Specification of Model</h3>
<p class="has-line-data" data-line-start="12" data-line-end="13">I think the best model to use for this project is a Random Forest. This model is very versatile and is usable in most situations. I think in this case particularly it will better highlight the importance of certain variables over others. And since Random Forests deal with multicollinearity, there is less concern about variables being correlated with one another.</p>
<h3 class="code-line" data-line-start=14 data-line-end=15 ><a id="Implications_of_Stakeholders_14"></a>Implications of Stakeholders</h3>
<p class="has-line-data" data-line-start="15" data-line-end="16">I think there are a lot of stakeholders at hand that can be affected by this project. The first and most obvious of which is people who are currently looking for a home in the Northern Virginia area who are trying to find an estimated price given their specifications. A second potential stakeholder would be the people selling their homes, they can use the model to see what their own home is worth and determine a price accordingly. A third stakeholder would be the realty companies selling these homes. They can use the model to better understand what types of homes are being sold and at what price points and can change their realty strategy accordingly.</p>
<h3 class="code-line" data-line-start=17 data-line-end=18 ><a id="Ethical_Legal_and_Societal_Implications_17"></a>Ethical, Legal, and Societal Implications</h3>
<p class="has-line-data" data-line-start="18" data-line-end="19">There are definitely some legal and societal implications of doing this project both positive and negative. In terms of legality, Zillow likely is not a fan of people scraping their site and it might violate the terms of service. There is also the question of violating homeowners privacy but I think as long as addresses and photos are not included that should minimize any direct exposure to homeowners. Some negative societal impacts that could be at play is the potential that there might end up bringing reduced affordability in certain areas or bulk investing. However, there is also societal upsides that are also important to consider such as affordability transparency and helping buyers and renters make better decisions about the kind of home they buy.</p>

</body></html>
28 changes: 28 additions & 0 deletions presentations/test.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
title: "My First Presentation"
subtitle: "⚔<br/>with xaringan"
author: "Ben Brandt"
institute: "Dickinson College"
date: "Sys.Date()"
output:
xaringan::moon_reader:
css: xaringan-themer.css
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
```{r xaringan-themer, include=FALSE, warning=FALSE}
library(xaringanthemer)
style_mono_accent(
base_color = "#1c5253",
header_font_google = google_font("Josefin Sans"),
text_font_google = google_font("Montserrat", "300", "300i"),
code_font_google = google_font("Fira Mono")
)
```
# My First Slide in Xaringan
**test slide**
---

172 changes: 172 additions & 0 deletions presentations/test.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
<!DOCTYPE html>
<html lang="" xml:lang="">
<head>
<title>My First Presentation</title>
<meta charset="utf-8" />
<meta name="author" content="Ben Brandt" />
<script src="libs/header-attrs-2.30/header-attrs.js"></script>
<link rel="stylesheet" href="xaringan-themer.css" type="text/css" />
</head>
<body>
<textarea id="source">
class: center, middle, inverse, title-slide

.title[
# My First Presentation
]
.subtitle[
## ⚔<br/>with xaringan
]
.author[
### Ben Brandt
]
.institute[
### Dickinson College
]
.date[
### Sys.Date()
]

---


# My First Slide in Xaringan
**test slide**
---

</textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "github",
"highlightLines": true,
"countIncrementalSlides": false
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
window.dispatchEvent(new Event('resize'));
});
(function(d) {
var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
if (!r) return;
s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
d.head.appendChild(s);
})(document);

(function(d) {
var el = d.getElementsByClassName("remark-slides-area");
if (!el) return;
var slide, slides = slideshow.getSlides(), els = el[0].children;
for (var i = 1; i < slides.length; i++) {
slide = slides[i];
if (slide.properties.continued === "true" || slide.properties.count === "false") {
els[i - 1].className += ' has-continuation';
}
}
var s = d.createElement("style");
s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
var deleted = false;
slideshow.on('beforeShowSlide', function(slide) {
if (deleted) return;
var sheets = document.styleSheets, node;
for (var i = 0; i < sheets.length; i++) {
node = sheets[i].ownerNode;
if (node.dataset["target"] !== "print-only") continue;
node.parentNode.removeChild(node);
}
deleted = true;
});
})();
// add `data-at-shortcutkeys` attribute to <body> to resolve conflicts with JAWS
// screen reader (see PR #262)
(function(d) {
let res = {};
d.querySelectorAll('.remark-help-content table tr').forEach(tr => {
const t = tr.querySelector('td:nth-child(2)').innerText;
tr.querySelectorAll('td:first-child .key').forEach(key => {
const k = key.innerText;
if (/^[a-z]$/.test(k)) res[k] = t; // must be a single letter (key)
});
});
d.body.setAttribute('data-at-shortcutkeys', JSON.stringify(res));
})(document);
(function() {
"use strict"
// Replace <script> tags in slides area to make them executable
var scripts = document.querySelectorAll(
'.remark-slides-area .remark-slide-container script'
);
if (!scripts.length) return;
for (var i = 0; i < scripts.length; i++) {
var s = document.createElement('script');
var code = document.createTextNode(scripts[i].textContent);
s.appendChild(code);
var scriptAttrs = scripts[i].attributes;
for (var j = 0; j < scriptAttrs.length; j++) {
s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
}
scripts[i].parentElement.replaceChild(s, scripts[i]);
}
})();
(function() {
var links = document.getElementsByTagName('a');
for (var i = 0; i < links.length; i++) {
if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
links[i].target = '_blank';
}
}
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
const hlines = d.querySelectorAll('.remark-code-line-highlighted');
const preParents = [];
const findPreParent = function(line, p = 0) {
if (p > 1) return null; // traverse up no further than grandparent
const el = line.parentElement;
return el.tagName === "PRE" ? el : findPreParent(el, ++p);
};

for (let line of hlines) {
let pre = findPreParent(line);
if (pre && !preParents.includes(pre)) preParents.push(pre);
}
preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>

<script>
slideshow._releaseMath = function(el) {
var i, text, code, codes = el.getElementsByTagName('code');
for (i = 0; i < codes.length;) {
code = codes[i];
if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
text = code.textContent;
if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
/^\$\$(.|\s)+\$\$$/.test(text) ||
/^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
code.outerHTML = code.innerHTML; // remove <code></code>
continue;
}
}
i++;
}
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement('script');
script.type = 'text/javascript';
script.src = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
if (location.protocol !== 'file:' && /^https?:/.test(script.src))
script.src = script.src.replace(/^https?:/, '');
document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
</body>
</html>
Loading