Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 9 additions & 67 deletions notebooks/02.06-Boolean-Arrays-and-Masks.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,28 +8,15 @@
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First comment on book cover


Reply via ReviewNB

Copy link
Owner Author

@amit1rrr amit1rrr Jul 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd comment "Added below deleting 2 cells"


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reply

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third comment -

import numpy as np


Reply via ReviewNB

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4th comment Seaborn


Reply via ReviewNB

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5th comment - Lots of print stmts


Reply via ReviewNB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on last cell


Reply via ReviewNB

"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
"\n",
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb) | [Contents](Index.ipynb) | [Fancy Indexing](02.07-Fancy-Indexing.ipynb) >"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Comparisons, Masks, and Boolean Logic"
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*\n",
"Added above deleting 2 cells"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Added below deleting 2 cells\n",
"This section covers the use of Boolean masks to examine and manipulate values within NumPy arrays.\n",
"Masking comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all outliers that are above some threshold.\n",
"In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks."
Expand Down Expand Up @@ -68,27 +55,9 @@
"# use pandas to extract rainfall inches as a NumPy array\n",
"rainfall = pd.read_csv('data/Seattle2014.csv')['PRCP'].values\n",
"inches = rainfall / 254.0 # 1/10mm -> inches\n",
"inches.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The array contains 365 values, giving daily rainfall in inches from January 1 to December 31, 2014.\n",
"\n",
"As a first quick visualization, let's look at the histogram of rainy days, which was generated using Matplotlib (we will explore this tool more fully in [Chapter 4](04.00-Introduction-To-Matplotlib.ipynb)):"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"inches.shape\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import seaborn; seaborn.set() # set plot styles"
"import matplotlib.pyplot as plt"
]
},
{
Expand All @@ -108,6 +77,7 @@
}
],
"source": [
"import seaborn; seaborn.set() # set plot styles\n",
"plt.hist(inches, 40);"
]
},
Expand Down Expand Up @@ -827,36 +797,7 @@
}
],
"source": [
"x[x < 5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What is returned is a one-dimensional array filled with all the values that meet this condition; in other words, all the values in positions at which the mask array is ``True``.\n",
"\n",
"We are then free to operate on these values as we wish.\n",
"For example, we can compute some relevant statistics on our Seattle rain data:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Median precip on rainy days in 2014 (inches): 0.194881889764\n",
"Median precip on summer days in 2014 (inches): 0.0\n",
"Maximum precip on summer days in 2014 (inches): 0.850393700787\n",
"Median precip on non-summer rainy days (inches): 0.200787401575\n"
]
}
],
"source": [
"x[x < 5]\n",
"# construct a mask of all rainy days\n",
"rainy = (inches > 0)\n",
"\n",
Expand Down Expand Up @@ -1177,6 +1118,7 @@
],
"metadata": {
"anaconda-cloud": {},
"celltoolbar": "Raw Cell Format",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
Expand All @@ -1192,7 +1134,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.8"
}
},
"nbformat": 4,
Expand Down
173 changes: 19 additions & 154 deletions notebooks/02.08-Sorting.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -113,32 +113,7 @@
"output_type": "execute_result"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test comment


Reply via ReviewNB

}
],
"source": [
"x = np.array([2, 1, 4, 3, 5])\n",
"bogosort(x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This silly sorting method relies on pure chance: it repeatedly applies a random shuffling of the array until the result happens to be sorted.\n",
"With an average scaling of $\\mathcal{O}[N \\times N!]$, (that's *N* times *N* factorial) this should–quite obviously–never be used for any real computation.\n",
"\n",
"Fortunately, Python contains built-in sorting algorithms that are *much* more efficient than either of the simplistic algorithms just shown. We'll start by looking at the Python built-ins, and then take a look at the routines included in NumPy and optimized for NumPy arrays."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fast Sorting in NumPy: ```np.sort``` and ``np.argsort``\n",
"\n",
"Although Python has built-in `sort` and ``sorted`` functions to work with lists, we won't discuss them here because NumPy's ``np.sort`` function turns out to be much more efficient and useful for our purposes.\n",
"By default ``np.sort`` uses an $\\mathcal{O}[N\\log N]$, *quicksort* algorithm, though *mergesort* and *heapsort* are also available. For most applications, the default quicksort is more than sufficient.\n",
"\n",
"To return a sorted version of the array without modifying the input, you can use ``np.sort``:"
]
"source": []
},
{
"cell_type": "code",
Expand All @@ -159,7 +134,10 @@
],
"source": [
"x = np.array([2, 1, 4, 3, 5])\n",
"np.sort(x)"
"bogosort(x)\n",
"x = np.array([2, 1, 4, 3, 5])\n",
"np.sort(x)\n",
"# 2 lines got copied below, top one has array output (intact) this one should have error output"
]
},
{
Expand Down Expand Up @@ -274,55 +252,12 @@
"source": [
"rand = np.random.RandomState(42)\n",
"X = rand.randint(0, 10, (4, 6))\n",
"print(X)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[2, 1, 4, 0, 1, 5],\n",
" [5, 2, 5, 4, 3, 7],\n",
" [6, 3, 7, 4, 6, 7],\n",
" [7, 6, 7, 4, 9, 9]])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(X)\n",
"# sort each column of X\n",
"np.sort(X, axis=0)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[3, 4, 6, 6, 7, 9],\n",
" [2, 3, 4, 6, 7, 7],\n",
" [1, 2, 4, 5, 7, 7],\n",
" [0, 1, 4, 5, 5, 9]])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.sort(X, axis=0)\n",
"# sort each row of X\n",
"np.sort(X, axis=1)"
"np.sort(X, axis=1)\n",
"# source from following 2 cells got appended here, output as is"
]
},
{
Expand Down Expand Up @@ -416,22 +351,6 @@
"Using the standard convention, we'll arrange these in a $10\\times 2$ array:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"X = rand.rand(10, 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get an idea of how these points look, let's quickly scatter plot them:"
]
},
{
"cell_type": "code",
"execution_count": 15,
Expand All @@ -449,10 +368,12 @@
}
],
"source": [
"X = rand.rand(10, 2)\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import seaborn; seaborn.set() # Plot styling\n",
"plt.scatter(X[:, 0], X[:, 1], s=100);"
"plt.scatter(X[:, 0], X[:, 1], s=100);\n",
"# line is copied down, should have image output (previous cell had array output, got removed)"
]
},
{
Expand Down Expand Up @@ -573,67 +494,6 @@
"dist_sq.diagonal()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It checks out!\n",
"With the pairwise square-distances converted, we can now use ``np.argsort`` to sort along each row. The leftmost columns will then give the indices of the nearest neighbors:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0 3 9 7 1 4 2 5 6 8]\n",
" [1 4 7 9 3 6 8 5 0 2]\n",
" [2 1 4 6 3 0 8 9 7 5]\n",
" [3 9 7 0 1 4 5 8 6 2]\n",
" [4 1 8 5 6 7 9 3 0 2]\n",
" [5 8 6 4 1 7 9 3 2 0]\n",
" [6 8 5 4 1 7 9 3 2 0]\n",
" [7 9 3 1 4 0 5 8 6 2]\n",
" [8 5 6 4 1 7 9 3 2 0]\n",
" [9 7 3 0 1 4 5 8 6 2]]\n"
]
}
],
"source": [
"nearest = np.argsort(dist_sq, axis=1)\n",
"print(nearest)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the first column gives the numbers 0 through 9 in order: this is due to the fact that each point's closest neighbor is itself, as we would expect.\n",
"\n",
"By using a full sort here, we've actually done more work than we need to in this case. If we're simply interested in the nearest $k$ neighbors, all we need is to partition each row so that the smallest $k + 1$ squared distances come first, with larger distances filling the remaining positions of the array. We can do this with the ``np.argpartition`` function:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"K = 2\n",
"nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to visualize this network of neighbors, let's quickly plot the points along with lines representing the connections from each point to its two nearest neighbors:"
]
},
{
"cell_type": "code",
"execution_count": 2,
Expand All @@ -652,6 +512,10 @@
}
],
"source": [
"nearest = np.argsort(dist_sq, axis=1)\n",
"print(nearest)\n",
"K = 2\n",
"nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)\n",
"plt.scatter(X[:, 0], X[:, 1], s=100)\n",
"\n",
"# draw lines from each point to its two nearest neighbors\n",
Expand All @@ -661,7 +525,8 @@
" for j in nearest_partition[i, :K+1]:\n",
" # plot a line from X[i] to X[j]\n",
" # use some zip magic to make it happen:\n",
" plt.plot(*zip(X[j], X[i]), color='black')"
" plt.plot(*zip(X[j], X[i]), color='black')\n",
"# source from earlier 2 cells got merged here. Earlier cell+outputs deleted, error output for this one."
]
},
{
Expand Down Expand Up @@ -729,7 +594,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
"version": "3.6.8"
}
},
"nbformat": 4,
Expand Down