-
Notifications
You must be signed in to change notification settings - Fork 7
Delete and merge cells #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,28 +8,15 @@ | |
| "<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n", | ||
|
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. reply
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| "*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n", | ||
| "\n", | ||
| "*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "<!--NAVIGATION-->\n", | ||
| "< [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb) | [Contents](Index.ipynb) | [Fancy Indexing](02.07-Fancy-Indexing.ipynb) >" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# Comparisons, Masks, and Boolean Logic" | ||
| "*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*\n", | ||
| "Added above deleting 2 cells" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "Added below deleting 2 cells\n", | ||
| "This section covers the use of Boolean masks to examine and manipulate values within NumPy arrays.\n", | ||
| "Masking comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all outliers that are above some threshold.\n", | ||
| "In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks." | ||
|
|
@@ -68,27 +55,9 @@ | |
| "# use pandas to extract rainfall inches as a NumPy array\n", | ||
| "rainfall = pd.read_csv('data/Seattle2014.csv')['PRCP'].values\n", | ||
| "inches = rainfall / 254.0 # 1/10mm -> inches\n", | ||
| "inches.shape" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "The array contains 365 values, giving daily rainfall in inches from January 1 to December 31, 2014.\n", | ||
| "\n", | ||
| "As a first quick visualization, let's look at the histogram of rainy days, which was generated using Matplotlib (we will explore this tool more fully in [Chapter 4](04.00-Introduction-To-Matplotlib.ipynb)):" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 2, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "inches.shape\n", | ||
| "%matplotlib inline\n", | ||
| "import matplotlib.pyplot as plt\n", | ||
| "import seaborn; seaborn.set() # set plot styles" | ||
| "import matplotlib.pyplot as plt" | ||
| ] | ||
| }, | ||
| { | ||
|
|
@@ -108,6 +77,7 @@ | |
| } | ||
| ], | ||
| "source": [ | ||
| "import seaborn; seaborn.set() # set plot styles\n", | ||
| "plt.hist(inches, 40);" | ||
| ] | ||
| }, | ||
|
|
@@ -827,36 +797,7 @@ | |
| } | ||
| ], | ||
| "source": [ | ||
| "x[x < 5]" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "What is returned is a one-dimensional array filled with all the values that meet this condition; in other words, all the values in positions at which the mask array is ``True``.\n", | ||
| "\n", | ||
| "We are then free to operate on these values as we wish.\n", | ||
| "For example, we can compute some relevant statistics on our Seattle rain data:" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 29, | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "Median precip on rainy days in 2014 (inches): 0.194881889764\n", | ||
| "Median precip on summer days in 2014 (inches): 0.0\n", | ||
| "Maximum precip on summer days in 2014 (inches): 0.850393700787\n", | ||
| "Median precip on non-summer rainy days (inches): 0.200787401575\n" | ||
| ] | ||
| } | ||
| ], | ||
| "source": [ | ||
| "x[x < 5]\n", | ||
| "# construct a mask of all rainy days\n", | ||
| "rainy = (inches > 0)\n", | ||
| "\n", | ||
|
|
@@ -1177,6 +1118,7 @@ | |
| ], | ||
| "metadata": { | ||
| "anaconda-cloud": {}, | ||
| "celltoolbar": "Raw Cell Format", | ||
| "kernelspec": { | ||
| "display_name": "Python 3", | ||
| "language": "python", | ||
|
|
@@ -1192,7 +1134,7 @@ | |
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.6.5" | ||
| "version": "3.6.8" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -113,32 +113,7 @@ | |
| "output_type": "execute_result" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| } | ||
| ], | ||
| "source": [ | ||
| "x = np.array([2, 1, 4, 3, 5])\n", | ||
| "bogosort(x)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "This silly sorting method relies on pure chance: it repeatedly applies a random shuffling of the array until the result happens to be sorted.\n", | ||
| "With an average scaling of $\\mathcal{O}[N \\times N!]$, (that's *N* times *N* factorial) this should–quite obviously–never be used for any real computation.\n", | ||
| "\n", | ||
| "Fortunately, Python contains built-in sorting algorithms that are *much* more efficient than either of the simplistic algorithms just shown. We'll start by looking at the Python built-ins, and then take a look at the routines included in NumPy and optimized for NumPy arrays." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Fast Sorting in NumPy: ```np.sort``` and ``np.argsort``\n", | ||
| "\n", | ||
| "Although Python has built-in `sort` and ``sorted`` functions to work with lists, we won't discuss them here because NumPy's ``np.sort`` function turns out to be much more efficient and useful for our purposes.\n", | ||
| "By default ``np.sort`` uses an $\\mathcal{O}[N\\log N]$, *quicksort* algorithm, though *mergesort* and *heapsort* are also available. For most applications, the default quicksort is more than sufficient.\n", | ||
| "\n", | ||
| "To return a sorted version of the array without modifying the input, you can use ``np.sort``:" | ||
| ] | ||
| "source": [] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
|
|
@@ -159,7 +134,10 @@ | |
| ], | ||
| "source": [ | ||
| "x = np.array([2, 1, 4, 3, 5])\n", | ||
| "np.sort(x)" | ||
| "bogosort(x)\n", | ||
| "x = np.array([2, 1, 4, 3, 5])\n", | ||
| "np.sort(x)\n", | ||
| "# 2 lines got copied below, top one has array output (intact) this one should have error output" | ||
| ] | ||
| }, | ||
| { | ||
|
|
@@ -274,55 +252,12 @@ | |
| "source": [ | ||
| "rand = np.random.RandomState(42)\n", | ||
| "X = rand.randint(0, 10, (4, 6))\n", | ||
| "print(X)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 10, | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "data": { | ||
| "text/plain": [ | ||
| "array([[2, 1, 4, 0, 1, 5],\n", | ||
| " [5, 2, 5, 4, 3, 7],\n", | ||
| " [6, 3, 7, 4, 6, 7],\n", | ||
| " [7, 6, 7, 4, 9, 9]])" | ||
| ] | ||
| }, | ||
| "execution_count": 10, | ||
| "metadata": {}, | ||
| "output_type": "execute_result" | ||
| } | ||
| ], | ||
| "source": [ | ||
| "print(X)\n", | ||
| "# sort each column of X\n", | ||
| "np.sort(X, axis=0)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 11, | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "data": { | ||
| "text/plain": [ | ||
| "array([[3, 4, 6, 6, 7, 9],\n", | ||
| " [2, 3, 4, 6, 7, 7],\n", | ||
| " [1, 2, 4, 5, 7, 7],\n", | ||
| " [0, 1, 4, 5, 5, 9]])" | ||
| ] | ||
| }, | ||
| "execution_count": 11, | ||
| "metadata": {}, | ||
| "output_type": "execute_result" | ||
| } | ||
| ], | ||
| "source": [ | ||
| "np.sort(X, axis=0)\n", | ||
| "# sort each row of X\n", | ||
| "np.sort(X, axis=1)" | ||
| "np.sort(X, axis=1)\n", | ||
| "# source from following 2 cells got appended here, output as is" | ||
| ] | ||
| }, | ||
| { | ||
|
|
@@ -416,22 +351,6 @@ | |
| "Using the standard convention, we'll arrange these in a $10\\times 2$ array:" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 14, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "X = rand.rand(10, 2)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "To get an idea of how these points look, let's quickly scatter plot them:" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 15, | ||
|
|
@@ -449,10 +368,12 @@ | |
| } | ||
| ], | ||
| "source": [ | ||
| "X = rand.rand(10, 2)\n", | ||
| "%matplotlib inline\n", | ||
| "import matplotlib.pyplot as plt\n", | ||
| "import seaborn; seaborn.set() # Plot styling\n", | ||
| "plt.scatter(X[:, 0], X[:, 1], s=100);" | ||
| "plt.scatter(X[:, 0], X[:, 1], s=100);\n", | ||
| "# line is copied down, should have image output (previous cell had array output, got removed)" | ||
| ] | ||
| }, | ||
| { | ||
|
|
@@ -573,67 +494,6 @@ | |
| "dist_sq.diagonal()" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "It checks out!\n", | ||
| "With the pairwise square-distances converted, we can now use ``np.argsort`` to sort along each row. The leftmost columns will then give the indices of the nearest neighbors:" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 21, | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "[[0 3 9 7 1 4 2 5 6 8]\n", | ||
| " [1 4 7 9 3 6 8 5 0 2]\n", | ||
| " [2 1 4 6 3 0 8 9 7 5]\n", | ||
| " [3 9 7 0 1 4 5 8 6 2]\n", | ||
| " [4 1 8 5 6 7 9 3 0 2]\n", | ||
| " [5 8 6 4 1 7 9 3 2 0]\n", | ||
| " [6 8 5 4 1 7 9 3 2 0]\n", | ||
| " [7 9 3 1 4 0 5 8 6 2]\n", | ||
| " [8 5 6 4 1 7 9 3 2 0]\n", | ||
| " [9 7 3 0 1 4 5 8 6 2]]\n" | ||
| ] | ||
| } | ||
| ], | ||
| "source": [ | ||
| "nearest = np.argsort(dist_sq, axis=1)\n", | ||
| "print(nearest)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "Notice that the first column gives the numbers 0 through 9 in order: this is due to the fact that each point's closest neighbor is itself, as we would expect.\n", | ||
| "\n", | ||
| "By using a full sort here, we've actually done more work than we need to in this case. If we're simply interested in the nearest $k$ neighbors, all we need is to partition each row so that the smallest $k + 1$ squared distances come first, with larger distances filling the remaining positions of the array. We can do this with the ``np.argpartition`` function:" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 22, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "K = 2\n", | ||
| "nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "In order to visualize this network of neighbors, let's quickly plot the points along with lines representing the connections from each point to its two nearest neighbors:" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 2, | ||
|
|
@@ -652,6 +512,10 @@ | |
| } | ||
| ], | ||
| "source": [ | ||
| "nearest = np.argsort(dist_sq, axis=1)\n", | ||
| "print(nearest)\n", | ||
| "K = 2\n", | ||
| "nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)\n", | ||
| "plt.scatter(X[:, 0], X[:, 1], s=100)\n", | ||
| "\n", | ||
| "# draw lines from each point to its two nearest neighbors\n", | ||
|
|
@@ -661,7 +525,8 @@ | |
| " for j in nearest_partition[i, :K+1]:\n", | ||
| " # plot a line from X[i] to X[j]\n", | ||
| " # use some zip magic to make it happen:\n", | ||
| " plt.plot(*zip(X[j], X[i]), color='black')" | ||
| " plt.plot(*zip(X[j], X[i]), color='black')\n", | ||
| "# source from earlier 2 cells got merged here. Earlier cell+outputs deleted, error output for this one." | ||
| ] | ||
| }, | ||
| { | ||
|
|
@@ -729,7 +594,7 @@ | |
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.6.4" | ||
| "version": "3.6.8" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First comment on book cover
Reply via ReviewNB