{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "IyuBmjJqKxBc", "slideshow": { "slide_type": "slide" } }, "source": [ "# Automatic Differentiation\n", "\n", "Parts of this is inspired by [this tutorial](https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/2.%20Automatic%20Differentiation.ipynb)\n", "\n", "In many applications it is crucial to compute derivatives such as machine learning, optimization problems, and so on. Often the most basis idea is to approximate the gradient of a function by computing a finite difference approximation\n", "$$\n", "\\frac{f(x_1+h,x_2,\\ldots,x_n)-f(x_1,x_2,\\ldots,x_n)}{h}\n", "$$\n", "or numerically a little better using a centered difference\n", "$$\n", "\\frac{f(x_1+h,x_2,\\ldots,x_n)-f(x_1-h,x_2,\\ldots,x_n)}{2h}\n", "$$\n", "\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pylab as plt\n", "\n", "def diff_f(x,h):\n", " fvec = np.exp(x)\n", " y = np.abs((np.exp(x+h)-np.exp(x-h))/(2*h)-fvec)/np.abs(fvec)\n", " return y\n", "\n", "def diff2_f(x,h):\n", " fvec = np.exp(x)\n", " y = np.abs((np.exp(x+h)-np.exp(x))/(h)-fvec)/np.abs(fvec)\n", " return y" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x1 = np.logspace(-10, 1, 100, endpoint=True)# range of h parameter\n", "y = diff_f(5,x1)\n", "y2 = diff2_f(5,x1)\n", "plt.loglog(x1, y,color='red')\n", "plt.loglog(x1, y2,color='blue')\n", "y = diff_f(2,x1)\n", "y2 = diff2_f(2,x1)\n", "plt.loglog(x1, y,color='red',linestyle='--')\n", "plt.loglog(x1, y2,color='blue',linestyle='--')\n", "y = diff_f(150,x1)\n", "y2 = diff2_f(150,x1)\n", "plt.loglog(x1, y,color='red',linestyle='-.')\n", "plt.loglog(x1, y2,color='blue',linestyle='-.')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This method can be quite slow and is prone to rounding errors. The following Figure illustrates the main idea behind back propagation which is based on automatic differentiation.\n", "![](https://miro.medium.com/max/771/1*DcLWqOojI1b9jzQaLibUkQ.png)\n", "\n", "> The **backpropagation** algorithm was originally introduced in the 1970s, but its importance wasn't fully appreciated until a famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "> **Backpropagation** is the key algorithm that makes training deep models computationally tractable. For modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That’s the difference between a model taking a week to train and taking 200,000 years. (Christopher Olah, 2016)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We have seen that in order to optimize our models we need to compute the derivative of the loss function with respect to all model paramaters. \n", "\n", "The computation of derivatives in computer models is addressed by four main methods: \n", "\n", "+ Manually working out derivatives and coding the result (as in the original paper describing backpropagation); \n", "\n", "![alt text](https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/back.png?raw=1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "+ Numerical differentiation (using finite difference approximations); \n", "+ Symbolic differentiation (using expression manipulation in software, such as `Sympy`); \n", "+ and Automatic differentiation (AD)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Automatic differentiation** (AD) works by systematically applying the **chain rule** of differential calculus at the elementary operator level.\n", "\n", "Let $ y = f(g(x)) $ be our target function. In its basic form, the chain rule states:\n", "\n", "$$ \\frac{\\partial y}{\\partial x} = \\frac{\\partial f}{\\partial g} \\frac{\\partial g}{\\partial x} $$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "or, if there are more than one variable $g_i$ in-between $y$ and $x$ (f.e. if $f$ is a two dimensional function such as $f(g_1(x), g_2(x))$), then:\n", "\n", "$$ \\frac{\\partial f}{\\partial x} = \\sum_i \\frac{\\partial f}{\\partial g_i} \\frac{\\partial g_i}{\\partial x} $$\n", "\n", "> See http://tutorial.math.lamar.edu/Classes/CalcIII/ChainRule.aspx" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Now, let's see how AD allows the accurate evaluation of derivatives at machine precision, with only a small constant factor of overhead.\n", "\n", "In its most basic description, AD relies on the fact that all numerical computations\n", "are ultimately compositions of a finite set of elementary operations for which derivatives are known." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "HRNAJqxYKxBf", "slideshow": { "slide_type": "slide" } }, "source": [ "For example, let's consider the computation of the derivative of the sigmoid\n", "\n", "$$\n", " f(x) = \\frac{1}{1 + e^{- ({w}^T \\cdot x + b)}} \n", "$$\n", "\n", "\n", "First, let's write how to evaluate $f(x)$ via a sequence of primitive operations (here for scalar $w$ and $x$):\n", "\n", "\n", "```python\n", "x = ?\n", "f1 = w * x\n", "f2 = f1 + b\n", "f3 = -f2\n", "f4 = 2.718281828459 ** f3\n", "f5 = 1.0 + f4\n", "f = 1.0/f5\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "HRNAJqxYKxBf", "slideshow": { "slide_type": "slide" } }, "source": [ "The question mark indicates that $x$ is a value that must be provided. \n", "\n", "This *program* can compute the value of $x$ and also **populate program variables**, which means to give the defined variables a value. \n", "\n", "We can evaluate $\\frac{\\partial f}{\\partial x}$ at some $x$ by using the chain rule. This is called **forward-mode differentiation**. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "HRNAJqxYKxBf", "slideshow": { "slide_type": "slide" } }, "source": [ "In principle we follow\n", "$$\n", "\\frac{\\partial y}{\\partial x}=\\frac{\\partial y}{\\partial w_{n-1}}\\frac{\\partial w_{n-1}}{\\partial x}=\\frac{\\partial y}{\\partial w_{n-1}}\\left(\\frac{\\partial w_{n-1}}{\\partial w_{n-2}}\\frac{\\partial w_{n-2}}{\\partial x}\\right)=\\ldots\n", "$$\n", "\n", "In our case:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 69 }, "colab_type": "code", "id": "xuWlp5RzKxBg", "outputId": "7be084e4-7745-46b8-8ac4-0f3a2768a0b0", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def f(x,w,b):\n", " f1 = w * x\n", " f2 = f1 + b\n", " f3 = -f2\n", " f4 = 2.718281828459 ** f3\n", " f5 = 1.0 + f4\n", " return 1.0/f5" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def dfdx_forward(x, w, b):\n", " f1 = w * x\n", " p1 = w # p1 = df1/dx\n", " f2 = f1 + b\n", " p2 = 1.0*p1 # p2 = df2/df1*p1 \n", " f3 = -f2\n", " p3 = -1.0*p2 # p3 = df3/df2*p2\n", " f4 = 2.718281828459 ** f3\n", " p4 = 2.718281828459 ** f3 *p3 # p4 = df4/df3*p3\n", " f5 = 1.0 + f4\n", " p5 = 1.0*p4 # p5 = df5/df4*p4\n", " f6 = 1.0/f5\n", " dfx = -1.0 / f5 ** 2.0*p5 # df/dx = df6/df5*p5\n", " return f6, dfx" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 69 }, "colab_type": "code", "id": "xuWlp5RzKxBg", "outputId": "7be084e4-7745-46b8-8ac4-0f3a2768a0b0", "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Value of the function at (3, 2, 1): 0.9990889488055992\n", "df/dx Derivative (fin diff) at (3, 2, 1): 0.0018204406870836465\n", "df/dx Derivative (aut diff) at (3, 2, 1): 0.0018204423602438654\n", "The difference between both is given as: 1.6731602188804762e-09\n" ] } ], "source": [ "h = 0.000001;\n", "der = (f(3+h, 2, 1) - f(3, 2, 1))/h\n", "\n", "print(\"Value of the function at (3, 2, 1): \",f(3, 2, 1))\n", "print(\"df/dx Derivative (fin diff) at (3, 2, 1): \",der)\n", "print(\"df/dx Derivative (aut diff) at (3, 2, 1): \",dfdx_forward(3, 2, 1)[1])\n", "print(\"The difference between both is given as:\",dfdx_forward(3, 2, 1)[1]-der)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We consider the function \n", "$$\n", "g(x_1,x_2)=\\sin(x_1)+x_1x_2\n", "$$" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import math\n", "\n", "def g(x1,x2):\n", " f1 = x1\n", " f2 = x2\n", " f3 = f1*f2\n", " f4 = math.sin(f1)\n", " f5 = f4+f3\n", " \n", " return f5" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def dgdx1_forward(x1,x2):\n", " f1 = x1\n", " p1 = 1 # p1=df1/dx1\n", " f2 = x2\n", " p2 = 0 # p2=df2/dx1\n", " f3 = f1*f2\n", " p3 = p1*f2+p2*f1 # p3 = f1'f2+f1f2'\n", " f4 = math.sin(f1)\n", " p4 = math.cos(f1)*p1 # d\n", " f5 = f4+f3\n", " p5 = p4+p3;\n", " return p5\n", " " ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def dgdx1_forward2(x1,x2):\n", " f1 = x1\n", " p1 = 0 # p1=df1/dx1\n", " f2 = x2\n", " p2 = 1 # p2=df2/dx1\n", " f3 = f1*f2\n", " p3 = p1*f2+p2*f1 # p3 = f1'f2+f1f2'\n", " f4 = math.sin(f1)\n", " p4 = math.cos(f1)*p1 # d\n", " f5 = f4+f3\n", " p5 = p4+p3;\n", " return p5" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Value of the function at (3, 2, 1): 3.1411200080598674\n", "dg/dx1 Derivative (fin diff) at (3, 2, 1): 0.010007497053265979\n", "dg/dx1 Derivative (aut diff) at (3, 2, 1): 0.010007503399554585\n", "The difference between both is given as: 6.346288605740824e-09\n" ] } ], "source": [ "h = 0.0000001;\n", "der = (g(3+h, 1) - g(3,1))/h\n", "print(\"Value of the function at (3, 2, 1): \",g(3, 1))\n", "print(\"dg/dx1 Derivative (fin diff) at (3, 2, 1): \",der)\n", "print(\"dg/dx1 Derivative (aut diff) at (3, 2, 1): \",dgdx1_forward(3, 1))\n", "print(\"The difference between both is given as:\",dgdx1_forward(3, 1)-der) " ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Value of the function at (3, 2, 1): 3.1411200080598674\n", "dg/dx1 Derivative (fin diff) at (3, 2, 1): 2.9999999995311555\n", "dg/dx1 Derivative (aut diff) at (3, 2, 1): 3.0\n", "The difference between both is given as: 4.688445187639445e-10\n" ] } ], "source": [ "h = 0.0000001;\n", "der = (g(3, 1+h) - g(3,1))/h\n", "print(\"Value of the function at (3, 2, 1): \",g(3, 1))\n", "print(\"dg/dx1 Derivative (fin diff) at (3, 2, 1): \",der)\n", "print(\"dg/dx1 Derivative (aut diff) at (3, 2, 1): \",dgdx1_forward2(3, 1))\n", "print(\"The difference between both is given as:\",dgdx1_forward2(3, 1)-der) " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "2cdOSRU0KxBn", "slideshow": { "slide_type": "slide" } }, "source": [ "It is interesting to note that this *program* can be automatically derived if we have access to **subroutines implementing the derivatives of primitive functions** (such as $\\exp{(x)}$ or $1/x$) and all intermediate variables are computed in the right order. \n", "\n", "It is also interesting to note that AD allows the accurate evaluation of derivatives at **machine precision**, with only a small constant factor of overhead.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "D-2EJLmpKxBt", "slideshow": { "slide_type": "slide" } }, "source": [ "The above can be viewed as differentiation of each equation with respect to some yet-to-be-given variable $x$ via the chain schematically given as\n", "$$\n", "\\frac{\\partial y}{\\partial x}=\\frac{\\partial y}{\\partial w_{n-1}}\\frac{\\partial w_{n-1}}{\\partial x}=\\frac{\\partial y}{\\partial w_{n-1}}\\left(\\frac{\\partial w_{n-1}}{\\partial w_{n-2}}\\frac{\\partial w_{n-2}}{\\partial x}\\right)=\\ldots\n", "$$\n", "where we can substitute $x=x_1$ or $x=x_2$ to obtain the desired derivative. This forward differentiation is efficient for functions $f : \\mathbb{R}^n \\rightarrow \\mathbb{R}^m$ with $n \\ll m$ (only $O(n)$ sweeps are necessary) as we need to call the function for $n$ different values of $x$. What if we want to avoid this?" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "D-2EJLmpKxBt", "slideshow": { "slide_type": "slide" } }, "source": [ "Alternatively, we can consider a backward version of the differentiation by proceeding as follows\n", "\\begin{align}\n", "\\frac{\\partial y}{\\partial x}=\\frac{\\partial y}{\\partial w_1}\\frac{\\partial w_1}{\\partial x}=\\left(\\frac{\\partial y}{\\partial w_2}\\frac{\\partial w_2}{\\partial w_1}\\right)\\frac{\\partial w_1}{\\partial x}=\\ldots\n", "\\end{align}\n", "using the function\n", "\n", "```python\n", " w1 = x1\n", " w2 = x2\n", " w3 = w1*w2\n", " w4 = math.sin(w1)\n", " w5 = w4+w3\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let’s take a second look at the chain rule we used to derive forward-mode AD for a generic variable $t$\n", "\n", "$$\n", "\\frac{\\partial w}{\\partial t}\n", "=\n", "\\sum_i\n", "(\n", "\\frac{\\partial y}{\\partial w_i}\n", "\\frac{\\partial w_i}{\\partial t}\n", ")\n", "=\n", "\\frac{\\partial y}{\\partial w_1}\n", "\\frac{\\partial w_1}{\\partial t}\n", "+\n", "\\frac{\\partial y}{\\partial w_2}\n", "\\frac{\\partial w_2}{\\partial t}\n", "+\\ldots\n", "$$\n", "\n", "To calculate the gradient using forward-mode AD, we had to perform two substitutions: one with $t=x_1$\n", "and another with $t=x_2$.\n", "\n", "This means we have to **run the code twice.**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "However, the chain rule is symmetric: it doesn’t care what’s in the “numerator” or the “denominator”. So let’s rewrite the chain rule but turn the derivatives upside down:\n", "\n", "$$\n", "\\frac{\\partial z}{\\partial w_j}\n", "=\n", "\\sum_i\n", "(\n", "\\frac{\\partial w_i}{\\partial w_j}\n", "\\frac{\\partial z}{\\partial w_i}\n", ")\n", "=\n", "\\frac{\\partial w_1}{\\partial w_j}\n", "\\frac{\\partial z}{\\partial w_1}\n", "+\n", "\\frac{\\partial w_2}{\\partial w_j}\n", "\\frac{\\partial z}{\\partial w_2}\n", "+\\ldots\n", "$$\n", "\n", "In this case we obtain the derivative with respect to a variable $w_j$ for some $j$ where $z$ is the output variable\n", "$$\n", "z=h(x_1,x_2)=x_1x_2+\\sin(x_1)\n", "$$\n", "and here we assume this is only one." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "D-2EJLmpKxBt", "slideshow": { "slide_type": "slide" } }, "source": [ "Reverse mode automatic differentiation proceeds by computing for $z=w_5$ that\n", "$$\n", "\\frac{\\partial z}{\\partial z}=\\frac{\\partial z}{\\partial w_5}=1.\n", "$$\n", "Now we know that $w_5=w_4+w_3$ and hence we get\n", "$$\n", "\\frac{\\partial w_5}{\\partial w_4}=1,\\quad\\frac{\\partial w_5}{\\partial w_3}=1.\n", "$$\n", "Now using the chain rule we get\n", "\\begin{align}\n", "\\frac{\\partial z}{\\partial w_3}&=\\frac{\\partial z}{\\partial w_5}\\frac{\\partial w_5}{\\partial w_3}=1\\\\\n", "\\frac{\\partial z}{\\partial w_4}&=\\frac{\\partial z}{\\partial w_5}\\frac{\\partial w_5}{\\partial w_4}=1.\\\\\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "D-2EJLmpKxBt", "slideshow": { "slide_type": "slide" } }, "source": [ "Now, we remember that $w_3=w_2w_1$ and get $\\frac{\\partial w_3}{\\partial w_2}=w_1$ and $\\frac{\\partial w_3}{\\partial w_1}=w_2$. This can be combined to obtain the following derivative\n", "$$\n", "\\frac{\\partial z}{\\partial w_2}=\\frac{\\partial z}{\\partial w_3}\\frac{\\partial w_3}{\\partial w_2}=1\\cdot w_1=x_1.\n", "$$\n", "Also, $w_1$ contributes to $z$ and we get\n", "\\begin{align}\n", "\\frac{\\partial z}{\\partial w_1}&=\\frac{\\partial z}{\\partial w_3}\\frac{\\partial w_3}{\\partial w_1}+\\frac{\\partial z}{\\partial w_4}\\frac{\\partial w_4}{\\partial w_1}\\\\\n", "&=w_2+\\mathrm{cos}(w_1)\\\\\n", "&=x_2+\\mathrm{cos}(x_1)\\\\\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "D-2EJLmpKxBt", "slideshow": { "slide_type": "slide" } }, "source": [ "This is called *reverse-mode differentiation* or *backwards propagation*. Reverse pass starts at the end (i.e. $\\frac{\\partial z}{\\partial z} = 1$) and propagates backward to all dependencies." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "af4RMYaIKxBu", "outputId": "bd858897-e072-4c76-d0b4-55aa98e63715", "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def dgdx_backward(x1, x2):\n", " import numpy as np\n", " w1 = x1\n", " w2 = x2\n", " w3 = w1*w2\n", " w4 = math.sin(w1)\n", " w5 = w4+w3\n", " z = w5\n", " \n", " dz = 1 # pf = df/df\n", " d5d4 = dz*1.0 # p5 = pf * df/df5 \n", " d5d3 = dz*1.0 # p4 = p5 * df5/df4\n", " d3d2 = w1 # p3 = p4 * df4/df3\n", " d3d1 = w2\n", " d4d1 = math.cos(w1) # p2 = p3 * df3/df2\n", " dzd1 = d5d3*d3d1+d5d4*d4d1 # p1 = p2 * df2/df1\n", " dzd2 = d5d3*d3d2 # df/dx = p1 * df1/dx \n", " return dzd1, dzd2" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "af4RMYaIKxBu", "outputId": "bd858897-e072-4c76-d0b4-55aa98e63715", "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Finite difference approximation: 2.540302261877514 0.999999993922529\n", "dg/dx Derivative (aut diff) at (3, 2, 1): (2.5403023058681398, 1.0)\n" ] } ], "source": [ "h = 0.0000001;\n", "der1 = (g(1+h, 2) - g(1,2))/h\n", "der2 = (g(1, 2+h) - g(1,2))/h\n", "\n", "print(\"Finite difference approximation:\",der1,der2)\n", "print(\"dg/dx Derivative (aut diff) at (3, 2, 1): \",\n", " dgdx_backward(1, 2))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "f0I-yLHWKxB6", "slideshow": { "slide_type": "slide" } }, "source": [ "In practice, reverse-mode differentiation is a two-stage process. In the first stage the original function code is run forward, populating the $w_i$ variables. In the second stage, derivatives are calculated by propagating in reverse, from the outputs to the inputs.\n", "\n", "The most important property of reverse-mode differentiation is that it is **cheaper than forward-mode differentiation for functions with a high number of input variables**. For a function of the form $f : \\mathbb{R}^n \\rightarrow \\mathbb{R}$, only one application of the reverse mode is sufficient to compute the full gradient of the function $\\nabla f = \\big( \\frac{\\partial y}{\\partial x_1}, \\dots ,\\frac{\\partial y}{\\partial x_n} \\big)$. This is the case of deep learning, where the number of input variables is very high. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "f0I-yLHWKxB6", "slideshow": { "slide_type": "slide" } }, "source": [ "> As we have seen, AD relies on the fact that all numerical computations\n", "are ultimately compositions of a finite set of elementary operations for which derivatives are known. \n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "0jadWz_IKxB7", "slideshow": { "slide_type": "slide" } }, "source": [ "## Autograd\n", "\n", "Autograd is a Python module (with only one function) that implements automatic differentiation.\n", "\n", "Autograd can automatically differentiate Python and Numpy code:\n", "\n", "+ It can handle most of Python’s features, including loops, if statements, recursion and closures.\n", "+ Autograd allows you to compute gradients of many types of data structures (Any nested combination of lists, tuples, arrays, or dicts).\n", "+ It can also compute higher-order derivatives.\n", "+ Uses reverse-mode differentiation (backpropagation) so it can efficiently take gradients of scalar-valued functions with respect to array-valued or vector-valued arguments.\n", "+ You can easily implement your custom gradients (good for speed, numerical stability, non-compliant code, etc)." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 52 }, "colab_type": "code", "id": "Eao5AQjiKxB8", "outputId": "8a36acaa-f380-45f6-b586-4490c4dbf8db", "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(5.50,2.28)\n", "(2.20,5.42)\n" ] } ], "source": [ "import autograd.numpy as np\n", "from autograd import grad\n", "\n", "x1 = np.array([2, 5], dtype=float)\n", "x2 = np.array([5, 2], dtype=float)\n", "\n", "def test(x):\n", " if x[0]>3:\n", " return np.log(x[0]) + x[0]*x[1] - np.sin(x[1])\n", " else:\n", " return np.log(x[0]) + x[0]*x[1] + np.sin(x[1])\n", " \n", "grad_test = grad(test)\n", "\n", "\n", "print(\"({:.2f},{:.2f})\".format(grad_test(x1)[0],grad_test(x1)[1]))\n", "print(\"({:.2f},{:.2f})\".format(grad_test(x2)[0],grad_test(x2)[1]))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "zhLa7xHMKxCA", "slideshow": { "slide_type": "slide" } }, "source": [ "The ``grad`` function:\n", "\n", "````\n", "grad(fun, argnum=0, *nary_op_args, **nary_op_kwargs)\n", "\n", "Returns a function which computes the gradient of `fun` with respect to positional argument number `argnum`. The returned function takes the same arguments as `fun`, but returns the gradient instead. The function `fun` should be scalar-valued. The gradient has the same type as the argument.\n", "```" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "QvIOvBsiKxCC", "slideshow": { "slide_type": "slide" } }, "source": [ "Then, a simple (there is no bias term) logistic regression model for $n$-dimensional data like this\n", "\n", "$$ f(x) = \\frac{1}{1 + \\exp^{-(\\mathbf{w}^T \\mathbf{x})}} $$\n", "\n", "can be implemented in this way:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import autograd.numpy as np\n", "from autograd import grad\n", "\n", "def sigmoid(x):\n", " return 1 / (1 + np.exp(-x))\n", "\n", "def logistic_predictions(weights, inputs):\n", " return sigmoid(np.dot(inputs, weights))\n", "\n", "def training_loss(weights, inputs, targets):\n", " preds = logistic_predictions(weights, inputs)\n", " loss = preds * targets + (1 - preds) * (1 - targets)\n", " return -np.sum(np.log(loss))\n", "\n", "def optimize(inputs, targets, training_loss):\n", " # Optimize weights using gradient descent.\n", " gradient_loss = grad(training_loss)\n", " weights = np.zeros(inputs.shape[1])\n", " print(\"Initial loss:\", training_loss(weights, inputs, targets))\n", " for i in range(100):\n", " weights -= gradient_loss(weights, inputs, targets) * 0.01\n", " print(\"Final loss:\", training_loss(weights, inputs, targets))\n", " return weights" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Initial loss: 2.772588722239781\n", "Final loss: 1.0672706757870165\n", "Weights: [ 0.48307366 -0.37057217 1.06937395]\n" ] } ], "source": [ "# Build a toy dataset with 3d data\n", "inputs = np.array([[0.52, 1.12, 0.77],\n", " [0.88, -1.08, 0.15],\n", " [0.52, 0.06, -1.30],\n", " [0.74, -2.49, 1.39]])\n", "targets = np.array([True, True, False, True])\n", "\n", "weights = optimize(inputs, targets, training_loss)\n", "print(\"Weights:\", weights)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "accelerator": "GPU", "celltoolbar": "Slideshow", "colab": { "include_colab_link": true, "name": "2. Automatic Differentiation.ipynb", "provenance": [], "version": "0.3.2" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 2 }