## Transcripts

1. ANN L4 Redo: Hey, guys, And welcome back to data science, Deep learning and python Part one. In this lecture, we're going to cover the problem. What happens if your output is more than two categories? So previously we only discussed binary classification, for which there are many real world applications. For example, we could take as inputs the humidity, whether or not the ground is wet the month, the geographical coordinates and try to predict whether or not it will rain. Another example is we could take inputs like how frequently someone exercises their age, their B m I, and what types of food they eat and try to predict whether or not they will get a certain disease. So these air typically yes or no questions. But in many cases, there are more than just two classes to choose from. Suppose your Facebook, and you like to tag all of the photos that get uploaded to your database. So far, Facebook is able to tag faces, but we could tag other things as well. Cars, wedding dresses, what environment the photo is in like a bar or outside in nature. This would be very valuable data for Facebook because they use the things on your profile to determine what advertisements to show you. There is also the famous M NUS data set, where you have to try and categorize each image as a digit between zero and nine. So let's talk about how we can extend the logistic unit to handle more than two classes. Recall that when we have two classes, we only need one output. This is because it can represent the probability of one class and then the other class can be represented by one minus that probability. But we can also do this another way. We could have to output notes and we could just normalize them so that they some to one. In fact, this is exactly what we'll do. We'll also exponentially ate them first to make sure that they're both positive. So this is how you would calculate the output of Class one. And this is how you would calculate the output of class to notice that the weights are now no longer a vector, as they were in logistic regression. Since every input note has to be connected to every output note, you have D input nodes and two output notes, so the total number of weights is two times the and the weights are stored in a matrix of size D by two. Notice that this is very easy to extend to K classes. We have a new way, matrix of Dimension D by K, and we calculate the outputs by first exponentially ating every output and then normalizing it. The output, which I've called a, is usually called the activation.
2. ANN SigmoidvsSoftmax: Hey, guys, welcome back to this class. Deep learning in Python Part one. In this lecture we're going to discuss when are the sigmoid and soft max equivalent? We're going to work through a little bit of the math to show you the equivalents of the sigmoid and soft max. First, let's write out the equation for soft max. We know that the probability that Y equals one given X is the exponential of W one transpose X, divided by the exponential of W one transpose x plus the exponential of w zero transpose x . The probability of why equals zero given X is just one minus that. So let's divide the first equation by exponential of W one transpose x on the top and the bottom. Try this yourself and see what you get. I want you guys that pause this video and see if you can find the answer yourself before I go to the next slide. Now this is in the same form is the sigmoid. So what does this mean? It means that having two weights is actually redundant when there are only two classes. From a software design perspective, it's safer to just always use soft max, since it can handle not only the sigmoid case where K equals two, but any value of K.
3. FeedforwardRemedial1: everyone. And welcome back to this class. Data science Deep learning in python part one. In the previous lectures, we were working under the assumption that you're familiar with the concept of making predictions and getting probabilities and multiplying vectors and matrices. So it was a little fast paced. In order for us to build on what we already know in this lecture, I'm going to be a little more lenient and walk you through some numerical examples to solidify these ideas. Don't worry. If you don't understand everything in this lecture right away, it might help to see some code first and then to return to this lecture later. Once you have some context for what we're trying to do. If you were already comfortable with the feed forward operation and doing calculations in matrix form, you can skip this lecture. Otherwise, keep watching. So let's set up the problem. Our training data, as you know, is a combination of inputs and targets. We call the Inputs X and the targets. Why, in this specific case, we're going to have any equals three and D equals two. That means in our experiment we collected three samples of data and there are two input features. For example, first, we'll consider binary classification and then we'll consider multi class classification using soft max. Our data is as follows. The two input features are has technical degree and hours of time spent per day studying deep learning. The binary output feature is whether or not the subjects will succeed in deep learning. We can see that one of the input features has technical degree is binary, so it's either going to be one of true or zero. If false, we can see that hours of time spent per day studying deep learning has to be a positive real number. For example, you can spend 1.5 hours per day studying deep learning, but you can't spend minus one hours per day studying deep learning. So in this case, subject number one does not have a technical degree but spends 3.5 hours per day studying deep learning. Subject Number one succeeds Subject number two does have a technical degree and spends two hours per day studying deep learning. Subject number two succeeds. Subject number three also has a technical degree but only spends 30.5 hours per day studying deep learning subject number three does not succeed. And so this is our data set. One important thing to keep in mind when we're doing prediction. We don't make use of the targets. The targets air on the used during training. All we're doing during prediction is taking the input and calculating some output. Using the neural network, we hope that the predictions are close to the targets, and in the next section we'll look at training. And the purpose of training is to make the predictions close to the targets. The structure of our unknown. That work will be as follows as you know the input layer sizes fixed the two because we have to input features the output layer sizes. Fix the one because we have one binary output prediction. The output is going to predict the probability that the subject succeeds at deep learning the hidden layers of size three. You might be wondering, How can we choose the right hand later size? This is in general and advanced and non trivial concept, so you'll need to wait till later in the course for discussion on this topic. Let's for now. Just assume that a hidden layer of size three is fine. Let's now look at the weights in the neural network, since every input note has to be connected to every head and layer. Note. There should be two times three or six weights from the input to hidden layer. We can write them out all individually or as a matrix of size to buy three. One meat thing about the weight indices is that they tell you the knows that they connect. So wait I j. Does Yvette. On the input side, it's coming from Note I and on the output side is going to know J. So wait I j connects. Note I from the previous layer to know J in the next layer. We can also have biased terms, and the bias terms are applied after the way multiplication. Therefore, we need three BIS terms. Let's call these be again Beacon be represented as three separate scale er's or is a vector of size three. One thing to remember about these weights. I just chose them arbitrarily. They're completely random Onley in the next section. When we talk about training, will we be able to discuss how to appropriately set the weights to best solve our problem. Next, let's look at the hidden toe output weights, since every headed note must be connected to every output note and there are three hidden nodes and one output note, then we only need three waits in the hidden output layer. Let's call this V again. We can represent each element separately or as a vector. And remember that we can also specify bias term. Since the bias term applies after we multiplied by V, the number of bias turns we need is only one, since there is only one output note. In other words, it's a scaler. Let's call it See. Okay, so now that we have all the numbers we need, let's do some calculations. Let's take the first sample so x zero and 3.5 and calculate the prediction. P of y equals one Given X. Initially, we're going to calculate this using summations. We're also going to use the tan age activation function. You might be wondering at this point, how do we choose which activation function to use, like the number of hidden units. This is also an advanced concept, so we'll be discussing it later in the course. Notice how each of the terms in the summation corresponds to an edge in the neural network . Also notice how the two indices of W corresponds a witch input note and which output noted connects So w 11 connects x one to z one w 21 connects X to Dizzy one, and we can do the same process to find Z two and Z three And here the answers with actual numbers. So at this point, we have all of the for the hidden layer 0.0.993 minus 0.74 and 0.604 What we would like to do next is calculate the output p of y equals one given X. To do this, we need to multiply by v, add C and take the sigmoid similar to the inputs ahead and waits. Each of the V's also connects one note in the head and layer to the output note. So V one connects the ones of the output note V to connect Z two to the upper node and Viva Reconnect Z three to the upper node. And so our prediction is subject number one as a 70% chance of succeeding at deep learning . So that was a lot of work just to calculate one sample, As you may remember, from my numb pie course and other prerequisites to this course, vector rise operations in numb pie are preferable. So we'd like to do vector operations rather than individual scale and multiplication. If possible, notice how this is just equal to W transpose X plus B, using the vector and matrix forms. To prove to yourself that this works, Let's plug in the numbers for W, X and B. We see that we arrive at the expected answer. Now, of course, we can do the same thing for the hidden output layer. So let's do that. If you plug in the numbers, you see that we arrive at the expected answer 0.70 Now that we're on the theme of making things more efficient by vector rising each operation, the next obvious question is, can we do this for multiple samples at the same time? Recall that we only looked at the first sample, but we have three. So what, we could calculate why one y two and white three separately. This is clearly not what we want to do. Instead, we can combine these operations so that we can calculate why one y two y three all at the same time. Remember that the sigmoid contain age functions apply element wise. So the thing that you pass into tan nature signal it can be of any size. The weird thing about this is W and X, and V seemed to have switched sides. So why do we have X w instead of w transpose X? The key to understanding why the weights get switched around is to remember how each sample is stored. When we're talking about a data matrix with multiple samples and multiple input features, it's an end by D Matrix, so each row is one sample. But when we talk about that sample as an individual vector, it's a column because vectors in linear algebra are by convention calling vectors. So you have to remember that in the full data matrix, each sample goes horizontally. When a sample is by itself, it goes vertically to further convince you why this works. Let's consider a neural network of arbitrary size. The number of inputs is D, and the number of hitting units is M because every input must be connected to every output . W must be of size d. Buy em if we have end samples than X must be of size end by D And remember the golden rule of matrix multiplication The inner dimensions must match. So the only way for this to be possible is if we multiply X times w Now the inner dimensions air both d, which makes this valid. The output is of size n by m, which makes sense because we should have an M size vector for every sample at the hidden layer. So hopefully everything so far has made sense to you. What we did was build things up from basic scaler multiplication, where we considered the value at each node one at a time to vector operations, where we calculated an entire layer of nodes at the same time to matrix operations where we calculated an entire layer of nodes for every sample at the same time as promised. We're going to move from sigmoid too soft, Max. When we go from signaling the soft max, it's helpful to look at where the target's will look like a well, since when we use soft max, we have multiple output notes, the output becomes a vector. Specifically when we have K classes, we have K output notes and hence the output vector is of size K. This also means that when we're considering multiple inputs at the same time where the input is a matrix of size and by DE than the output is also a matrix but of size and by K. This is interesting because it suggests that if we use soft max that are outputs should be two dimensional. But as you recall from our original data set, the outputs are one dimensional. So how can we reconcile this difference? The answer is that the targets in a neural network with soft max output are more conveniently expressed as an indicator matrix. This is also equivalent to one hot encoding, which you first saw in my linear regression course. Before we look at one hot encoding, we should look at label representations at a high level. Labels have RIA life meaning, for example, your data set might consist of images of BMWs and jaguars, and your classifications task might be to differentiate between BMWs and Jaguars based on images of the cars. But as you know, when we're doing binary classification. We can encode these A zero on one. By the way, it should be clear that it doesn't matter which one is one and which one is zero. BMW can be zero, and Jaguar can be won or vice versa. It doesn't matter. This format is convenient because the output of a signal it is always between zero and one . And so our prediction becomes whatever the output is closest to. For example, 0.75 means we should predict one 0.49 means we should predict a zero. But what happens when we have more than two labels when we have more than two labels? The convention is to simply continue counting up from zero on one. So if we have three labels, the targets would be 01 and two. For example, if I have three labels BMW, Jaguar and Volkswagen than in my code, I would refer to them as 01 and two. If I have four labels, the targets would be 012 and three. The reason why we do this is different than the situation we have with binary classification. We're no longer rounding, so I'll never have an output like 2.7 and then round. That's a three. That is not how soft Max works. Remember, soft Max outputs probabilities, so they'll always be between zero and one. The reason is because we'll be using these targets to index a raise. How exactly we're going to do that will be explained later in this lecture. For now, it's sufficient to remember that if we have an array of size K, call it a, then the first element can be accessed at a zero. The second element can be accessed at a one and so on up to a of K minus one. Let's now get back to one hot encoding. If you don't remember what one hot encoding looks like, let's recap with this example. Let's say I have a list of target categories. 051314 to 0. It should be clear that there are six distinct categories here, since the numbers from 0 to 5, our president and eight samples since the size of this array is eight. Remember that we named the categories zero to K minus one by convention on Lee, so zero could represent BMW one could represent Jaguar to could represent Volkswagen and so on. But clearly we can't use BMW Jaguar in Volkswagen to index of Vector or matrix because their strings therefore we encode them using the numbers zero to K minus one. So if we have six different categories, then we'll have six different outputs in the neural network represented by the six columns you see here. If you have ever used the psychic learned library, you may remember that it used to be the case that you have to encode your targets as imagers from zero to K minus one. These days, the library also supports strings as targets, but internally it's still converting those strings to integers from zero to K minus one. It's important to keep in mind that while we encode the targets using the numbers zero to K minus one, that doesn't mean these numbers have any meaning relative to each other. Let's say zero means BMW one means Jaguar and three means Volkswagen. We know that one is closer to zero than three is 20 But does that indicate to us that Jaguar is more like BMW than Volkswagen is like BMW? Well, of course, not because we could have encoded these some other way as well. For example, we could have made one Volkswagen and three Jaguar. It's only a number that's representing the true category, but the numbers have no meaning relative to themselves. They are just distinct symbols. Okay, so back to one hot encoding. We have these eight samples and six classes. What does this mean? It means sample number one is labeled as category zero. Sample number two is labeled as Category five. Sample number three is labeled as Category one, and so on. We would like to one hot and code these, and that means we need to put them in a matrix of size eight by six. What the original target tells us is where the one goes in the indicator matrix. So since the first row is Category zero, that means zero with element in the first row of the indicator. Matrix should be one. Since the second row is Category five. That means the fifth element in the second row of the indicator matrix should be a one. Remember, an indicator matrix on Lee has ones and zeros. Another way of writing this compact Lee is indicator of n minus K equals one. If. Why event equals K otherwise. Zero. To help you solidify this idea. Let's write some pseudo code to convert A one D array of targets numbered zero to K minus one into a one hot encoded target indicator matrix. We'll be seeing this later in the code, so it's good to review now as input, we take a one D array. Why in it contains labels from zero to K minus one. We can then retrieve n the number of samples, which is the length of why in we can also find K. Since the maxim Hawaiian is K minus one, we then initialize the matrix of zeros of size end by K. Next, we lived through each target in Hawaiian, and we said the corresponding why out toe one. It should be clear now why the values in why in must be from zero to K minus one. It's because they're indexes into the Y out indicator matrix in the second dimension, which is of size K. One question you might have is why does turning the targets into an indicator matrix makes sense? Well, remember that the output of a neural network is a list of probabilities, the probability that the input belongs to each class. Remember that probabilities are for telling us about things that are unknown. Well, what about things that are known well? In that case, the probability should be one or zero, because there is no uncertainty. For example, what's the probability that you are watching the video for a deep learning class? The probability is one, because it's something that has already happened. So the probability that you are watching a video for a deep learning classes. One the probability that you are not watching the video for a deep learning Class zero. And so for the target indicator matrix, which is actually a matrix of probabilities. The probability that the label is the true label, given the data is one. And the probability that the label is any other label is zero. This is because their targets there, already known
4. FeedforwardRemedial2: The output probabilities from a neural network, of course, will be numbers between zero and one since their predictions, not certainties. But the goal is that after training, the probability of the true target is higher than the probabilities for any other label. So, for example, the target indicator might be 00100 which means the target label is Category two, and after training we'd want an output prediction something like 0.1 point 1.5 point 2.1. So the maximum probability corresponds to the target label. That's what our goal will be during training. One way to solidify this idea further is to write out an output matrix and mark down what each of the probabilities means. As usual, each road represents what sample we're looking at. Each column tells us which category we're getting The probability for this is why the output of a neural network, when its processing multiple samples at the same time will be an end bike a matrix. So, for example, this entry represents the probability that why one belongs to category zero, given x one, this entry represents the probability that why to belongs to Category one given x two. Let's do a simple example to determine whether or not a set of predictions matches the targets on the left. We have a matrix of probabilities which represent predictions. Notice how every row adds upto one on the right. We have the targets. Technically speaking, every row adds up to one here, too, because they're also probabilities. In the first row, 0.7 is the maximum number, and it corresponds to the location of the one in the target table. So this prediction is correct. In the second row, 0.4 is the max and it's in the left column. But that corresponds to a zero in the target table, so this prediction is incorrect. In the third row, 0.6 is the max and it's in the right column, and this corresponds to the one in the target table. So this prediction is correct. In the fourth row, 0.5 is the max and it corresponds to a zero in the target table. So this prediction is incorrect. In total, we got two out of four correct. We call this a classifications rate or classification accuracy, and that is equal in this case to 50%. Note that if we were to do this operation in code, it would actually involve the arg Max rather than the max. Remember that the max tells us the biggest value in an array. But the art Max tells us the location of the biggest value and number five. This could be done as follows. Access equals one means we take the arcamax over the columns rather than the entire matrix . One important thing to note is that the Arg Max is the inverse of turning an array of integers with the values zero to K minus one into an indicator matrix. So one way to verify that you're one hot encoding is correct is you can one hot and cold your labels. Then take the arg Max and the result should be the same as the original. Now that you've seen how to turn a list of targets into an indicator matrix, let's do this with our original data set. Recall that we have three samples and two classes. That means our indicator matrix should be of size three by two and the elements are shown here since sample one has the label one there should be a one in the target indicator matrix at index one. Since sample to has the label one, there should be a one in the target indicator matrix at index one. Since Sample three has the label zero, there should be a one in the target indicator matrix at index zero. Okay, So since this is our target indicator matrix, remember that after we do training, which will cover in the next section are output probabilities. Should be something close to this target. So what we would like to get is something that looks like this. One thing that we have to be mindful of is that the numbering here is a little strange, because for some things, we're counting with a zero based count, for example, the classes are numbered 012 and so on. But then, some things we're counting with a one base count, For example, we've called the components of x one and two and the components of Z 12 and three. This is further complicated by the fact that some languages like Matt Lab actually used it one base count, or as most computer languages use a zero based count. So why do we do this? While it stems from the fact that machine learning is sort of a combination of math and programming, if you pick up any math or statistics textbook, you'll notice that the counts by convention start at one. If you see a submission, most likely you'll see I counting from one to end rather than zero to n minus one. But if you pick up any programming textbook, you'll notice that the count started zero. Since machine learning is a combination of these, you're going to see both. Typically, when we're looking at equations were starting accounts at one. But if we're talking about implementation, details were starting accounts at zero. And so that's why for labels, because they need to index a raise, they refer to implementation. And so we start the count. Zero. The moral of the story is you should use the numbering that corresponds to the programming language that you're writing in. Since we were writing in Python, counting starts at zero, and finally we're going to get back to how to actually calculate the soft max output. Since we don't know how to do training yet, I've just pick some random weights Let's also assume that the inputs a hidden layer is the same as before. So we have three units in the hidden layer. Remember that Now we have to output knows. So the hidden toe output weight matrix of E has to be of size three by two. And the bias see, which applies to every output note, is now a vector of size too. Since we don't know what these weights are. Yet I've defined them here on this slide. For this particular example, we're going to again use the first sample. X equals zero on 3.5. Since W and B are the same as before. This means that Z is also the same, so we can start a calculation from that point. Remember that Z is 0.993 minus 0.74 and 0.604 Let's again do this calculation the naive way going node by node. Instead of expressing the entire output equation directly, we can do it in parts. The activation A for output node one which actually refers to class zero, is the top equation and the activation a for upward note to which actually refers to Class one is the bottom equation. Remember that these days, or what we get right before going into the soft max. So let's calculate a one and a two. This is the contribution from Hidden node one to output Note one. This is the contribution from hidden No to toe up. Footnote one. This is the contribution from hidden No. Three toe up would note one. This is the contribution from hidden note one to outwit no to this is the contribution from hidden No to tap with no to And this is the contribution from hidden No. 32 up with no to Now that we have a one and a two, we can calculate the soft max and hence the output probabilities. Not surprisingly, we end up with the same output probabilities as before, because I chose v NC such that this would be the case. And remember, I taught you in the lecture sigmoid versus soft max, how to choose the app. It waits so that using the soft max would be equivalent to using sigmoid. So if you want to know how I did it, you may want to review that lecture and confirmed that my nuvi NC follow the rule that I derive their same as before. We could have done that all in one step using a vector rised operation. So if you want to confirm to yourself that this is true, you may want to write this down on paper and do some of it by hand. And of course, we can go further than this and calculate the output for all of the samples at the same time. But at this point, it starts just looking like the equation we had before with sigmoid. Now the soft max just replaces the signal it. Okay, so now you know how to calculate the output of a neural network as a batch matrix operation using both sigmoid and soft max outputs. Now there is one more nuance we have to talk about. Recall that if our input dimensionality is D and the number of hitting units is m, then W is a deal. I m matrix. And since the biased ERM b is applied after multiplying by W, it has a vector of size M. So if we're trying to calculate the output for a batch of end samples, the next will be of size and by D. Now what happens after we multiply X and W together? Well, first, let's recall why this is a valid matrix multiplication. It's because the inner dimension, which is D matches on both sides the result of multiplying X and W together is an n by n matrix, but W is on the a vector of size. M recall That matrix addition is element wise. That means when you add to make sure sees together, they have to be the same size. So doing X w plus b, which we showed earlier, doesn't actually work. Mathematically, we know that what should happen is the BIS be should be applied to all an input samples. For that to happen, we need to flip be so that its horizontal and then repeat it 10 times. This would give us an N by N matrix, and then we'd have to and my m matrices that we could add together. Luckily, numb pie doesn't require you to do this explicitly number. I already knows that it should add the same be to every rule of the results. So in the end, no additional work is required in them by beyond just writing ex dot w plus B