티스토리 뷰

python & DS

conclusion

shannon. 2022. 7. 31. 18:05
반응형

Compare the three methods

Selecting transcript lines in this section will navigate to timestamp in the video
 
- [Instructor] In this video, we're going to take a step back and compare the three ensemble learning techniques that we've learned about in prior chapters. We'll review what we've learned about each, directly compare them all, and then we'll set the appropriate context to compare our best models on the validation and test data in the next video. We'll start with Boosting. Remember that the weak models or base models in Boosting, are learned sequentially because each successive model is trained on the heirs of the models before it. These weak models are typically very shallow, under fit, high bias, low variance trees. Then these are all combined using weighted voting for classification or averaging for regression. Gradient Boosting and adaptive Boosting are two examples of algorithms that leverage Boosting. Next up is Bagging. Unlike Boosting, Bagging trains independent week models in parallel where each week model is a deep, often overfit low bias high variance tree. Then all the weak models are combined using standard voting or averaging. Random forest is the most popular algorithm that leverages Bagging. The only tweak on standard Bagging that random forest implements, is that for each week model it's not only built on a sample of the data but also on a sample of the features. This ensures that each week model is independent, uncorrelated, and focusing on a certain aspect of the data. Last up is Stacking. Like Bagging, Stacking trains independent base models in parallel where each week model is often low bias high variants. In Stacking, each base model is trained on the full data set, unlike in Bagging and those base models do not have to be trees. Also, unlike in Boosting and Bagging where the Metamodel is simply evoting or averaging, Stacking involves a real trained Metamodel on top of the predictions from the base models. One final reminder that Boosting start with high bias, low variance models, and reduces the bias by each model training on the heirs of the models before it. Remember our dartboard analogy. High bias, low variance means a very tight range of outcomes but not centered on the bullseye. The process of Boosting allows each successive model to learn in what way prior models were missing the bullseye. Allowing it to move closer and closer to the bullseye, which in machine learning, means it is effectively reducing bias. Bagging and Stacking both start with low bias, high variance models, and reduce the variance by allowing each base model to contribute to the final answer where effectively their variants tends to cancel each other out. Recall our dartboard analogy where all these independent models are roughly centered around the bullseye but it's a wider range of outcomes. When each gets a voice and they work together, that variance effectively cancels each other out to find the bullseye. Okay, now we're going to attempt to compare these three methods directly across a few dimensions. It's worth noting here that we're speaking in extremely general terms. Just because we mark Boosting as being able to make faster predictions than Bagging, it doesn't mean that will be true in every single scenario. It'll depend on the problem and how you configure your model. First up, in Boosting, the training data is sampled where misclassified examples from prior models are overweighted so that each success in model learns from the mistakes of prior models. Bagging just takes a random sample for each base model just to try to make sure each model is independent and uncorrelated. Lastly, there's no sampling for Stacking. Each base model is trained on the full training set. Next up, is describing the type of base models, each ensemble technique relies on. Boosting starts with high bias, low variance base models, while both Bagging and Stacking start with low bias, high variance base models. Training speed can be pretty directly tied to whether training can be parallelized and the complexity of the base models. So Boosting has simple base models but you can't parallelize the training which really slows it down. Bagging starts with complex base models but it can be trained in parallel, so it's moderately fast. Stacking is the same as Bagging. Starts with complex base models but it can be trained in parallel so it's also moderately fast. Again, I'm going to say that this is likely the one that fluctuates most depending on how you're structuring your model, the data, and even the architecture the computer that you're training it on. Some computers are better set up to run the parallelization that you can do with Bagging and Stacking. So these labels are broadly true but it should be noted that it can vary from one problem to the next. For prediction speed, Boostings very fast because the base models are so simple and while training cannot be parallelized, prediction can be parallelized. Bagging and Stacking are not the slowest models out there for prediction time, but they're also not the fastest. Prediction can be parallelized but the base models are more complex than the base models for Boosting so they tend to be a bit slower. Lastly, none of these techniques generate models that are terribly transparent. If interpretability is important to you, you may want to consider a different algorithm. Now, before we jump into evaluating our three models on the validation and test data, let's revisit where we are in the standard machine learning pipeline. This diagram will look familiar from our earlier review chapter. So we started back in the second chapter by exploring and cleaning our data. Then we split the data into training validation and test sets and we wrote these out to ensure that we're using the same exact data for each ensemble technique. Then over the last three chapters, we fit models using fivefold cross validation. And then we use grid search to find the best hyper perimeter settings. Then we saved out the best model from each of our three techniques. Now this is the stage we're at now. We have our three best models based on the training set, now we want to evaluate those three models against each other on data they have never seen before in the validation set. Then we'll select the best model based on performance on the validation data and then we'll evaluate it on the test data to give us one final, completely unbiased view of how our model will do on unseen data. This evaluation on the test data is particularly important if you're planning on putting your model into production where it'll run in some automated fashion on data that's never seen before. This is the best estimation of how it will perform. With all that covered, let's jump into evaluating our final models on the validation and test set in the next video.
 

Compare all models on validation set

Selecting transcript lines in this section will navigate to timestamp in the video
 
- [Instructor] So now we've learned about three different ensemble learning techniques, how they minimize error, and why they're so powerful. Then we actually built some models and used Grid Search CV to do hyperparameter tuning to find the best model for each ensemble learning technique. In this video, we're going to pick up those three models that were fit on the training set and we'll evaluate them against one another on the validation set. This will give us a view of how the best models generated by each ensemble learning technique will perform on data that they were not fit on. Then we'll select the best model based on the performance on the validation set and evaluate it on the holdout test set to get an unbiased view of how this model will perform on new data. Let's start by importing the packages that we'll need. I'll call out that we're importing accuracy, precision and recall score calculators from sklearn.metrics. These are the standard metrics used to evaluate performance on a classification problem. Lastly, we'll read in the features and labels for both our validation set and our test set. Now let's read in the models that we have stored. Just a reminder that these are models that have already been fit and optimized on the training data and are ready to make predictions on data that they haven't seen before. So let's load them from the models directory using joblib. So first, we have to go up a few levels, into the models directory, then we'll load GB_model.pkl. And then let's just copy this path down since these are all stored in the models directory. And we'll just update the model name to RF_model for random forest and stacked_model for the stacked model. So you're going to run that and now we've loaded these fit models that are ready to make predictions into these Python objects: gb_mdl, rf_mdl, and stacked_mdl. Now, before diving into the results, a quick review of the evaluation metrics. If you want to learn more, take a look at my foundation's course where we cover it in more detail but here's a quick primer. Accuracy score is just the number correctly predicted over the total number of examples. Precision is the number predicted as surviving that actually survived divided by the total number predicted to survive. In other words, it says when the model predicted somebody would survive how often did they actually survive? Recall is the compliment to that. So it's the number predicted as surviving that actually survived divided by the total number that actually survived. In other words, it says, given that somebody actually survived, what is the likelihood that the model correctly predicted that they would survive? Okay, so now we have this function that's going to help us evaluate each of our three models on the validation set. The function's called evaluate model and it accepts the following arguments: the model object, features, and then the labels. Now just mention that we're going to use this time function that just stores the time when the given command was run. So we call it once before this predict step and once after the predict step. So if you just take the difference between this end and start, it gives us the amount of time it took for the given model to make its predictions. So our next three lines after this then generate the accuracy, precision and recall score using labels and prediction. Then at the very end, it prints out all of those performance metrics along with the time it took to make predictions. So now, since we just want to pass each model into the function we just created, let's create a loop, where we could just call each model from a list and then pass it into the function. So we do that by saying for mdl in gb_mdl, rf_mdl, and stacked_mdl. Then we'll pass this into our function, evaluate_model. The first parameter will be the actual fit model, and then we need to pass in our validation features and our validation labels. So then let's run the cell with our function in it, and then let's evaluate all of our models. Before digging into the results, there's one important thing to note. Previously I mentioned how if I ran, for instance, random forest twice, I would get different results. It's important to understand that that was only true in the training phase. You can run the training twice on the same exact data and get two slightly different models. What we're dealing with here are stored fit concrete models. So I can run this cell as many times as I want and the results will be exactly the same, except latency, but that really still shouldn't vary too much. Okay, so let's look at the results. A couple of things to note here. The random forest model is generating the best performance here as it has the best accuracy, the best precision, and the best recall. Random forest also takes the longest to make predictions. Remember we talked about how gradient boosting was very slow to train but quick to make predictions? We see evidence of that here as it's by far the fastest model to make predictions. So this brings us to a conversation about trade-offs. There're really two types of trade-offs. The first is precision versus recall. In many cases, you have to give up some recall for gains in precision, and vice versa. Which model you choose would really come down to the problem that you're trying to solve or the business use case. For instance, if this is a spam detector, we would want to optimize for precision. In other words, if the model says it's spam, it better be spam or else we're blocking real emails that people would want to see. On the other side, if this is a fraud detection model, you're likely to optimize for recall, because missing one of those fraudulent transactions could cost thousands or maybe even millions of dollars. So in our case, random forest has the best precision and the best recall. But if you look at gradient boosting versus the stacking classifier, stacking has better precision but gradient boosting has better recall. And that's where we come into consideration what problem you're trying to solve and what matters more. The second trade-off is between model performance in terms of precision recall versus latency. In our case, our best performing model in terms of precision recall is actually the slowest. So if we were deploying this model in a real time environment, we would certainly factor latency into the equation and we would consider whether the improvement in performance is worth the extra time it takes to make predictions. In other words, maybe the random forest model would be too slow for our environment and we would have to choose gradient boosting. Since latency isn't really a consideration for this course, we're just going to select the model that did best on precision recall and evaluate that on the test set. So let's go ahead and evaluate the random forest model on the test set. So just like we saw before, we're going to use this evaluate_model function. We're going to pass in our random forest model object. And then instead of the validation features and labels, we'll pass in the test features and the test labels. As I run this, I'll just note that we should see performance that aligns fairly closely with the validation set. The reason we evaluate on the validation set and the test set is because effectively we use the performance on the validation set to select our best model. So in a sense, the validation set played a role in training our model, or selecting the best model at least. So this test set will not be used for any model selection. So it's a completely unbiased view of how we can expect this model to perform our new data moving forward. Again, ideally, we're just looking for performance that is relatively close to what we saw on the validation set. So we could see that the performance is relatively close to what we saw on the validation set. Accuracy is 81.6%. Precision is 86.4%. Recall is 67.1%. And the latency is 28.4 milliseconds. Great. So now we've explored around 200 candidate models across three different ensemble learning techniques, to try to find the best model for this Titanic data set. We finally narrowed it down to this random forest model with 250 estimators and the max depth of eight. We've robustly tested this best model by evaluating it on completely unseen data. And we know that it generated an accuracy score of 82.0% in our cross validation, 82.6% on the validation set, and 81.6% on the test set. So now we have a pretty good feel of the likely performance of this model on totally new data. And we can be confident in proposing this model as the best model for making predictions on whether people aboard the Titanic will survive or not. You've now gathered the foundational knowledge about these three ensemble learning techniques, as well as the ability to implement, optimize, and evaluate these models. Now you're ready to take this knowledge and skill and apply it to real problems outside of this course.
 

How to continue advancing your skills

Selecting transcript lines in this section will navigate to timestamp in the video
 
- [Derek] Congratulations. You now know how three of the most powerful ensemble learning techniques are implemented, how they reduce overall error, what some of the key hyper parameters are for each, and their relative strengths and weaknesses. You're now ready to apply these techniques to new problems that you encounter. Understanding what drives each of these algorithms and how and when to use them will allow you to truly optimize a model that is tailored to the specific problem in front of you. This ability to deliver a powerful tailored solution is truly invaluable, but don't stop here. There's still so much more to learn. Here are a few next steps that you could take. First, if you want to learn more about some of the foundations of machine learning that generalize to all problems, check out one of my other courses in this series, Applied Machine Learning: Foundations. Second, we only explored ensemble learning techniques in this course. While those techniques are very powerful, there are a lot of other powerful machine learning algorithms as well. If you want to learn about those machine learning algorithms, like logistic regression and support vector machines, take a look at one of my other courses in this series, Applied Machine Learning: Algorithms. Lastly, one of the absolute best machine learning resources out there is called fast.ai. This was started by Jeremy Howard, formerly the president of Kaggle. They have blog posts and really specialize in making deep learning as practical and tangible as possible. But above all, don't stop here. There's no substitute for actually getting your hands dirty and doing this work yourself. That hands-on experience will further hone your skills and technique and unlock brand new doors for you.
 

Derek Jedamski

반응형

'python & DS' 카테고리의 다른 글

IoT2  (0) 2022.07.31
IoT  (0) 2022.07.31
06  (0) 2022.07.31
05  (0) 2022.07.31
04  (0) 2022.07.31
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2024/07   »
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31
글 보관함