05

What is bagging?

Selecting transcript lines in this section will navigate to timestamp in the video

- [Instructor] Now that we've learned a little bit about boosting in the last chapter, let's dig into bagging. Bagging is an ensemble method that creates one strong model from a number of independent weak models, often trees, trained in parallel. This sounds pretty similar to boosting which isn't all that surprising since they both fall under the umbrella of ensemble learners. Let's highlight a couple of things in this definition. First is that boosting often leverages decision trees as their base models. This is one similarity that bagging has with boosting. The second thing that I want to highlight is that bagging leverages weak models, just like boosting, but the type of weak model is very different. In boosting the weak models are forced to be simple high bias, low variance models. In bagging, the weak models are usually more complex, low bias, high variance models. So in boosting, we started with low variance in our weak models, and then we reduced the overall bias by sequentially training the models on the errors of the prior models. In begging, we start with low bias in our weak models and then we reduce the variance by combining the weak models. Don't worry about fully grasping that right now. We'll dig more into those details in the next video. I just wanted to introduce the idea that the type of weak models that we'll be using in bagging are different than the type of weak models used in boosting. Again, remember that boosting models are trained sequentially so that each weak model is dependent on the one before it. So those models are very slow to train. In bagging, these weak models are independent, meaning that they can be trained in parallel. So training a model that uses bagging is typically much faster than training a model that leverages boosting. This is a huge perk of models that use bagging. The term bagging comes from bootstrap aggregating. Bootstrap is a statistical term that means estimating a quantity from random sampling with replacement. So bootstrap aggregation or bagging just means aggregating or combining a bunch of samples. This will probably make more sense if we look at a diagram of what bagging actually does. So we'll start with what the training looks like, and then we'll look at what it's like to test this model. In training this algorithm will take n samples from the training data. Remember, this is sampling with replacement, which means that one single example can both appear multiple times in a single sample and appear across multiple samples. Then for each sample of training data, the algorithm will build a decision tree to generate the most accurate results. It's important to remember a couple of things here. First, these are deep decision trees. Recall with boosting, we used really shallow trees to keep them simple and under fit, meaning high bias, low variance. These trees are much deeper. Again, this diagram is just for illustration. They'll usually be much deeper than what you see here. The second thing to remember is that these decision trees are all developed on their own. They do not know what the other trees are doing. And again, this is different from boosting where each tree depends on the ones before it. So in summary, we would have n trees that are built independently on all different subsets of data. It's important that these are each independent because you want these decision trees to be uncorrelated, and for each decision tree to key in on different relationships in the data. This is what will generate the best prediction for this overall framework. Speaking of predictions, let's see how we get there. So all I did was basically copy those trees that were fit on the training data in the last slide. These represent your stored fit concrete model, just a collection of decision trees. So we would take an example from the test data. We'll just call it example one. Then you pass example one into each of the n decision trees that were built in training. And that example will traverse down the tree based on its features. Then each of the n trees will generate a prediction. Remember, the sample of data that each tree was trained on will be different which means each tree will be different. So our trees can generate different predictions on the same example because they're all trained on different data. So now you'll have n predictions from example one. So you have one prediction from each tree. Then the meta model will basically just aggregate all of the predictions together and then based on voting, it will make the final prediction. So in this case, we'll just say class A.

How does bagging reduce overall error?

Selecting transcript lines in this section will navigate to timestamp in the video

- [Instructor] Let's dig into how bagging reduces overall error. Let's start by revisiting this plot we looked at previously. So total model error is roughly made up of model bias plus model variance. Overly simple and overly complex models both result in high error models. Overly simple error is driven by high bias, while overly complex error is driven by high variance. We're aiming to find the right trade off between bias and variance to achieve the lowest total error possible. In the last chapter, we talked about how boosting starts with extremely simple models that are underfit, then the process of boosting, where we're training on the errors of prior models, allows us to drive down that bias to achieve the optimal trade off and reduce overall error. Bagging takes the opposite approach. It starts with more complex models that tend to be low bias, high variance, and then the process of bagging will drive down the variance to achieve the optimal trade off. Let's revisit the diagram from last video in order to better understand how bagging achieves that. We've taken n samples and generated n deep decision trees representing our weak models. Each of these trees will represent a weak model where the total error is driven by the high variance. The upside of these models is that they do have low bias. So let's circle back around for a second to our dartboard analogy to recall what high variance and low bias actually means. Variance is an algorithm's sensitivity to small fluctuations in the training data, and high variance comes from an algorithm fitting to the random noise in the training data. In other words, it's due to the model overfitting to the training data, rather than just fitting to the proper trends in that data. On a dartboard, low bias means it will be centered around the bullseye, but the high variance means it'll rarely actually hit the bullseye. So what bagging does is it reduces the variance while maintaining that low bias. So let's jump back to our diagram. How does bagging take all of these high variance weak models and combine them into a low variance weak model? By simply giving each of them a voice. Thinking about it practically, and again, remembering our dartboard analogy, each point represents a low bias, high variance outcome. Each point represents one of our weak models, basically. Now, if you allow all these points to effectively work together to find the bullseye, it can be really powerful. To put it another way, if you effectively average all of these points or you find the center point, you'll end up with a point in or near the bullseye. In technical terms, that means you reduce the variance of the overall framework to find a low bias, low variance solution in or near the bullseye. Bagging effectively works the same way. With enough votes, the variance of all the individual models effectively cancel each other out, resulting in a low bias, low variance model. Another great perk is that bagging protects against overfitting to your data. Individual decision trees are prone to overfitting, that's why they're high variance, but by giving each tree a voice to vote on the final outcome, you nearly eliminate the risk of overfitting. Compare this to boosting, where you're consistently learning from the mistakes of prior models on the training data. You do run the risk of doing that too many times and then just overfitting to the training data and correcting every mistake that you made on the training data. So bagging is an incredibly powerful tool that allows you to efficiently find a low bias, low variance model. In the next lesson, we'll talk about when you should use it and when you should not use it.

When should you consider using bagging?

Selecting transcript lines in this section will navigate to timestamp in the video

- [Instructor] In this lesson, we're going to talk about when you should and should not use bagging. Just like we saw with boosting, bagging works quite well for problems with both categorical and continuous target variables. Again, similar to boosting, one of my favorite things about bagging is it has an attribute to layout the importance of each feature you're using. This is really nice to understand the relationships between the features and the thing you're trying to predict. Bagging is extremely flexible, relatively quick to train, and it tends to perform quite well on most problems, which makes it a great candidate to be your first quick benchmark model. And oftentimes this may very well actually be your best model. On the note of flexibility, when you have really messy data with missing values, outliers, skewed data, and so on, bagging is often a great choice because it deals really well with that messy data. So when should we not use this? While bagging is fairly fast, quite flexible, and it performs well, it's not necessarily likely to be the best model for any given problem. If you want a quick model that gets you 90% of the way there this probably is the right tool for your job. If you really need to get 100% of the way there and extract all possible signal of the data bagging might not be the right tool. There are other more powerful algorithms out there. While it's nice to understand the features and the decision trees on their own can be fairly easy to understand, having a collection of hundreds of decision trees it can be quite hard to interpret what's happening across the entire model. So if you need to see the details within the model, bagging might not be the right choice. Remember last chapter where we referred to boosting as a black box algorithm, the same applies to bagging. While bagging models are relatively quick to train because they are parallelizable, they are not quite as quick to make predictions as boosting models because the trees are so much deeper than they are for boosting. Overall bagging is a tremendously flexible, relatively fast tool that works pretty well for almost any problem. That's a really nice swiss army knife to have in your tool set.

What are examples of algorithms that use bagging?

Selecting transcript lines in this section will navigate to timestamp in the video

- [Instructor] Now that we've defined bagging, learned how it reduces air, and when you should and should not use it. Let's talk about by far the most popular algorithm that uses bagging and that's the random forest algorithm. Random forest is an ensemble method that builds many independent deep decision trees using the bagging technique. Then it combines that I'm using averaging for regression, voting for classification, for more accurate and stable predictions. I just want to highlight two things in this definition. First is to emphasize again that it's important that these are independent decision trees, and that's important for two reasons. The first is that these trees being trained independently allows us to parallelize their training so that the training process is much faster than it is for boosting where the trees are not independent, so it can't be paralyzed. The second reason is that the goal is for each tree to really represent trends or information in a certain section of the data. The another way to put this is that we want these trees to be uncorrelated. If they're correlated, then they're representing similar data, and the trees are effectively redundant. This will have a negative impact on your model variance and will make for a worse model. So it's critical for these weak models or decision trees to be independent, both for speed of training and for performance. The second thing to highlight is that random forest can be used for regression problems and for classification problems. So you build the individual, independent decision trees and then each one will generate an output. Then we need to combine those for our final output for the random forest model. For regression problems, it will just average the responses coming out of all the decision trees. For classification problems, which is what we focused on for this course, it just gives each tree a vote and majority wins. So now that we've defined random forest, let's revisit our bagging diagram, as there's one wrinkle that random forest introduces to the bagging process. So a standard bag, we start with our training data. We take end samples again with replacement, and then we build a deep decision tree in parallel on each of our end samples. So let's jump back real quick to our prior slide, so I could show you the wrinkle that random forest introduces to this process. So we still take our end data samples with replacement. Now, remember earlier I talked about how important it is for these individual deep decision trees to be independent and uncorrelated. We want each tree to really fit closely to subsections of the data. That's how we get high variance models. In standard bagging, we sample the data or the examples in our data. The wrinkle random forest introduces is that for each data sample, it also samples the features. So just as an example, that may mean for each sample, we're selecting 50% of the data or 50% of the rows in our data, as well as 50% of the features or the columns in our data. And then we train our deep tree on that section of the data. So now we have a sample of the examples, and a sample of the feature to train a tree on that adds an additional layer to force each tree to really focus on a small section of the data and really learn that data and understand the trends in that data. That ensures a high variance tree. And it ensures that the trees are uncorrelated with each other. Other than that added wrinkle, random forest follows the exact bagging process we laid out throughout this chapter. That includes the guidelines for when to use random forest and when not to use it. Overall random forest is one of the most flexible, popular algorithms in a machine learning practitioner's toolbox. It easily handles messy data with different data types, trains relatively quickly, and it generally provides well performing models with little overfitting risk.

Explore bagging algorithms in Python

Selecting transcript lines in this section will navigate to timestamp in the video

- [Instructor] In the last video, we talked about how Random Forest is the most popular algorithm that leverages bagging. So in this video, we'll explore some of the key hyper parameters for Random Forest. Again, we're only looking at the classifier here since our Titanic data set is a classification problem, but there's also a random forest regress tool in scikit-learn. So we'll start by importing Random Forest classifier from scikit-learn.ensemble, and then we're going to call the parameters in the same way we did with the boosting models. So copy down Random Forest classifier, we'll leave some open parentheses, then we'll call the get_params method. So here are all the potential hyper parameters we could tune, and this might look a lot like gradient boosting. It's true. A lot of the hyper parameters are the same because they're both tree-based methods with that said, we're only going to be focusing on two hyper parameters here. Those two are n estimators, or number of estimators, and max depth. These two are define the same way we did for gradient boosting, and estimators is the number of base models or number of individual decision trees, and max depth is the depth of each individual decision tree. One difference I want to call out is that max depth for gradient boosting was set to three, while max depth for Random Forest is set to none. That means the decision tree can grow as deep as it needs to in order to find the optimal model. So by default, Random Forest will build 100 independent really deep trees in parallel. Gradient boosting by default will build 100 trees with a max depth of three. These trees will be built so sequentially where each tree is trained on the mistakes of the prior trees. Remember, this all comes back to the bias variance trade off, where boosting starts with high bias low variance base models, and reduces the bias through boosting, whereas bagging starts with low bias, high variance models and reduces the variance through bagging. In the next lesson, we'll actually fit a Random Forest model.

Implement a bagging model

Selecting transcript lines in this section will navigate to timestamp in the video

- [Instructor] In this final video in the Bagging Chapter, we're going to try to build the best Random Forest model we can on this Titanic dataset by using the same process we did for gradient boosting. So we'll search for the best hyper parameter settings for the Random Forest model using grid search CV. So let's start by importing the same packages we imported in the last chapter, so that's joblib in order to save out our model, pandas to read in our data, and then our classifier and grid search CV from sklearn. Then we'll just read in our training features and the training labels. Recall this helper function we used in the last chapter to help us print out the average accuracy score and the standard deviation of the accuracy score across five folds built into our cross validation for each hyper perimeter combination. So let's walk through this code again for grid search CB. This should look familiar as it's basically the same as what we ran for gradient boosting. We have this Random Forest classifier object and we store it as RF. Then we have to define our hyper parameter dictionary. So the keys in the dictionary align with the name of the hyper parameters that would be passed into Random Forest classifier. So that's n_estimators and max_depth. And then we just need to define the list of settings that we want to test for each hyper parameter. So for the number of estimators we'll build out 5, 50, 250 and 500 trees. And then for max_depth we'll test out 4, 8, 16 32, and just a reminder, none will indicate that the tree should be built as deep as it needs to be until it achieves some level of training air tolerance defined within Random Forest classifier. So you'll notice we're testing deeper trees here than we did for gradient boosting. And that's expected based on the way that we know these two algorithms optimize that bias variance trade off. Random Forest starts with deep trees that have high variance and low bias. So now let's create our grid search CV object. So we'll call GridSearchCV. We're going to pass in our model object then pass in our hyper parameter dictionary. Lastly, we have to define the number of folds in our cross validation. We'll keep it at five, just like we did in the last chapter. We'll sign all of that to CV. Then, just as we do for all sklearn objects, we call the .fit method and we pass in the training features and the training labels. And one more reminder that these training labels are stored as a column vector and we need to convert it to an array for sklearn. So we just call it .values.revel and again, GridSearchCV will run five fold cross validation for each hyper perimeter setting combination to determine the best model. And then we just want to print out our results using this print results function. So we'll copy that down here and we'll just pass in our GridSearchCV object. So go ahead and run that. Before we dig into these results, I want to call out that even if I ran this same exact cell again with the same exact training set, I would get different results. That's because each time you run Random Forest it's randomly sampling the rows and the columns internally like we discussed earlier in this chapter, to build each decision tree. So you could get different results each time you run Random Forest. Also remember that these results are on unseen data thanks to the way cross validation built into GridSearchCV, splits up the data. Now looking at the results, the best results are using 250 estimators with a max step of eight, which gives us an overall accuracy score of 82%. Feel free to dig through these results to better understand how each of our two hyper parameters are impacting the results. Then let's ask GridSearchCV object to print out the best model based on performance on unseen data. Just like we did in the last chapter, we'll call cv.best_estimator. And it'll tell us it's the model with a max depth of eight and 250 estimators. Lastly, we'll just write out the fit model using joblib.dump. So we'll call joblib.dump. Then we have to tell it what we want to save out. We want to save out our best_estimator. Then we have to tell it where to save it. So we have to go up a couple of directories and then save it to models and we'll call it RF_model.pkl. In the next chapter, we'll cover the last of our ensemble techniques, stacking.

저작자표시 비영리 변경금지

'python & DS' 카테고리의 다른 글

conclusion (0)	2022.07.31
06 (0)	2022.07.31
04 (0)	2022.07.31
01 Applied Machine Learning: Ensemble Learning (0)	2022.07.31
python json parsing error (0)	2022.07.22

development note

05

What is bagging?

How does bagging reduce overall error?

When should you consider using bagging?

What are examples of algorithms that use bagging?

Explore bagging algorithms in Python

Implement a bagging model

'python & DS' 카테고리의 다른 글

티스토리툴바

05

What is bagging?

How does bagging reduce overall error?

When should you consider using bagging?

What are examples of algorithms that use bagging?

Explore bagging algorithms in Python

Implement a bagging model

'python & DS' 카테고리의 다른 글

'python & DS' Related Articles

티스토리툴바