04

What is boosting?

Selecting transcript lines in this section will navigate to timestamp in the video

- [Narrator] Let's dig into our first of three ensemble techniques that we'll be covering in this course. Boosting is an ensemble method that sequentially trains a number of weak models, often trees, to create one strong model. This sounds a lot like our general definition for ensemble learning, but there're two things that I want to emphasize here. The first is that boosting often leverages decision trees as their base models. And the second is that boosting really enforces that these base models are weak. Again, by weak model, I mean one that is only a little better than guessing. And boosting forces these base models to be weak by capping the maximum depth of the trees that make up those weak models. This is an effective way of capping the power of those base models. There's one critically important factor that makes boosting so powerful, and it separates it from most other ensemble learning techniques. That is that these weak models are trained sequentially, and what boosting does is it evaluates each weak model to identify what it's getting wrong. And then, the next weak model focuses on that aspect of the data. So while each individual model is weak, together they're incredibly powerful, because they're intentionally trained to pick up on what the other models are missing. This is unique to boosting, makes it really powerful. Now, let's take a look at a diagram to see how boosting works. So you'll start with your training data and you take a sample of data from it. With that sample, you'll build a simple decision tree. And I do want to call out that this tree is very shallow. Again, this is because we want these individual base models to be weak learners, and a shallow tree is more likely to be a weak model. Then, the algorithm will evaluate the performance of that first model, and it will resample while overweighting the examples that were misclassified by the prior model. Essentially, it assumes that this model has a good handle on any examples that are correctly classified, so the next model should focus on the ones that the first model couldn't quite figure out. So once this next sample is set with a higher proportion of examples that were misclassified by the prior model, it builds a new weak model that hopefully keys in on what the prior model missed. Then, it repeats this evaluation, sampling process over and over. So each model will attempt to learn from the mistakes of the prior models. So by the end, you have n relatively weak models, but as a whole, they've learned from the mistakes of all prior iterations, so they represent a very strong model. With that in mind, let's look at how boosting aggregates all of these individual models to generate one prediction. One thing I want to call out here is you cannot train these base models in parallel, because each model builds on the prior model. This is one of the main drawbacks of boosting, is it's expensive to train due to the sequential nature. But in prediction, you can parallelize it, because all the individual trees or models are already built. So that's why you'll learn that boosting is very slow to train, but it's fast to make predictions. So we start with our end weak models that we built in training, and then, we'll feed in the first example from the test data and each weak model ends up generating a prediction. Now, remember in the last chapter, we talked about there being some metamodel to create a final prediction, and with bagging and boosting that metamodel is often something as simple as averaging or voting. For boosting, it's a weighted voting based on how well each individual base model performed in training. Remember, after each weak model was built in training, we evaluated it to figure out which examples it got wrong. That performance is then used for waiting these final votes, so the best individual models get more say in the final prediction. Ultimately, the weighted vote ends up generating the final prediction. This model's ability to learn from its own mistakes and then calibrate the weighted voting based on the performance of each weak model, makes this a tremendously powerful algorithm, that is one of the most used algorithms in machine learning.

How does boosting reduce overall error?

Selecting transcript lines in this section will navigate to timestamp in the video

- [Instructor] Let's dig into how boosting reduces overall error. In order to do that, let's quickly revisit this plot we looked at when we talked about the bias variance trade off. So just a reminder, total model error is roughly made up of model bias plus model variance. So as model complexity goes up, bias goes down but variance goes up and vice versa. So there's this trade off between bias and variance and we're aiming to find the right trade off to generate the optimal model with the lowest total error. Very simple models on the left side of this plot will be under fit to the data and they'll have high bias and low variance, which will result in high total error. On the bottom of this slide, you could see that this results in a decision boundary that is extremely simple and may do a decent job splitting your classes but it certainly could be better and that's why we refer to it as under fit. We call the errors that's making here bias errors because the decision boundary is not complex enough to properly fit the trends in the data. On the right side of this plot we have very complex models that are over fit to the training data, and they have very low bias but high variance resulting in high total error. The reason they have high variance is because they're so closely fit to the training data. Sometimes they're simply memorizing the training data. So any slight fluctuations in that training data will negatively impact the model predictions because it didn't fit to the trends in the data. It just memorized the training data. And that results in a decision boundary on the bottom right of this slide here, that's very complex and it's trying to classify every single training example correctly. Then in the middle we have a properly fit model with relatively low bias and low variance, resulting in low total error. You could see it at the bottom of this slide that this model will pick up on the signal in the data but it will not over fit to the training data and try to properly classify every single point in the training data. This type of model is what we're aiming for. So just putting this all together, here's the full range of outcomes. So let's apply this to boosting. Remember we started with a data sample from the training data. Then we built a very simple model on that sample of training data. Again, often it's tree based. We know this model is simple because the tree is so shallow and because the model is so simple, it typically means it's under fitting the data. Remember that means we'll be on the left side of this plot where this first model has low variance but high bias. Because it's not complex enough to identify the trends in the data, it results in this extremely simple decision boundary that is too simple to properly classify a lot of the data in the training sample. Now recall what boosting does that's so powerful is it evaluates the model to identify what it's getting wrong. Then for the next data sample, it over samples the examples that the prior model got wrong. Remember we call these errors bias errors because they're errors due to the model being too simple to properly fit to the trends in the data. What that means is that this next model will focus on the areas that the prior model struggled with, these bias errors. So now when the second week model is built, it reduces the bias of the overall model by just a little bit because it's focusing on the bias errors from the prior model and building on that to make the overall model just a little bit more complex so that it makes fewer bias errors. Then we'll repeat this process n times where we are sequentially building models that are learning from the mistakes of the prior models by oversampling the misclassified examples. This results in a highly optimized model where we're taking N very simple high bias, low variance models and we maintain that low variance by keeping the base models very simple but through the process of boosting, we drive down the bias by allowing each sequential weak model to learn from the mistakes or the bias errors of the models before it. Now in the next video, we'll talk about when you should and should not be using boosting.

When should you consider using boosting?

Selecting transcript lines in this section will navigate to timestamp in the video

- [Instructor] We've talked quite a bit in this chapter about the different attributes of boosting and how powerful it is. With that said, you shouldn't approach any machine learning problem with just a single hammer. You should understand which algorithms apply to which types of problems. So in this video, we're going to talk about when you should and should not use boosting. Boosting can be used for classification and regression problems, meaning it can be used to predict categorical outcomes and continuous outcomes. It's important to note that this is not true of all algorithms. This makes boosting quite flexible in terms of the types of problems it can be applied to. One of my favorite things about boosting is it's pretty much ready to go right out of the box. It can handle different data types for features. It handles missing values and so on. There's very little data handling that's really required before passing your raw data into a boosting algorithm. One of the other great things about boosting is that it has an attribute that lays out the importance of each feature you're using. Here's an example of what this looks like using the Titanic data set. The values on the x-axis aren't terribly important. It just helps you understand the relative importance of each feature. And this is really nice to understand the relationships and the power of each feature in your model. Now, we talked about this earlier in the chapter as well. But remember, while the training was sequential, the prediction for boosting was in parallel. And since each individual base model is really simple, this makes boosting quite fast at prediction time. So if prediction time is really important, boosting should be one of the first tools that you look at. For all the power and benefits that boosting brings to the table, it's not without its flaws. So when should we not use it? Boosting models contain so many base models under the hood that it can be really difficult to understand while each individual weak model is quite simple, when you have tens or hundreds of them, it can be difficult to wrap your head around. For that reason, boosting models are not very transparent. It's not easy to understand why it's making a certain prediction. So if transparency is critically important to the problem you're trying to solve, you might want to avoid boosting. In machine learning, we call this a black box model. You pass data in, it runs a bunch of operations inside the black box, then it outputs an answer, but it's not always totally clear what happened inside that black box to come up with this answer. The sequential nature in training the weak models where each successive model learns from the mistakes of the previous models makes this a really powerful algorithm, but that sequential nature also makes it really slow to train. So if you have limited time or compute power, boosting might not be the first option for you. One of the main issues with boosting is it has a tendency to overfit because it constantly learns from its own mistakes. By design, a model should not fit to noisy data or outliers. But if a model's always trying to fix its own mistakes, it might learn to fit to those outliers. So if your data is really noisy, I won't say that you can't use boosting but you need to be careful and be aware of its tendency to try to fit to that noise. In summary, boosting is one of the most flexible, powerful tools out there. And honestly, you should consider it for almost every problem. With that said, you do need to consider that it'll take a long time to fit and you have to be careful of its tendency to overfit.

What are examples of algorithms that use boosting?

Selecting transcript lines in this section will navigate to timestamp in the video

- [Narrator] In the last two videos in this chapter, we're going to actually write the code to implement a boosting model. To prepare for that, let's discuss two commonly used machine learning algorithms that leverage boosting, adaptive boosting and gradient boosting. These two algorithms are very similar, but in this video we're going to focus on a few of the differences that make them distinct algorithms. It's worth noting that the diagrams we've been looking at earlier in this chapter are based on adaptive boosting, but would only require a couple very small tweaks to apply them to gradient boosting. It's also worth noting we'll be going into pretty deep technical detail in this lesson, but it's not really necessary to fully understand all of these differences to be able to implement these algorithms. We could spend an entire course on these technical details. So if you find a particular detail or difference interesting, feel free to continue your own research. The first difference between these two algorithms is how they directly reduce bias. As we talked about earlier in this chapter, adaptive boosting reduces bias by oversampling misclassified examples from prior weak models. Gradient boosting does this by training on the actual errors or residuals rather than oversampling the misclassified examples. The second difference is that adaptive boosting can technically accept a variety of algorithms as its base models. Though it's worth noting that it still does usually use trees as we saw earlier in this chapter. You'll see when we move to the implementation stage that adaptive boosting accepts a base estimator parameter that allows you to dictate which algorithm you want to use for the weak learners. In contrast, gradient boosting always uses decision trees. Earlier in this chapter, we talked about how our meta model or our strong model, makes the final prediction using weighted voting where each weak model is weighted based on how well it performed in evaluation. That's how adaptive boosting optimizes the strong learner. Gradient boosting optimizes a little bit differently by using gradient descent optimization, which is where it gets its name from. The last difference we'll discuss is the loss function. Every machine learning algorithm has some loss function that it seeks to optimize to achieve the best model. Some loss functions are better than others for certain types of problems. Adaptive boosting uses an exponential loss function, whereas gradient boosting is a little bit more flexible and accepts exponential or deviance. One final note, you may have heard of XGBoost, is that's another popular algorithm. XGBoost stands for extreme gradient boosting and it's a speed optimized implementation of gradient boosting. So you can assume most of the details are the same as gradient boosting. Hopefully, so far this chapter has given you a good idea of what boosting is, how it reduces errors, when you should use it, and the different types of algorithms that use boosting. In the next video, we're going to explore these boosting algorithms in Python.

Explore boosting algorithms in Python

Selecting transcript lines in this section will navigate to timestamp in the video

- [Instructor] In this video, we're going to import the gradient boosting and AdaptiveBoostingClassifiers used in Python and explore some of the key hyperparameters to tune. We're only looking at classifiers here since our Titanic dataset that we'll be working with is a classification problem, but there are also equivalent tools in Python for regression. So let's start by importing both the GradientBoostingClassifier and AdaptiveBoostingClassifier. So do that by calling them from sklearn.ensemble. We'll say import GradientBoostingClassifier and Adaptive Boosting, which is classified as AdaBoostClassifier. And then, we're going to start by exploring the hyperparameters for gradient boosting, and then, we'll look at adaptive boosting in just a minute. So we can view all available hyperparameters by calling the get_params method. So I'm just going to copy down this GradientBoostingClassifier, leave open parenthesis, and then call the get_params, open parenthesis method. Now there are a lot of hyperparameters that we could explore here and I encourage you to take a look at those in the documentation linked up above in this notebook. In the interest of time, I'm just going to focus on some of the most important hyperparameters. Now remember, gradient boosting is made up of a lot of really shallow decision trees. So this end estimators parameter controls the number of trees that the algorithm should create. The default is set to 100. And then the max depth is the maximum depth of those trees. And the default is set to three, which is quite shallow. So by default, gradient boosting would build 100 trees of depth three that would represent its base models. Lastly, there's a learning rate parameter that effectively controls how quickly the algorithm attempts to optimize. If learning rate is really low, it may take longer for the algorithm to fit and it may not end up finding the optimal model. It may get stuck in what's called a local minimum, meaning it finds a pretty good answer, but not the best answer, which would be called the global minimum. On the other side, if you set learning rate really high, the model may fit quicker, but again, you run into the risk of finding a suboptimal model. So in other words, learning rate allows you to control the balance between the time to fit and how well the model fits. So now let's take a look at the hyperparameters for the AdaptiveBoostingClassifier by calling that same get_params method. So let's copy down our classifier that we imported, leave open parentheses, and then we'll call the get_params method. So you'll see right off the bat that there's a much smaller set of hyperparameters here. So let's start with the algorithm. This is the actual boosting algorithm used and the default is the true boosting algorithm that we already talked about and you'll want to leave it as that. The base estimator refers to the type of weak models that you'll use. Remember, one of the differences between gradient boosting and adaptive boosting is that gradient boosting requires the use of the decision trees for the base models, but you don't need to use decision trees in adaptive boosting. You can use, for the most part, whatever algorithm you want. Then learning rate is the same as it is for gradient boosting. It controls the trade-off between the time to fit and how well the model fits. And end estimators is the number of base or weak models that you want. By default, it's set to build 50 of whatever model you declare in base estimator. In the next and final video in this chapter, we're going to actually fit a gradient boosting model.

Implement a boosting model

Selecting transcript lines in this section will navigate to timestamp in the video

- [Instructor] In this final video this chapter, we'll fit the best gradient boosting model we can using GridSearchCV to tune three key hyperparameters. If you want to learn more about GridSearchCV, you should take my Applied Machine Learning: Foundations course, that talks more about the proper framework to fit and evaluate models. In short, GridSearchCV allows us to easily search through a number of different hyperparameter setting combinations to find the one that generates the best performance on unseen data. So we'll find the best boosting model we can, in this video. Then we'll save that fit model. Then we'll do the same in the bagging and stacking chapters, later on. Then in the final chapter of this course, we'll compare the best boosting, bagging, and stacking models against each other on the validation set to see which model performs best. So let's jump into the code. We're going to start by importing a few packages. Joblib will help us save our fit model at the end of this video. Pandas will help read our data into a data frame. And then, GradientBoostingClassifier and GridSearchCV will help us fit and evaluate a model from scikit-learn. Lastly, we'll just read in our training features and our labels that we created earlier in this course. Now, the GridSearchCV method stores a lot of information about model performance, but it can be kind of difficult to pick through to find what you need. So I wrote a quick little function here for us to use to print out the results a little more cleanly. Effectively, what this function does is for every hyperparameter combination that we test, it will print out the average accuracy score, and standard deviation of that accuracy score, across the five folds built into our cross-validation. This will give us the information we need to compare performance for each hyperparameter combination, to then select the best one. So we'll run that cell, and move on to the actual GridSearchCV code. So quick reminder of the three key hyperparameters that we'll be looking at. And you can find these three hyperparameters here in this parameters dictionary. First up, the number of estimators simply represents how many individual decision trees that will be built. And those decision trees represent our weak models. Max_depth dictates how deep each of those individual trees can go. Lastly, learning_rate controls how quickly this algorithm will try to find the optimal model. Too large, it'll never find the optimal solution. Too small, and it may also not find the optimal solution. And even if it does, it may take a long time to do so. So let's get into it. We'll start by calling the GradientBoostingClassifier object, and we'll store it as gb. If we want to hardcode any hyperparameter values, then we would enter them in these parentheses. Otherwise, GradientBoostingClassifier will just use the defaults, which we saw in the last lesson. We don't want to hardcode any values at this moment because we want to let GridSearchCV test different hyperparameter settings. So now the first thing we need to do for GridSearchCV is define our parameters dictionary. So again, the hyperparameters we want to tune are n_estimators, max_depth, and learning_rate. And all the other hyperparameters that we saw in the last video will just be set to their default values. So now we just need to populate these lists with the different settings that we want to test out. So for number of estimators, we'll test out 5 trees, 50 trees, 250 trees, and 500 trees. Then for max_depth, we'll test out a depth of 1, which is called a decision stump, 3, 5, 7, and 9. Lastly, for learning_rate, we'll test out 0.01, 0.1, 1, 10, and 100. So now that we have our model object established, and our parameters dictionary, we can then call GridSearchCV. Then we'll pass in our model object, and then we'll pass in our parameters dictionary. And the last thing we need to do is we need to tell GridSearchCV how many folds to use as part of the cross-validation. We're just going to do five folds. Then we'll store this object as cv. And then just as we do for every scikit-learn object, we need to fit it. So we'll call the .fit method, and then we'll pass in our features and our labels. So just as a reminder, what this will do is it'll fit a model with each hyperparameter combination, and then it'll evaluate them to see which is the best one. One final note here, tr_labels is stored as a column vector type, but what scikit-learn really wants it to be is an array. So we're just going to convert it from a column vector to an array using .values.ravel. Then the last thing we're going to do is we're going to use our print results function to show how each model performed. So we'll just copy that down. And then we just need to pass in our GridSearchCV fit object. Now we can run that cell. Before we dive into the results, I want to note two things here. The first is that these results are still on unseen data. The cross-validation built into GridSearchCV splits the training data that we fed into it into k-parts. It trains the model on k - 1 parts, and then it evaluates the model on the last chunk of data that it was not trained on. Secondly, if you're running this code along with me at home, you might have a different training set than I do. And there's some randomization built into some of these algorithms. So it's very possible, and actually, even likely, that you'll get different results than I do. In fact, I could even run this cell again and get slightly different results. Now, there are a lot of hyperparameter combinations here. Remember, we tested four levels of the number of estimators, five levels of max_step, and five levels of learning_rate. So that makes for a total of 100 total models that were fit. So scanning through these results, we'll find that the best model is the one with a learning_rate of 0.01, a max_depth of 3, and 500 total estimators. And that combination is generating an accuracy of 82.6%. Now, feel free to scan through these results yourself. Two things I'll point out though. The high learning_rate is generating really poor results across the board. And that's indicating that it's kind of jumping across the loss curve too quickly, and not finding the optimal model. Secondly, models with 5 estimators are pretty consistently the worst model. But depending on other settings, we have 50, 250, and 500 estimators that are all generating fairly good models. So again, feel free to dig through these results yourself. Take your time. See what other insights you may be able to pull out, and that will help build your intuition for future model builds. Next thing we're going to do is we're going to call the best_estimator_ attribute from our fit GridSearchCV object. So we'll call cv and best_estimator_. Don't forget the trailing underscore. And what this returns is the fit model that performed best on unseen data. So again, you'll notice it says learning_rate equal to 0.01, and 500 estimators. You'll notice it's not telling us what the optimal max_depth is. The reason it's not is because, as we just discussed, the max_depth for the optimal model is 3. And if you recall from our last video, the default value for max_depth is 3. So it's just leaving it out here because it's set to the default. All right. Lastly, let's write this model out so that we can compare it to the other models in the last chapter of this course. We'll do that by calling joblib.dump. And then we're going to pass in the cv.best_estimator_. So that's telling it what we want to write out. And then lastly, let's tell it where to write out. So we need to go up a few levels in our directory, then save it into that models folder. And we'll name it GB_model.pkl. So it's important to remember that this is saving your model that has been fit on training data. So once it's saved, then we can read it back into Python and immediately start making predictions on data it's never seen before. So hopefully this chapter has given you a pretty good grasp on boosting, and how to implement it in Python. In the next chapter, we're going to take a look at a different type of ensemble learning, called bagging.

저작자표시 비영리 변경금지 (새창열림)

'python & DS' 카테고리의 다른 글

06 (0)	2022.07.31
05 (0)	2022.07.31
01 Applied Machine Learning: Ensemble Learning (0)	2022.07.31
python json parsing error (0)	2022.07.22
bz2 file read with pandas (0)	2022.07.18

development note

04

What is boosting?

How does boosting reduce overall error?

When should you consider using boosting?

What are examples of algorithms that use boosting?

Explore boosting algorithms in Python

Implement a boosting model

'python & DS' 카테고리의 다른 글

티스토리툴바

04

What is boosting?

How does boosting reduce overall error?

When should you consider using boosting?

What are examples of algorithms that use boosting?

Explore boosting algorithms in Python

Implement a boosting model

'python & DS' 카테고리의 다른 글

'python & DS' Related Articles

티스토리툴바