티스토리 뷰

python & DS

06

shannon. 2022. 7. 31. 18:03
반응형

What is stacking?

Selecting transcript lines in this section will navigate to timestamp in the video
 
- [Instructor] Now let's talk about the last of the three ensemble techniques that we'll cover in this course. And that's stacking. Stacking is an ensemble method that creates one strong metamodel that's trained on the predictions of several independent base models. On the surface, that may sound awfully similar to boosting and bagging. And the reality is that these are similar since they all fall under the umbrella of ensemble learners. So let's dig into the ways in which stacking is different than both boosting and bagging. The first is that the base models and metamodel are all trained on the same dataset, the full training set. Remember that in boosting and bagging, we use sampling techniques to generate the data for our base models to train on. In part, that's to make sure each model is different from each other and keying in on different trends in the data. We don't need to do that with stacking because the base models can be different algorithms. So instead of sampling the data to allow our base models to find different trends in the data, we use different algorithms altogether, which will naturally find different trends or information in the data. So the idea is that we could have one base model that's logistic regression, one that's random forest and so on and each of those algorithms have certain strengths that will allow them to look at the data in different ways and find different trends. Another difference is that in boosting and bagging, we typically have a large number of base models. In stacking, we usually have far fewer, maybe as few as three or four compared to sometimes hundreds in boosting and bagging. The last difference is that boosting and bagging use very naive metamodels. They usually just do some sort of averaging or voting from the weak models to make the final prediction. Stacking built an actual model on top of the predictions from the base models to generate the final prediction. What this allows is for the metamodel to learn which base model should be prioritized and weighted differently. Sometimes we feed in extra data into the metamodel in addition to the base model predictions. In all, the idea with stacking is using different algorithms for the base models allows each of them to contribute certain unique insights based on the way each algorithm is optimized, then the metamodel can learn on top of those unique insights to understand which models should be leaned on for the final prediction. This creates quite the powerful framework. Let's look at a visual representation of this framework, but before doing that, let's quickly revisit what the framework for boosting looks like. Again, some things to notice here. These weak models are all trees and they are intentionally very shallow weak trees. The models are all trained on samples from the training data and the models are all trained sequentially based on their need to learn from prior models. Now let's jump to bagging and notice, again, these are all trees as well and these trees are all trained on, again, samples from the training data. Now let's get into what this looks like for stacking. So we start with the full training set and then we train end models on top of that training set. Again, these can all be different algorithms so the first could be logistic regression and then random forest and so on. So each of these models generate a prediction. It's worth calling out here that while these models are all trained on the same exact data, they can and will generate different predictions on individual examples because models are not deterministic. You can train the same exact algorithm on the same exact data and get two slightly different answers. In addition to that, if using different algorithms for these base models, each algorithm will handle the data in different ways, optimize in different ways and so on, leading to different answers in some cases. Then we would take all those predictions and feed them into a metamodel. It's worth noting that sometimes there will be some supplemental data that is fed into the metamodel on top of the predictions from the base models. This really just depends on the problem and whether the base models are fully able to capture all of the information required to make accurate predictions. Then finally, the metamodel is trained to generate a final prediction. This creates a really powerful framework where the base models can twist and turn the training data to look at it in different ways, then we can combine all the outputs of those models into a single model that can learn how to weight those insights in making a final prediction.
 

How does stacking reduce overall error?

Selecting transcript lines in this section will navigate to timestamp in the video
 
- [Instructor] Let's dig into how stacking reduces overall error. Just a reminder, boosting starts on the left side of this plot with simple under fit based models with high bias and low variants. Then the process of boosting drives down the bias to achieve the optimal bias variants trade off. Bagging starts on the right side of this plot with more complex over fit models with low bias and high variants. Then the process of bagging drives down the variants to achieve the optimal bias variants trade off. The way stacking reduces error is essentially the same as how bagging reduces error. We have complex base models and we reduce the variance by giving them all a voice and combining them to a meta model. Let's revisit the diagram from last video in order to better understand how stacking reduces error. Again, we'll be moving through this quickly since it's essentially the same as bagging. So we've trained N complex models on the training set and all of these models represent our weak models where the total error is driven by the high variance. Then just like with bagging, stacking reduces the variance by taking all these complex high variance models and simply giving them each a voice. Remember high variance means the models are most likely overfit to the training data. And because we use different algorithms from our base models, they're all different and likely overfit to different things in our training data. So by giving each a voice and allowing the meta model to learn which models to rely on for which information, it reduces the variance because the metamodel is not overfit to the training data in the way that our base models were. Stacking is a powerful tool that allows you to dig deep into the data and extract as much value as possible without overfitting. In the next lesson, we'll talk about when you should use stacking and when you should not use stacking.
 

When should you consider using stacking?

Selecting transcript lines in this section will navigate to timestamp in the video
 
- [Instructor] In this video, we're going to talk about when you should and should not use stacking. Just like we saw with the other two methods, stacking works quite well for problems with both categorical and continuous target variables. If you have a very complex, novel problem that you're trying to solve with really nuanced, complicated data, then stacking might be a good option for you because you can tailor it so much to the problem that you're trying to solve and you can allow different algorithms to try to extract out the complicated trends in the data to solve the problem. This is actually the main reason that you would use stacking. It's not terribly easy or fast to implement but it can be really powerful. On that note, let's talk about a few instances when you should not use stacking. You should not use stacking if you're looking for a quick benchmark model. We talked about bagging being a good option for a benchmark model because it's quick to implement and train and it's flexible in handling different data types. Implementing stacking can be a time-consuming process as you're training and evaluating many different types of models. Consider looking elsewhere if you need a quick benchmark model. Just like boosting and bagging, stacking is also a black box algorithm. If transparency is important, if you want to be able to easily explain the model and the decisions it's making, you shouldn't use stacking. As previously mentioned, it's really complicated. It can be difficult explaining the choices made even by a single model. Let alone a bunch of models combined together with a new model train on top of all those models. If transparency is important, look to one of the other tools you have. Lastly, stack models can take a long time to train and it can take a long time to make predictions with as well. That's because you're training so many complicated models and then training a new model on top of those models as opposed to simple averaging or voting like we saw with both boosting and bagging. So if you need fast training or prediction, you may want to consider other options. Stacking is a tremendously powerful technique but should really only be used to solve very complex problems where all you care about is the performance of the model, regardless of how transparent it is or how long it takes to train.
 

What are examples of algorithms that use stacking?

Selecting transcript lines in this section will navigate to timestamp in the video
 
- [Instructor] So let's talk about algorithms you can use that leverage stacking. For boosting, we talked about gradient boosting and adaptive boosting as two algorithms that leverage boosting. These two algorithms implement the sampling technique that allows the weak models to learn from the mistakes of the prior models. For bagging, we talked about random forest as the most popular algorithm that leverages bagging. Random forest implements the data and the feature sampling required. So for both boosting and bagging, there are sampling techniques that are unique to those tools that require an algorithm to execute the overall process. For stacking, it's a little bit different. There's not necessarily any sampling or training that is unique to stacking. Stacking is much more general. It's just a framework where we set up our models in a certain way to extract as much value out of the data as possible. So there's not really a known algorithm like gradient boosting or random forest that implement stacking. With that said, implementing stacking manually can be quite time consuming, since you have to train end models and then aggregate predictions to train a metamodel on it. For that reason, there is a tool in scikit-learn that helps you set up your stacked framework. That tool is called StackingClassifier in the ensemble package from scikit-learn. There's also a StackingRegressor tool if you're working on a regression problem. So instead of talking about the specifics of a given algorithm like we did for boosting and bagging, we're going to talk about some of the parameters that'll be passed into StackingClassifier. And we're going to highlight what those parameters are that you'll pass into StackingClassifier by referring to the stacking diagram that we reviewed earlier in this chapter. The first parameter that you'll pass in is called estimators. And this is a list of algorithms you want to use for all of your base models. So you could pass in say, logistic regression, random forest, and gradient boosting, and then the StackingClassifier object would then train one logistic regression model on the training data, one random forest model on the training data, and one gradient boosting model on the training data. So if you want three models, you pass in a list of three algorithms. If you want five models, you pass in a list of five algorithms. So now that you have your base model set, now you have to define the algorithm you want to use for your metamodel. And you do that with an argument called final estimator. This is the algorithm that will be used to train the metamodel on the predictions from the base models. The default here is logistic regression. Lastly, there's an argument called passthrough, which is just a boolean to indicate whether you want to train the model on only the base model predictions, or if you want to also pass through some supplemental data from the training set. Now, those are not the only parameters in StackingClassifier but they are the main parameters that you'll need to know if constructing a stacked model in Python using scikit-learn's StackingClassifier. This context will be useful in the next few videos, as we move towards implementing an actual stacked model.
 

Explore stacking algorithms in Python

Selecting transcript lines in this section will navigate to timestamp in the video
 
- [Instructor] Just as we did for boosting and bagging previously, in this video we'll explore some of the key hyperparameters for the StackingClassifier in Scikit-learn. Let's start by importing the StackingClassifier from sklearn.ensemble, and here you'll see one difference between this algorithm and the others we've looked at previously. So let's try to call the parameters in the same way we did with prior models by calling this get_params method. You'll see, we get an error here that basically says it expects an estimators argument to be passed into StackingClassifier. The GradientBoosting and RandomForest classifiers had no such requirements. Recall from the last video, the estimators argument is a list of algorithms to use for the base models. You can check out the docs linked above if you want to read more about this argument. So let's pass in a list containing the two algorithms that we've already looked at earlier in this course. So first we're going to need to import the GradientBoostingClassifier and RandomForestClassifier. GradientBoosting and RandomForest. Then we're going to define a list of estimators and what this list actually is, is a list of tuples. The first entry in the tuple will be what you want to name the model. And the second entry will be the actual algorithm. So let's start with GradientBoosting and we'll just name it gb like we did in prior chapters. Then we need to call the GradientBoostingClassifier and we're going to leave the parentheses open so it'll just use the default settings for each hyperparameter. Then for our next model, we'll use RandomForest and we'll name it rf, and we'll do the same thing. We'll just we'll leave the default hyperparameter settings. Then we just need to add the estimators argument to the StackingClassifier. So copy down estimators, that's the name of the hyperparameter, and we'll tell it we're just passing in a list named estimators. So now we should be able to run this and it'll give us our list of hyperparameter settings. So there's a lot of potential hyperparameters that we could tune here. So let's start at the bottom and work our way up. You'll notice all of these arguments, here at the bottom of the list, begin with the rf prefix. So these are all the hyperparameters for the RandomForest model. So you aren't only tuning hyperparameters for the stacking algorithm, but you're also tuning the hyperparameters for the base models, RandomForest in this case. If you think about it, we did the same thing with GradientBoosting and RandomForest. So when we set the max depth for those two algorithms that was setting a hyperparameter for the base model. It's just now with stacking we're building on top of base models that have a lot more hyperparameters. So we see all these hyperparameters for RandomForest and then we scroll up, and then you'll see all these with the gb prefix. And these are all the hyperparameters for the GradientBoosting model. And then we'll get to the actual stacking hyperparameters. And we're going to focus on three hyperparameters here that we already kind of went over in the last video. So the first is estimators. And this is just a list of tuples that define the base models you want to use. You can do multiple models of the same type of algorithm with different hyperparameters. So we could have one RandomForest model with 50 trees and another with 150 if we wanted. The second hyperparameter is final_estimator. This is the metamodel that is trained on the output of all of the base models. Lastly is passthrough. This parameter allows you to pass the original training data into the metamodel for the final_estimator. If it's set to false, it will only train on the predictions from the base models. If it's set to true, then it'll pass the original training data into the final_estimator, in addition to the predictions from the base models.
 

Implement a stacking model

Selecting transcript lines in this section will navigate to timestamp in the video
 
- [Instructor] In this final video in the stacking chapter, we're going to build the best stack model we can using the same process we've went through in prior chapters. So let's start by importing our packages. So we have the same joblib package to save out our model, pandas to read in our data, and GridSearchCV from sklearn that we've used prior chapters to execute the grid search cross-validation. Now, we're also going to import our stacking classifier from sklearn, and then we have to import the objects that we'll use to fit our base models. So we're going to import GradientBoosting and RandomForest. Again, those will represent our base models, and we'll stick to those, since we're pretty familiar with them from earlier in this course. Then we're also going to import LogisticRegression from sklearn.linearmodel. That logisticregression will be our meta model. If you want to learn more about logistic regression and the key hyper parameters should take a look at my algorithms course in this applied machine learning series. Lastly we'll read in the training features and the training labels. And then next we'll run this self or helper function that calculates the average accuracy score and standard deviation of the accuracy score across the five folds built into our cross validation for each hyper parameter combination. Okay, so that brings us to the grid search CV code. So we've gone through this step already with gradient boosting and random forest. So we're going to go through it fairly quickly this time. You'll remember that we instantiate a model object and then we create a dictionary for the hyper parameter settings and then grid search CV loops through the different hyper parameter settings fits a model and finds the best one. We're doing the same thing here but with very slight tweaks only because our stacking classifier has slightly different requirements. So let's start by creating our stacking classifier object in the same way we did in our last video. So we'll define our estimators and we'll have one random force model and one gradient boosting model. And then we'll leave the parenthesis open because we wanted to find those hyper parameters in our parameters dictionary. So this is a very, very simple stacking model with only two base models. So we create our stacking classifier and then remember it requires us to pass in that list of estimators. And then we'll assign that to the estimator's hyper parameter. Then let's call this get params method to refresh our memory on the name of the parameters that we want to tweak. In the interest of time, we're going to keep it really simple. So we know there are a number of hyper parameters that we can tweak for gradient boosting and random forest but we're going to focus only on N estimators. In other words, we'll only tweak the number of trees that they use. And those parameters will be represented by GB_ n_estimators and rf_ n_estimators. So let's move on to our parameters dictionary and start populating the lists. So we're going to test out 50 and 100 trees for gradient boosting and 50 and 100 trees for random forest. The next parameter we want to set is this final estimator one. This will be our meta model. As I mentioned before, we're going to use logistic regression. So I'll pass in logistic regression. And the key hyper parameter to tune for logistic regression is the C parameter. And that controls the amount of regularization or how closely it fits to the training data. So we're going to try C equal to 0.1 and then we're just going to copy this down to create the other options that we want to test out as part of grid search. Then we'll copy it down one more time and we'll just change the C value from 0.1 to 1 and then 10. So again, for the final estimator or our metamodel we'll test out logistic regression with C equal to 0.1, C equal to 1, and C equal to 10. The last hyper parameter that we want to define here is the pass through hyper parameter. So again, this is the one that controls whether the model fits only on the output of the base models or if also uses the original training data as well. So we'll just test out true which means include the training data and false which means only include the output from the two base models. So then the rest of this code looks exactly the same as prior chapters. So grid search using our stacking classifier object with fivefold cross validation, then we call.fit with our training features and our labels. Then we'll just print out the results using our print results function. So let's go ahead and run this. So, first of all, we see this warning pertaining to the logistic regression metamodel not converging in the given number of iterations and it suggests scaling our data. And we see so many of these warnings because it's returned each time grid search CV tries to fit a logistic regression model, meaning once for each hyper perimeter combination. So logistic regression does not necessarily require your data to be properly scaled, but it does perform a bit better when you do scale that training data. So feel free to play around with scaling the data and exploring if it improves your performance at all. In the interest of time, we're going to move forward with the unscaled data. If you want to refresh on how to scale your training data check out my foundation's course in this applied machine learning series. Now looking at the results, we could see that the best results are using a C value of 0.1 for our meta model, with 50 trees for gradient boosting, 100 trees for random forest, and without passing any training data through. And this model generates an accuracy of 82.8%. Lastly, we just need to write out the fit model using joblib.dump. So we just pass in the grid search CV object with the best estimator attribute and then we're going to tell it to write out to this model directory and we'll save it as stacked_ model.pkl. Now in the final chapter we're going to review everything that we've learned so far and compare the best model from each of our three ensemble techniques on the validation data.
 
 
반응형

'python & DS' 카테고리의 다른 글

IoT  (0) 2022.07.31
conclusion  (0) 2022.07.31
05  (0) 2022.07.31
04  (0) 2022.07.31
01 Applied Machine Learning: Ensemble Learning  (0) 2022.07.31
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2024/07   »
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31
글 보관함