quantificationlib.multiclass.regression module¶
- class REG(bag_generator=<quantificationlib.bag_generator.PriorShift_BagGenerator object>, n_bins=8, bin_strategy='equal_width', regression_estimator=None, verbose=0, **kwargs)[source]¶
Bases:
object
REG base class for REGX y REGy
The idea of these quantifiers is to learn a regression model able to predict the prevalences. To learn said regression model, this kind of objects generates a training set of bag of examples using a selected kind of shift (prior probability shift, covariate shift or a mix of both). The training set contains a collection of pairs (PDF distribution, prevalences) in which each pair is obtained from a bag of examples. The PDF tries to capture the distribution of the bag.
- Parameters:
bag_generator (BagGenerator object (default=PriorShift_BagGenerator())) – Object to generate the bags with a selected shift
n_bins (int (default=8)) – Number of bins to compute the PDF of each distribution
bin_strategy (str (default='normal')) –
- Method to compute the boundaries of the bins:
’equal_width’: bins of equal length (it could be affected by outliers)
’equal_count’: bins of equal counts (considering the examples of all classes)
- ’binormal’: (Only for binary quantification)
It is inspired on the method devised by (Tasche, 2019, Eq (A16b)). the cut points, \(-\infty < c_1 < \ldots < c_{b-1} < \infty\), are computed as follows based on the assumption that the features follow a normal distribution:
\(c_i = \frac{\sigma^+ + \sigma^{-}}{2} \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \frac{\mu^+ + \mu^{-}}{2} , \quad i=1,\ldots,b-1\)
where \(\Phi^{-1}\) is the quantile function of the standard normal distribution, and \(\mu\) and \(\sigma\) of the normal distribution are estimated as the average of those values for the training examples of each class.
- ’normal’: The idea is that each feature follows a normal distribution. \(\mu\) and \(\sigma\) are
estimated as the weighted mean and std from the training distribution. The cut points \(-\infty < c_1 < \ldots < c_{b-1} < \infty\) are computed as follows:
\(c_i = \sigma^ \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \mu , \quad i=1,\ldots,b-1\)
regression_estimator (estimator object (default=None)) – A regression estimator object. If the value is None the regression estimator used is a Generalized Linear Model (GLM) from statsmodels package with logit link and Binomial family as parameters (see Baum 2008). It is used to learn a regression model able to predict the prevalence for each class, so the method will fit as many regression estimators as classes in multiclass problems and just one for binary problems.
verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode
- bag_generator¶
Object to generate the bags with a selected shift
- Type:
BagGenerator object
- n_bins¶
Number of bins to compute the PDF of each distribution
- Type:
int
- bin_strategy¶
Method to compute the boundaries of the bins
- Type:
str
- regression_estimator¶
A regression estimator object
- Type:
estimator object, None
- verbose¶
The verbosity level
- Type:
int
- dataX_¶
X data for training REGX/REGy’s regressor model. Each row corresponds to the collection of histograms (one per input feature) of the corresponding bag
- Type:
array-like, shape(n_bags, n_features)
- dataY_¶
Y data for training REGX/REGy’s regressor model. Each value corresponds to the prevalences of the corresponding bag
- Type:
array-like, shape(n_bags, n_classes)
- bincuts_¶
Bin cuts for each feature
- Type:
ndarray, shape (n_features, n_bins + 1)
- estimators_¶
It stores the estimators. For multiclass problems, the method learns an individual estimator for each class
- Type:
array of estimators, shape (n_classes, ) multiclass (1, ) binary quantification
- models_¶
This is the fitted regressor model for each class. It is needed when regression_estimator is None and a GML models are used (these objects do not store the fitted model).
- Type:
array of models, i.e., fitted estimators, shape (n_classes, )
- n_classes_¶
The number of classes
- Type:
int
References
Christopher F. Baum: Stata tip 63: Modeling proportions. The Stata Journal 8.2 (2008): 299-303
- create_training_set_of_distributions(X, y, att_range=None)[source]¶
Create a training set for REG objects. Each example corresponds to a histogram of a bag of examples generated from (X, y). The size of the complete histogram is n_features * n_bins, because it is formed by concatenating the histogram for each input feature. This method computes the values for dataX_, dataY_ and bincuts_ attributes
- Parameters:
X (array-like, shape (n_examples, n_features)) – Data
y (array-like, shape (n_examples, )) – True classes
att_range (array-like, (2,1)) – Min and Max possible values of the input feature x. These values might not coincide with the actual Min and Max values of vector x. For instance, in the case of x represents a set of probabilistic predictions, these values will be 0 and 1. These values may be needed by compute_bincuts function
- predict_bag(bagX)[source]¶
This method makes a prediction for a testing bag represented by its PDF, bagX parameter
- Parameters:
bagX (array-like, shape (n_bins * n_classes, ) for REGy and (n_bins * n_features, ) for REGX) – Testing bag’s PDF
- Returns:
prevalences – Contains the predicted prevalence for each class
- Return type:
ndarray, shape(n_classes, )
- class REGX(bag_generator=<quantificationlib.bag_generator.PriorShift_BagGenerator object>, n_bins=8, bin_strategy='normal', regression_estimator=None, verbose=False)[source]¶
Bases:
WithoutClassifiers
,REG
The idea is to learn a regression model able to predict the prevalences given a PDF distribution. In this case, the distributions are represented using PDFs of the input features (X). To learn such regression model, this object generates a training set of bags of examples using a selected kind of shift (prior probability shift, covariate shift or a mix of both)
- Parameters:
bag_generator (BagGenerator object (default=PriorShift_BagGenerator())) – Object to generate the bags with a selected shift
n_bins (int (default=8)) – Number of bins to compute the PDF of each distribution
bin_strategy (str (default='normal')) –
- Method to compute the boundaries of the bins:
’equal_width’: bins of equal length (it could be affected by outliers)
’equal_count’: bins of equal counts (considering the examples of all classes)
- ’binormal’: (Only for binary quantification)
It is inspired on the method devised by (Tasche, 2019, Eq (A16b)). the cut points, \(-\infty < c_1 < \ldots < c_{b-1} < \infty\), are computed as follows based on the assumption that the features follow a normal distribution:
\(c_i = \frac{\sigma^+ + \sigma^{-}}{2} \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \frac{\mu^+ + \mu^{-}}{2} , \quad i=1,\ldots,b-1\)
where \(\Phi^{-1}\) is the quantile function of the standard normal distribution, and \(\mu\) and \(\sigma\) of the normal distribution are estimated as the average of those values for the training examples of each class.
- ’normal’: The idea is that each feacture follows a normal distribution. \(\mu\) and \(\sigma\) are
estimated as the weighted mean and std from the training distribution. The cut points \(-\infty < c_1 < \ldots < c_{b-1} < \infty\) are computed as follows:
\(c_i = \sigma^ \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \mu , \quad i=1,\ldots,b-1\)
regression_estimator (estimator object (default=None)) – A regression estimator object. If the value is None the regression estimator used is a Generalized Linear Model (GLM) from statsmodels package with logit link and Binomial family as parameters (see Baum 2008). It is used to learn a regression model able to predict the prevalence for each class, so the method will fit as many regression estimators as classes in multiclass problems and just one for binary problems.
verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode
- bag_generator¶
Object to generate the bags with a selected shift
- Type:
BagGenerator object
- n_bins¶
Number of bins to compute the PDF of each distribution
- Type:
int
- bin_strategy¶
Method to compute the boundaries of the bins
- Type:
str
- regression_estimator¶
A regression estimator object
- Type:
estimator object, None
- verbose¶
The verbosity level
- Type:
int
- dataX_¶
X data for training REGX’s regressor model. Each row corresponds to the collection of histograms (one per input feature) of the corresponding bag
- Type:
array-like, shape(n_bags, n_features)
- dataY_¶
Y data for training REGX’s regressor model. Each value corresponds to the prevalences of the corresponding bag
- Type:
array-like, shape(n_bags, n_classes)
- bincuts_¶
Bin cuts for each feature
- Type:
ndarray, shape (n_features, n_bins + 1)
- estimators_¶
It stores the estimators. For multiclass problems, the method learns an individual estimator for each class
- Type:
array of estimators, shape (n_classes, ) multiclass (1, ) binary quantification
- models_¶
This is the fitted regressor model for each class. It is needed when regression_estimator is None and a GML models are used (this objects do not store the fitted model).
- Type:
array of models, i.e., fitted estimators, shape (n_classes, )
- n_classes_¶
The number of classes
- Type:
int
References
Christopher F. Baum: Stata tip 63: Modeling proportions. The Stata Journal 8.2 (2008): 299-303
- fit(X, y)[source]¶
This method just has two steps: 1) it computes a training dataset formed by a collection of bags of examples (using create_training_set_of_distributions) and 2) it trains a regression model using said training set just calling fit_regressor, a inherited method from REG base class
- Xarray-like, shape (n_examples, n_features)
Data
- yarray-like, shape (n_examples, )
True classes
- predict(X)[source]¶
This method computes the histogram for the testing set X, using the bincuts for each input feature computed by fit method and then it makes a prediction applying the regression model using the inherited method predict_bag
- Parameters:
X (array-like, shape (n_examples, n_features)) – Testing bag
- Returns:
prevalences – Contains the predicted prevalence for each class
- Return type:
ndarray, shape(n_classes, )
- class REGy(estimator_train=None, estimator_test=None, bag_generator=<quantificationlib.bag_generator.PriorShift_BagGenerator object>, n_bins=8, bin_strategy='equal_width', regression_estimator=None, verbose=False)[source]¶
Bases:
UsingClassifiers
,REG
The idea is to learn a regression model able to predict the prevalences given a PDF distribution. In this case, the distributions are represented using PDFs of the predictions (y) from a classifer. To learn such regression model, this object first trains a classifier using all data and then generates a training set of bags of examples (in this case the predictions of each example) using a selected kind of shift (prior probability shift, covariate shift or a mix of both)
- Parameters:
estimator_train (estimator object (default=None)) – An estimator object implementing fit and predict_proba. It is used to train a classifier using the examples of the training set. This classifier is used to obtain the predictions for the training examples and to compute the PDF of each class individually using such predictions
estimator_test (estimator object (default=None)) – An estimator object implementing fit and predict_proba. It is used to classify the examples of the testing set and to compute the distribution of the whole testing set
bag_generator (BagGenerator object (default=PriorShift_BagGenerator())) – Object to generate the bags with a selected shift
n_bins (int (default=8)) – Number of bins to compute the PDF of each distribution
bin_strategy (str (default='normal')) –
- Method to compute the boundaries of the bins
’equal_width’: bins of equal length (it could be affected by outliers)
’equal_count’: bins of equal counts (considering the examples of all classes)
- ’binormal’: (Only for binary quantification)
It is inspired on the method devised by (Tasche, 2019, Eq (A16b)). the cut points, \(-\infty < c_1 < \ldots < c_{b-1} < \infty\), are computed as follows based on the assumption that the features follow a normal distribution:
\(c_i = \frac{\sigma^+ + \sigma^{-}}{2} \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \frac{\mu^+ + \mu^{-}}{2} , \quad i=1,\ldots,b-1\)
where \(\Phi^{-1}\) is the quantile function of the standard normal distribution, and \(\mu\) and \(\sigma\) of the normal distribution are estimated as the average of those values for the training examples of each class.
- ’normal’: The idea is that each feacture follows a normal distribution. \(\mu\) and \(\sigma\) are
estimated as the weighted mean and std from the training distribution. The cut points \(-\infty < c_1 < \ldots < c_{b-1} < \infty\) are computed as follows:
\(c_i = \sigma^ \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \mu , \quad i=1,\ldots,b-1\)
regression_estimator (estimator object (default=None)) – A regression estimator object. If it is None the regression estimator used is a Generalized Linear Model (GLM) from statsmodels package with logit link and Binomial family as parameters. It is used to learn a regression model able to predict the prevalence for each class, so the method will fit as many regression estimators as classes in multiclass problem and just one for binary problems.
verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode
- estimator_train¶
Estimator used to classify the examples of the training set
- Type:
estimator
- estimator_test¶
Estimator used to classify the examples of the testing bag
- Type:
estimator
- bag_generator¶
Object to generate the bags with a selected shift
- Type:
BagGenerator object
- needs_predictions_train¶
It is True because PDFy quantifiers need to estimate the training distribution
- Type:
bool, True
- probabilistic_predictions¶
This means that predictions_train_/predictions_test_ contain probabilistic predictions
- Type:
bool, True
- n_bins¶
Number of bins to compute the PDF of each distribution
- Type:
int
- bin_strategy¶
Method to compute the boundaries of the bins
- Type:
str
- regression_estimator¶
A regression estimator object
- Type:
estimator object, None
- verbose¶
The verbosity level
- Type:
int
- predictions_train_¶
Predictions of the examples in the training set
- Type:
ndarray, shape (n_examples, n_classes) (probabilities)
- predictions_test_¶
Predictions of the examples in the testing bag
- Type:
ndarray, shape (n_examples, n_classes) (probabilities)
- classes_¶
Class labels
- Type:
ndarray, shape (n_classes, )
- dataX_¶
X data for training REGy’s regressor model. Each row corresponds to the predictions histogram for the examples of the corresponding bag
- Type:
array-like, shape(n_bags, n_features)
- dataY_¶
Y data for training REGy’s regressor model. Each value corresponds to the prevalences of the corresponding bag
- Type:
array-like, shape(n_bags, n_classes)
- bincuts_¶
Bin cuts for each feature
- Type:
ndarray, shape (n_features, n_bins + 1)
- estimators_¶
It stores the estimators. For multiclass problems, the method learns an individual estimator for each class
- Type:
array of estimators, shape (n_classes, ) multiclass (1, ) binary quantification
- models_¶
This is the fitted regressor model for each class. It is needed when regression_estimator is None and a GML models are used (this objects do not store the fitted model).
- Type:
array of models, i.e., fitted estimators, shape (n_classes, )
- n_classes_¶
The number of classes
- Type:
int
References
Christopher F. Baum: Stata tip 63: Modeling proportions. The Stata Journal 8.2 (2008): 299-303
- fit(X, y, predictions_train=None)[source]¶
This method just has two steps: 1) it computes a training dataset formed by a collection of bags of examples (using create_training_set_of_distributions) and 2) it trains a regression model using said training set just calling fit_regressor, a inherited method from REG base class
- Parameters:
X (array-like, shape (n_examples, n_features)) – Data
y (array-like, shape (n_examples, )) – True classes
predictions_train (ndarray, shape (n_examples, n_classes)) – Predictions of the examples in the training set
- Raises:
ValueError – When estimator_train and predictions_train are both None
- predict(X, predictions_test=None)[source]¶
This method first computes the histogram for the testing set X, using the bincuts computed by the fit method and the predictions for the testing bag (X, y). These predictions may be explicited given in the predictions_test parameter. Then it makes a prediction applying the regression model using the inherited method predict_bag
- Parameters:
X (array-like, shape (n_examples, n_features)) – Testing bag
predictions_test (ndarray, shape (n_examples, n_classes) (default=None)) –
They must be probabilities (the estimator used must have a predict_proba method)
If predictions_test is not None they are copied on predictions_test_ and used. If predictions_test is None, predictions for the testing examples are computed using the predict method of estimator_test (it must be an actual estimator)
- Raises:
ValueError – When estimator_test and predictions_test are both None
- Returns:
prevalences – Contains the predicted prevalence for each class
- Return type:
ndarray, shape(n_classes, )
- set_fit_request(*, predictions_train='$UNCHANGED$')¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
predictions_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_train
parameter infit
.self (REGy) –
- Returns:
self – The updated object.
- Return type:
object
- set_predict_request(*, predictions_test='$UNCHANGED$')¶
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
predictions_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_test
parameter inpredict
.self (REGy) –
- Returns:
self – The updated object.
- Return type:
object