quantificationlib.multiclass.regression module

class REG(bag_generator=<quantificationlib.bag_generator.PriorShift_BagGenerator object>, n_bins=8, bin_strategy='equal_width', regression_estimator=None, verbose=0, **kwargs)[source]

Bases: object

REG base class for REGX y REGy

The idea of these quantifiers is to learn a regression model able to predict the prevalences. To learn said regression model, this kind of objects generates a training set of bag of examples using a selected kind of shift (prior probability shift, covariate shift or a mix of both). The training set contains a collection of pairs (PDF distribution, prevalences) in which each pair is obtained from a bag of examples. The PDF tries to capture the distribution of the bag.

Parameters:
  • bag_generator (BagGenerator object (default=PriorShift_BagGenerator())) – Object to generate the bags with a selected shift

  • n_bins (int (default=8)) – Number of bins to compute the PDF of each distribution

  • bin_strategy (str (default='normal')) –

    Method to compute the boundaries of the bins:
    • ’equal_width’: bins of equal length (it could be affected by outliers)

    • ’equal_count’: bins of equal counts (considering the examples of all classes)

    • ’binormal’: (Only for binary quantification)

      It is inspired on the method devised by (Tasche, 2019, Eq (A16b)). the cut points, \(-\infty < c_1 < \ldots < c_{b-1} < \infty\), are computed as follows based on the assumption that the features follow a normal distribution:

      \(c_i = \frac{\sigma^+ + \sigma^{-}}{2} \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \frac{\mu^+ + \mu^{-}}{2} , \quad i=1,\ldots,b-1\)

      where \(\Phi^{-1}\) is the quantile function of the standard normal distribution, and \(\mu\) and \(\sigma\) of the normal distribution are estimated as the average of those values for the training examples of each class.

    • ’normal’: The idea is that each feature follows a normal distribution. \(\mu\) and \(\sigma\) are

      estimated as the weighted mean and std from the training distribution. The cut points \(-\infty < c_1 < \ldots < c_{b-1} < \infty\) are computed as follows:

      \(c_i = \sigma^ \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \mu , \quad i=1,\ldots,b-1\)

  • regression_estimator (estimator object (default=None)) – A regression estimator object. If the value is None the regression estimator used is a Generalized Linear Model (GLM) from statsmodels package with logit link and Binomial family as parameters (see Baum 2008). It is used to learn a regression model able to predict the prevalence for each class, so the method will fit as many regression estimators as classes in multiclass problems and just one for binary problems.

  • verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode

bag_generator

Object to generate the bags with a selected shift

Type:

BagGenerator object

n_bins

Number of bins to compute the PDF of each distribution

Type:

int

bin_strategy

Method to compute the boundaries of the bins

Type:

str

regression_estimator

A regression estimator object

Type:

estimator object, None

verbose

The verbosity level

Type:

int

dataX_

X data for training REGX/REGy’s regressor model. Each row corresponds to the collection of histograms (one per input feature) of the corresponding bag

Type:

array-like, shape(n_bags, n_features)

dataY_

Y data for training REGX/REGy’s regressor model. Each value corresponds to the prevalences of the corresponding bag

Type:

array-like, shape(n_bags, n_classes)

bincuts_

Bin cuts for each feature

Type:

ndarray, shape (n_features, n_bins + 1)

estimators_

It stores the estimators. For multiclass problems, the method learns an individual estimator for each class

Type:

array of estimators, shape (n_classes, ) multiclass (1, ) binary quantification

models_

This is the fitted regressor model for each class. It is needed when regression_estimator is None and a GML models are used (these objects do not store the fitted model).

Type:

array of models, i.e., fitted estimators, shape (n_classes, )

n_classes_

The number of classes

Type:

int

References

Christopher F. Baum: Stata tip 63: Modeling proportions. The Stata Journal 8.2 (2008): 299-303

create_training_set_of_distributions(X, y, att_range=None)[source]

Create a training set for REG objects. Each example corresponds to a histogram of a bag of examples generated from (X, y). The size of the complete histogram is n_features * n_bins, because it is formed by concatenating the histogram for each input feature. This method computes the values for dataX_, dataY_ and bincuts_ attributes

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Data

  • y (array-like, shape (n_examples, )) – True classes

  • att_range (array-like, (2,1)) – Min and Max possible values of the input feature x. These values might not coincide with the actual Min and Max values of vector x. For instance, in the case of x represents a set of probabilistic predictions, these values will be 0 and 1. These values may be needed by compute_bincuts function

fit_regressor()[source]

This method trains the regressor model using dataX_ and dataY_

predict_bag(bagX)[source]

This method makes a prediction for a testing bag represented by its PDF, bagX parameter

Parameters:

bagX (array-like, shape (n_bins * n_classes, ) for REGy and (n_bins * n_features, ) for REGX) – Testing bag’s PDF

Returns:

prevalences – Contains the predicted prevalence for each class

Return type:

ndarray, shape(n_classes, )

class REGX(bag_generator=<quantificationlib.bag_generator.PriorShift_BagGenerator object>, n_bins=8, bin_strategy='normal', regression_estimator=None, verbose=False)[source]

Bases: WithoutClassifiers, REG

The idea is to learn a regression model able to predict the prevalences given a PDF distribution. In this case, the distributions are represented using PDFs of the input features (X). To learn such regression model, this object generates a training set of bags of examples using a selected kind of shift (prior probability shift, covariate shift or a mix of both)

Parameters:
  • bag_generator (BagGenerator object (default=PriorShift_BagGenerator())) – Object to generate the bags with a selected shift

  • n_bins (int (default=8)) – Number of bins to compute the PDF of each distribution

  • bin_strategy (str (default='normal')) –

    Method to compute the boundaries of the bins:
    • ’equal_width’: bins of equal length (it could be affected by outliers)

    • ’equal_count’: bins of equal counts (considering the examples of all classes)

    • ’binormal’: (Only for binary quantification)

      It is inspired on the method devised by (Tasche, 2019, Eq (A16b)). the cut points, \(-\infty < c_1 < \ldots < c_{b-1} < \infty\), are computed as follows based on the assumption that the features follow a normal distribution:

      \(c_i = \frac{\sigma^+ + \sigma^{-}}{2} \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \frac{\mu^+ + \mu^{-}}{2} , \quad i=1,\ldots,b-1\)

      where \(\Phi^{-1}\) is the quantile function of the standard normal distribution, and \(\mu\) and \(\sigma\) of the normal distribution are estimated as the average of those values for the training examples of each class.

    • ’normal’: The idea is that each feacture follows a normal distribution. \(\mu\) and \(\sigma\) are

      estimated as the weighted mean and std from the training distribution. The cut points \(-\infty < c_1 < \ldots < c_{b-1} < \infty\) are computed as follows:

      \(c_i = \sigma^ \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \mu , \quad i=1,\ldots,b-1\)

  • regression_estimator (estimator object (default=None)) – A regression estimator object. If the value is None the regression estimator used is a Generalized Linear Model (GLM) from statsmodels package with logit link and Binomial family as parameters (see Baum 2008). It is used to learn a regression model able to predict the prevalence for each class, so the method will fit as many regression estimators as classes in multiclass problems and just one for binary problems.

  • verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode

bag_generator

Object to generate the bags with a selected shift

Type:

BagGenerator object

n_bins

Number of bins to compute the PDF of each distribution

Type:

int

bin_strategy

Method to compute the boundaries of the bins

Type:

str

regression_estimator

A regression estimator object

Type:

estimator object, None

verbose

The verbosity level

Type:

int

dataX_

X data for training REGX’s regressor model. Each row corresponds to the collection of histograms (one per input feature) of the corresponding bag

Type:

array-like, shape(n_bags, n_features)

dataY_

Y data for training REGX’s regressor model. Each value corresponds to the prevalences of the corresponding bag

Type:

array-like, shape(n_bags, n_classes)

bincuts_

Bin cuts for each feature

Type:

ndarray, shape (n_features, n_bins + 1)

estimators_

It stores the estimators. For multiclass problems, the method learns an individual estimator for each class

Type:

array of estimators, shape (n_classes, ) multiclass (1, ) binary quantification

models_

This is the fitted regressor model for each class. It is needed when regression_estimator is None and a GML models are used (this objects do not store the fitted model).

Type:

array of models, i.e., fitted estimators, shape (n_classes, )

n_classes_

The number of classes

Type:

int

References

Christopher F. Baum: Stata tip 63: Modeling proportions. The Stata Journal 8.2 (2008): 299-303

fit(X, y)[source]

This method just has two steps: 1) it computes a training dataset formed by a collection of bags of examples (using create_training_set_of_distributions) and 2) it trains a regression model using said training set just calling fit_regressor, a inherited method from REG base class

Xarray-like, shape (n_examples, n_features)

Data

yarray-like, shape (n_examples, )

True classes

predict(X)[source]

This method computes the histogram for the testing set X, using the bincuts for each input feature computed by fit method and then it makes a prediction applying the regression model using the inherited method predict_bag

Parameters:

X (array-like, shape (n_examples, n_features)) – Testing bag

Returns:

prevalences – Contains the predicted prevalence for each class

Return type:

ndarray, shape(n_classes, )

class REGy(estimator_train=None, estimator_test=None, bag_generator=<quantificationlib.bag_generator.PriorShift_BagGenerator object>, n_bins=8, bin_strategy='equal_width', regression_estimator=None, verbose=False)[source]

Bases: UsingClassifiers, REG

The idea is to learn a regression model able to predict the prevalences given a PDF distribution. In this case, the distributions are represented using PDFs of the predictions (y) from a classifer. To learn such regression model, this object first trains a classifier using all data and then generates a training set of bags of examples (in this case the predictions of each example) using a selected kind of shift (prior probability shift, covariate shift or a mix of both)

Parameters:
  • estimator_train (estimator object (default=None)) – An estimator object implementing fit and predict_proba. It is used to train a classifier using the examples of the training set. This classifier is used to obtain the predictions for the training examples and to compute the PDF of each class individually using such predictions

  • estimator_test (estimator object (default=None)) – An estimator object implementing fit and predict_proba. It is used to classify the examples of the testing set and to compute the distribution of the whole testing set

  • bag_generator (BagGenerator object (default=PriorShift_BagGenerator())) – Object to generate the bags with a selected shift

  • n_bins (int (default=8)) – Number of bins to compute the PDF of each distribution

  • bin_strategy (str (default='normal')) –

    Method to compute the boundaries of the bins
    • ’equal_width’: bins of equal length (it could be affected by outliers)

    • ’equal_count’: bins of equal counts (considering the examples of all classes)

    • ’binormal’: (Only for binary quantification)

      It is inspired on the method devised by (Tasche, 2019, Eq (A16b)). the cut points, \(-\infty < c_1 < \ldots < c_{b-1} < \infty\), are computed as follows based on the assumption that the features follow a normal distribution:

      \(c_i = \frac{\sigma^+ + \sigma^{-}}{2} \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \frac{\mu^+ + \mu^{-}}{2} , \quad i=1,\ldots,b-1\)

      where \(\Phi^{-1}\) is the quantile function of the standard normal distribution, and \(\mu\) and \(\sigma\) of the normal distribution are estimated as the average of those values for the training examples of each class.

    • ’normal’: The idea is that each feacture follows a normal distribution. \(\mu\) and \(\sigma\) are

      estimated as the weighted mean and std from the training distribution. The cut points \(-\infty < c_1 < \ldots < c_{b-1} < \infty\) are computed as follows:

      \(c_i = \sigma^ \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \mu , \quad i=1,\ldots,b-1\)

  • regression_estimator (estimator object (default=None)) – A regression estimator object. If it is None the regression estimator used is a Generalized Linear Model (GLM) from statsmodels package with logit link and Binomial family as parameters. It is used to learn a regression model able to predict the prevalence for each class, so the method will fit as many regression estimators as classes in multiclass problem and just one for binary problems.

  • verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode

estimator_train

Estimator used to classify the examples of the training set

Type:

estimator

estimator_test

Estimator used to classify the examples of the testing bag

Type:

estimator

bag_generator

Object to generate the bags with a selected shift

Type:

BagGenerator object

needs_predictions_train

It is True because PDFy quantifiers need to estimate the training distribution

Type:

bool, True

probabilistic_predictions

This means that predictions_train_/predictions_test_ contain probabilistic predictions

Type:

bool, True

n_bins

Number of bins to compute the PDF of each distribution

Type:

int

bin_strategy

Method to compute the boundaries of the bins

Type:

str

regression_estimator

A regression estimator object

Type:

estimator object, None

verbose

The verbosity level

Type:

int

predictions_train_

Predictions of the examples in the training set

Type:

ndarray, shape (n_examples, n_classes) (probabilities)

predictions_test_

Predictions of the examples in the testing bag

Type:

ndarray, shape (n_examples, n_classes) (probabilities)

classes_

Class labels

Type:

ndarray, shape (n_classes, )

dataX_

X data for training REGy’s regressor model. Each row corresponds to the predictions histogram for the examples of the corresponding bag

Type:

array-like, shape(n_bags, n_features)

dataY_

Y data for training REGy’s regressor model. Each value corresponds to the prevalences of the corresponding bag

Type:

array-like, shape(n_bags, n_classes)

bincuts_

Bin cuts for each feature

Type:

ndarray, shape (n_features, n_bins + 1)

estimators_

It stores the estimators. For multiclass problems, the method learns an individual estimator for each class

Type:

array of estimators, shape (n_classes, ) multiclass (1, ) binary quantification

models_

This is the fitted regressor model for each class. It is needed when regression_estimator is None and a GML models are used (this objects do not store the fitted model).

Type:

array of models, i.e., fitted estimators, shape (n_classes, )

n_classes_

The number of classes

Type:

int

References

Christopher F. Baum: Stata tip 63: Modeling proportions. The Stata Journal 8.2 (2008): 299-303

fit(X, y, predictions_train=None)[source]

This method just has two steps: 1) it computes a training dataset formed by a collection of bags of examples (using create_training_set_of_distributions) and 2) it trains a regression model using said training set just calling fit_regressor, a inherited method from REG base class

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Data

  • y (array-like, shape (n_examples, )) – True classes

  • predictions_train (ndarray, shape (n_examples, n_classes)) – Predictions of the examples in the training set

Raises:

ValueError – When estimator_train and predictions_train are both None

predict(X, predictions_test=None)[source]

This method first computes the histogram for the testing set X, using the bincuts computed by the fit method and the predictions for the testing bag (X, y). These predictions may be explicited given in the predictions_test parameter. Then it makes a prediction applying the regression model using the inherited method predict_bag

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Testing bag

  • predictions_test (ndarray, shape (n_examples, n_classes) (default=None)) –

    They must be probabilities (the estimator used must have a predict_proba method)

    If predictions_test is not None they are copied on predictions_test_ and used. If predictions_test is None, predictions for the testing examples are computed using the predict method of estimator_test (it must be an actual estimator)

Raises:

ValueError – When estimator_test and predictions_test are both None

Returns:

prevalences – Contains the predicted prevalence for each class

Return type:

ndarray, shape(n_classes, )

set_fit_request(*, predictions_train='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • predictions_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_train parameter in fit.

  • self (REGy) –

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, predictions_test='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • predictions_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_test parameter in predict.

  • self (REGy) –

Returns:

self – The updated object.

Return type:

object