quantificationlib.multiclass.energy module

Multiclass versions for quantifiers based on the Energy Distance

class CvMy(estimator_train=None, estimator_test=None, distance=<function manhattan_distances>, verbose=0)[source]

Bases: UsingClassifiers

Multiclass CvMy method

As described in (Castaño et al 2019), the predicted prevalences can be analytically calculated solving an optimization problem (with quadprog.solve_qp in this library). All ED-based methods share several functions in distribution_matching.utils. These functions are used to compute the elements of the optimization problem (compute_ed_param_train, compute_ed_param_test) and to solve the optimization problem (solve_ed)

This class (as every other class based on distribution matching using classifiers) works in two different ways:

  1. Two estimators are used to classify training examples and testing examples in order to compute the distribution of both sets. Estimators can be already trained

  2. You can directly provide the predictions for the examples in the fit/predict methods. This is useful for synthetic/artificial experiments

The idea in both cases is to guarantee that all methods based on distribution matching are using exactly the same predictions when you compare this kind of quantifiers (and others that also employ an underlying classifier, for instance, CC/PCC and AC/PAC). In the first case, estimators are only trained once and can be shared for several quantifiers of this kind

Parameters:
  • estimator_train (estimator object (default=None)) – An estimator object implementing fit and predict_proba. It is used to classify the examples of the training set and to compute the distribution of each class individually

  • estimator_test (estimator object (default=None)) – An estimator object implementing fit and predict_proba. It is used to classify the examples of the testing set and to compute the distribution of the whole testing set. For some experiments both estimator_train and estimator_test could be the same

  • distance (distance function (default=manhattan_distances)) – It is the function used to compute the distance between every pair of examples

  • verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode

estimator_train

Estimator used to classify the examples of the training set

Type:

estimator

estimator_test

Estimator used to classify the examples of the testing bag

Type:

estimator

predictions_train_

Predictions of the examples in the training set

Type:

ndarray, shape (n_examples, n_classes) (probabilities)

predictions_test_

Predictions of the examples in the testing bag

Type:

ndarray, shape (n_examples, n_classes) (probabilities)

needs_predictions_train

It is True because CvMy quantifiers need to estimate the training distribution

Type:

bool, True

probabilistic_predictions

This means that predictions_train_/predictions_test_ contain probabilistic predictions

Type:

bool, True

classes_

Class labels

Type:

ndarray, shape (n_classes, )

y_ext_

Repmat of true labels of the training set. When CV_estimator is used with averaged_predictions=False, predictions_train_ will have a larger dimension (factor=n_repetitions * n_folds of the underlying CV) than y. In other cases, y_ext_ == y. y_ext_ is used in predict method whenever the true labels of the training set are needed, instead of y

Type:

ndarray, shape(len(predictions_train_, 1)

distance

Function used to compute the distance between every pair of examples

Type:

distance function

train_n_cls_i_

Number of the examples of each class in the training set. Used to compute average distances

Type:

ndarray, shape(n_classes, 1)

train_distrib_

Each key has associated a ndarray with the predictions, shape (train_n_cls_i_[i], 1) (binary quantification problems) or (train_n_cls_i_[i], n_classes) (multiclass quantification problems)

Type:

Dict, the keys are the labels of the classes (classes_)

test_distrib_

The distribution of the test distribution

Type:

ndarray, shape(n_examples, )

K_

Average distance between the examples in the training set of each pair of classes

Type:

ndarray, shape (n_classes, n_classes)

G_, C_, b_

These variables are precomputed in the fit method and are used for solving the optimization problem using quadprog.solve_qp. See compute_ed_param_train function

Type:

variables of different kind for definining the optimization problem

a_

This one is computed in the predict method, just before solving the optimization problem

Type:

another variable of the optimization problem

verbose

The verbosity level

Type:

int

Notes

Notice that at least one between estimator_train/predictions_train and estimator_test/predictions_test must be not None. If both are None a ValueError exception will be raised. If both are not None, predictions_train/predictions_test are used

References

Alberto Castaño, Laura Morán-Fernández, Jaime Alonso, Verónica Bolón-Canedo, Amparo Alonso-Betanzos, Juan José del Coz: An analysis of quantification methods based on matching distributions

fit(X, y, predictions_train=None)[source]

This method performs the following operations: 1) fits the estimators for the training set and the testing set (if needed), and 2) computes predictions_train_ (probabilities) if needed. Both operations are performed by the fit method of its superclass. After that, the method stores the true classes in y_train_ attribute.

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Data

  • y (array-like, shape (n_examples, )) – True classes

  • predictions_train (ndarray, shape (n_examples, n_classes) (probabilities)) – Predictions of the examples in the training set

Raises:

ValueError – When estimator_train and predictions_train are both None

predict(X, predictions_test=None)[source]

Predict the class distribution of a testing bag

First, predictions_test_ are computed (if needed, when predictions_test parameter is None) by super().predict() method.

Then, the method computes all the elements of the optimization problem after computing the combined ranking of the predictions for the training examples and the testing examples using rankdata function

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Testing bag

  • predictions_test (ndarray, shape (n_examples, n_classes) (default=None)) –

    They must be probabilities (the estimator used must have a predict_proba method)

    If predictions_test is not None they are copied on predictions_test_ and used. If predictions_test is None, predictions for the testing examples are computed using the predict method of estimator_test (it must be an actual estimator)

Raises:

ValueError – When estimator_test and predictions_test are at the same time None or not None

Returns:

prevalences – Contains the predicted prevalence for each class

Return type:

ndarray, shape(n_classes, )

set_fit_request(*, predictions_train='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • predictions_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_train parameter in fit.

  • self (CvMy) –

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, predictions_test='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • predictions_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_test parameter in predict.

  • self (CvMy) –

Returns:

self – The updated object.

Return type:

object

class EDX(distance=<function euclidean_distances>, verbose=0)[source]

Bases: WithoutClassifiers

Multiclass EDX method

Parameters:
  • distance (distance function (default=euclidean_distances)) – It is the function used to compute the distance between every pair of examples

  • verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode

distance_

The distance fuction used for computing the distance between every pair of examples

Type:

distance function

classes_

Class labels

Type:

ndarray, shape (n_classes, )

train_n_cls_i_

Number of the examples of each class in the training set. Used to compute average distances

Type:

ndarray, shape(n_classes, 1)

train_distrib_

Each key has associated a ndarray with the predictions, shape (train_n_cls_i_[i], 1) (binary quantification problems) or (train_n_cls_i_[i], n_classes) (multiclass quantification problems)

Type:

Dict, the keys are the labels of the classes (classes_)

K_

Average distance between the examples in the training set of each pair of classes

Type:

ndarray, shape (n_classes, n_classes)

G_, C_, b_

These variables are precomputed in the fit method and are used for solving the optimization problem using quadprog.solve_qp. See compute_ed_param_train function

Type:

variables of different kind for definining the optimization problem

verbose

The verbosity level

Type:

int

References

Hideko Kawakubo, Marthinus Christoffel Du Plessis, and Masashi Sugiyama. 2016. Computationally efficient class-prior estimation under class balance change using energy distance. Transactions on Information and Systems 99, 1 (2016), 176–186.

fit(X, y)[source]

This method computes all the elements of the optimization that involve just the training data: K_, G_, C_ and b_.

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Data

  • y (array-like, shape (n_examples, )) – True classes

predict(X)[source]

Predict the class distribution of a testing bag

This method computes a, the only element of the optimization problem that needs the testing data. Then, it solves the optimization problem using quadprog.solve_qp in solve_ed function

Parameters:

X (array-like, shape (n_examples, n_features)) – Testing bag

Returns:

prevalences – Contains the predicted prevalence for each class

Return type:

ndarray, shape(n_classes, )

class EDy(estimator_train=None, estimator_test=None, distance=<function manhattan_distances>, verbose=0)[source]

Bases: UsingClassifiers

Multiclass EDy method

As described in (Castaño et al 2019), the predicted prevalences can be analytically calculated solving an optimization problem (with quadprog.solve_qp in this library). All ED-based methods share several functions in distribution_matching.utils. These functions are used to compute the elements of the optimization problem (compute_ed_param_train, compute_ed_param_test) and to solve the optimization problem (solve_ed)

This class (as every other class based on distribution matching using classifiers) works in two different ways:

  1. Two estimators are used to classify training examples and testing examples in order to compute the distribution of both sets. Estimators can be already trained

  2. You can directly provide the predictions for the examples in the fit/predict methods. This is useful for synthetic/artificial experiments

The idea in both cases is to guarantee that all methods based on distribution matching are using exactly the same predictions when you compare this kind of quantifiers (and others that also employ an underlying classifier, for instance, CC/PCC and AC/PAC). In the first case, estimators are only trained once and can be shared for several quantifiers of this kind

Parameters:
  • estimator_train (estimator object (default=None)) – An estimator object implementing fit and predict_proba. It is used to classify the examples of the training set and to compute the distribution of each class individually

  • estimator_test (estimator object (default=None)) – An estimator object implementing fit and predict_proba. It is used to classify the examples of the testing set and to compute the distribution of the whole testing set. For some experiments both estimator_train and estimator_test could be the same

  • distance (distance function (default=manhattan_distances)) – It is the function used to compute the distance between every pair of examples

  • verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode

estimator_train

Estimator used to classify the examples of the training set

Type:

estimator

estimator_test

Estimator used to classify the examples of the testing bag

Type:

estimator

predictions_train_

Predictions of the examples in the training set

Type:

ndarray, shape (n_examples, n_classes) (probabilities)

predictions_test_

Predictions of the examples in the testing bag

Type:

ndarray, shape (n_examples, n_classes) (probabilities)

needs_predictions_train

It is True because EDy quantifiers need to estimate the training distribution

Type:

bool, True

probabilistic_predictions

This means that predictions_train_/predictions_test_ contain probabilistic predictions

Type:

bool, True

classes_

Class labels

Type:

ndarray, shape (n_classes, )

y_ext_

Repmat of true labels of the training set. When CV_estimator is used with averaged_predictions=False, predictions_train_ will have a larger dimension (factor=n_repetitions * n_folds of the underlying CV) than y. In other cases, y_ext_ == y. y_ext_ is used in fit method whenever the true labels of the training set are needed, instead of y

Type:

ndarray, shape(len(predictions_train_, 1)

train_n_cls_i_

Number of the examples of each class in the training set. Used to compute average distances

Type:

ndarray, shape(n_classes, 1)

train_distrib_

Each key has associated a ndarray with the predictions, shape (train_n_cls_i_[i], 1) (binary quantification problems) or (train_n_cls_i_[i], n_classes) (multiclass quantification problems)

Type:

Dict, the keys are the labels of the classes (classes_)

K_

Average distance between the examples in the training set of each pair of classes

Type:

ndarray, shape (n_classes, n_classes)

G_, C_, b_

These variables are precomputed in the fit method and are used for solving the optimization problem using quadprog.solve_qp. See compute_ed_param_train function

Type:

variables of different kind for definining the optimization problem

a_

This one is computed in the predict method, just before solving the optimization problem

Type:

another variable of the optimization problem

verbose

The verbosity level

Type:

int

Notes

Notice that at least one between estimator_train/predictions_train and estimator_test/predictions_test must be not None. If both are None a ValueError exception will be raised. If both are not None, predictions_train/predictions_test are used

References

Alberto Castaño, Laura Morán-Fernández, Jaime Alonso, Verónica Bolón-Canedo, Amparo Alonso-Betanzos, Juan José del Coz: An analysis of quantification methods based on matching distributions

fit(X, y, predictions_train=None)[source]

This method performs the following operations: 1) fits the estimators for the training set and the testing set (if needed), and 2) computes predictions_train_ (probabilities) if needed. Both operations are performed by the fit method of its superclass. After that, the method computes all the elements of the optimization problem that involve just the training data: K_, G_, C_ and b_.

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Data

  • y (array-like, shape (n_examples, )) – True classes

  • predictions_train (ndarray, shape (n_examples, n_classes) (probabilities)) – Predictions of the examples in the training set

Raises:

ValueError – When estimator_train and predictions_train are both None

predict(X, predictions_test=None)[source]

Predict the class distribution of a testing bag

First, predictions_test_ are computed (if needed, when predictions_test parameter is None) by super().predict() method.

After that, the method computes a, the only element of the optimization problem that needs the testing data

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Testing bag

  • predictions_test (ndarray, shape (n_examples, n_classes) (default=None)) –

    They must be probabilities (the estimator used must have a predict_proba method)

    If predictions_test is not None they are copied on predictions_test_ and used. If predictions_test is None, predictions for the testing examples are computed using the predict method of estimator_test (it must be an actual estimator)

Raises:

ValueError – When estimator_test and predictions_test are at the same time None or not None

Returns:

prevalences – Contains the predicted prevalence for each class

Return type:

ndarray, shape(n_classes, )

set_fit_request(*, predictions_train='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • predictions_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_train parameter in fit.

  • self (EDy) –

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, predictions_test='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • predictions_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_test parameter in predict.

  • self (EDy) –

Returns:

self – The updated object.

Return type:

object