quantificationlib.baselines.ac module

Multiclass versions of AC and PAC quantifiers

class AC(estimator_train=None, estimator_test=None, distance='HD', verbose=0)[source]

Bases: UsingClassifiers

Multiclass Adjusted Count method

This class works in two different ways:

  1. Two estimators are used to classify the examples of the training set and the testing set in order to compute the confusion matrix of both sets. Estimators can be already trained

  2. You can directly provide the predictions for the examples in the fit/predict methods. This is useful for synthetic/artificial experiments

The idea in both cases is to guarantee that all methods based on distribution matching are using exactly the same predictions when you compare this kind of quantifiers (and others that also employ an underlying classifier, for instance, CC/PCC). In the first case, estimators are only trained once and can be shared for several quantifiers of this kind

Parameters:
  • estimator_train (estimator object (default=None)) – An estimator object implementing fit and predict. It is used to classify the examples of the training set and to compute the confusion matrix

  • estimator_test (estimator object (default=None)) – An estimator object implementing fit and predict. It is used to classify the examples of the testing set and to obtain their predictions. For some experiments both estimators could be the same

  • distance (str, representing the distance function (default='HD')) – It is the name of the distance used to compute the difference between the mixture of the training distribution and the testing distribution. Only used in multiclass problems. Distances supported: ‘HD’, ‘L2’ and ‘L1’

  • verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode

estimator_train

Estimator used to classify the examples of the training set

Type:

estimator

estimator_test

Estimator used to classify the examples of the testing bag

Type:

estimator

needs_predictions_train

It is True because AC quantifiers need to estimate the training distribution

Type:

bool, True

probabilistic_predictions

This means that predictions_test_ contains crisp predictions

Type:

bool, False

distance

A string with the name of the distance function (‘HD’/’L1’/’L2’)

Type:

str

predictions_train_

Predictions of the examples in the training set

Type:

ndarray, shape (n_examples, ) (crisp estimator)

predictions_test_

Predictions of the examples in the testing bag

Type:

ndarray, shape (n_examples, ) (crisp estimator)

classes_

Class labels

Type:

ndarray, shape (n_classes, )

y_ext_

Repmat of true labels of the training set. When CV_estimator is used with averaged_predictions=False, predictions_train_ will have a larger dimension (factor=n_repetitions * n_folds of the underlying CV) than y. In other cases, y_ext_ == y. y_ext_ is used in fit method whenever the true labels of the training set are needed, instead of y

Type:

ndarray, shape(len(predictions_train_, 1)

cm_

Confusion matrix. The true classes are in the rows and the predicted classes in the columns. So, for the binary case, the count of true negatives is cm_[0,0], false negatives is cm_[1,0], true positives is cm_[1,1] and false positives is cm_[0,1] .

Type:

ndarray, shape (n_classes, n_classes)

G_, C_, b_

These variables are precomputed in the fit method and are used for solving the optimization problem using quadprog.solve_qp. See compute_l2_param_train function

Type:

variables of different kind for defining the optimization problem

problem_

This attribute is set to None in the fit() method. With such model, the first time a testing bag is predicted this attribute will contain the corresponding cvxpy Object (if such library is used, i.e in the case of ‘L1’ and ‘HD’). For the rest testing bags, this object is passed to allow a warm start. The solving process is faster.

Type:

a cvxpy Problem object

verbose

The verbosity level

Type:

int

Notes

Notice that at least one between estimator_train/predictions_train and estimator_test/predictions_test must be not None. If both are None a ValueError exception will be raised. If both are not None, predictions_train/predictions_test are used

References

George Forman. 2008. Quantifying counts and costs via classification. Data Mining Knowledge Discovery 17, 2 (2008), 164–206.

fit(X, y, predictions_train=None)[source]

This method performs the following operations: 1) fits the estimators for the training set and the testing set (if needed), and 2) computes predictions_train_ (crisp values) if needed. Both operations are performed by the fit method of its superclass. Finally the method computes the confusion matrix of the training set using predictions_train_

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Data

  • y (array-like, shape (n_examples, )) – True classes

  • predictions_train (ndarray, shape (n_examples, ) or (n_examples, n_classes)) – Predictions of the examples in the training set. If shape is (n_examples, n_classes) predictions are converted to crisp values by super().fit()

Raises:

ValueError – When estimator_train and predictions_train are both None

predict(X, predictions_test=None)[source]

Predict the class distribution of a testing bag

First, predictions_test_ are computed (if needed, when predictions_test parameter is None) by super().predict() method.

After that, the prevalences are computed solving a system of linear scalar equations:

cm_.T * prevalences = CC(X)

For binary problems the system is directly solved using the original AC algorithm proposed by Forman

p = (p_0 - fpr ) / ( tpr - fpr)

For multiclass problems, the system may not have a solution. Thus, instead we propose to solve an optimization problem of this kind:

Min distance ( cm_.T * prevalences, CC(X) )

s.t. sum(prevalences) = 1 , prevalecences_i >= 0

in which distance can be ‘HD’ (defect value), ‘L1’ or ‘L2’

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Testing bag

  • predictions_test (ndarray, shape (n_examples, n_classes) (default=None)) –

    They must be probabilities (the estimator used must have a predict_proba method)

    If predictions_test is not None they are copied on predictions_test_ and used. If predictions_test is None, predictions for the testing examples are computed using the predict method of estimator_test (it must be an actual estimator)

Raises:

ValueError – When estimator_test and predictions_test are both None

Returns:

prevalences – Contains the predicted prevalence for each class

Return type:

ndarray, shape(n_classes, )

set_fit_request(*, predictions_train='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • predictions_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_train parameter in fit.

  • self (AC) –

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, predictions_test='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • predictions_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_test parameter in predict.

  • self (AC) –

Returns:

self – The updated object.

Return type:

object

class PAC(estimator_test=None, estimator_train=None, distance='L2', verbose=0)[source]

Bases: UsingClassifiers

Multiclass Probabilistic Adjusted Count method

This class works in two different ways:

  1. Two estimators are used to classify the examples of the training set and the testing set in order to compute the (probabilistic) confusion matrix of both sets. Estimators can be already trained

  2. You can directly provide the predictions for the examples in the fit/predict methods. This is useful for synthetic/artificial experiments

The idea in both cases is to guarantee that all methods based on distribution matching are using exactly the same predictions when you compare this kind of quantifiers (and others that also employ an underlying classifier, for instance, CC/PCC). In the first case, estimators are only trained once and can be shared for several quantifiers of this kind

Parameters:
  • estimator_train (estimator object (default=None)) – An estimator object implementing fit and predict_proba. It is used to classify the examples of the training set and to compute the confusion matrix

  • estimator_test (estimator object (default=None)) – An estimator object implementing fit and predict_proba. It is used to classify the examples of the testing set and to obtain the confusion matrix of the testing set. For some experiments both estimators could be the same

  • distance (str, representing the distance function (default='L2')) – It is the name of the distance used to compute the difference between the mixture of the training distribution and the testing distribution. Only used in multiclass problems. Distances supported: ‘HD’, ‘L2’ and ‘L1’

  • verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode

estimator_train

Estimator used to classify the examples of the training set

Type:

estimator

estimator_test

Estimator used to classify the examples of the testing bag

Type:

estimator

distance

A string with the name of the distance function (‘HD’/’L1’/’L2’)

Type:

str

predictions_train_

Predictions of the examples in the training set

Type:

ndarray, shape (n_examples, n_classes) (probabilistic estimator)

predictions_test_

Predictions of the examples in the testing bag

Type:

ndarray, shape (n_examples, n_classes) (probabilistic estimator)

needs_predictions_train

It is True because PAC quantifiers need to estimate the training distribution

Type:

bool, True

probabilistic_predictions

This means that predictions_test_ contains probabilistic predictions

Type:

bool, True

classes_

Class labels

Type:

ndarray, shape (n_classes, )

y_ext_

Repmat of true labels of the training set. When CV_estimator is used with averaged_predictions=False, predictions_train_ will have a larger dimension (factor=n_repetitions * n_folds of the underlying CV) than y. In other cases, y_ext_ == y. y_ext_ is used in fit method whenever the true labels of the training set are needed, instead of y

Type:

ndarray, shape(len(predictions_train_, 1)

cm_

Confusion matrix

Type:

ndarray, shape (n_classes, n_classes)

G_, C_, b_

These variables are precomputed in the fit method and are used for solving the optimization problem using quadprog.solve_qp. See compute_l2_param_train function

Type:

variables of different kind for defining the optimization problem

problem_

This attribute is set to None in the fit() method. With such model, the first time a testing bag is predicted this attribute will contain the corresponding cvxpy Object (if such library is used, i.e in the case of ‘L1’ and ‘HD’). For the rest testing bags, this object is passed to allow a warm start. The solving process is faster.

Type:

a cvxpy Problem object

verbose

The verbosity level

Type:

int

Notes

Notice that at least one between estimator_train/predictions_train and estimator_test/predictions_test must be not None. If both are None a ValueError exception will be raised. If both are not None, predictions_train/predictions_test are used

References

Antonio Bella, Cèsar Ferri, José Hernández-Orallo, and María José Ramírez-Quintana. 2010. Quantification via probability estimators. In Proceedings of the IEEE International Conference on Data Mining (ICDM’10). IEEE, 737–742.

fit(X, y, predictions_train=None)[source]

This method performs the following operations: 1) fits the estimators for the training set and the testing set (if needed), and 2) computes predictions_train_ (probabilities) if needed. Both operations are performed by the fit method of its superclass. Finally the method computes the (probabilistic) confusion matrix using predictions_train

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Data

  • y (array-like, shape (n_examples, )) – True classes

  • predictions_train (ndarray, shape (n_examples, n_classes)) – Predictions of the training set

Raises:

ValueError – When estimator_train and predictions_train are both None

predict(X, predictions_test=None)[source]

Predict the class distribution of a testing bag

First, predictions_test_ are computed (if needed, when predictions_test parameter is None) by super().predict() method.

After that, the prevalences are computed solving a system of linear scalar equations:

cm_.T * prevalences = PCC(X)

For binary problems the system is directly solved using the original PAC algorithm proposed by Bella et al.

p = (p_0 - PA(negatives) ) / ( PA(positives) - PA(negatives) )

in which PA stands for probability average.

For multiclass problems, the system may not have a solution. Thus, instead we propose to solve an optimization problem of this kind:

Min distance ( cm_.T * prevalences, PCC(X) )

s.t. sum(prevalences) = 1, prevalecences_i >= 0

in which distance can be ‘HD’, ‘L1’ or ‘L2’ (default value)

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Testing bag

  • predictions_test (ndarray, shape (n_examples, n_classes) (default=None)) –

    They must be probabilities (the estimator used must have a predict_proba method)

    If predictions_test is not None they are copied on predictions_test_ and used. If predictions_test is None, predictions for the testing examples are computed using the predict method of estimator_test (it must be an actual estimator)

Raises:

ValueError – When estimator_test and predictions_test are both None

Returns:

prevalences – Contains the predicted prevalence for each class

Return type:

ndarray, shape(n_classes, )

set_fit_request(*, predictions_train='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • predictions_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_train parameter in fit.

  • self (PAC) –

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, predictions_test='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • predictions_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_test parameter in predict.

  • self (PAC) –

Returns:

self – The updated object.

Return type:

object