quantificationlib.multiclass.df module¶
Multiclass versions for quantifiers based on representing the distributions using CDFs/PDFs
- class DFX(distribution_function='PDF', n_bins=8, bin_strategy='equal_width', distance='HD', tol=1e-05, verbose=0)[source]¶
Bases:
WithoutClassifiers
Generic Multiclass DFX method
The idea is to represent the mixture of the training distribution and the testing distribution (using CDFs/PDFs) of the features of the input space (X). The difference between both are minimized using a distante/loss function. Originally, (González et al. 2013) propose the combination of PDF and Hellinger Distance, but also CDF and any other distance/loss function could be used, like L1 or L2.
The class has two parameters to select:
the method used to represent the distributions (CDFs or PDFs)
the distance used.
- Parameters:
distribution_function (str, (default='PDF')) – Type of distribution function used. Two types are supported ‘CDF’ and ‘PDF’
n_bins (int) – Number of bins to compute the PDFs
bin_strategy (str (default='norm')) –
- Method to compute the boundaries of the bins:
’equal_width’: bins of equal length (it could be affected by outliers)
’equal_count’: bins of equal counts (considering the examples of all classes)
- ’binormal’: (Only for binary quantification)
It is inspired on the method devised by (Tasche, 2019, Eq (A16b)). The cut points, \(-\infty < c_1 < \ldots < c_{b-1} < \infty\), are computed as follows based on the assumption that the features follow a normal distribution:
\(c_i = \frac{\sigma^+ + \sigma^{-}}{2} \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \frac{\mu^+ + \mu^{-}}{2} , \quad i=1,\ldots,b-1\)
where \(\Phi^{-1}\) is the quantile function of the standard normal distribution, and \(\mu\) and \(\sigma\) of the normal distribution are estimated as the average of those values for the training examples of each class.
- ’normal’: The idea is that each feacture follows a normal distribution. \(\mu\) and \(\sigma\) are
estimated as the weighted mean and std from the training distribution. The cut points \(-\infty < c_1 < \ldots < c_{b-1} < \infty\) are computed as follows:
\(c_i = \sigma^ \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \mu , \quad i=1,\ldots,b-1\)
distance (str, representing the distance function (default='HD')) – It is the name of the distance used to compute the difference between the mixture of the training distribution and the testing distribution
tol (float, (default=1e-05)) – The precision of the solution when search is used to compute the prevalence
verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode
- classes_¶
Class labels
- Type:
ndarray, shape (n_classes, )
- distribution_function¶
Type of distribution function used. Two types are supported ‘CDF’ and ‘PDF’
- Type:
str
- n_bins¶
The number of bins to compute the PDFs
- Type:
int
- bin_strategy¶
Method to compute the boundaries of the bins
- Type:
str
- distance¶
A string with the name of the distance function (‘HD’/’L1’/’L2’) or a distance function
- Type:
str or a distance function
- tol¶
The precision of the solution when search is used to compute the solution
- Type:
float
- bincuts_¶
Bin cuts for each input feature
- Type:
ndarray, shape(n_features, b+1)
- train_distrib_¶
The PDF for each class in the training set
- Type:
ndarray, shape (n_bins * n_features, n_classes)
- test_distrib_¶
The PDF for the testing bag
- Type:
ndarray, shape (n_bins * n_features, 1) multiclass
- problem_¶
This attribute is set to None in the fit() method. With such model, the first time a testing bag is predicted this attribute will contain the corresponding cvxpy Object (if such library is used, i.e in the case of ‘L1’ and ‘HD’). For the rest testing bags, this object is passed to allow a warm start. The solving process is faster.
- Type:
a cvxpy Problem object
- mixtures_¶
Contains the mixtures for all the prevalences in the range [0, 1] step=0.01. This speeds up the prediction for a collection of testing bags
- Type:
ndarray, shape (101, n_quantiles)
- verbose¶
The verbosity level
- Type:
int
References
Víctor González-Castro, Rocío Alaiz-Rodríguez, and Enrique Alegre: Class Distribution Estimation based on the Hellinger Distance. Information Sciences 218 (2013), 146–164.
Aykut Firat. 2016. Unified Framework for Quantification. arXiv preprint arXiv:1606.00868 (2016).
Dirk Tasche: Confidence intervals for class prevalences under prior probability shift. Machine Learning and Knowledge Extraction, 1(3), (2019) 805-831.
- fit(X, y)[source]¶
This method just computes the PDFs for all the classes in the training set. The values are stored in train_dist_
- Parameters:
X (array-like, shape (n_examples, n_features)) – Data
y (array-like, shape (n_examples, )) – True classes
- predict(X)[source]¶
Predict the class distribution of a testing bag
First, the method computes the PDF for the testing bag.
After that, the prevalences are computed using the corresponding function according to the value of distance attribute
- Parameters:
X (array-like, shape (n_examples, n_features)) – Testing bag
- Returns:
prevalences – Contains the predicted prevalence for each class
- Return type:
ndarray, shape(n_classes, )
- class DFy(estimator_train=None, estimator_test=None, distribution_function='PDF', n_bins=8, bin_strategy='equal_width', distance='HD', tol=1e-05, verbose=0)[source]¶
Bases:
UsingClassifiers
Generic Multiclass DFy method
The idea is to represent the mixture of the training distribution and the testing distribution (using CDFs/PDFs) of the predictions given by a classifier (y). The difference between both is minimized using a distance/loss function. Originally, (González-Castro et al. 2013) propose the combination of PDF and Hellinger Distance, but also CDF and any other distance/loss function could be used, like L1 or L2. In fact, Forman (2005) propose to use CDF’s an a measure equivalent to L1.
The class has two parameters to select:
the method used to represent the distributions (CDFs or PDFs)
the distance used.
This class (as every other class based on distribution matching using classifiers) works in two different ways:
Two estimators are used to classify training examples and testing examples in order to compute the distribution of both sets. Estimators can be already trained
You can directly provide the predictions for the examples in the fit/predict methods. This is useful for synthetic/artificial experiments
The goal in both cases is to guarantee that all methods based on distribution matching are using exactly the same predictions when you compare this kind of quantifiers (and others that also employ an underlying classifier, for instance, CC/PCC and AC/PAC). In the first case, estimators are only trained once and can be shared for several quantifiers of this kind
- Parameters:
estimator_train (estimator object (default=None)) – An estimator object implementing fit and predict_proba. It is used to classify the examples of the training set and to compute the distribution of each class individually
estimator_test (estimator object (default=None)) – An estimator object implementing fit and predict_proba. It is used to classify the examples of the testing set and to compute the distribution of the whole testing set. For some experiments both estimator_train and estimator_test could be the same
distribution_function (str, (default='PDF')) – Type of distribution function used. Two types are supported ‘CDF’ and ‘PDF’
n_bins (int (default=8)) – Number of bins to compute the CDFs/PDFs
bin_strategy (str (default='norm')) –
- Method to compute the boundaries of the bins:
’equal_width’: bins of equal length (it could be affected by outliers)
’equal_count’: bins of equal counts (considering the examples of all classes)
- ’binormal’: (Only for binary quantification)
It is inspired on the method devised by (Tasche, 2019, Eq (A16b)). The cut points, \(-\infty < c_1 < \ldots < c_{b-1} < \infty\), are computed as follows based on the assumption that the features follow a normal distribution:
\(c_i = \frac{\sigma^+ + \sigma^{-}}{2} \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \frac{\mu^+ + \mu^{-}}{2} , \quad i=1,\ldots,b-1\)
where \(\Phi^{-1}\) is the quantile function of the standard normal distribution, and \(\mu\) and \(\sigma\) of the normal distribution are estimated as the average of those values for the training examples of each class.
- ’normal’: The idea is that each feacture follows a normal distribution. \(\mu\) and \(\sigma\) are
estimated as the weighted mean and std from the training distribution. The cut points \(-\infty < c_1 < \ldots < c_{b-1} < \infty\) are computed as follows:
\(c_i = \sigma^ \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \mu , \quad i=1,\ldots,b-1\)
distance (str, representing the distance function (default='HD')) – It is the name of the distance used to compute the difference between the mixture of the training distribution and the testing distribution
tol (float, (default=1e-05)) – The precision of the solution when search is used to compute the prevalence
verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode
- classes_¶
Class labels
- Type:
ndarray, shape (n_classes, )
- estimator_train¶
Estimator used to classify the examples of the training set
- Type:
estimator
- estimator_test¶
Estimator used to classify the examples of the testing bag
- Type:
estimator
- predictions_train_¶
Predictions of the examples in the training set
- Type:
ndarray, shape (n_examples, n_classes) (probabilities)
- predictions_test_¶
Predictions of the examples in the testing bag
- Type:
ndarray, shape (n_examples, n_classes) (probabilities)
- needs_predictions_train¶
It is True because PDFy quantifiers need to estimate the training distribution
- Type:
bool, True
- probabilistic_predictions¶
This means that predictions_train_/predictions_test_ contain probabilistic predictions
- Type:
bool, True
- bin_strategy¶
Method to compute the boundaries of the bins
- Type:
str
- distance¶
A string with the name of the distance function (‘HD’/’L1’/’L2’) or a distance function
- Type:
str or a distance function
- bincuts_¶
Bin cuts for each input feature
- Type:
ndarray, shape(n_features, b+1)
- tol¶
The precision of the solution when search is used to compute the solution
- Type:
float
- classes_¶
Class labels
- Type:
ndarray, shape (n_classes, )
- y_ext_¶
Repmat of true labels of the training set. When CV_estimator is used with averaged_predictions=False, predictions_train_ will have a larger dimension (factor=n_repetitions * n_folds of the underlying CV) than y. In other cases, y_ext_ == y. y_ext_ is used in fit method whenever the true labels of the training set are needed, instead of y
- Type:
ndarray, shape(len(predictions_train_, 1)
- distribution_function¶
Type of distribution function used. Two types are supported ‘CDF’ and ‘PDF’
- Type:
str
- n_bins¶
The number of bins to compute the CDFs/PDFs
- Type:
int
- train_distrib_¶
The CDF/PDF for each class in the training set
- Type:
ndarray, shape (n_bins * 1, n_classes) binary or (n_bins * n_classes, n_classes) multiclass
- test_distrib_¶
The CDF/PDF for the testing bag
- Type:
ndarray, shape (n_bins * 1, 1) binary quantification or (n_bins * n_classes_, 1) multiclass q
- G_, C_, b_
These variables are precomputed in the fit method and are used for solving the optimization problem using quadprog.solve_qp. See compute_l2_param_train function
- Type:
variables of different kind for defining the optimization problem
- problem_¶
This attribute is set to None in the fit() method. With such model, the first time a testing bag is predicted this attribute will contain the corresponding cvxpy Object (if such library is used, i.e in the case of ‘L1’ and ‘HD’). For the rest testing bags, this object is passed to allow a warm start. The solving process is faster.
- Type:
a cvxpy Problem object
- mixtures_¶
Contains the mixtures for all the prevalences in the range [0, 1] step=0.01. This speeds up the prediction for a collection of testing bags
- Type:
ndarray, shape (101, n_quantiles)
- verbose¶
The verbosity level
- Type:
int
Notes
Notice that at least one between estimator_train/predictions_train and estimator_test/predictions_test must be not None. If both are None a ValueError exception will be raised. If both are not None, predictions_train/predictions_test are used
References
Víctor González-Castro, Rocío Alaiz-Rodríguez, and Enrique Alegre: Class Distribution Estimation based on the Hellinger Distance. Information Sciences 218 (2013), 146–164.
George Forman: Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European conference on machine learning (ECML’05), Porto, (2005) pp 564–575
Aykut Firat. 2016. Unified Framework for Quantification. arXiv preprint arXiv:1606.00868 (2016).
Dirk Tasche: Confidence intervals for class prevalences under prior probability shift. Machine Learning and Knowledge Extraction, 1(3), (2019) 805-831.
- fit(X, y, predictions_train=None)[source]¶
This method performs the following operations: 1) fits the estimators for the training set and the testing set (if needed), and 2) computes predictions_train_ (probabilities) if needed. Both operations are performed by the fit method of its superclass. After that, the method computes the pdfs for all the classes in the training set
- Parameters:
X (array-like, shape (n_examples, n_features)) – Data
y (array-like, shape (n_examples, )) – True classes
predictions_train (ndarray, shape (n_examples, n_classes)) – Predictions of the examples in the training set
- Raises:
ValueError – When estimator_train and predictions_train are both None
- predict(X, predictions_test=None)[source]¶
Predict the class distribution of a testing bag
First, predictions_test_ are computed (if needed, when predictions_test parameter is None) by super().predict() method.
After that, the method computes the PDF for the testing bag.
Finally, the prevalences are computed using the corresponding function according to the value of distance attribute
- Parameters:
X (array-like, shape (n_examples, n_features)) – Testing bag
predictions_test (ndarray, shape (n_examples, n_classes) (default=None)) –
They must be probabilities (the estimator used must have a predict_proba method)
If predictions_test is not None they are copied on predictions_test_ and used. If predictions_test is None, predictions for the testing examples are computed using the predict method of estimator_test (it must be an actual estimator)
- Raises:
ValueError – When estimator_test and predictions_test are both None
- Returns:
prevalences – Contains the predicted prevalence for each class
- Return type:
ndarray, shape(n_classes, )
- set_fit_request(*, predictions_train='$UNCHANGED$')¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
predictions_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_train
parameter infit
.self (DFy) –
- Returns:
self – The updated object.
- Return type:
object
- set_predict_request(*, predictions_test='$UNCHANGED$')¶
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
predictions_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_test
parameter inpredict
.self (DFy) –
- Returns:
self – The updated object.
- Return type:
object
- class HDX(n_bins=8, bin_strategy='equal_width', tol=1e-05)[source]¶
Bases:
DFX
Multiclass HDX method
This class is a wrapper. It just uses all the inherited methods of its superclass (DFX)
References
Víctor González-Castro, Rocío Alaiz-Rodríguez, and Enrique Alegre: Class Distribution Estimation based on the Hellinger Distance. Information Sciences 218 (2013), 146–164.
- class HDy(estimator_train=None, estimator_test=None, n_bins=8, bin_strategy='equal_width', tol=1e-05, verbose=0)[source]¶
Bases:
DFy
Multiclass HDy method
This class is just a wrapper. It just uses all the inherited methods of its superclass (DFy)
References
Víctor González-Castro, Rocío Alaiz-Rodríguez, and Enrique Alegre: Class Distribution Estimation based on the Hellinger Distance. Information Sciences 218 (2013), 146–164.
- set_fit_request(*, predictions_train='$UNCHANGED$')¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
predictions_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_train
parameter infit
.self (HDy) –
- Returns:
self – The updated object.
- Return type:
object
- set_predict_request(*, predictions_test='$UNCHANGED$')¶
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
predictions_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_test
parameter inpredict
.self (HDy) –
- Returns:
self – The updated object.
- Return type:
object
- class MMy(estimator_train=None, estimator_test=None, n_bins=8, bin_strategy='equal_width', tol=1e-05, verbose=0)[source]¶
Bases:
DFy
Multiclass MM method
This class is just a wrapper. It just uses all the inherited methods of its superclass (DFy)
References
George Forman: Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European conference on machine learning (ECML’05), Porto, (2005) pp 564–575
- set_fit_request(*, predictions_train='$UNCHANGED$')¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
predictions_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_train
parameter infit
.self (MMy) –
- Returns:
self – The updated object.
- Return type:
object
- set_predict_request(*, predictions_test='$UNCHANGED$')¶
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
predictions_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_test
parameter inpredict
.self (MMy) –
- Returns:
self – The updated object.
- Return type:
object
- compute_bincuts(x, y=None, classes=None, n_bins=8, bin_strategy='equal_width', att_range=None)[source]¶
Compute the bincuts for calculate a histrogram with the values in X. These bincuts depends on the bincut strategy
- Parameters:
x (array-like, shape (n_examples, )) – Input feature
y (array-like, shape (n_examples, ), (default=None)) – True classes. It is needed when bin_strategy is ‘binormal’. In other cases, it is ignored
classes (ndarray, shape (n_classes, )) – Class labels
n_bins (int, (default=8)) – Number of bins
bin_strategy (str (default='equal_width')) –
- Method to compute the boundaries of the bins:
’equal_width’: bins of equal length (it could be affected by outliers)
’equal_count’: bins of equal counts (considering the examples of all classes)
- ’binormal’: (Only for binary quantification)
It is inspired on the method devised by (Tasche, 2019, Eq (A16b)). The cut points, \(-\infty < c_1 < \ldots < c_{b-1} < \infty\), are computed as follows based on the assumption that the features follow a normal distribution:
\(c_i = \frac{\sigma^+ + \sigma^{-}}{2} \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \frac{\mu^+ + \mu^{-}}{2} , \quad i=1,\ldots,b-1\)
where \(\Phi^{-1}\) is the quantile function of the standard normal distribution, and \(\mu\) and \(\sigma\) of the normal distribution are estimated as the average of those values for the training examples of each class.
- ’normal’: The idea is that each feacture follows a normal distribution. \(\mu\) and \(\sigma\) are
estimated as the weighted mean and std from the training distribution. The cut points \(-\infty < c_1 < \ldots < c_{b-1} < \infty\) are computed as follows:
\(c_i = \sigma^ \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \mu , \quad i=1,\ldots,b-1\)
att_range (array-like, (2,1)) – Min and Max possible values of the input feature x. These values might not coincide with the actual Min and Max values of vector x. For instance, in the case of x represents a set of probabilistic predictions, these values will be 0 and 1
- Returns:
bincuts – Bin cuts for input feature x
- Return type:
ndarray, shape (n_bins +1 , )