quantificationlib.ensembles.eoq module¶
- class EoQ(base_quantifier=None, n_quantifiers=100, bag_generator=<quantificationlib.bag_generator.PriorShift_BagGenerator object>, combination_strategy='mean', ensemble_estimator_train=None, ensemble_estimator_test=None, distribution_function='PDF', n_bins=100, bin_strategy='equal_width', distance_bags='euclidean', percentage_of_selected_models=0.5, verbose=0)[source]¶
Bases:
WithoutClassifiers
This class implements Ensembles of Quantifiers for all kind of quantifiers. All the quantifiers of the ensemble are of the same class and using the same parameters.
- Parameters:
base_quantifier (quantifier object, optional, (default=None)) – The quantifier used for each model of the ensemble
n_quantifiers (int, (default=100)) – Number of quantifiers in the ensemble
bag_generator (BagGenerator object (default=PriorShift_BagGenerator())) – Object to generate the bags (with a selected shift) for training each quantifier
combination_strategy (str, (default='mean')) – Strategy used to combine the predictions of the quantifiers
ensemble_estimator_train (estimator object, optional, (default=None)) – Estimator used to classify the examples of the training bags when a base_quantifier of class UsingClassifiers is used. A regular estimator can be used, this implies that a unique classifier is share for all the quantifiers in the ensemble. If the users prefers that each quantifier uses an individual classifier the an estimator of the class EnsembleOfClassifiers must be passed here
ensemble_estimator_test (estimator object, optional, (default=None)) – Estimator used to classify the examples of the testing bags. A regular estimator can be used, this implies that a unique classifier is share for all the quantifiers in the ensemble. If the users prefers that each quantifier uses an individual classifier the an estimator of the class EnsembleOfClassifiers must be passed here
distribution_function (str, (default='PDF')) – Method to estimate the distributions of training and testing bags. Possible values ‘PDF’ or ‘CDF’. This is used just for distribution_similarity combination strategy. This strategy is based on comparing the PDFs or CDFs of the training bags and the PDF/CDF of the testing bag, selecting those quantifiers training over the most similar distributions. To compute the distribution, EoQ employs the input features (Xs) for quantifiers derived from WithoutClassifiers class and the predictions (Ys) for quantifiers derived from UsingClassifiers
n_bins (int, (default=100)) – Numbers of bins to estimate the distributions of training and testing bags. This is needed for distribution_similarity combination strategy.
bin_strategy (str, (default='equal_width')) –
Method to compute the boundaries of the bins for to estimate the distributions of training and testing bags when the distribution_similarity combination strategy is used. Possible values:
’equal_width’: bins of equal length (it could be affected by outliers)
’equal_count’: bins of equal counts (considering the examples of all classes)
- ’binormal’: (Only for binary quantification) It is inspired on the method devised by
(Tasche, 2019, Eq (A16b)). the cut points, \(-\infty < c_1 < \ldots < c_{b-1} < \infty\), are computed as follows based on the assumption that the features follow a normal distribution:
\(c_i = \frac{\sigma^+ + \sigma^{-}}{2} \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \frac{\mu^+ + \mu^{-}}{2} , \quad i=1,\ldots,b-1\)
where \(\Phi^{-1}\) is the quantile function of the standard normal distribution, and \(\mu\) and \(\sigma\) of the normal distribution are estimated as the average of those values for the training examples of each class.
- ’normal’: The idea is that each feature follows a normal distribution. \(\mu\) and \(\sigma\) are
estimated as the weighted mean and std from the training distribution. The cut points \(-\infty < c_1 < \ldots < c_{b-1} < \infty\) are computed as follows:
\(c_i = \sigma^ \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \mu , \quad i=1,\ldots,b-1\)
distance_bags (str, (default='euclidean')) – Distance used to compute distribution similarity
percentage_of_selected_models (float, value in [0, 1], (default=0.5)) – Percentage of selected models for distribution similarity and prevalence similarity strategies
verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode
- base_quantifier¶
The quantifier used for each model of the ensemble
- Type:
quantifier object
- n_quantifiers¶
Number of quantifiers in the ensemble
- Type:
int
- bag_generator¶
Object to generate the bags for training each quantifier
- Type:
BagGenerator object
- combination_strategy¶
Strategy used to combine the predictions of the ensemble quantifiers
- Type:
str
- ensemble_estimator_train¶
Estimator used to classify the examples of the training bags when a base_quantifier of class UsingClassifiers is used
- Type:
estimator object
- ensemble_estimator_test¶
Estimator used to classify the examples of the testing bags
- Type:
estimator object
- distribution_function¶
Method to estimate the distributions of training and testing bags
- Type:
str
- n_bins¶
Numbers of bins to estimate the distributions of training and testing bags
- Type:
int
- bin_strategy = str, (default='equal_width')
Method to compute the boundaries of the bins for to estimate the distributions of training and testing bags
- distance_bags¶
Distance used to compute distribution similarity
- Type:
str
- percentage_of_selected_models¶
Percentage of selected models for distribution similarity and prevalence similarity strategies
- Type:
float
- quantifiers_¶
This vector stores the quantifiers of the ensemble
- Type:
ndarray, shape (n_quantifiers,)
- prevalences_¶
It contains the prevalence of each training bag used to fit each quantifier of the ensemble
- Type:
ndarray, shape (n_quantifiers,)
- indexes_¶
The indexes of the training examples that compose each training bag. The number of training examples used in each bag is fixed true bag_generator parameter
- Type:
ndarry, shape (n_examples_of_training_bags, n_quantifiers)
- bincuts_¶
Bin cuts for each feature used to estimate the training/testing distributions for distribution similarity strategy. The total number of features depends on the kind of base_quantifier used and on the quantification problem. For quantifiers derived from WithoutClassifiers n_features is the dimension on the input space. For quantifiers derived from UsingClassifiers n_features is 1 for binary quantification tasks and is n_classes for multiclass/ordinal problems
- Type:
ndarray, shape (n_features, n_bins + 1)
- distributions_¶
It constains the estimated distribution for each quantifier
- Type:
ndarray, shape (n_quantifiers, n_features * n_bins)
- classes_¶
Class labels
- Type:
ndarray, shape (n_classes, )
- verbose¶
The verbosity level
- Type:
int
- fit(X, y, predictions_train=None, prevalences=None, indexes=None)[source]¶
This method does the following tasks:
It generates the training bags using a Bag_Generator object
It fits the quantifiers of the ensemble.
In the case of quantifiers derived from the class UsingClassifiers, there are 3 possible ways to do this:
train a classifier for each bag. To do this an object from the class EnsembleOfClassifiers must be passed on ensemble_estimator
train a classifier for the whole training set using an estimator from other class
uses the predictions_train given in the predictions_train parameter (these predictions usually are obtained applying an estimator over the whole training set like in the previous case)
- Parameters:
X (array-like, shape (n_examples, n_features)) – Data
y (array-like, shape (n_examples, )) – True classes
predictions_train (ndarray, optional) – shape(n_examples, 1) crisp, or shape (n_examples, n_classes) (probs with a regular estimator), or shape(n_examples, n_estimators, n_classes) with an instance of EnsembleOfClassifiers. Predictions of the examples in the training set
prevalences (array-like, shape (n_classes, n_bags)) – i-th row contains the true prevalences of each bag
indexes (array-line, shape (bag_size, n_bags)) – i-th column contains the indexes of the examples for i-th bag
- predict(X, predictions_test=None)[source]¶
- Parameters:
X (array-like, shape (n_examples, n_features)) – Testing bag
predictions_test (ndarray, (default=None) shape (n_examples, n_classes) if ensemble_estimator_train is not COMPLETE) – Predictions for the testing bag
- Returns:
prevalences – Each value contains the predicted prevalence for the corresponding class. shape(n_classes, ) if an individual combination strategy is selected or a dictionary with the predictions for all strategies if ‘all’ is selected.
- Return type:
ndarray, shape(n_classes, ) or dict
- set_fit_request(*, indexes='$UNCHANGED$', predictions_train='$UNCHANGED$', prevalences='$UNCHANGED$')¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
indexes (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
indexes
parameter infit
.predictions_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_train
parameter infit
.prevalences (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
prevalences
parameter infit
.self (EoQ) –
- Returns:
self – The updated object.
- Return type:
object
- set_predict_request(*, predictions_test='$UNCHANGED$')¶
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
predictions_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
predictions_test
parameter inpredict
.self (EoQ) –
- Returns:
self – The updated object.
- Return type:
object