quantificationlib.ensembles.eoq module

class EoQ(base_quantifier=None, n_quantifiers=100, bag_generator=<quantificationlib.bag_generator.PriorShift_BagGenerator object>, combination_strategy='mean', ensemble_estimator_train=None, ensemble_estimator_test=None, distribution_function='PDF', n_bins=100, bin_strategy='equal_width', distance_bags='euclidean', percentage_of_selected_models=0.5, verbose=0)[source]

Bases: WithoutClassifiers

This class implements Ensembles of Quantifiers for all kind of quantifiers. All the quantifiers of the ensemble are of the same class and using the same parameters.

Parameters:
  • base_quantifier (quantifier object, optional, (default=None)) – The quantifier used for each model of the ensemble

  • n_quantifiers (int, (default=100)) – Number of quantifiers in the ensemble

  • bag_generator (BagGenerator object (default=PriorShift_BagGenerator())) – Object to generate the bags (with a selected shift) for training each quantifier

  • combination_strategy (str, (default='mean')) – Strategy used to combine the predictions of the quantifiers

  • ensemble_estimator_train (estimator object, optional, (default=None)) – Estimator used to classify the examples of the training bags when a base_quantifier of class UsingClassifiers is used. A regular estimator can be used, this implies that a unique classifier is share for all the quantifiers in the ensemble. If the users prefers that each quantifier uses an individual classifier the an estimator of the class EnsembleOfClassifiers must be passed here

  • ensemble_estimator_test (estimator object, optional, (default=None)) – Estimator used to classify the examples of the testing bags. A regular estimator can be used, this implies that a unique classifier is share for all the quantifiers in the ensemble. If the users prefers that each quantifier uses an individual classifier the an estimator of the class EnsembleOfClassifiers must be passed here

  • distribution_function (str, (default='PDF')) – Method to estimate the distributions of training and testing bags. Possible values ‘PDF’ or ‘CDF’. This is used just for distribution_similarity combination strategy. This strategy is based on comparing the PDFs or CDFs of the training bags and the PDF/CDF of the testing bag, selecting those quantifiers training over the most similar distributions. To compute the distribution, EoQ employs the input features (Xs) for quantifiers derived from WithoutClassifiers class and the predictions (Ys) for quantifiers derived from UsingClassifiers

  • n_bins (int, (default=100)) – Numbers of bins to estimate the distributions of training and testing bags. This is needed for distribution_similarity combination strategy.

  • bin_strategy (str, (default='equal_width')) –

    Method to compute the boundaries of the bins for to estimate the distributions of training and testing bags when the distribution_similarity combination strategy is used. Possible values:

    • ’equal_width’: bins of equal length (it could be affected by outliers)

    • ’equal_count’: bins of equal counts (considering the examples of all classes)

    • ’binormal’: (Only for binary quantification) It is inspired on the method devised by

      (Tasche, 2019, Eq (A16b)). the cut points, \(-\infty < c_1 < \ldots < c_{b-1} < \infty\), are computed as follows based on the assumption that the features follow a normal distribution:

      \(c_i = \frac{\sigma^+ + \sigma^{-}}{2} \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \frac{\mu^+ + \mu^{-}}{2} , \quad i=1,\ldots,b-1\)

      where \(\Phi^{-1}\) is the quantile function of the standard normal distribution, and \(\mu\) and \(\sigma\) of the normal distribution are estimated as the average of those values for the training examples of each class.

    • ’normal’: The idea is that each feature follows a normal distribution. \(\mu\) and \(\sigma\) are

      estimated as the weighted mean and std from the training distribution. The cut points \(-\infty < c_1 < \ldots < c_{b-1} < \infty\) are computed as follows:

      \(c_i = \sigma^ \ \Phi^{-1}\bigg(\frac{i}{b}\bigg) + \mu , \quad i=1,\ldots,b-1\)

  • distance_bags (str, (default='euclidean')) – Distance used to compute distribution similarity

  • percentage_of_selected_models (float, value in [0, 1], (default=0.5)) – Percentage of selected models for distribution similarity and prevalence similarity strategies

  • verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode

base_quantifier

The quantifier used for each model of the ensemble

Type:

quantifier object

n_quantifiers

Number of quantifiers in the ensemble

Type:

int

bag_generator

Object to generate the bags for training each quantifier

Type:

BagGenerator object

combination_strategy

Strategy used to combine the predictions of the ensemble quantifiers

Type:

str

ensemble_estimator_train

Estimator used to classify the examples of the training bags when a base_quantifier of class UsingClassifiers is used

Type:

estimator object

ensemble_estimator_test

Estimator used to classify the examples of the testing bags

Type:

estimator object

distribution_function

Method to estimate the distributions of training and testing bags

Type:

str

n_bins

Numbers of bins to estimate the distributions of training and testing bags

Type:

int

bin_strategy = str, (default='equal_width')

Method to compute the boundaries of the bins for to estimate the distributions of training and testing bags

distance_bags

Distance used to compute distribution similarity

Type:

str

percentage_of_selected_models

Percentage of selected models for distribution similarity and prevalence similarity strategies

Type:

float

quantifiers_

This vector stores the quantifiers of the ensemble

Type:

ndarray, shape (n_quantifiers,)

prevalences_

It contains the prevalence of each training bag used to fit each quantifier of the ensemble

Type:

ndarray, shape (n_quantifiers,)

indexes_

The indexes of the training examples that compose each training bag. The number of training examples used in each bag is fixed true bag_generator parameter

Type:

ndarry, shape (n_examples_of_training_bags, n_quantifiers)

bincuts_

Bin cuts for each feature used to estimate the training/testing distributions for distribution similarity strategy. The total number of features depends on the kind of base_quantifier used and on the quantification problem. For quantifiers derived from WithoutClassifiers n_features is the dimension on the input space. For quantifiers derived from UsingClassifiers n_features is 1 for binary quantification tasks and is n_classes for multiclass/ordinal problems

Type:

ndarray, shape (n_features, n_bins + 1)

distributions_

It constains the estimated distribution for each quantifier

Type:

ndarray, shape (n_quantifiers, n_features * n_bins)

classes_

Class labels

Type:

ndarray, shape (n_classes, )

verbose

The verbosity level

Type:

int

fit(X, y, predictions_train=None, prevalences=None, indexes=None)[source]

This method does the following tasks:

  1. It generates the training bags using a Bag_Generator object

  2. It fits the quantifiers of the ensemble.

In the case of quantifiers derived from the class UsingClassifiers, there are 3 possible ways to do this:

  • train a classifier for each bag. To do this an object from the class EnsembleOfClassifiers must be passed on ensemble_estimator

  • train a classifier for the whole training set using an estimator from other class

  • uses the predictions_train given in the predictions_train parameter (these predictions usually are obtained applying an estimator over the whole training set like in the previous case)

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Data

  • y (array-like, shape (n_examples, )) – True classes

  • predictions_train (ndarray, optional) – shape(n_examples, 1) crisp, or shape (n_examples, n_classes) (probs with a regular estimator), or shape(n_examples, n_estimators, n_classes) with an instance of EnsembleOfClassifiers. Predictions of the examples in the training set

  • prevalences (array-like, shape (n_classes, n_bags)) – i-th row contains the true prevalences of each bag

  • indexes (array-line, shape (bag_size, n_bags)) – i-th column contains the indexes of the examples for i-th bag

predict(X, predictions_test=None)[source]
Parameters:
  • X (array-like, shape (n_examples, n_features)) – Testing bag

  • predictions_test (ndarray, (default=None) shape (n_examples, n_classes) if ensemble_estimator_train is not COMPLETE) – Predictions for the testing bag

Returns:

prevalences – Each value contains the predicted prevalence for the corresponding class. shape(n_classes, ) if an individual combination strategy is selected or a dictionary with the predictions for all strategies if ‘all’ is selected.

Return type:

ndarray, shape(n_classes, ) or dict

set_fit_request(*, indexes='$UNCHANGED$', predictions_train='$UNCHANGED$', prevalences='$UNCHANGED$')

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • indexes (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for indexes parameter in fit.

  • predictions_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_train parameter in fit.

  • prevalences (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for prevalences parameter in fit.

  • self (EoQ) –

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, predictions_test='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • predictions_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for predictions_test parameter in predict.

  • self (EoQ) –

Returns:

self – The updated object.

Return type:

object