quantificationlib.estimators.cross_validation module

Estimator object based on Cross Validation

class CV_estimator(estimator, groups=None, cv='warn', n_jobs=None, fit_params=None, pre_dispatch='2*n_jobs', averaged_predictions=True, voting='hard', verbose=0)[source]

Bases: BaseEstimator, ClassifierMixin

Cross Validation Estimator

The idea is to have an estimator in which the model is formed by the models of a CV. This object is needed to estimate the distribution of the training set and testing sets. It has a fit method, that trains the models of the CV, and the typical methods predict and predict_proba to compute the predictions using such models. This implies that this object can be used by any distribution matching method that requires an estimator to represent the distributions

Parameters:
  • sklearn (Mainly the same that cross_validate method in) –

  • estimator (estimator object implementing fit) – The object to use to fit the data.

  • groups (array-like, with shape (n_samples,), optional) – Group labels for the samples used while splitting the dataset into train/test set.

  • cv (int, cross-validation generator or an iterable, optional) –

    Determines the cross-validation splitting strategy. Possible inputs for cv are:

    • None, to use the default 3-fold cross validation,

    • integer, to specify the number of folds in a (Stratified)KFold,

    • CV splitter,

    • An iterable yielding (train, test) splits as arrays of indices.

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

  • n_jobs (int or None, optional (default=None)) – The number of CPUs to use to do the computation. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • fit_params (dict, optional) – Parameters to pass to the fit method of the estimator.

  • pre_dispatch (int, or string, optional) –

    Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

    • None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs

    • An int, giving the exact number of total jobs that are spawned

    • A string, giving an expression as a function of n_jobs, as in 2*n_jobs

  • averaged_predictions (bool, optional (default=True)) – If True, predict and predict_proba methods average the predictions given by estimators_ for each example

  • voting (str, {'hard', 'soft'} (default='hard')) – Only used when averaged_predictions is True. If ‘hard’, predict and predict_proba methods apply majority rule voting. If ‘soft’, predict the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

  • verbose (integer, optional (default=0)) – The verbosity level.

estimator

The estimator to fit each model of the CV

Type:

An estimator object

estimators_

The list of estimators trained by fit method. The number of estimators is equal to the number of folds times number of repetitions

Type:

list of trained estimators

averaged_predictions

Determines whether the predictions for each example given by estimators_ are averaged or not

Type:

bool

voting

How predictions are aggregated:

  • ‘hard’, applying majority rule voting

  • ‘soft’, based on the argmax of the sums of the predicted probabilities

Type:

str, {‘hard’, ‘soft’} (default=’hard’)

le_

Used to compute the class labels

Type:

a LabelEncoder fitted object

classes_

Class labels

Type:

ndarray, shape (n_classes, )

X_train_

Data. It is needed to obtain the predictions over the own training set

Type:

array-like, shape (n_examples, n_features)

y_train_

True classes. It is needed to obtain the predictions over the own training set

Type:

array-like, shape (n_examples, )

verbose

The verbosity level.

Type:

integer

fit(X, y)[source]

Fit the models It calls cross_validate to fit the models and save them in estimators_ attribute. It also stores some attributes needed by predict and predict_proba, namely, le_, classes_, X_train and y_train_

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Data

  • y (array-like, shape (n_examples, )) – True classes

predict(X)[source]

Returns the crisp predictions given by a CV estimator

Parameters:

X (array-like, shape (n_examples, n_features)) – Test ata

Returns:

preds – Crisp predictions for the examples in X

Training set:
  • averaged_predictions == True, shape(n_examples, )

  • averaged_predictions == False, shape(n_examples * n_reps, )

Testing set:
  • averaged_predictions == True, shape(n_examples, )

  • averaged_predictions == False, shape(n_examples * n_reps * n_folds, )

Return type:

array-like, shape depends on type of the examples and the value of averaged_predictions

predict_proba(X)[source]

Returns probabilistic predictions given by a CV estimator

Parameters:

X (array-like, shape (n_examples, n_features)) – Test ata

Returns:

preds – Probabilistic predictions for the examples in X.

Training set:
  • averaged_predictions == True, shape(n_examples, n_classes)

  • averaged_predictions == False, shape(n_examples * n_reps, n_classes)

Testing set:
  • averaged_predictions == True, shape(n_examples, n_classes)

  • averaged_predictions == False, shape(n_examples * n_reps * n_folds, n_classes)

Return type:

array-like, shape depends on type of the examples and the value of averaged_predictions

set_score_request(*, sample_weight='$UNCHANGED$')

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

  • self (CV_estimator) –

Returns:

self – The updated object.

Return type:

object