art.estimators.poison_mitigation

This module implements all poison mitigation models in ART.

Keras Neural Cleanse Classifier

class art.estimators.poison_mitigation.KerasNeuralCleanse(model: keras.models.Model | tf.keras.models.Model, use_logits: bool = False, channels_first: bool = False, clip_values: CLIP_VALUES_TYPE | None = None, preprocessing_defences: Preprocessor | List[Preprocessor] | None = None, postprocessing_defences: Postprocessor | List[Postprocessor] | None = None, preprocessing: PREPROCESSING_TYPE = (0.0, 1.0), input_layer: int = 0, output_layer: int = 0, steps: int = 1000, init_cost: float = 0.001, norm: int | float = 2, learning_rate: float = 0.1, attack_success_threshold: float = 0.99, patience: int = 5, early_stop: bool = True, early_stop_threshold: float = 0.99, early_stop_patience: int = 10, cost_multiplier: float = 1.5, batch_size: int = 32)

Implementation of methods in Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. Wang et al. (2019).

__init__(model: keras.models.Model | tf.keras.models.Model, use_logits: bool = False, channels_first: bool = False, clip_values: CLIP_VALUES_TYPE | None = None, preprocessing_defences: Preprocessor | List[Preprocessor] | None = None, postprocessing_defences: Postprocessor | List[Postprocessor] | None = None, preprocessing: PREPROCESSING_TYPE = (0.0, 1.0), input_layer: int = 0, output_layer: int = 0, steps: int = 1000, init_cost: float = 0.001, norm: int | float = 2, learning_rate: float = 0.1, attack_success_threshold: float = 0.99, patience: int = 5, early_stop: bool = True, early_stop_threshold: float = 0.99, early_stop_patience: int = 10, cost_multiplier: float = 1.5, batch_size: int = 32)

Create a Neural Cleanse classifier.

Parameters:
  • model – Keras model, neural network or other.

  • use_logits (bool) – True if the output of the model are logits; false for probabilities or any other type of outputs. Logits output should be favored when possible to ensure attack efficiency.

  • channels_first (bool) – Set channels first or last.

  • clip_values – Tuple of the form (min, max) of floats or np.ndarray representing the minimum and maximum values allowed for features. If floats are provided, these will be used as the range of all features. If arrays are provided, each value will be considered the bound for a feature, thus the shape of clip values needs to match the total number of features.

  • preprocessing_defences – Preprocessing defence(s) to be applied by the classifier.

  • postprocessing_defences – Postprocessing defence(s) to be applied by the classifier.

  • preprocessing – Tuple of the form (subtrahend, divisor) of floats or np.ndarray of values to be used for data preprocessing. The first value will be subtracted from the input. The input will then be divided by the second one.

  • input_layer (int) – The index of the layer to consider as input for models with multiple input layers. The layer with this index will be considered for computing gradients. For models with only one input layer this values is not required.

  • output_layer (int) – Which layer to consider as the output when the models has multiple output layers. The layer with this index will be considered for computing gradients. For models with only one output layer this values is not required.

  • steps (int) – The maximum number of steps to run the Neural Cleanse optimization

  • init_cost (float) – The initial value for the cost tensor in the Neural Cleanse optimization

  • norm – The norm to use for the Neural Cleanse optimization, can be 1, 2, or np.inf

  • learning_rate (float) – The learning rate for the Neural Cleanse optimization

  • attack_success_threshold (float) – The threshold at which the generated backdoor is successful enough to stop the Neural Cleanse optimization

  • patience (int) – How long to wait for changing the cost multiplier in the Neural Cleanse optimization

  • early_stop (bool) – Whether or not to allow early stopping in the Neural Cleanse optimization

  • early_stop_threshold (float) – How close values need to come to max value to start counting early stop

  • early_stop_patience (int) – How long to wait to determine early stopping in the Neural Cleanse optimization

  • cost_multiplier (float) – How much to change the cost in the Neural Cleanse optimization

  • batch_size (int) – The batch size for optimizations in the Neural Cleanse optimization

abstain() ndarray

Abstain from a prediction :return: A numpy array of zeros

backdoor_examples(x_val: ndarray, y_val: ndarray) Tuple[ndarray, ndarray, ndarray]

Generate reverse-engineered backdoored examples using validation data :type y_val: ndarray :type x_val: ndarray :param x_val: validation data :param y_val: validation labels :return: a tuple containing (clean data, backdoored data, labels)

property channels_first: bool
Returns:

Boolean to indicate index of the color channels in the sample x.

check_backdoor_effective(backdoor_data: ndarray, backdoor_labels: ndarray) bool

Check if supposed backdoors are effective against the classifier

Return type:

bool

Parameters:
  • backdoor_data (ndarray) – data with the backdoor added

  • backdoor_labels (ndarray) – the correct label for the data

Returns:

true if any of the backdoors are effective on the model

class_gradient(x: ndarray, label: int | List[int] | ndarray | None = None, training_mode: bool = False, **kwargs) ndarray

Compute per-class derivatives of the given classifier w.r.t. x of original classifier.

Return type:

ndarray

Parameters:
  • x (ndarray) – Sample input with shape as expected by the model.

  • label – Index of a specific per-class derivative. If an integer is provided, the gradient of that class output is computed for all samples. If multiple values as provided, the first dimension should match the batch size of x, and each value will be used as target for its corresponding sample in x. If None, then gradients for all classes will be computed for each sample.

  • training_mode (bool) – True for model set to training mode and ‘False for model set to evaluation mode.

Returns:

Array of gradients of input features w.r.t. each class in the form (batch_size, nb_classes, input_shape) when computing for all classes, otherwise shape becomes (batch_size, 1, input_shape) when label parameter is specified.

property clip_values: CLIP_VALUES_TYPE | None

Return the clip values of the input samples.

Returns:

Clip values (min, max).

clone_for_refitting() KerasClassifier

Create a copy of the classifier that can be refit from scratch. Will inherit same architecture, optimizer and initialization as cloned model, but without weights.

Returns:

new classifier

compute_loss(x: ndarray, y: ndarray, reduction: str = 'none', **kwargs) ndarray

Compute the loss of the neural network for samples x.

Parameters:
  • x (ndarray) – Samples of shape (nb_samples, nb_features) or (nb_samples, nb_pixels_1, nb_pixels_2, nb_channels) or (nb_samples, nb_channels, nb_pixels_1, nb_pixels_2).

  • y (ndarray) – Target values (class labels) one-hot-encoded of shape (nb_samples, nb_classes) or indices of shape (nb_samples,).

  • reduction (str) – Specifies the reduction to apply to the output: ‘none’ | ‘mean’ | ‘sum’. ‘none’: no reduction will be applied ‘mean’: the sum of the output will be divided by the number of elements in the output, ‘sum’: the output will be summed.

Returns:

Loss values.

Return type:

Format as expected by the model

compute_loss_from_predictions(pred: ndarray, y: ndarray, **kwargs) ndarray

Compute the loss of the estimator for predictions pred.

Return type:

ndarray

Parameters:
  • pred (ndarray) – Model predictions.

  • y (ndarray) – Target values.

Returns:

Loss values.

custom_loss_gradient(nn_function, tensors, input_values, name='default')

Returns the gradient of the nn_function with respect to model input

Parameters:
  • nn_function (a Keras tensor) – an intermediate tensor representation of the function to differentiate

  • tensors (list) – the tensors or variables to differentiate with respect to

  • input_values (list) – the inputs to evaluate the gradient

  • name (str) – The name of the function. Functions of the same name are cached

Returns:

the gradient of the function w.r.t vars

Return type:

np.ndarray

fit(*args, **kwargs)

Fit the classifier on the training set (x, y).

Parameters:
  • x – Training data.

  • y – Target values (class labels) one-hot-encoded of shape (nb_samples, nb_classes) or index labels of shape (nb_samples,).

  • batch_size – Size of batches.

  • nb_epochs – Number of epochs to use for training.

  • verbose – Display training progress bar.

  • kwargs – Dictionary of framework-specific arguments. These should be parameters supported by the fit_generator function in Keras and will be passed to this function as such. Including the number of epochs or the number of steps per epoch as part of this argument will result in as error.

fit_generator(generator: DataGenerator, nb_epochs: int = 20, verbose: bool = False, **kwargs) None

Fit the classifier using the generator that yields batches as specified.

Parameters:
  • generator – Batch generator providing (x, y) for each epoch. If the generator can be used for native training in Keras, it will.

  • nb_epochs (int) – Number of epochs to use for training.

  • verbose (bool) – Display training progress bar.

  • kwargs – Dictionary of framework-specific arguments. These should be parameters supported by the fit_generator function in Keras and will be passed to this function as such. Including the number of epochs as part of this argument will result in as error.

generate_backdoor(x_val: ndarray, y_val: ndarray, y_target: ndarray) Tuple[ndarray, ndarray]

Generates a possible backdoor for the model. Returns the pattern and the mask :return: A tuple of the pattern and mask for the model.

get_activations(x: ndarray, layer: int | str, batch_size: int = 128, framework: bool = False) ndarray

Return the output of the specified layer for input x. layer is specified by layer index (between 0 and nb_layers - 1) or by name. The number of layers can be determined by counting the results returned by calling layer_names.

Return type:

ndarray

Parameters:
  • x (ndarray) – Input for computing the activations.

  • layer – Layer for computing the activations.

  • batch_size (int) – Size of batches.

  • framework (bool) – If true, return the intermediate tensor representation of the activation.

Returns:

The output of layer, where the first dimension is the batch size corresponding to x.

get_params() Dict[str, Any]

Get all parameters and their values of this estimator.

Returns:

A dictionary of string parameter names to their value.

property input_layer: int

The index of the layer considered as input for models with multiple input layers. For models with only one input layer the index is 0.

Returns:

The index of the layer considered as input for models with multiple input layers.

property input_shape: Tuple[int, ...]

Return the shape of one input sample.

Returns:

Shape of one input sample.

property layer_names: List[str] | None

Return the names of the hidden layers in the model, if applicable.

Returns:

The names of the hidden layers in the model, input and output layers are ignored.

Warning

layer_names tries to infer the internal structure of the model. This feature comes with no guarantees on the correctness of the result. The intended order of the layers tries to match their order in the model, but this is not guaranteed either.

loss_gradient(x: ndarray, y: ndarray, training_mode: bool = False, **kwargs) ndarray

Compute the gradient of the loss function w.r.t. x.

Return type:

ndarray

Parameters:
  • x (ndarray) – Sample input with shape as expected by the model.

  • y (ndarray) – Target values (class labels) one-hot-encoded of shape (nb_samples, nb_classes) or indices of shape (nb_samples,).

  • training_mode (bool) – True for model set to training mode and ‘False for model set to evaluation mode.

Returns:

Array of gradients of the same shape as x.

mitigate(x_val: ndarray, y_val: ndarray, mitigation_types: List[str]) None

Mitigates the effect of poison on a classifier

Parameters:
  • x_val (ndarray) – Validation data to use to mitigate the effect of poison.

  • y_val (ndarray) – Validation labels to use to mitigate the effect of poison.

  • mitigation_types – The types of mitigation method, can include ‘unlearning’, ‘pruning’, or ‘filtering’

Returns:

Tuple of length 2 of the selected class and certified radius.

property model

Return the model.

Returns:

The model.

property nb_classes: int

Return the number of output classes.

Returns:

Number of classes in the data.

outlier_detection(x_val: ndarray, y_val: ndarray) List[Tuple[int, ndarray, ndarray]]

Returns a tuple of suspected of suspected poison labels and their mask and pattern :return: A list of tuples containing the the class index, mask, and pattern for suspected labels

property output_layer: int

The index of the layer considered as output for models with multiple output layers. For models with only one output layer the index is 0.

Returns:

The index of the layer considered as output for models with multiple output layers.

predict(*args, **kwargs)

Perform prediction of the given classifier for a batch of inputs, potentially filtering suspicious input

Parameters:
  • x – Input data to predict.

  • batch_size – Batch size.

  • training_modeTrue for model set to training mode and ‘False for model set to evaluation mode.

Returns:

Array of predictions of shape (nb_inputs, nb_classes).

reset()

Reset the state of the defense :return:

save(filename: str, path: str | None = None) None

Save a model to file in the format specific to the backend framework. For Keras, .h5 format is used.

Parameters:
  • filename (str) – Name of the file where to store the model.

  • path – Path of the folder where to store the model. If no path is specified, the model will be stored in the default data location of the library ART_DATA_PATH.

set_params(**kwargs) None

Take a dictionary of parameters and apply checks before setting them as attributes.

Parameters:

kwargs – A dictionary of attributes.

property use_logits: bool

A boolean representing whether the outputs of the model are logits.

Returns:

a boolean representing whether the outputs of the model are logits.

Mixin Base Class STRIP

class art.estimators.poison_mitigation.STRIPMixin(predict_fn: Callable[[ndarray], ndarray], num_samples: int = 20, false_acceptance_rate: float = 0.01, **kwargs)

Implementation of STRIP: A Defence Against Trojan Attacks on Deep Neural Networks (Gao et. al. 2020)

__init__(predict_fn: Callable[[ndarray], ndarray], num_samples: int = 20, false_acceptance_rate: float = 0.01, **kwargs) None

Create a STRIP defense

Parameters:
  • predict_fn – The predict function of the original classifier

  • num_samples (int) – The number of samples to use to test entropy at inference time

  • false_acceptance_rate (float) – The percentage of acceptable false acceptance

abstain() ndarray

Abstain from a prediction :return: A numpy array of zeros

mitigate(x_val: ndarray) None

Mitigates the effect of poison on a classifier

Parameters:

x_val (ndarray) – Validation data to use to mitigate the effect of poison.

property nb_classes: int

Return the number of output classes.

Returns:

Number of classes in the data.

predict(*args, **kwargs)

Perform prediction of the given classifier for a batch of inputs, potentially filtering suspicious input

Parameters:

x – Input samples.

Returns:

Array of predictions of shape (nb_inputs, nb_classes).