ADR 0019 - Deletion Protection with Validating Webhook

Author

Simon Beck

Owner

Schedar

Reviewers

Schedar

Date Created

2024-03-14

Date Updated

2024-03-18

Status

implemented

Tags

framework,service

Summary

We use a custom validation webhook to implement deletion protection for service instances.

Problem

To ensure that no PVCs and buckets are deleted without explicit intention, we need various systems that create deletion protection at different levels. There are multiple layers of deletion protection to consider:

  • Accidental deletion by an end-user → for example deleting a claim without intending to

  • Accidental deletion on the backend → for example a bug in the composition-function unintentionally removes something

Currently, there’s a deletion protection for PostgreSQL in place with a separate controller. However, this controller only protects against accidental deletion of the claim, nothing else. It also lacks support for other services and contains a few bugs which prevent it from cleaning up the resources after the grace period.

Goals

  • Easy to use end-user deletion protection

  • Easy to extend and integrate backend deletion protection

Non-Goals

  • N/A

Proposals

The proposals are split between the end-user and backend deletion protection.

End-user Proposal 1: Protect claim deletion via webhooks

The deletion of the claim is blocked via a webhook. There will be a new field in every claim that controls the deletion protection. For example .spec.parameters.backup.deletionProtection, if set to true, the webhook will not allow any deletion operations against the claim. To delete a claim, the field has to be explicitly set to a specific value (like false), and be applied with that value first.

Advantages
  • No actual deletion happens

  • No messing around with finalizers

  • No undelete process necessary, no deletion happened to begin with

  • Directly provides feedback to the user with a helpful message

  • Apart from the webhooks no separate controller is necessary for the implementation

Disadvantages
  • Breaks some k8s expected behavior

  • Changing the value in a GitOps environment, as the claim needs to be committed with the disabled deletion protection first

  • Not possible to implement a grace period, as no deletionTimestamp is set

End-user Proposal 2: Protect claims via finalizers and separate controller

There’s a controller which watches all claims and sets an additional finalizer on them. Additionally, all relevant objects within the service also get a finalizer, for example the namespace or pvcs. This finalizer stays on the objects as long as the deletion protection is active. It’s controlled via the claim for example via .spec.parameters.backup.deletionProtection.

Advantages
  • A grace period can be implemented, after which the controller removes the finalizers

  • The existing controller can be extended with the features

Disadvantages
  • Each service will need its own configuration to specify which objects need finalizers

  • As Crossplane doesn’t trigger any reconciles after the deletionTimestamp is set, a separate controller is necessary

  • The deletionTimestamp gets set on the objects, there’s no way to undelete the object at this point. Any important data or manifest needs to be copied away

  • Prone to bugs, as the current implementation shows

Backend Proposal 1: Set finalizers on specific objects

Every service specifies a set of objects that will receive additional finalizers. The field to control this protection should be the same as for the end-user protection (.spec.parameters.backup.deletionProtection).

Advantages
  • A grace period can be implemented, after which the controller removes the finalizers

  • The existing controller can be extended with the features

Disadvantages
  • Each service will need its own configuration to specify which objects need finalizers

  • As Crossplane doesn’t trigger any reconciles after the deletionTimestamp is set, a separate controller is necessary

  • The deletionTimestamp gets set on the objects, there’s no way to undelete the object at this point. Any important data or manifest needs to be copied away

Backend Proposal 2: Add additional logic to composition functions to ensure important objects aren’t removed

Have a composition function which checks at the end, some specified objects from the observed state are still present in the desired state. Either the function could simply add the observed object back, or simply abort the composition function with a fatal return.

Advantages
  • No actual deletion happens

  • No messing around with finalizers

  • No undelete process necessary, no deletion happened to begin with

Disatvantages
  • Increases complexity of the composition functions

  • Every service needs specific settings

  • Doesn’t protect against deletion from outside AppCat

  • It’s within composition functions itself

Backend Proposal 3: Deletion protection by custom validation webhook

A validation webhook that will check on namespaces and pvcs. If a namespace/pvc belongs to an active AppCat instance, it will block the deletion of the object. An active AppCat instance is determined by the existence of the composite. So as long as the composite is not deleted, then the deletion of the namespace or pvc belonging to that composite will be blocked by the validation webhook.

Advantages
  • Can be implemented very generic, only core objects and metadata of composites are needed

  • Not part of the AppCat composition functions

  • No actual deletion happens

  • No messing around with finalizers

  • No undelete process necessary, no deletion happened to begin with

Disatvantages
  • No grace period time

Complementary Proposal: Use canary instances on the lab

One instance of each service runs on the lab. Each new release will be deployed on the lab and tested extensively. After all tests succeed and no issues on the existing instances is observed, the release continues

Advantages
  • Some of the issues can be detected during testing on the lab

  • Rather easy to implement

Disadvantages
  • Would not have protected from the previous incident

  • More expensive, as lab resources would be used permanently

Decision

  • End-user Proposal 1: Protect claim deletion via webhooks

  • Backend Proposal 3: Deletion protection by custom validation webhook

  • Minimal backend proposal 2: Make a mini function that makes sure that the namespace is always in the desired state.

Rationale

While we loose the ability to delay the deletion with Webhooks, they are less complex to implement. Patching and managing finalizers on objects that are already managed by another controller leads to race conditions and bugs, as the current implementation shows. Webhooks also provide a much nicer user experience all around, as the user will get direct feedback from the system, if an invalid deletion command is issued. As the deletion will not hit the actual Kubernetes API, the deletionTimestamp is not set on the object, which effectively results in a no-op. Adding a small function which ensures that the namespace is always present doesn’t increase the complexity too much.