Apache Superset [SIP-149] Proposal for Kubernetes Operator for Apache Superset

[SIP-149] Proposal for Kubernetes Operator for Apache Superset

Motivation

Apache Superset's Helm chart [1] [2] is widely used and receives regular contributions, reflecting the popularity of Kubernetes-based deployments within the community. However, Helm's reliance on static templates, duplicated code, lack of built-in testing frameworks, and limited support for advanced lifecycle management makes maintenance of the Helm chart opaque, error prone, and can cause significant downtime risks in large scale deployments relying on it.

This proposal introduces a Kubernetes Operator [3] (hereafter referred to as "the Operator"), offering a Kubernetes-native approach to managing Superset deployments. The Operator will provide similar configuration options to the Helm chart, while addressing its limitations and introducing features like better testing, observability and automation. This proposal aligns with the approach taken by other Apache projects, such as Apache Flink [4] [5] and Apache Druid [6] [7], whose communities have embraced operators to manage their deployments more effectively.

Proposed Change

The Operator will introduce a Custom Resource Definition (CRD) [8] for managing Superset deployments declaratively. Key features include:

Helm-Aligned Configuration: A configuration model similar to the Helm chart, exposing commonly needed configuration options.
Enhanced Observability: Built-in support for metrics collection, making it easier to monitor key operator related metrics (reconciliation successes/failures, durations etc).
Improved Lifecycle Management: All resources created by the operator will include an owner reference to the custom resource (CR), ensuring that the Kubernetes Garbage Collector automatically cleans up these resources when the CR is deleted. The operator also lays the groundwork for advanced features like staged upgrades, rollbacks, and downgrades, which are currently not possible using the Helm chart.
Enhanced Testing: The Operator will leverage the Operator SDK [9] testing framework, making it easier to validate bug fixes and improvements while ensuring greater reliability and maintainability over time.

The Operator would be placed in a separate repo under the Apache GitHub org, preferably /apache/superset-kubernetes-operator. While the ASF doesn't explicitly stamp Docker images or other convenience releases, the constraints that are imposed by the ASF on their repos will provide an added layer of security for ensuring that the built and released artifacts are legit, and have met the strict requirements of the ASF. Having a dedicated repo would also make it easier to maintain dedicated CI workflows, and would also decrease traffic on the main repo by having its own set of Releases, PRs, Issues and Discussions.

New or Changed Public Interfaces

Kubernetes CRDs: A Superset CRD for declarative configuration of the deployment. This structure will be similar to the values.yaml in the current Helm chart. For key components in the Superset deployment, like the workers and scheduler, there will be separate CRDs to capture specific details of those components.
Operator Image: Docker image for the Operator, built using the Go-based Operator SDK.
Deployment Artifacts: YAML manifests for deploying the Operator, with optional OLM support.

Apache Superset [SIP-149] Proposal for Kubernetes Operator for Apache Superset Figure 1. A Superset deployment based on the current Helm Chart, where Helm renders manifests based on the values.yaml file and Helm chart, and applies them to the target namespace.

Apache Superset [SIP-149] Proposal for Kubernetes Operator for Apache Superset Figure 2. Diagram depicting the proposed operator based flow, where the Operator is deployed in its own namespace, and continuously reconciles the desired state in the custom Superset resources. The CRD ensures that the Superset manifests are valid and applies defaults as needed.

The Operator will consist of a main controller that reconciles the main Superset CR, and component specific controllers to manage the core Superset components (currently the Web Worker, Async Worker and Async Scheduler). The main controller will reconcile common shared resources, like Ingress/Gateway, Service, ConfigMaps, and will also reconcile CRs for the core Superset components. These will then in turn be reconciled by their respective controllers, which ultimately create the Deployments running the application components. Having a main Superset CRD with separate CRDs for the main components has the following advantages: - Simplicity: Having a single CR will ensure shared properties are defined only once. - Modularity: As new components are introduced, these will be introduced as new CRDs and controllers that focus on reconciling a single component. This gives admins the freedom to define Superset components directly without interacting via the main Superset CRD. This would be similar to creating a ReplicaSet without first creating a Deployment. - Transparency: Having separate resources will make it easier to check the specific status of each component, as they may fail independently.

Changes to SIP and Release Process

To ensure breaking changes to Superset are handled by the Operator, the following changes would need to be done to existing processes:

SIP process: A section for required changes to the Operator would be added to the SIP template. Most changes don't impact the infrastructure deployment process, but some do, like the addition/removal of workers or critical components, like the Celery scheduler. These changes would need to be made to the development version of the Operator before the changes are made generally available.
Release process: In case that a breaking change is introduced that requires changes to the Operator, the official Operator images would be updated as follows:
Previous release: Checks for upper bounds for the Superset version would be added to the Operator to report a warning/error if an unsupported version is chosen. This would be reported both in the CR's status and via a metric.
Next release: Similar lower bound checks would be added to the Operator if the new version is incapable of supporting a deprecated/removed feature on an old Superset version.

To keep the releases of Superset and the Operator aligned, we would ensure that that all currently supported Superset versions are backed up by an Operator release. As we're officially maintaining "the latest minor of the last two majors" [10], the Operator would also support these. At the time of writing that would mean 4.1 and 3.1. Note that the Operator version would not track the official Superset version, as breaking changes that require changes to the Operator are fairly uncommon.

New dependencies

The Operator will rely on the Go-based Operator SDK [11] for its implementation and testing framework. Beyond this, it will share the same core dependencies as the existing Helm chart, such as Kubernetes APIs and configurations, but without requiring Helm as a dependency.

Migration Plan and Compatibility

Migrating from the Helm chart to the Operator will be straightforward, as the Operator’s CRD will closely align with the structure of the current values.yaml used in the Helm chart. Additionally, the resources created by the Operator will closely mimic those generated by the Helm chart, ensuring consistency and familiarity. Administrators already familiar with managing Superset via Helm will find the transition intuitive.

Benefits

Kubernetes-Native Management: A set of clean CRDs and continuous reconciliation provide a more natural Kubernetes experience.
Dynamic Lifecycle Features: The Operator lays the foundation for advanced features like staged upgrades and automated recovery. These are difficult to achieve using the current Helm-based approach.
Enhanced Observability: Prometheus-compatible metrics make it easy to monitor the Operator and Superset deployments.
Improved Testing: Operator SDK enables comprehensive testing, both full integration tests and light weight unit tests, improving reliability.
Helm Independence: Users can deploy Superset without relying on Helm.

Proposed Operator Scope and Deprecation of Helm Chart

We propose deprecating the Helm chart once the Operator is deemed stable to avoid the burden of maintaining both. The Operator will also exclude reconciliation support for PostgreSQL and Redis. Users can continue using Helm for these services or adopt dedicated operators [12] [13], ensuring a more focused approach for managing Superset.

Rejected Alternatives

Enhancing the Helm Chart: Helm is limited in its ability to support advanced lifecycle features, testing, dynamic reconciliation, and observability.
Standalone Scripts: Scripts lack maintainability and alignment with Kubernetes-native workflows.
Existing operators: No open-source operators provide a clean CRD or are aligned with Superset’s Helm chart configurations.

Comment From: michael-s-molina

Thank you for the proposal @villebro. Do we plan to officially support this operator for official releases? If yes, could you enhance the SIP explaining how the Release Process would be affected?

Comment From: villebro

@michael-s-molina thanks for the feedback. Version support has not been a major issue in the current Helm chart, as it's mostly decoupled from the Superset release process. However, you're right that major changes, like the introduction/removal of new worker types, would definitely cause a breaking change in the operator, too. I will add a section to cover this.

Comment From: michael-s-molina

I will add a section to cover this.

Thanks. Please consider any necessary changes to RELEASING/README.md.

Comment From: villebro

I will add a section to cover this.

Thanks. Please consider any necessary changes to RELEASING/README.md.

@michael-s-molina I think it's actually mostly relevant for the SIP process, rather than the release process. Any major breaking changes or new advanced features that affect how Superset is deployed may affect how the Docker image is built, our Docker Compose flows, and ultimately the Kubernetes deployment model. A few examples: - Global Async Queries using Websockets: there is an unofficial oneacrefund/superset-websocket image that requires an extra deployment on Kubenetes, which is currently supported by Helm. In retrospect, [SIP-43] should have addressed how this would be supported in all the currently existing deployment models. - Addition/removal of a critical component: Assuming we were to replace Celery Beat with another scheduler, that would need to be considered during the SIP review, as it would likely require changing what the scheduler Deployment looks like.

Therefore, major changes should be handled as follows: 1. Infra related breaking changes will need to be raised during the SIP process to ensure they're considered before the vote. 2. Support for accepted SIPs will need to be added to the dev version of the Operator, so that a clear warning/error can be emitted if the chosen Superset version is unsupported. Note that will be easier to support in the Operator, but more difficult in Helm, as Helm doesn't easily support this type of logic. 3. Once the new version is released that introduces the breaking change, the affected versions of the operator should be patched with logic to check if they are compatible with the newly introduced version or not.

Comment From: mistercrunch

Do these typically live in mono-repo or in their own repo?

Comment From: villebro

@mistercrunch I would place this in a separate repo, similar to what Flink is doing: https://github.com/apache/flink-kubernetes-operator (I would suggest following this pattern: apache/superset-kubernetes-operator). Then we wouldn't have to burden the main repo's CI, and could let both repos evolve in their own directions as needed.

Edit: I added a note about this in the proposal.

Comment From: mistercrunch

Probably fine to use https://github.com/apache-superset/ org for this, that way you get admin rights and we don't have to consider this tool/repo as an ASF-sanctioned thing that provides all the ASF-related-type constrainst & guarantees

Comment From: mistercrunch

In some ways this would also make it such that we don't really require a SIP or the SIP process.

Comment From: villebro

Some pros/cons that come to mind: - Since this is directly tied to Apache Superset, it seems logical to have it reside under the ASF umbrella to give the strong quality guarantees that the ASF process provides. Some orgs may not be able to use the code/assets unless they're governed by the ASF. - There's definitely extra overhead for setting this up under the ASF. However, since Flink has been able to get it working, I'm sure we can make it work, too. - Druid has decided to go the way of a non-ASF repo (https://github.com/datainfrahq/druid-operator), so that's apparently ok, too.

I would personally vote to keep this under the ASF GitHub org, but I'm not super opinionated, so I can probably be convinced the other way, too.

Comment From: mistercrunch

Makes sense, though from my understanding the ASF and its participant can't really officially stamp things like a docker image since it include all sorts of other binaries that we can't/shouldn't certify for legal reasons. The only binaries that are official are the tarballs. As long as it's a "recipe" and not a meal it's fine, meaning say a Dockerfile is fair game, but the docker image itself with a bunch of other binaries in it we can't officially certify or distribute. Guessing the k8s Operator would be mostly a recipe, which would be fine.

Comment From: Synarcs

@villebro , I believe this proposal is getting away from the helm finalizers as well and add custom finalizers, ownerreferences for each superset deployment-able manifest for complete lifecycle management of the state as mentioned in crd. In addition, what are the plans for gateway networking, , I believe it would be agnostic to underlying ingress routing policy in the cluster. but will the operator also own support for deploying ingress resources for cluster running older ingress api, or resources for cluster running k8s gateway api (istio, contour, traeflik) etc.

Comment From: villebro

I believe this proposal is getting away from the helm finalizers as well and add custom finalizers, ownerreferences for each superset deployment-able manifest for complete lifecycle management of the state as mentioned in crd.

@Synarcs I didn't mention it in the SIP, but you're right, the custom resource would be the owner of all spawned resources. In other words, when you delete the CR, it will let the k8s GC remove all child resources. I added this clarification to the SIP.

Custom finalizers could be introduced as needed. However, for the first version, I think we can get by without any custom finalizers, as the GC will handle the majority of cleanup (most admins will probably prefer to clean up their metastore etc manually).

In addition, what are the plans for gateway networking, , I believe it would be agnostic to underlying ingress routing policy in the cluster. but will the operator also own support for deploying ingress resources for cluster running older ingress api, or resources for cluster running k8s gateway api (istio, contour, traeflik) etc.

I see no reason why we can't support both Gateway and Ingress, the latter which is currently supported by the Helm chart. For this we could just add an optional gateway property on the CRD, making it possible to choose that one instead of ingress where needed. To get this effort started, I think we can add it to the Helm chart first (disabled by default), and then just use a similar structure on the CRD as in the values.yaml. Over time we would then deprecate ingress, and only support gateway once feature parity is firmly established.

Comment From: villebro

Probably fine to use https://github.com/apache-superset/ org for this, that way you get admin rights and we don't have to consider this tool/repo as an ASF-sanctioned thing that provides all the ASF-related-type constrainst & guarantees

@mistercrunch I've carefully considered the ASF vs non-ASF repo option, and here's some further thoughts:

For scripts and other general recipes like a Helm chart, which are easy to audit, hosting on a non-ASF repo should be fine. However, a Golang based operator, with a complex build process that results in a Docker image, will not be easy to audit before use.
While the ASF doesn't explicitly stamp Docker images or other convenience releases, the battle tested constraints & guarantees that are imposed by the ASF on their repos will provide an added layer of security for ensuring that the built and released artifacts are legit, and have met the strict requirements of the ASF.
Kubernetes operators are deployed on highly critical infrastructure, and are able to both provision and destroy resources not only on the Kubernetes cluster, but even outside the cluster (e.g. load balancers via Ingress/Gateway resources). For this reason, many orgs will have security and compliance policies preventing them from deploying OSS operators that originate from non-ASF sanctioned repos.
The ASF already has a number of actively maintained operator repos for top projects. This implies that this is a typical pattern for ASF projects, and will likely not receive pushback from the ASF.

For the reasons listed above I maintain my recommendation of hosting the operator on the ASF GitHub org. I added a summary of the above in the SIP.

Comment From: rusackas

@villebro I think we defer to you as to where this lives, and otherwise, this seems ready for a VOTE :)

Comment From: villebro

Thanks @rusackas! Let me ping the dev list once more to let people know the discussion has settled and to urge anyone who hasn't spoken up yet to do so now.

Edit: Sent an email to the Dev list: https://lists.apache.org/thread/ylycmvy3vbygnmk6xw5jqmndryvt02om

Comment From: mistercrunch

I'm cool with on or off apache or apache-superset, but feel strongly about keeping it out of the "monorepo". If we do start a new repo for that purpose though, do you think helm/ should also live there? In its own repo?

Comment From: villebro

I'm cool with on or off apache or apache-superset, but feel strongly about keeping it out of the "monorepo". If we do start a new repo for that purpose though, do you think helm/ should also live there? In its own repo?

@mistercrunch the proposal is to deprecate the Helm chart once the operator reaches maturity to avoid duplicated effort. At the end of the day they do the same thing - the operator is just more "Kubernetes native" than the Helm chart. The community would be free to fork the Helm chart and start maintaining it in a separate repo if they so wish.

Comment From: Synarcs

I feel in operator for ease of use each superset architecture component to be managed by its isolated controller in controller runtime, with a single controller (superset-controller) managing all these components with their respective controllers via ownerreference. This would allow isolation and scale the controllers for each component in superset architecture as individual components evolve with superset versions.

Comment From: villebro

I feel in operator for ease of use each superset architecture component to be managed by its isolated controller in controller runtime, with a single controller (superset-controller) managing all these components with their respective controllers via ownerreference. This would allow isolation and scale the controllers for each component in superset architecture as individual components evolve with superset versions.

@Synarcs are you proposing having separate CRDs and controllers for each component, like the web worker, async worker, scheduler etc? In theory this does sound appealing. However, I see the following problems: - The workers currently map fairly cleanly into vanilla Deployment resources. Having separate CRDs and controllers for each worker would only add an extra layer to the chain of resources (WebWorker -> Deployment -> ReplicaSet -> Pod) without adding that much value. It would essentially be a worker-specific Deployment wrapper. - The workers and scheduler will need identical image tags, environment variables, mounts, service references etc. Having separate CRDs (and consequently CRs) for all these will cause significant duplication and make the management of the cluster as a whole more complex. - Separating the individual components would also make it more logical to split out Ingress/Gateway and Services. However, the exact structure of these, especially the main service, tend to be uninteresting to the typical admin. Breaking these out would make it more difficult to get a Superset deployment up and running.

If any of the above changes in the future, I see no reason why we can't break up the operator into separate CRDs and controllers. However, given the current architecture, I feel having a single CRD is the most convenient means of abstracting Superset as a whole.

Comment From: Synarcs

@villebro this makes sense, considering keeping the operator as a single monolith, with the single controller handling reconciliation for all the various combinations end-users may select to deploy superset in their environment. Some thoughts on this, maybe it would be useful in future,

We need to consider if there is a single CRD all options offered now will be converted in spec sections, specifically breaking spec section for resource into chunks to ensure each resource is configurable (following same hierarchy we keep in values for superset helm). For each section any child resources it creates would be added using proper ownerreferences, and finalizers for better gc by the superset operator.
Each of this spec subsection resembling superset architecture component can have its status as the standard way for operators to ensure the superset operator to properly reconcile and deploy each architecture component of superset.

Finally, I feel, if any subsection in resource spec needs a lot much custom configuration for deployment, it would be better we break it into an isolate CRD with its own controller managed by the main superset controller to ensure the parent CRD does not become too large and difficult to manage, and the isolated component CRD in the architecture can evolve with its custom configuration.

Comment From: villebro

@Synarcs you raised a good point about the having separate statuses for the different components. I agree that it will be clearer if we can split out the individual components into separate CRs, even if we have a single monolith at the top. So let's say we would have the following hierarchy (each node representing a separate resource type):

Superset
  SupersetWebWorker
    Deployment
      ReplicaSet
        Pod
  SupersetAsyncWorker
    Deployment
      ReplicaSet
        Pod
  SupersetAsyncScheduler
    Deployment
      ReplicaSet
        Pod
  Ingress/Gateway
  Service
  ConfigMap
  ...

This would make it possible to do kubectl describe superset my-superset to see the status at high level, but then drill down into the individual workers by doing kubectl describe supersetwebworker my-superset etc, where a more detailed status could be seen. This aligns with how Deployment branches out into a single ReplicaSet, which initially may seem like unnecessary overhead, but makes sense once you consider what they represent.

Therefore, we'd still have a single monolithic CRD for Superset, but we'd introduce separate CRDs and controllers for the various worker types and reconcile then in separate reconciliation loops. So the Superset controller/reconciler would create SupersetWebWorker, and that one would then in turn create the actual Deployment etc. While most admins would just use the Superset CRD, this way you'd also have the option of spawning a SupersetWebWorker without a top level Superset CR if you really want to.

If you agree with this approach I'll update the proposal to reflect this structure.

Comment From: Synarcs

@villebro this looks clean and can evolve as different core superset components evolve. For now, I believe the same way kubebuilder or operator sdk provides way to package the operator into helm chart, the main parent controller of the (superset operator) can be packaged into helm, by providing initial interface / templates for users to deploy superset with operator in accordance how it was done with helm before making it easy transition for helm users to explore the operator to deploy resources, and in future the helm chart can be completely decommissioned if required as operator ensures proper deployments of each superset release.

Comment From: villebro

@Synarcs I agree, a Helm chart could indeed be included that makes it easier to deploy the operator. However, I think Operator SDK mainly provides manifests to deploy the operator (I haven't checked if it can also generate a Helm chart). But at any rate, we'll definitely aim to make the deployment process as easy as possible, both by adding good documentation, and providing good deployment templates (be it a Helm chart or manifests).

Comment From: villebro

Vote sent to mailing list: https://lists.apache.org/thread/j32styh48lwrjzkpvqk4mnvrmvklvz3d

Comment From: abhioncbr

This sounds interesting to me. I want to contribute to building the operator.

Comment From: villebro

The vote has PASSED with 6 binding +1, 2 non-binding + 1 and 0 -1 votes.

Result thread: https://lists.apache.org/thread/cdol8xrokk04klvg4tpxmk3y245zwx80

Comment From: abhioncbr

Since the vote has PASSED. What would be the next steps?

Comment From: villebro

Since the vote has PASSED. What would be the next steps?

I will start preparations for establishing the initial codebase for the operator. I'll drop a note on the dev list when we have the next steps ready.

Comment From: michael-s-molina

Closing the SIP as it was approved.

Comment From: Damanotra

Hi team, anyone knows how to contribute to Superset Operator?

Comment From: villebro

Hi @Damanotra , work to implement the operator is expected to start shortly. I will update here when the work starts in earnest.