Skip to content

PRD: Refactor Jenkins Pipeline Library — Shared Abstractions and Pipeline Consolidation #585

@floatingman

Description

@floatingman

Problem Statement

The rancher/tests repository contains 37 Jenkinsfiles totaling ~5,700 lines of pipeline code with zero local shared abstractions. The same boilerplate — checkout logic, credential loading, Docker cleanup, tofu lifecycle calls, S3 artifact management — is copy-pasted across files with minor variations. This creates three compounding problems:

  • Maintenance burden: A change to repo URLs, credential handling, or the tofu module path requires editing every affected file independently. During the airgap pipeline development, the same checkout block was manually duplicated across 3 files.
  • Onboarding difficulty: New team members face 37 differently-named Jenkinsfiles with inconsistent patterns. Some use the qa-jenkins-library shared library; most use raw scripted pipelines. Parameter names for the same concept vary between files (DESTROY_ON_FAILURE vs DESTROY_AFTER_TESTS).
  • Reliability gaps: Error handling is inconsistent — some pipelines silently swallow errors in try/catch blocks, teardown logic is duplicated with subtle differences, and the same Docker cleanup sequence is written slightly differently in each file.

The airgap pipelines (developed most recently) are the best candidates for initial refactoring because they are the most modern and already use qa-jenkins-library, but the same patterns apply to the broader pipeline ecosystem.

Solution

Create a two-tier shared library architecture:

  1. qa-jenkins-library (external, PR-based contribution): Infrastructure primitives — tofu lifecycle operations, Ansible runners, S3 artifact helpers. These are generic and reusable across any pipeline that manages infrastructure.

  2. Local vars/ directory (in this repo): Orchestration functions — checkout patterns, credential loading, Docker cleanup, parameter resolution, and pipeline templates. These are domain-specific to the rancher-tests CI workflow.

Then consolidate pipelines by similarity group:

  • Merge airgap setup + destroy into a single parameterized infrastructure pipeline
  • Keep airgap go-tests as a separate pipeline that shares the infra abstractions
  • Collapse 4 nearly-identical simple test runners into a single parameterized pipeline
  • Delete the deprecated tfp/Jenkinsfile.airgap.tests
  • Convert all refactored pipelines to Declarative Pipeline syntax

The refactored system should enable creating a new pipeline in under 30 minutes instead of days.

User Stories

Airgap Infrastructure Pipeline

  1. As a QA engineer, I want setup and destroy to be a single pipeline with an ACTION parameter, so that I don't need to maintain two nearly-identical files
  2. As a QA engineer, I want the infra pipeline to support ACTION=setup, ACTION=destroy, and ACTION=setup-and-test, so that I can choose the right granularity for my use case
  3. As a QA engineer, I want the DEPLOY_RANCHER parameter to remain optional on setup, so that I can choose to deploy only the Kubernetes infrastructure without Rancher
  4. As a QA engineer, I want terraform.tfvars to be uploaded to S3 during setup and downloaded during destroy via a shared function, so that the destroy pipeline can reprovision exactly what was created
  5. As a QA engineer, I want the tofu lifecycle (init, workspace create/select, apply, destroy, delete workspace) to be orchestrated by a shared function, so that I don't need to remember the correct sequence in every pipeline
  6. As a QA engineer, I want Ansible variable configuration to use a shared function that handles SSH key paths, inventory rendering, and variable substitution, so that the same pattern is used consistently
  7. As a QA engineer, I want the checkout block (clone tests + qa-infra-automation) to be a single shared function call, so that repo URLs and branch handling are defined in one place
  8. As a QA engineer, I want infrastructure details (bastion DNS, load balancer hostnames) to be output consistently after setup, so that I can access the deployed environment easily

Airgap Test Pipeline

  1. As a QA engineer, I want the go-tests pipeline to share the same checkout and infra setup functions as the infra pipeline, so that changes propagate to both
  2. As a QA engineer, I want gotestsum invocation to be standardized with consistent flags, output format, and artifact archiving, so that test results are always captured the same way
  3. As a QA engineer, I want Qase reporting to be an optional stage that uses the same pattern across all test pipelines, so that I can enable/disable it without code changes
  4. As a QA engineer, I want the teardown logic (destroy on failure, destroy after tests) to use the same shared function as the destroy pipeline, so that cleanup is always consistent
  5. As a QA engineer, I want cattle-config generation to be a shared function, so that the token extraction and yq patching pattern is not duplicated

Simple Test Runners

  1. As a QA engineer, I want the 4 nearly-identical test runners (validation, e2e, harvester, vsphere) to be a single parameterized pipeline, so that bug fixes apply everywhere
  2. As a QA engineer, I want to select the target environment via a NODE_LABEL parameter, so that I can run tests against any environment without a dedicated Jenkinsfile
  3. As a QA engineer, I want credential sets to be loaded based on the target environment, so that the right cloud provider credentials are always available

Shared Library Foundation

  1. As a pipeline developer, I want a resolvePipelineParams() function that parses the job name and resolves branch, repo, and timeout with standard defaults, so that I don't need to write the same 10-line block in every file
  2. As a pipeline developer, I want a standardDockerCleanup() function that handles container stop, image removal, and volume cleanup, so that the 15-line cleanup sequence is defined once
  3. As a pipeline developer, I want a standardCheckout() function in qa-jenkins-library that handles dual-repo cloning with parameterized branches, so that new pipelines get the correct checkout behavior automatically
  4. As a pipeline developer, I want S3 artifact upload/download to be shared functions with a consistent path pattern, so that the S3 URL construction is not duplicated across files
  5. As a pipeline developer, I want all refactored pipelines to use Declarative Pipeline syntax, so that parameter definitions, post conditions, and stage structure are more readable and get better Jenkins UI integration

Migration and Coexistence

  1. As a QA engineer, I want old and new pipelines to coexist during migration, so that production Jenkins jobs are not disrupted
  2. As a QA engineer, I want new Jenkinsfiles to use simplified naming (e.g., airgap-rke2-infra, airgap-rke2-tests), so that the pipeline purpose is immediately clear from the filename
  3. As a pipeline developer, I want parameters to be harmonized across the refactored pipelines (e.g., consistent DESTROY_ON_FAILURE semantics), so that the same parameter name always means the same thing

Developer Experience

  1. As a new team member, I want a README in the pipeline directory explaining the shared library structure and how to create a new pipeline, so that I can onboard quickly
  2. As a new team member, I want each shared function to have clear parameter documentation, so that I know what to pass without reading the implementation
  3. As a pipeline developer, I want to create a new airgap variant (e.g., airgap K3s) in under 30 minutes by using the shared abstractions, so that we can quickly add coverage for new scenarios

Implementation Decisions

Architecture: Two-Tier Library

  • qa-jenkins-library (external repo, PR-based contribution): Receives new infrastructure primitives — airgap.standardCheckout, airgap.teardownInfrastructure, s3.uploadArtifact, s3.downloadArtifact. These are generic operations that any pipeline consuming qa-infra-automation could use.
  • Local vars/ directory (in rancher-tests): Contains rancher-tests-specific orchestration — airgapInfraPipeline, airgapTestPipeline, simpleTestPipeline, standardDockerCleanup, resolvePipelineParams. These compose the qa-jenkins-library primitives into pipeline patterns specific to this repository's workflow.

Pipeline Consolidation

  • Airgap infrastructure: Jenkinsfile.setup.airgap.rke2 + Jenkinsfile.destroy.airgap.rke2 merge into a single Jenkinsfile.airgap-rke2-infra with ACTION parameter (setup/destroy). The destroy action downloads tfvars from S3, runs teardown, and cleans up.
  • Airgap tests: Jenkinsfile.airgap.go-tests becomes Jenkinsfile.airgap-rke2-tests, consuming shared infra functions for setup and teardown while retaining its own test execution stages.
  • Simple test runners: validation/Jenkinsfile, validation/Jenkinsfile.e2e, validation/Jenkinsfile.harvester, validation/Jenkinsfile.vsphere collapse into a single Jenkinsfile.validation with NODE_LABEL parameter.
  • Deprecation: tfp/Jenkinsfile.airgap.tests is deleted (uses completely different patterns and is no longer actively maintained).

Syntax Migration

  • All refactored pipelines convert from Scripted Pipeline (node { ... }) to Declarative Pipeline (pipeline { ... }).
  • Complex teardown logic (the go-tests destroy-on-failure pattern) uses post { failure { ... } } and post { always { ... } } blocks in Declarative syntax.

Parameter Harmonization

  • DESTROY_ON_FAILURE is the unified parameter name (used in both infra and test pipelines).
  • DESTROY_AFTER_TESTS remains specific to the go-tests pipeline.
  • All pipelines use the same parameter defaults for QA_JENKINS_LIBRARY_BRANCH, TESTS_BRANCH, QA_INFRA_BRANCH.

Dockerfile Strategy

  • Dockerfile.infra and Dockerfile.airgap-go-tests remain separate — consolidating them would increase infra pipeline build time due to the Go toolchain.
  • Docker image building is extracted into shared functions so the build/tag pattern is consistent.

S3 Artifact Pattern

  • s3.uploadArtifact(workspaceName, localPath, s3Key) and s3.downloadArtifact(workspaceName, s3Key, localPath) wrap the current aws s3 cp Docker pattern into reusable functions.
  • The path pattern env:/${workspaceName}/terraform.tfvars is encapsulated within these functions.

Shared Function Interfaces

  • resolvePipelineParams(): Parses job name, resolves BRANCH/REPO/TIMEOUT with standard defaults. Returns a map of resolved values.
  • standardDockerCleanup(containerNames, imageNames, volumeNames): Runs docker stop/rm/rmi/volume rm for the specified resources.
  • airgapInfraPipeline: Shared parameters, credential list, path constants, checkout stages, tofu lifecycle, and S3 artifact management for airgap infrastructure operations.
  • airgapTestPipeline: Extends infra pipeline with Go test parameters, gotestsum invocation pattern, and Qase reporting.
  • simpleTestPipeline: Shared parameters and stage flow for the simple test runner group (checkout, configure, build, test, report).

Testing Decisions

Testing Strategy: Live Pipeline Testing

Given that Jenkins pipelines are notoriously difficult to unit test, the primary validation method is live pipeline execution against development/staging infrastructure:

  • Each refactored pipeline is created alongside the original (parallel coexistence) and triggered manually to verify functional equivalence.
  • The old pipelines remain active until the new ones have been verified across at least 2 successful runs.

What Makes a Good Test

  • A test verifies that the refactored pipeline produces identical infrastructure output as the original — same resources provisioned, same configuration applied, same artifacts archived.
  • Test runs should cover the full lifecycle: setup, optional Rancher deploy, test execution, teardown.
  • Error scenarios should be tested: failed tofu apply triggers cleanup, failed tests trigger optional destroy.

Modules to Test via Live Execution

  • Jenkinsfile.airgap-rke2-infra with ACTION=setup — verify infra creation matches original setup pipeline
  • Jenkinsfile.airgap-rke2-infra with ACTION=destroy — verify teardown matches original destroy pipeline
  • Jenkinsfile.airgap-rke2-tests — verify full test lifecycle matches original go-tests pipeline
  • Jenkinsfile.validation — verify each node label variant produces same results as the 4 original files

Modules NOT Tested via Live Execution

  • resolvePipelineParams — pure parameter parsing, verified by inspection
  • standardDockerCleanup — straightforward Docker commands, verified by inspection
  • Shared library additions to qa-jenkins-library — tested by the pipelines that consume them

Migration Verification Checklist

  • New infra pipeline ACTION=setup produces identical AWS resources
  • New infra pipeline ACTION=destroy cleanly removes all resources
  • New test pipeline runs same test packages and archives same artifacts
  • New validation pipeline runs tests on each node label correctly
  • Old pipelines still work during coexistence period
  • S3 tfvars upload/download works across setup→destroy cycle

Out of Scope

  • Phase 2+ pipeline groups: Elemental pipelines (Group D), upgrade pipelines (Group C), TFP pipelines (Group E), Neuvector consolidation (Group G), and Harvester bare-metal pipeline are not included in Phase 1. The shared library foundation will support their future migration.
  • qa-jenkins-library structural changes: We are adding new functions, not refactoring existing ones. The existing tofu.*, property.*, config.* interfaces remain unchanged.
  • Jenkinsfile.rc: The RC pipeline is complex enough (378 lines, corral packages, multi-SCM) to warrant its own dedicated refactor.
  • Shell scripts in validation/pipeline/scripts/: The existing shell scripts continue to work as-is. They may be refactored in a future phase to call the shared library functions.
  • CI/CD job configuration in Jenkins: This PRD covers Jenkinsfile content only, not Jenkins job configuration, credentials setup, or plugin management.
  • Non-pipeline files: Go test code, Dockerfiles (except potentially minor adjustments), and other repository content are not in scope.

Further Notes

Implementation Order

  1. Create local vars/ directory with shared functions (resolvePipelineParams, standardDockerCleanup)
  2. Submit PR to qa-jenkins-library adding airgap.standardCheckout, airgap.teardownInfrastructure, s3.uploadArtifact, s3.downloadArtifact
  3. Build airgapInfraPipeline and airgapTestPipeline in local vars/
  4. Create Jenkinsfile.airgap-rke2-infra and Jenkinsfile.airgap-rke2-tests
  5. Build simpleTestPipeline in local vars/
  6. Create Jenkinsfile.validation
  7. Verify all new pipelines via live execution
  8. Delete tfp/Jenkinsfile.airgap.tests
  9. Add README documentation for the shared library structure

Parallel Coexistence Plan

Old and new Jenkinsfiles coexist during migration. New files use the simplified naming convention. Once the new pipelines are verified:

  1. Update Jenkins job configurations to point to new Jenkinsfiles
  2. Monitor for 1-2 weeks
  3. Archive old Jenkinsfiles (move to deprecated/ directory or delete)

Risks

  • qa-jenkins-library PR timeline: Since it requires external PR review, the airgap pipeline refactor may be blocked if the library PR is delayed. Mitigation: the local vars/ functions can temporarily include the logic that would eventually move to qa-jenkins-library.
  • Declarative pipeline limitations: Some complex scripted patterns (like the destroyInfrastructure closure in go-tests) may require script { } blocks within Declarative syntax. This is acceptable and well-documented.
  • Build time regression: Consolidating Dockerfiles was rejected specifically to avoid increasing the infra pipeline's build time. This should be monitored.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestteam/pit-crewslack notifier for pit crew

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions