GitLab serves as a robust version control system (VCS) with an array of continuous integration (CI) functionalities. However, it fell short for a specific project, resulting in developers deferring their merge requests due to excessively protracted pipeline wait times, consequently undermining the 'continuous' nature of the workflow.
The aim was to establish a scenario where multiple pipelines could run concurrently, each within a temporary instance hosted on AWS.
Integration of GitLab and Jenkins
Although GitLab provides directions for this setup, our team encountered substantial challenges during the implementation. For instance, GitLab's suggested approach employs a fork of the now-deprecated Docker Machine. Furthermore, our requirements were considerably more distinct.
Jenkins, a long-standing CI/CD orchestrator, excels in managing tasks across numerous nodes, either in a dynamically generated 'cloud' environment or with static nodes. While Jenkins entails a more intricate configuration compared to GitLab CI, and its interface is somewhat archaic, it turned out to be more adept at accomplishing our objectives.
My final arrangement encompasses the following attributes:
- Ability to execute an essentially limitless number of CI pipelines concurrently
- Direct triggering of pipelines by merge requests (MRs) in GitLab
- Inclusion of pipeline status within the MR, mirroring GitLab's pipeline behavior
- Dedicated node allocation for each pipeline's execution
- Dynamic creation of new nodes by Jenkins when pipeline jobs are in queue, and subsequent removal of nodes when no longer necessary
- Utilization of 'spot instances' for nodes, capitalizing on spare cloud resources, rendering them ephemeral, and more cost-effective compared to regular instances.
Resource Orchestration by Jenkins
I was pleasantly surprised by Jenkins' capacity to manage the orchestration of runners. This resulted in the formulation of policies aimed at averting resource depletion while maintaining relatively 'fresh' instances:
Each node runs five builds before termination, mitigating disk space exhaustion
- A single node persistently runs, prepared to accept a build
- A maximum of ten nodes are active simultaneously
- A node is terminated if it remains idle for over an hour
Consequently, during periods of heightened development activity, 'warm' nodes are promptly accessible to execute CI pipelines. The latency between the pending and running states is diminished, a contrast to strategies that only launch nodes upon pipeline initiation.
Moreover, the agents (referred to as 'runners') launch with a 'golden AMI' preloaded with requisite Docker image pulls. This expedites the pipeline script execution by readily supplying necessary Docker images for integration testing (e.g., Node.js, Elasticsearch, MySQL 8).
To realize this Jenkins setup on AWS, the following system was devised:
A Jenkins controller node, acting as a singular point of failure that must remain operational. This controller node operates within an ECS Fargate container, ensuring automatic replacement upon failure. It's accompanied by an EFS file system for data persistence.
For agents, I prepared a 'golden AMI' furnished with required resources. The setup involves configuring a launch template connected to an auto-scaling group. This allows the selection of suitable instance sizes from a pool of available spot instances, with Jenkins managing the auto-scaling group based on its resource requirements.
Integrating Jenkins with GitLab
The integration of our Jenkins pipelines with GitLab was a straightforward process, facilitated by readily available Jenkins plugins for GitLab and EC2 fleet control.
Efforts at Optimization
Jenkins provides a wealth of control and orchestration capabilities, including the ability to:
- Execute pipeline stages on different nodes
- Run pipeline stages in parallel
- Run parallel pipeline stages on distinct nodes
I experimented with partitioning tests and running them in parallel on various nodes. Although I achieved success, it didn't translate into faster pipeline runs due to:
- The necessity to copy application/VCS files to other nodes and return results to the parent stage, a process involving slow tar and gzip actions before parallel stages commence
- Escalation of complexity in the declarative pipeline script
- Occurrence of CPU and memory depletion when running multiple stages on a single machine in parallel
- Inconsistent execution times when shifting stages between nodes during simultaneous pipeline runs
This approach might become viable when our pipeline runs/tests extend their duration (typically around 20 minutes), and the overhead from stashing and unstashing becomes less significant relative to the partitioned run's duration.