Gene Solutions SLURM HPC Deployment

Gene Solutions

By harnessing the power of technology and research, Gene Solutions develops cutting-edge genetic tests that improve screening, diagnosis and treatment at affordable costs for Vietnamese patients.

Brief results of the collaboration:

  • Having analyzed Gene Solutions’ (GSL) specific business needs around Artificial Intelligence (AI) research and execution, HADII consultants developed Ansible Playbooks to automate the deployment of a fully configured SLURM cluster in the customer’s on-premise High Performance Compute (HPC) cluster.

  • Additional consulting was provided in analyzing and selecting both cost-efficient and performant flash-based Network Attached Storage (NAS) storage solutions to be utilized by the cluster for processing Terabytes of data per day.

SLURM cluster architecture

The Need

Without a centralized system to run Machine Learning jobs, researchers would need to find idle servers to run experimental jobs and pipelines. However, as the data science team grew and its practices matured, a number of challenges were discovered:

  1. Scalability - Long-running machine learning jobs would typically be executed on a single server. A more efficient method for organizing larger jobs into multiple one that could be distributed across remaining servers was needed to reduce job execution times.

  2. Resource Management - Having to figure out what server were idle and contained enough available memory and CPUs to run jobs on was tedious and time-consuming. Additionally, without intelligent distribution of jobs, many resources including GPUs, CPU and memory are wasted.

  3. Standardization - Dependencies and experimental procedures were stored from server-to-server in varying locations, making traceability, documentation, and the reproduction of experimental results a major challenge.

The Challenge

Due to cost constraints, production machine learning pipelines need to reliably run at a fixed interval in the same cluster used to execute experimental jobs. This can be challenging to achieve due to resource competition between processes, and in determining what jobs to terminate to free up resources. Additionally, researchers required the ability to launch docker containers via SLURM, which poses numerous security issues need to be addressed appropriately.

The Solution

After collaborating with the GSL data science team, a SLURM cluster design was proposed to solve problems around resource sharing and scalability, as well as streamline the ability of the team to validate and deploy machine learning technology. Next, the deployment and configuration of the SLURM cluster was automated using Ansible Playbooks and consumed by the GSL team for deployment in their on-premise HPC cluster. The playbook has been open sourced and is available below.

Next
Next

Deploying a Production-Ready PrestaShop Solution on AWS