Summary: Boise State University School of Nursing is seeking an HPC Systems Engineer 1 to provide high-performance computing support for Idaho’s academic research community. The role involves server administration, evaluation, planning, configuration, and installation of HPC environments, along with user support and system maintenance. Responsibilities: - Administer production HPC systems supporting university researchers, research centers, and sponsored projects, maintaining system availability, integrity, and performance - Evaluate and recommend cluster-based solutions for research workloads, including capacity planning for compute, storage, backup and restore, and data movement requirements - Collaborate with research computing staff and scientific support specialists to support researchers across diverse domains and application stacks - Manage user accounts, allocations, and group memberships; provide first- and second-tier support for job submission, scheduler troubleshooting, storage quotas, and Linux command-line assistance - Install, build, and maintain scientific software—from vendor-supplied packages to user-requested builds from source—using Spack and manual compilation as appropriate - Triage and resolve technical issues spanning the Linux OS, networking, parallel filesystems, interconnects, and clustered scientific applications - Create, update, and close tickets in the research computing ticketing system according to established service standards - Develop and maintain system documentation, runbooks, knowledge base articles, and user-facing training materials using wikis and knowledge management tools - Maintain inventory and lifecycle tracking of HPC equipment, including procurement support, receiving, decommissioning, and warranty records - Participate in on-call rotation for after-hours maintenance windows and service-down scenarios - Configure and maintain Slurm scheduling systems, including partitions, accounts, QoS, fairshare, and preemption policies - Operate and extend cluster provisioning systems (Warewulf) and environment module systems (Lmod) - Manage and extend Open OnDemand deployments and interactive computing interfaces for research users - Develop and maintain infrastructure-as-code using configuration management tools (Ansible, Chef) and version-controlled repositories - Implement and monitor system performance, utilization, and health using monitoring and metrics platforms; analyze results and implement tuning or capacity changes - Support implementation of System Security Plans, including access controls, patching cadence, audit logging, and compliance documentation for regulated data environments where applicable - Propose, maintain, and enforce operational policies, practices, and security procedures in coordination with OIT and the security team - Perform other duties as assigned Required Qualifications: - Bachelor's Degree or equivalent experience - Administer production HPC systems supporting university researchers, research centers, and sponsored projects, maintaining system availability, integrity, and performance - Evaluate and recommend cluster-based solutions for research workloads, including capacity planning for compute, storage, backup and restore, and data movement requirements - Collaborate with research computing staff and scientific support specialists to support researchers across diverse domains and application stacks - Manage user accounts, allocations, and group memberships; provide first- and second-tier support for job submission, scheduler troubleshooting, storage quotas, and Linux command-line assistance - Install, build, and maintain scientific software—from vendor-supplied packages to user-requested builds from source—using Spack and manual compilation as appropriate - Triage and resolve technical issues spanning the Linux OS, networking, parallel filesystems, interconnects, and clustered scientific applications - Create, update, and close tickets in the research computing ticketing system according to established service standards - Develop and maintain system documentation, runbooks, knowledge base articles, and user-facing training materials using wikis and knowledge management tools - Maintain inventory and lifecycle tracking of HPC equipment, including procurement support, receiving, decommissioning, and warranty records - Participate in on-call rotation for after-hours maintenance windows and service-down scenarios - Configure and maintain Slurm scheduling systems, including partitions, accounts, QoS, fairshare, and preemption policies - Operate and extend cluster provisioning systems (Warewulf) and environment module systems (Lmod) - Manage and extend Open OnDemand deployments and interactive computing interfaces for research users - Develop and maintain infrastructure-as-code using configuration management tools (Ansible, Chef) and version-controlled repositories - Implement and monitor system performance, utilization, and health using monitoring and metrics platforms; analyze results and implement tuning or capacity changes - Support implementation of System Security Plans, including access controls, patching cadence, audit logging, and compliance documentation for regulated data environments where applicable - Propose, maintain, and enforce operational policies, practices, and security procedures in coordination with OIT and the security team - Familiarity with HPC cluster architecture, including compute nodes, high-speed interconnects (InfiniBand or Omni-Path), parallel filesystems (Lustre, GPFS, BeeGFS), and distributed storage systems (Ceph, NFS) - Familiarity with HPC workload managers, particularly Slurm, including job submission, partition and QoS configuration, and fairshare scheduling - Familiarity with cluster provisioning and configuration tools such as Warewulf, Bright, or equivalent stateless/stateful provisioning systems - Familiarity with environment module systems (Lmod, Environment Modules) and user-facing HPC portals such as Open OnDemand - Familiarity with cloud computing platforms and hybrid HPC deployments, particularly AWS services for research computing (EC2, S3, ParallelCluster) - Demonstrated proficiency with Linux system administration, including command-line tools, package management, system service management, user and permissions management, file systems, and log analysis - Demonstrated proficiency in shell (Bash) and Python scripting for automation, with the ability to develop, deploy, and schedule scripts in production environments - Experience installing and maintaining scientific applications on Linux, including building from source with autotools, CMake, and scientific software stack managers (Spack) - Experience with HPC compiler toolchains (GCC, Intel oneAPI, NVIDIA HPC SDK) and MPI implementations (OpenMPI, MPICH, Intel MPI) - Familiarity with GPU computing ecosystems (CUDA Toolkit) and workload scheduling of GPU resources - Exposure to containerized HPC workloads using Apptainer or equivalent container technologies - Experience with version control using Git, including branching workflows and collaborative platforms (GitHub, GitLab) - Familiarity with configuration management and automation tools such as Ansible or Chef for infrastructure-as-code practices - Strong verbal and written communication skills, with the ability to translate technical concepts for research users across a wide range of technical backgrounds - Ability to manage, prioritize, and make progress on multiple concurrent projects with minimal supervision - Customer-service orientation when supporting a broad research user community with patience and professionalism - Problem-solving and critical thinking skills to diagnose and resolve complex technical issues methodically Required Skills: HPC systems administration, HPC cluster architecture, InfiniBand, Omni-Path, Lustre, GPFS, BeeGFS, Ceph, NFS, Slurm workload manager, Slurm job submission, Slurm partition configuration, Slurm QoS configuration, Slurm fairshare scheduling, Warewulf cluster provisioning, Bright provisioning system, Lmod environment modules, Open OnDemand HPC portal, AWS EC2, AWS S3, AWS ParallelCluster, Linux system administration, Linux command-line tools, Linux package management, Linux system service management, Linux user and permissions management, Linux file systems, Linux log analysis, Bash scripting, Python scripting, Scientific software installation on Linux, Autotools, CMake, Spack software stack manager, GCC compiler toolchain, Intel oneAPI compiler, NVIDIA HPC SDK, OpenMPI, MPICH, Intel MPI, CUDA Toolkit, Apptainer container technology, Git version control, GitHub, GitLab, Ansible configuration management, Chef configuration management, Customer-service orientation Benefits: 12 paid holidays AND the University is closed between Christmas and New Year's (requires use of 3 vacation days), Between 12-24 annual paid vacation days for full-time Professional and Classified staff depending on position type and years of service, 10.76% University contribution to your ORP retirement fund (Professional and Faculty employees), 11.96% University contribution to your PERSI retirement fund (Classified employees), Excellent medical, dental and other health-related insurance coverages, Tuition fee waiver benefits for employees, spouses and their dependents

Ready to automate your job applications?

HPC Systems Engineer 1

Skills

Job Description

Benefits

Interested in this role?

Ready to automate your job applications?