Jobs
>
Bandar Baru Bangi

    Senior SRE Software Engineer, Storage and Data - Bandar Baru Bangi, Malaysia - NVIDIA

    NVIDIA
    NVIDIA Bandar Baru Bangi, Malaysia

    5 days ago

    Default job background
    Full time
    Description

    SRE at NVIDIA ensures that our DGX Cloud platform continues to be reliable and performant to meet the needs of our users. You will play a critical role in ensuring the reliability, availability, and performance of storage infrastructures for NVIDIA DGX GPU cloud platforms. To collaborate with cross-functional teams to design, build, and maintain scalable and fault-tolerant storage solutions that support our mission-critical applications and services. Your expertise in storage systems and reliability engineering will be instrumental in minimizing downtime, improving system efficiency, and enhancing the overall user experience.

    SRE is also a mindset and a set of engineering approaches to running efficient production systems, with a focus on eliminating manual work through modern automation practices and performance tuning. We promote self-direction to work on meaningful projects while striving to build an environment that provides the support and mentorship needed to learn and grow.

    What You Will Be Doing:

  • Develop strategies to ensure the reliability and availability of storage systems, including redundancy, failover, and disaster recovery plans.
  • Continuously analyze and fine-tune storage systems for optimal performance, including throughput optimization, caching, and latency reduction. Identify and resolve performance bottlenecks to enhance overall system efficiency.
  • Develop and maintain automation scripts and tools to streamline storage provisioning, configuration, and maintenance tasks.
  • Implement monitoring and alerting systems to proactively identify and address issues.
  • Participate in on-call rotation to respond to storage-related incidents promptly conduct root cause analysis of outages and implement preventive measures.
  • Collaborate with cross-functional teams, including Compute SRE, development, and networking, to ensure seamless integration of large-scale storage solutions.
  • Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.
  • What We Need To See:

  • BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), with 5+ years equivalent practical experience.
  • Proven experience in storage system administration and site reliability engineering.
  • Experience with Git, RESTFul API, Linux service operation, networking, complexity analysis, AWS S3, software design, and maintaining large-scale Linux based systems.
  • Experience in one or more of the following languages: Ansible, Bash, Python, Go, YAML, Java
  • Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform.
  • Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic(OpenSearch) stack, Grafana.
  • Ways to stand out from the crowd:

  • Experience with storage solutions like: OpenStack Swift(object), AWS S3(object), DDN, Lustre.
  • Strong Linux and network troubleshooting skills by running various commands and tools.
  • Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success..
  • Interest in crafting, analyzing, and fixing large-scale distributed systems. Strong debugging skills with a systematic problem-solving approach to identify complex problems.
  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker.


  • Quantum Corporation Bangi, Malaysia Full time

    As a valuable member of the CatDV team, your role will involve actively contributing to our software release process and enhancing our automation testing efforts. Your contributions will play a pivotal role in ensuring consistent and high-quality product releases, ultimately elev ...


  • Quantum Corporation Bangi, Malaysia Full time

    The Internal Audit Senior will play a crucial role in supporting comprehensive audit programs, focusing on operational and compliance effectiveness, including adherence to the Sarbanes-Oxley Act (SOX), across the organization. This position involves actively participating in risk ...


  • Quantum Corporation Bangi, Malaysia Full time

    The Internal Audit Manager will play a pivotal role in overseeing comprehensive audit programs, focusing on both operational and compliance effectiveness, including adherence to the Sarbanes-Oxley Act (SOX), across the organization. They will be instrumental in conducting risk as ...


  • BayWa r.e. Putrajaya, Malaysia Full time

    Engineering (Construction) Intern · BayWa r.e. is the home for change makers. We energy – how it is produced, stored and can be best used to enable the global renewable energy transition that is essential to the future of our planet. At BayWa r.e. we effect change globally. With ...

  • Logicalis Australia

    Business Analyst

    5 days ago


    Logicalis Australia Cyberjaya, Malaysia Full time

    We are seeking for a Business Analyst based in Cyberjaya Malaysia, you will be dedicated to monitor the operations of the professional services business and help deliver profitable outcomes through their work in analysis, resource utilization and continuous improvement. · In this ...

  • Logicalis Australia

    Shift Team Lead

    5 days ago


    Logicalis Australia Cyberjaya, Malaysia Full time

    We are looking for a Shift Team Lead to be based in our Cyberjaya Office. This role is a key leadership role that is responsible for overseeing and coordinating the activities of a designated shift within the organization. Your core objective is to continually measure and improve ...

  • NTT

    MS Engineer L1

    6 days ago


    NTT Cyberjaya, Malaysia Full time

    NTT is a leading global IT solutions and services organisation that brings together people, data and things to create a better and more sustainable future. · In today's 'iNTTerconnected' world, connections matter more now than ever. By bringing together talented people, world-cla ...

  • Nityo Infotech

    Server Engineer

    5 days ago


    Nityo Infotech Puchong, Malaysia

    · •Excellent communication and presentation skills. · •Result oriented & customer focused. · •Mature, self-motivated, proactive, aggressive, and independent. · •Required to speak fluently and write English and Malay. · •Fresh graduates are encouraged to apply · •Candidate should ...

  • Hunters International Sdn Bhd

    Warehouse Assistant

    5 days ago


    Hunters International Sdn Bhd Puchong, Malaysia Full time

    Job Description · Unloading, controlling, installing and transferring goods in the warehouse · Loading and unloading trucks and containers · Operating forklift · Driving lorry to transport goods · Transporting orders · Processing and sorting incoming goods · Tracking the inventor ...


  • AXA Puchong, Malaysia Full time

    JOB PURPOSE · AXA Group Operation Malaysia is seeking a Cloud Infrastructure Specialist to join us in Kuala Lumpur. In this role, you will build, implement, and operate IT solutions based on various vibrant technologies such as virtualization, container, JEE based middleware, AP ...

  • Mercedes-Benz Services Malaysia Sdn Bhd

    Full Stack

    5 days ago


    Mercedes-Benz Services Malaysia Sdn Bhd Puchong, Malaysia

    Mercedes-Benz Mobility supports the sales of Mercedes-Benz Group's automobile brands with financial services, which range from financing and leasing to insurance, car rental and fleet management. · Following the strategy of the Mercedes-Benz Mobility AG to digitize and automate ...


  • Tentacle Sso Sdn Bhd Kuala Lumpur, Malaysia Part time

    Position: Data Protection Engineer · Experience: 3-10 Years · Eligible: Apply from anywhere · Visa: Provided by companyBudget: MYR 11.5k/Month · Location: Menara Maybank , KL · Domain: Banking Industry · Job Description: Senior Data Protection Engineer Level 3 (7yrs-10yrs Mus ...


  • Johnson Matthey Malaysia, Kuala Lumpur Full time

    Vacancy: Infrastructure Engineer · Location: Kuala Lumpur, Malaysia · Job Family: IT · The Infrastructure Engineer will support the Infrastructure function globally by implementing and administering storage and datacentre infrastructure (SAN, server, storage etc) installed across ...


  • Jobs via eFinancialCareers Malaysia, Kuala Lumpur Full time

    Job Description · Looking for an Infrastructure Solution Architect to make intuitive high level decisions for Infrastructure Solution Design. · Own infrastructure architecture solution design. · Job Responsibilities · To evaluate, recommend, collect and confirm requirements of se ...

  • Hibiscus Petroleum Berhad

    Production Analyst

    3 days ago


    Hibiscus Petroleum Berhad Malaysia, Kuala Lumpur Full time

    Duties And Responsibilities · Key role in assisting the team in delivering production, injection and emission target. · Responsible for production data gathering, data integrity and checks, data storage and filing, and routine reporting to various stakeholders. · Generate reports ...

  • Tek Infotree Sdn Bhd

    Data Scientist

    3 days ago


    Tek Infotree Sdn Bhd Kuala Lumpur, Malaysia Part time

    We are hiring a Data Scientist. · Long-term Contracts with the possibility of permanent employment · Local Malaysians and Expat are welcome · Max salary budget: RM10,000 · Location KL / WFO · Functional Group: Technical Team · REPORTS TO · • Project Manager – Software and ...

  • Ambition

    AWS Cloud Engineer

    4 days ago


    Ambition Malaysia, Kuala Lumpur Full time

    Responsibilities · Job descriptionUnderstand AWS technologies (WebApp, Azure API, frontdoor, Managed Instance, Compute, network, storage, and data combination: · IAAS and PAAS; VM creations, networks (vnet, traffic manager, load balancer, application gateway) · Function app/ stor ...


  • LPS Malaysia, Kuala Lumpur Full time

    Job description: · Job Descriptions: · Backup Administrator: · In this role, you are responsible for supporting the backup operation of client's network. You will also involve in new project implementation. You must possess hands-on experience in backup solutions and architectur ...


  • Agensi Pekerjaan Great Pyramid Sdn Bhd Kuala Lumpur, Malaysia Full time

    Job Summary: · As System Administrator, you are responsible to plan, implement and maintain system infrastructure and related projects as well as providing level-2/3 support on the day-to-day operations and ensuring system infrastructure which inclusive of Servers, Storage, Backu ...


  • TC Management Services Corporation Sdn Bhd Kuala Lumpur, Malaysia Full time

    Job Summary:As System Administrator, you are responsible to plan, implement and maintain system infrastructure and related projects as well as providing level-2/3 support on the day-to-day operations and ensuring system infrastructure which inclusive of Servers, Storage, Backup a ...