Software Engineer, AI Training and Infrastructure
Skild AI
Company Overview
At Skild AI, we are building the world's first general purpose robotic intelligence that is robust and adapts to unseen scenarios without failing. We believe massive scale through data-driven machine learning is the key to unlocking these capabilities for the widespread deployment of robots within society. Our team consists of individuals with varying levels of experience and backgrounds, from new graduates to domain experts. Relevant industry experience is important, but ultimately less so than your demonstrated abilities and attitude. We are looking for passionate individuals who are eager to explore uncharted waters and contribute to our innovative projects.
Position Overview
We are looking for a Software Engineer to work at the forefront of developing and optimizing the software infrastructure and tools necessary for training cutting-edge AI models. You will focus on building robust, scalable, and efficient training pipelines and frameworks that support the entire machine learning lifecycle, from data preparation to model deployment. You will collaborate with researchers and machine learning engineers to ensure seamless integration and operation of training systems, pushing the boundaries of what AI can achieve in real-world robotics applications. You will explore new ways to efficiently make use of many types of data in our training pipeline.
Responsibilities
- Develop and maintain robust, scalable, and distributed training pipelines (data preprocessing, training orchestration, and model evaluation) and frameworks for large-scale AI models.
- Optimize training processes for performance and resource utilization, ensuring scalability and reliability.
- Collaborate with researchers and machine learning engineers to integrate state-of-the-art algorithms and techniques into training pipelines.
- Monitor and analyze training, identifying bottlenecks and proposing solutions to improve efficiency and performance.
- Ensure the robustness and reliability of the training infrastructure, including automated testing and continuous integration.
Preferred Qualifications
- BS, MS or higher degree in Computer Science, Robotics, Engineering or a related field, or equivalent practical experience.
- Proficiency in Python, C++, or similar and at least one deep learning library such as PyTorch, TensorFlow, JAX, etc.
- Strong background in distributed computing, parallel processing techniques, handling large-scale datasets and data preprocessing.
- Deep understanding of state-of-the-art machine learning techniques and models.
- Experience with cloud-based training environments (AWS, Google Cloud, Azure).
- Experience in developing and maintaining software tooling and infrastructure for machine learning.
- Deep understanding and practical experience with software engineering principles, including algorithms, data structures, and system design.
- Experience with continuous integration and automated testing frameworks.