Senior AI Platform Engineer
Expel
Software Engineering, Data Science
Remote
You believe great ML systems don't just work — they scale, they recover gracefully, and they give data scientists the confidence to iterate quickly. At Expel, you'll be a key contributor in building and maturing the infrastructure that powers our machine learning and generative AI capabilities. From end-to-end training pipelines to the specialized infrastructure behind production agentic applications, your work will directly shape how fast we can innovate and how reliably our AI systems run.
You'll work closely with senior and principal engineers, data scientists, and cross-functional teams to operationalize ML at scale. You bring strong hands-on expertise and a genuine drive to continuously improve the systems and practices around you.
What Expel can do for you
- Give you hard, meaningful problems — building the infrastructure that lets defenders win using AI
- Connect you with a collaborative team of engineers, data scientists, and researchers who care about doing it right
- Offer unlimited PTO (that leadership models and encourages), up to 24 weeks of parental leave, and really excellent health benefits
- Pay you a monthly fitness and cell phone stipends — no receipts required
- Support your professional growth with a conference benefit and continuous learning opportunities
- Offer full remote flexibility — work from wherever you do your best work
What you can do for Expel
Build and scale ML infrastructure
- Architect and maintain end-to-end machine learning training pipelines on AWS (SageMaker, EKS, Step Functions) to ensure reliable and reproducible model development and deployment
- Build and maintain infrastructure for production agentic applications using Amazon Bedrock and Bedrock AgentCore — including agent runtimes, memory, secure gateways, and observability at scale
- Contribute to the architectural evolution of our ML platform, including evaluating MLOps tooling and participating in buy vs. build decisions
Operationalize with rigor
- Implement AI/ML governance best practices for model versioning, testing, validation, maintenance, and security
- Integrate MLOps best practices with Expel's SDLC, security, and infrastructure standards, working alongside SRE, Platform Engineering, and Security teams
- Drive quality, reliability, and scalability improvements through thoughtful engineering and monitoring
Collaborate and enable
- Partner with data scientists, software engineers, and stakeholders to operationalize ML models reliably and at scale
- Mentor and support junior engineers; foster a culture of engineering excellence
- Create and maintain documentation, internal tooling, and enablement resources so practitioners across Expel can work effectively with ML systems
- Stay current with the MLOps landscape and bring relevant innovations back to the team
What you should bring with you
Collaboration & communication
- Clear communicator — able to write documentation and explain technical concepts to both engineering and non-technical audiences
- Strong collaborator with engineers, product managers, and business stakeholders
- Demonstrated ability to mentor others and invest in the growth of the people around you
- Balances near-term delivery with longer-term technical quality
Technical depth
- Strong Python proficiency; familiarity with other languages (Go, JS) is a plus
- Solid experience with CI/CD pipelines, infrastructure-as-code, and containerization for ML workloads
- Hands-on experience with cloud-based ML platforms — AWS (SageMaker, Bedrock, Bedrock AgentCore) strongly preferred; GCP (Vertex AI) experience also valued
- Proven experience operationalizing LLMs and building infrastructure for complex agentic applications — agent orchestration, memory, tool calling, RAG architectures
- Familiarity with ML frameworks including Scikit-Learn, PyTorch, Spark, and TensorFlow
- Working knowledge of continuous retraining, concept drift monitoring, and data drift detection in production
Education & experience
- 5+ years of relevant software engineering experience with meaningful focus on ML operations and infrastructure
- Degree in Computer Science, Mathematics, Statistics, Engineering, or a related technical field preferred (or a compelling story)
- Demonstrated track record of delivering impactful ML infrastructure or MLOps projects
- Experience contributing to team practices, standards, or tooling in a collaborative environment
Additional information
The base salary range for this role is between $142,900 USD and $207,200 USD + bonus eligibility and equity.
We believe in paying transparently and equitably. Your salary will ultimately be based on factors such as your experience, skills, team equity, and market data. You’ll also be eligible for unlimited PTO (which we model and encourage), work location flexibility, up to 24 weeks of parental leave, and really excellent health benefits.
We’re only hiring those authorized to work in the United States.
We’re an Equal Opportunity Employer: You’ll receive consideration for employment without regard to race, sex, color, religion, sexual orientation, gender identity, national origin, protected veteran status, or on the basis of disability.
We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.
#LI-Remote