Skip to content

Research Engineer, Companion

    Job description

    About the Role

    NEO is a home robot that handles chores and provides personalized assistance. It's meant to be controlled naturally: you talk, it understands, it responds, it acts.

    As a Research Engineer, you'll own the speech and language stack that makes NEO feel like a calm, capable presence: streaming ASR, low-latency TTS, speech-to-speech interaction, and task-level NLU, all tuned for robotics constraints (latency, noise, far-field audio, on-device budgets, safety, and reliability).

    Job requirements

    In this role, you will:

    • Build and deploy production speech models (ASR, TTS, and/or Speech to Speech) optimized for NEO's on-device compute and real-time latency requirements.

    • Own the full pipeline from data collection and model training through deployment and monitoring.

    • Solve hard acoustic problems that come with putting a robot in someone's living room: far-field recognition, noise robustness, barge-in handling, and multi-speaker environments.

    • Design voice interaction that feels human: natural prosody, appropriate turn-taking, responses that match context and emotional tone.

    • Build evaluation infrastructure, define quality metrics that actually correlate with user experience, and use real-world feedback to prioritize improvements.

    • Collaborate with hardware and robotics teams to integrate speech capabilities with NEO's vision, memory, and physical behavior systems.


    You might thrive in this role if you:

    • Have 3+ years building speech systems, especially those currently running in production environments.

    • Have deep expertise in at least one of: automatic speech recognition, text-to-speech synthesis, spoken language understanding, or speech-to-speech modeling.

    • Are fluent in modern deep learning (transformers, diffusion models, autoregressive generation) and can train, debug, and optimize models end-to-end in PyTorch or JAX.

    • Have deployed models under real constraints: on-device inference, latency budgets, memory limits, or edge hardware.

    • Have published at top venues (ICASSP, Interspeech, NeurIPS, ICML) or built equivalent systems at companies known for voice AI.


    or

    On-site
    • San Carlos, California, United States
    $150,000 - $250,000 per year
    Artificial Intelligence (AI)