What you’ll do
At Doctolib, we're on a mission to transform healthcare through the power of AI. As a Senior Data Engineer, you'll play a key role in building and optimizing the data foundations within the AI Team to deliver safe, scalable, and impactful models.
You will join a dedicated team working on data infrastructure for LLM, VLM and RAG-based systems, powering our new AI Medical Companion.
Your work will ensure that our engineers and data scientists can train, evaluate, and deploy AI models efficiently on high-quality, well-structured, and compliant data.
Your responsibilities include but are not limited to:
- Ensure high standards of data quality for AI model inputs.
- Design, build, and maintain scalable data pipelines on Google Cloud Platform (GCP) for AI and machine learning use cases.
- Implement data ingestion and transformation frameworks that power Retrieval systems and training datasets for LLMs and multimodal models.
- Architect and manage NoSQL and Vector Databases to store and retrieve embeddings, documents, and model inputs efficiently.
- Collaborate with ML and platform teams to define data schemas, partitioning strategies, and governance rules that ensure privacy, scalability, and reliability.
- Integrate unstructured and structured data sources (text, speech,image, documents, metadata) into unified data models ready for AI consumption.
- Optimize performance and cost of data pipelines using GCP native services (BigQuery, Dataflow, Pub/Sub, Cloud Storage, Vertex AI).
- Contribute to data quality and lineage frameworks, ensuring AI models are trained on validated, auditable, and compliant datasets.
- Continuously evaluate and improve our data stack to accelerate AI experimentation and deployment.
Who you are
You could be our next teammate if you have:
- Master’s or Ph.D. degree in Computer Science, Data Engineering, or a related field.
- 5+ years of experience in Data Engineering, ideally supporting AI or ML workloads.
- Strong experience with the GCP data ecosystem
- Proficiency in Python and SQL, with experience in data pipeline orchestration (e.g., Airflow, Dagster, Cloud Composer).
- Deep understanding of NoSQL systems (e.g., MongoDB) and vector databases (e.g., FAISS, Vector Search).
- Experience designing data architectures for RAG, embeddings, or model training pipelines.
- Knowledge of data governance, security, and compliance for sensitive or regulated data.
- Familiarity with W&B / MLflow / Braintrust / DVC for experiment tracking and dataset versioning (extract snapshots, change tracking, reproducibility).
- Familiarity with (Docker, Kubernetes) and CI/CD for data workflows.containerized environments
- A collaborative mindset and passion for building the data foundations of next-generation AI systems.
What we offer
- Free health insurance for you and your children
- Parent Care Program: receive one additional month of leave on top of the legal parental leave
- Free mental health and coaching services through our partner Moka.care
- For caregivers and workers with disabilities, a package including remote policy adaptations, extra days off, and psychological support
- Work from EU countries and the UK for up to 10 days per year, thanks to our flexibility days policy
- Work Council subsidy to refund part of your sport club membership or creative class
- Up to 14 days of RTT
- Lunch voucher with Swile card
The interview process
- HR Screen
- Technical Deep Dive
- System Design
- Behavioral Interview
- Reference check and criminal records check
Offer!