Responsibilities
- Design and maintain robust data pipelines to ingest from a wide range of sources, including APIs, documents, websites, and raw sensor data
- Integrate and optimize ETL/ELT processes developed by MLE colleagues, improving performance, reliability, and long-term maintainability
- Own the full dataset lifecycle, from raw ingestion through cleaning, validation, and delivery as training-ready data
- Define and enforce data quality standards and governance practices across the Data Foundry team
- Build and maintain labeling pipeline infrastructure for ML applications, working closely with the annotation team
- Participate in architectural decisions, code reviews, and technical mentorship within the team
- Document data sources, pipeline logic, and processing decisions for reproducibility and team alignment
Requirements
- 3+ years of experience in data engineering
- Degree in Computer Science, Data Engineering, Computer Engineering, Information Systems, or equivalent technical background
- Solid understanding of the ML training lifecycle and what properties make a dataset suitable for model training
- Familiarity with layered data architecture patterns such as Medallion Architecture (Bronze/Silver/Gold) or Data Mesh
- Proficiency in Python, with focus on data manipulation, pipeline development, and automation
- Workflow orchestration using code-based tools such as Temporal, Airflow, Prefect, Dagster, or equivalent
- Distributed data processing with Spark, Databricks, or similar
- REST and gRPC API integration
- Strong SQL skills, both for data modeling and query optimization
- Experience with streaming systems and event-driven pipelines (Kafka, Kinesis, or equivalent)
Soft Skills
- Comfortable jumping into ongoing codebases and optimizing work built by others, without needing to start from scratch
- Technology-agnostic: you evaluate tools based on what the project needs, adopt new ones quickly, and don't get attached to a specific stack
- At ease in fast-moving environments where priorities shift and the right answer isn't always obvious
- Engineering-first mindset: you think in pipelines, own outcomes, and care about the quality of what you ship
- Driven by curiosity and innovation, not by comfort with a known toolset
Nice to Have
- Experience making architectural decisions and contributing to the technical growth of a team, formally or informally
- Go, for high-performance pipeline components
- dbt for transformation layer modeling
- Open table formats: Delta Lake, Apache Iceberg, or Hudi
- Data quality frameworks such as Great Expectations or Soda
- Cloud experience, preferably OCI (our current migration target). AWS, GCP, or Azure background is also valued
- Rapid prototyping with Streamlit or similar tools. The use of LLMs and GenAI to speed up internal tooling and experimentation is actively encouraged
- Experience with data annotation workflows or training dataset pipelines