Data Engineer - Data Foundry Engineer

🌐 GlobalRemote

Posted Apr 2, 2026Updated May 20, 2026

Data Science at TRACTIAN

The Data Science team at TRACTIAN focuses on extracting valuable insights from vast amounts of industrial data. Using advanced statistical methods, algorithms, and data visualization techniques, this team transforms raw data into actionable intelligence that drives decision-making across engineering, product development, and operational strategies. The team constantly works on optimizing prediction models, identifying trends, and providing data-driven solutions that directly enhance the company’s operational efficiency and the quality of its products.

What you'll do

We're looking for a Data Engineer with a strong engineering foundation and comfort with AI workflows to join our Data Foundry team. In this role, you'll be the bridge between our model training and data annotation teams, building the pipelines and infrastructure that turn raw, messy data into gold-standard datasets ready for AI consumption.

Responsibilities

Design and maintain robust data pipelines to ingest from a wide range of sources, including APIs, documents, websites, and raw sensor data

Integrate and optimize ETL/ELT processes developed by MLE colleagues, improving performance, reliability, and long-term maintainability

Own the full dataset lifecycle, from raw ingestion through cleaning, validation, and delivery as training-ready data

Define and enforce data quality standards and governance practices across the Data Foundry team

Build and maintain labeling pipeline infrastructure for ML applications, working closely with the annotation team

Participate in architectural decisions, code reviews, and technical mentorship within the team

Document data sources, pipeline logic, and processing decisions for reproducibility and team alignment

Requirements

3+ years of experience in data engineering

Degree in Computer Science, Data Engineering, Computer Engineering, Information Systems, or equivalent technical background

Solid understanding of the ML training lifecycle and what properties make a dataset suitable for model training

Familiarity with layered data architecture patterns such as Medallion Architecture (Bronze/Silver/Gold) or Data Mesh

Proficiency in Python, with focus on data manipulation, pipeline development, and automation

Workflow orchestration using code-based tools such as Temporal, Airflow, Prefect, Dagster, or equivalent

Distributed data processing with Spark, Databricks, or similar

REST and gRPC API integration

Strong SQL skills, both for data modeling and query optimization

Experience with streaming systems and event-driven pipelines (Kafka, Kinesis, or equivalent)

Soft Skills

Comfortable jumping into ongoing codebases and optimizing work built by others, without needing to start from scratch

Technology-agnostic: you evaluate tools based on what the project needs, adopt new ones quickly, and don't get attached to a specific stack

At ease in fast-moving environments where priorities shift and the right answer isn't always obvious

Engineering-first mindset: you think in pipelines, own outcomes, and care about the quality of what you ship

Driven by curiosity and innovation, not by comfort with a known toolset

Nice to Have

Experience making architectural decisions and contributing to the technical growth of a team, formally or informally

Go, for high-performance pipeline components

dbt for transformation layer modeling

Open table formats: Delta Lake, Apache Iceberg, or Hudi

Data quality frameworks such as Great Expectations or Soda

Cloud experience, preferably OCI (our current migration target). AWS, GCP, or Azure background is also valued

Rapid prototyping with Streamlit or similar tools. The use of LLMs and GenAI to speed up internal tooling and experimentation is actively encouraged

Experience with data annotation workflows or training dataset pipelines

Apply on company site