Junior Data Engineer

About the Role

We are looking for a Data Engineer with a software engineering mindset to join our Data & AI team. This is not just a role for writing SQL scripts; it is an opportunity to build robust, scalable, and observable data infrastructure on the cloud.

You will work with a modern tech stack (Dagster, dbt, Clickhouse, AWS) to build the pipelines that power our analytics, machine learning, and GenAI products. If you care about code quality, automation, and "Data as a Product," this role is for you.

Our Tech Stack

Languages: SQL, Python
Orchestration: Dagster (migrating from Airflow).
Data Stores: Redshift, Clickhouse, S3.
Transformation: dbt, Fivetran.
Cloud & Infra: AWS (ECS/EKS, Glue, Lambda, Athena)
IaC: Terraform with Terragrunt.
AI/GenAI: AWS Bedrock, LangChain, LLMs.

Key Responsibilities

Build & Orchestrate: Develop and maintain reliable ETL/ELT pipelines using SQL and Python. You will use Dagster to orchestrate dependencies, ensuring data flows correctly from source to destination.
Data Transformation: Use dbt to model raw data into clean, business-ready datasets (Star Schema) that enable stakeholders to self-serve.
Quality & Observability: Own the quality of your data. Implement tests (dbt tests, unit tests) and monitoring to ensure "silent failures" don't happen. You will troubleshoot pipelines when they break and fix the root cause.
Cloud Engineering: Work with AWS services (S3, DMS, Glue) and containerized environments (Docker/Kubernetes) to deploy your code.
Collaborate: Partner with Data Scientists and Product Managers to understand their data needs and deliver high-quality solutions.
Innovation: Support the team in integrating GenAI capabilities (LLMs, LangChain) into our engineering workflows.

What We Look For (Essential Experience)

Experience: 2–3 years of hands-on experience in Data Engineering.
Engineering Mindset: You treat data pipelines like software products. You are comfortable with Version Control (Git), code reviews, and testing.
SQL Mastery: You can write complex, efficient queries and understand data modeling concepts (e.g., Joins, Window Functions, Normalization).
Python Proficiency: You can write clean Python scripts for data manipulation and automation (beyond just "notebook scripting").
Cloud Native: Familiarity with cloud data warehouses (Redshift, Snowflake, or BigQuery) and core cloud concepts (S3, IAM, Compute).
Modern ETL: Experience with modern transformation tools (like dbt)
Problem Solver: Excellent ability to investigate data issues. You don't just restart the job; you dig into the logs to find why it failed.
GenAI Interest: Familiarity with LLMs, AWS Bedrock, or LangChain is a huge plus.
CI/CD: Experience automating deployments using GitHub Actions, Jenkins, or similar.

Desired Experience (The "Nice to Haves")

Infrastructure as Code: Familiarity with Terraform or Terragrunt.
Containerization: Experience running code in Docker or Kubernetes (EKS/ECS).
Streaming: Exposure to real-time data frameworks (AWS Kinesis, Kafka, SQS/SNS).
Data orchestration: Experience with orchestration tools like Dagster, Airflow, AWS Step functions, etc.