Summary
Overview
Work History
Education
Skills
Certification
Projects
Websites
Timeline
Generic

Vakul Muvvala

Dorchester

Summary

Results-driven Data Engineer with over 4 years of experience designing and implementing scalable data pipelines across banking and healthcare domains. Proficient in Python, PySpark, and SQL for processing large-scale structured and semi-structured data, with hands-on expertise in Apache Spark, Kafka, Airflow, and Flume. Skilled in building end-to-end ETL workflows using AWS Glue, Azure Data Factory, and Databricks, and managing data lakes on AWS S3 and Azure Data Lake Storage. Experienced in modeling data using Star and Snowflake schemas, optimizing performance in Snowflake, Redshift, Synapse, PostgreSQL, and SQL Server. Adept at real-time data processing using Kafka Streams and Spark Structured Streaming for use cases such as fraud detection. Strong knowledge of data quality checks with Great Expectations, metadata management via Apache Atlas and Glue Data Catalog, and CI/CD pipelines with GitHub Actions and Jenkins. Developed REST APIs using Flask and FastAPI, and deployed scalable jobs on EMR and Kubernetes. Ensures enterprise-grade data governance with HIPAA/SOX compliance, robust monitoring using CloudWatch and Datadog, and collaboration with architects, analysts, and data scientists on ML feature pipelines. Passionate about clean architecture, reusable design, and mentoring peers in data engineering best practices

Overview

5
5
years of professional experience
1
1
Certification

Work History

Data Engineer

BNY Mellon
Boston
08.2021 - Current
  • Designed and maintained PySpark pipelines to ingest and transform financial transactions and market feed data from multiple systems, applying enrichment and validation logic before persisting into AWS S3 and Redshift.
  • Created Airflow DAGs to orchestrate daily batch jobs for reconciliations, trade settlements, and account balances, adding dynamic scheduling, SLAs, and alerting mechanisms for operational confidence.
  • Developed real-time Kafka streaming applications using Spark Structured Streaming to detect suspicious activities, such as high-frequency trades, large withdrawals, and failed login attempts, enabling fraud prevention in near real time.
  • Built ETL jobs that extracted portfolio data from legacy Oracle systems, transformed them into target schemas using Spark SQL, and loaded to Snowflake for downstream analytics and regulatory reporting.
  • Implemented fine-grained row-level security in Snowflake using Secure Views and RBAC, restricting access to sensitive financial records based on department, user role, and asset class permissions.
  • Created data quality checks using Great Expectations, validating schema structure, numeric boundaries, duplicate detection, and null values to ensure clean input for financial dashboards and reports.
  • Collaborated with data governance team to document and maintain data lineage using Apache Atlas, tracking field-level transformations and enhancing auditability for internal and external compliance checks.
  • Built CI/CD workflows with Jenkins and GitHub, automating deployments of PySpark jobs and Airflow DAGs across dev, QA, and prod environments with rollback support and version control.
  • Used AWS Glue Catalog and Crawlers to automatically classify incoming datasets, define schemas, and make metadata accessible to data analysts via Athena.
  • Performed optimization on Spark jobs by broadcasting dimension tables, reducing shuffle operations, tuning memory configurations, and compressing intermediate files using Parquet and Snappy.
  • Integrated Databricks Delta Lake to maintain ACID compliance, enabling scalable transaction processing, schema evolution, and rollback support on critical account data pipelines.
  • I wrote reusable transformation functions using Python modules, handling date normalization, currency conversion, and instrument mappings across multiple pipeline stages, reducing code duplication across teams.
  • Worked with PostgreSQL and SQL Server for intermediate data staging, and wrote advanced SQL queries using CTEs, window functions, and case logic for quarterly audit reports.
  • Implemented anomaly detection metrics using Prometheus and visualized them through Grafana, proactively identifying delayed jobs and data volume mismatches in production ETL workflows.
  • Managed S3 lifecycle policies and storage class transitions to control storage costs while maintaining retention policies required for compliance and regulatory reporting (in some cases, over 7 years).
  • Participated in weekly Agile ceremonies and collaborated with stakeholders to convert financial reporting requirements into well-defined user stories with technical specs and acceptance criteria.
  • Wrote Spark UDFs to handle complex tax calculation logic based on international rules, residency status, and transaction type, supporting downstream systems like tax withholding engines.
  • Integrated historical data sources into curated zones by building a backfill frameworkwith checkpoints, deduplication logic, and audit trails, ensuring completeness without duplication.
  • Applied KMS-based encryption, VPC routing, and private subnet configurationsfor all data movement and processing layers to enforce security and comply with banking standards like SOX and PCI-DSS.
  • Supported internal finance and risk teams by delivering data extracts for stress testing, liquidity tracking, and capital adequacy using fully validated and version-controlled datasets.

Data Engineer

Intermountain Healthcare
Hyderabad
06.2020 - 07.2022
  • Developed robust ETL pipelines using Apache Spark and Python to process patient records, admission logs, and lab results, enabling a centralized analytics platform for real-time clinical decision support.
  • Created ingestion workflows from Cerner and Epic EHR systems into Azure Data Lake Storage (ADLS) using Azure Data Factory, standardizing file formats and schema structures for downstream processing.
  • Designed Delta Lake-based storage layers for curated patient data, enabling ACID-compliant transactions, schema evolution, and reproducible clinical research pipelines across multiple departments.
  • Built PySparktransformation jobs for cleansing and anonymizing PHI/PII data, applying hashing, tokenization, and masking based on compliance rules defined by HIPAA and internal security policies.
  • Configured Apache Airflowto orchestrate nightly batch jobs that aggregated patient vitals, medication orders, and encounter notes into curated clinical dashboards used by healthcare executives and quality teams.
  • Wrote complex SQL scripts in Azure Synapse and SQL Server to transform and join datasets across lab, pharmacy, and billing domains, supporting metrics like readmission rates and treatment costs.
  • Integrated HL7 and FHIR APIsusing Python to receive updates from third-party labs and wearable devices, harmonizing incoming data and storing structured results in centralized patient history tables.
  • Enabled data access through Power BI by creating optimized views, columnar tables, and indexed aggregations in Azure SQL, powering clinical KPIs and operational reporting with minimal refresh latency.
  • Implemented Great Expectationsvalidations to ensure data quality across hundreds of fields, including checks for null values, reference integrity, code standardization, and duplicate medical identifiers.
  • Ø Worked closely with data governance teams to define data lineage and metadata policiesusing Azure Purview, allowing traceability for research-grade datasets and supporting clinical audit requests.
  • Configured secure, role-based access to data pipelines using Azure Key Vault, RBAC, and Managed Identity authentication, ensuring only authorized teams could interact with sensitive datasets.
  • Monitored pipeline execution using Log Analytics and Azure Monitor, creating dashboards for SLA compliance, execution time trends, and failure root cause patterns.
  • Used Azure DevOps to manage code repositories, execute release pipelines, and track bugs or enhancements through work items linked to ETL modules and transformations.
  • Migrated legacy SSISpackages to ADF/Spark pipelines, reducing dependency on on-premise infrastructure and improving runtime by 60% for high-volume ingestion workflows.
  • Built data mart models for hospital operations teams to analyze occupancy, discharge delays, surgery turnarounds, and staff efficiency using star schema design with historical versioning.
  • Supported research teams with reproducible datasets for clinical trials, building version-controlled data snapshots, audit logs, and secure delivery mechanisms using Azure Blob Storage with encryption.
  • Collaborated with Business Analysts to capture logic for key indicators like patient wait times, lab result delays, and medication adherence, translating them into SQL-based transformations and DAX logic.
  • Used Python-based UDFs in Spark to parse and transform unstructured clinical notes into structured formats, feeding into downstream NLP engines for classification and insight extraction.
  • Provided support for critical downtime scenarios, analyzing Spark job logs, cluster metrics, and data anomalies, contributing to fast root cause identification and data recovery procedures.
  • Created internal Confluence pages documenting pipeline designs, error handling logic, security rules, and SLAs for various datasets, improving knowledge transfer across shifts and teams.

Education

Master of Science - Information Technology

University of Massachusetts, Boston
Boston, MA

Bachelor of Technology - Computer Science and Engineering

KL University
India

Skills

Programming Languages:Python, SQL, Scala (basic), Shell Scripting

Big Data Technologies:Apache Spark, PySpark, Hadoop, Delta Lake, Spark Structured Streaming

Data Integration:Apache Airflow, AWS Glue, Azure Data Factory, Informatica (basic)

Cloud Platforms:AWS (S3, Redshift, Glue, EMR, KMS, Athena), Azure (ADLS, Synapse, Key Vault, ADF)

Data Warehouses:Snowflake, Redshift, Azure Synapse Analytics, SQL Server, PostgreSQL, Oracle

Data Modeling:Star Schema, Snowflake Schema, Dimensional Modeling

Streaming & Messaging:Apache Kafka, Kafka Streams, Flume

Data Catalog/Lineage:AWS Glue Catalog, Apache Atlas, Azure Purview

Workflow Orchestration:Apache Airflow, Oozie

DevOps & CI/CD:Git, GitHub Actions, Bitbucket, Jenkins, Azure DevOps

Monitoring & Logging:CloudWatch, Datadog, Prometheus, Azure Monitor, Log Analytics

Testing & Quality:Great Expectations, PyTest, UnitTest, Data Validation Scripts

Visualization Tools:Power BI, Tableau (for data outputs only)

Security & Compliance:IAM, KMS, HIPAA, SOX, Role-Based Access Control (RBAC)

File Formats:Parquet, ORC, CSV, JSON, Avro, XML

Documentation & Tools:Confluence, JIRA, Postman, Swagger, MS Excel

Certification

  • Microsoft Certified – Azure Data Engineer Associate
  • Certified AWS Data Engineer – Associate (DEA – C01)

Projects

Predicting House Prices Using Machine Learning (Python, ML, Scikit-Learn) Jan 2020 - Jun 2020, Achieved an R² score of 0.92, meaning the model explained 92% of the variance in house prices. This high level of accuracy allowed real estate agents to make more data-driven pricing decisions, leading to more competitive pricing strategies and faster property sales., Reduced prediction error by 25% compared to traditional manual methods, enabling real estate professionals to avoid overpricing or underpricing properties. This resulted in a 15% improvement in the average time to sell properties, optimizing the sales cycle., Real-time Analytics and Optimization of E-commerce Platform using Big Data (Spark, Hive, Power Bi) Jan 2023 - May 2023, Reduced data processing time by 40% through the implementation of Apache Spark for distributed computing, allowing real-time user behavior analysis and quick decision-making across marketing teams., Increased e-commerce sales by 15% within 6 months by deploying personalized product recommendations based on real-time analysis of user activity, boosting conversion rates and customer retention., Economic Forecasting and Policy Impact Analysis for Consumer Spending Behavior Aug 2022 - Dec 2022, Improved model prediction accuracy by 20% through model optimization and the incorporation of external factors such as government stimulus programs and interest rate changes, resulting in a more reliable consumer spending forecast., Quantified the potential impact of monetary policy changes on consumer spending, providing actionable insights for policymakers. The analysis showed that a 1% reduction in interest rates could lead to a 3% increase in consumer spending, influencing future fiscal policy decisions., Optimizing Sales Forecasting & Market Analysis for Conagra (Python, PySpark, Tableau) Jan 2024 - May 2024, Developed and deployed a linear regression model in Python to predict sales trends, achieving a 35% improvement in forecast accuracy over previous methods., Leveraged Tableau for real-time data visualization, enabling a 25% improvement in strategic decision-making through better consumer trend analysis.

Timeline

Data Engineer

BNY Mellon
08.2021 - Current

Data Engineer

Intermountain Healthcare
06.2020 - 07.2022

Master of Science - Information Technology

University of Massachusetts, Boston

Bachelor of Technology - Computer Science and Engineering

KL University
Vakul Muvvala