KodeKloud Notes

Welcome back, AWS Solutions Architects. In this article, presented by Michael Forrester, we explore the power of data transformation with Glue DataBrew—a service that markedly differs from Glue ETL.

Glue DataBrew is a visual data preparation tool that enables you to clean and normalize data without writing a single line of code. Unlike Glue ETL, which allows you to apply Python or PySpark code for data transformations, DataBrew relies purely on a graphical user interface.

How Glue DataBrew Works

The workflow of Glue DataBrew is straightforward:

Create a Project: Establish a workspace to interact, analyze, explore, and perform data preparation tasks.
Select Datasets and/or Data Sources: Import data from various sources such as S3, Redshift, or other services—similar to the process in Glue ETL.
Choose Recipes: Recipes are sets of visual data transformation steps, including operations like filtering rows and converting data types (e.g., string to number). All operations are applied from an intuitive menu without the need for coding.
Run the Recipe: When executed, DataBrew applies all specified transformations to the complete dataset. The processed data is then stored in Amazon S3 for consumption by other services.

Serverless Advantage

One significant advantage of Glue DataBrew is its serverless nature. This means you do not need to manage, secure, or scale servers manually. Instead, operational aspects like monitoring are seamlessly handled via services such as CloudWatch.

Data sources for Glue DataBrew include the Glue Catalog, various database services, and S3, all of which can be directly integrated into your workflows.

The image is a diagram showing AWS Glue Databrew connected to Amazon S3, Amazon Redshift, Amazon RDS, and AWS Glue.

Example Workflow

Consider a workflow where data is sourced from S3 and ingested into Glue DataBrew. It leverages pre-built transformations, and the output is subsequently loaded into Athena. The processed data then becomes accessible to QuickSight for analysis by data and business analysts.

The image is a flowchart illustrating the data processing workflow using AWS services, including AWS Glue Databrew, AWS Glue, Athena, and QuickSight, with roles for Data Analyst and Business Analyst.

Under the hood, Glue DataBrew utilizes AWS Glue to perform data transformations and supports machine learning workflows. For example, you can source data via DataBrew and export the processed output to services such as SageMaker, Rekognition, or Polly.

The image is a diagram showing the integration of AWS services, including Amazon S3, Amazon Redshift, AWS Glue, AWS Glue Databrew, and Amazon SageMaker. It illustrates a data processing workflow from storage to machine learning.

Key Features of Glue DataBrew

Feature	Description
Visual Data Preparation	Clean and transform data through an intuitive graphical interface—no coding required.
Data Profiling	Automatically generate metadata statistics to identify outliers, anomalies, missing values, and inconsistencies.
Scalability	Automatically scales with your data preparation workload without manual intervention.
Integration with AWS Data Stores	Seamlessly integrates with services such as Aurora, Redshift, and RDS.
Job Scheduling and Reusability	Schedule data tasks based on triggers or time, and create reusable project templates.

The image lists five features: Visual Data Preparation, Data Profiling, Scalability and Performance, Integration with AWS Data Stores, and Job Scheduling and Reusability. Each feature is represented with an icon and a gradient background.

In Summary

Glue DataBrew offers a streamlined, visual, and code-free solution for data transformation that harnesses the power of AWS Glue behind the scenes. By simplifying the data preparation process, DataBrew makes it accessible for users who prefer not to write code and accelerates the journey from raw data to actionable insights.

Thank you for reading—see you in the next article.

Watch Video

Watch video content