Guide showing how to export SageMaker Data Wrangler transformations to Canvas datasets, Amazon S3, or a Jupyter notebook plus .flow for reproducible Processing jobs, with step examples and tips
At this stage the data-flow transforms are already defined and ordered correctly: drop column → impute → scale values → ordinal encode → one-hot encode. The remaining step is to export the transformed dataset so it can be persisted and consumed later (for training, reporting, or downstream pipelines).This guide shows three export targets and how to reproduce the same Data Wrangler flow as code:
Export to a Canvas dataset (low-code, Canvas-managed)
Export to Amazon S3 (for downstream processing or storage)
Export a Jupyter Notebook + .flow file (reproducible Processing job)
Each section explains the steps and examples you can adapt for your environment.
From the end of the Data Wrangler flow, click the plus sign → Export → Canvas dataset.
Give the dataset a clear, descriptive name (for example, “KodeKloud house price data”).
Optionally choose whether to process a sample (fast iteration) or the entire dataset. For demos, a subset is fine.
Run the export. Data Wrangler will choose where to execute the transform:
If it fits within the Canvas managed instance limits, it runs locally there.
Otherwise Data Wrangler uses a managed Spark backend (for example, EMR).
Once created, you can view the dataset metadata on the SageMaker Datasets page (name, size, creation time, status).
Tip: Rename the flow to something readable (for example, “KodeKloud house price flow”). Timestamps and auto-generated IDs in flow names are often awkward when referenced in code.
Open your Data Wrangler flow and add a destination node: Export → Amazon S3.
Choose a descriptive dataset name (example: “KodeKloud dataset house price”) and select an S3 bucket and path.
For production/full runs choose to process the entire dataset. Click Export and wait for the job to finish.
After the job completes, verify the output in the S3 bucket. If the Data Wrangler job used Spark, the output is typically partitioned files with prefixes like part-00000-... and an output prefix/directory created by the export job.
If you open the export folder you will see one or more CSV objects produced by the job. Spark-style outputs use file names like part-00000-....
Exporting the flow as a Jupyter Notebook gives a reproducible artifact that sets up a SageMaker Processing job to apply the same transformations. This is ideal for handing work to a developer or integrating into CI/CD.From the Data Wrangler flow:
Add Export → Jupyter Notebooks → Amazon S3 and choose an S3 destination.
Data Wrangler will store a .ipynb and a .flow file in the chosen location. The notebook contains boilerplate code to create and run a SageMaker Processing job that uses the .flow file as the transformation spec.
When the export completes, you will see a confirmation in the Canvas UI.
Best practice: Sign out of SageMaker Canvas when finished to stop the managed instance and avoid charges. Canvas will warn you if background jobs are still running.
SageMaker Studio/JupyterLab does not show S3 objects directly in the file browser. Use the AWS CLI in a terminal to copy the exported notebook and .flow file into the Studio filesystem.First, verify you can list buckets:
Copy
sagemaker-user@default:~$ aws s3 ls
Then copy the notebook and flow file (quote URIs that contain spaces):
Copy
sagemaker-user@default:~$ aws s3 cp "s3://kodekloud-sagemaker-demystified/output_1746186367/kk-house-price-flow.ipynb" sagemaker-demystified/download: s3://kodekloud-sagemaker-demystified/output_1746186367/kk-house-price-flow.ipynb to sagemaker-demystified/kk-house-price-flow.ipynbsagemaker-user@default:~$ aws s3 cp "s3://kodekloud-sagemaker-demystified/output_1746186367/kk-house-price-flow.flow" sagemaker-demystified/download: s3://kodekloud-sagemaker-demystified/output_1746186367/kk-house-price-flow.flow to sagemaker-demystified/kk-house-price-flow.flow
You should now see both files in the JupyterLab file browser.
The exported notebook sets up ProcessingInput(s) for the flow and inputs, and ProcessingOutput(s) for S3. Below are representative snippets you will find (adapt as needed).Typical imports used in the exported notebook:
Copy
# Typical imports used in the exported notebookfrom sagemaker.processing import ProcessingInput, ProcessingOutputfrom sagemaker.dataset_definition.inputs import AthenaDatasetDefinition, DatasetDefinition, RedshiftDatasetDefinitionimport timeimport uuidimport boto3import sagemakerimport osimport jsonfrom pprint import pprint
Define the S3 bucket and create a unique export prefix:
Copy
# Define the S3 bucket used to store export outputsbucket = "kodekloud-sagemaker-demystified"# Create a unique export name and S3 prefixflow_export_id = f"{time.strftime('%d-%H-%M-%S', time.gmtime())}{str(uuid.uuid4())[:8]}"flow_export_name = f"flow-{flow_export_id}"# Output_name is auto-generated from the select node's ID + output name from the flow file.output_name = "3e5d1454-c172-4b12-b0c9-99b2e4d040d1.default"s3_output_prefix = f"export-{flow_export_name}/output"s3_output_base_path = f"s3://{bucket}/{s3_output_prefix}"print(f"Processing output base path: {s3_output_base_path}\nThe final output location will contain additional subdirectories.")
Configure the ProcessingOutput that writes results to S3:
Provide the exported .flow file as a ProcessingInput:
Copy
# The flow file's S3 location (from the Data Wrangler export)flow_s3_uri = "s3://kodekloud-sagemaker-demystified/output_1746186367/kk-house-price-flow.flow"print(f"Data flow is located at {flow_s3_uri}")# Provide the flow file as a ProcessingInput to the SageMaker Processing jobflow_input = ProcessingInput( source=flow_s3_uri, destination="/opt/ml/processing/flow", input_name="flow", s3_data_type="S3Prefix", s3_input_mode="File", s3_data_distribution_type="FullyReplicated")
The .flow file is JSON and contains the pipeline metadata, nodes, operators, and parameters. Loading and pretty-printing it in the notebook helps you review the exact transformations and execution settings:
Copy
import jsonfrom pprint import pprintwith open('kk-house-price-flow.flow') as f: data = json.load(f)pprint(data)
Key contents you will typically see:
metadata (for example, an instance_type suggestion like “ml.m5.4xlarge”)
nodes list with SOURCE and TRANSFORM nodes (infer_and_cast_type, drop column, impute, scale, ordinal encode, one-hot encode)
operator implementations (often under sagemaker.spark.*)
transform parameters and any trained parameters (e.g., learned imputations or encodings)
When copying S3 objects whose key contains spaces, quote the S3 URI (or URL-encode the path) so the CLI treats it as a single argument. Also ensure the IAM role or credentials used by your notebook/studio have access to the bucket.