AZ-400: Designing and Implementing Microsoft DevOps Solutions

Configuring and Managing Repositories

Purge Data from Source Control

Purging data from source control is essential for maintaining a clean, efficient, and secure codebase. In this guide, we’ll define purging in the context of Git repositories, explain why it matters, compare the top tools, and walk through hands-on examples.

What Is Purging?

Purging a repository means removing unwanted or sensitive files from its commit history. This process helps you:

  • Reclaim disk space
  • Eliminate accidental commits
  • Protect secrets from exposure

The image shows a stack of documents with a magnifying glass, symbolizing examination or review. Below, there's text explaining "Purging" as the process of cleaning up a codebase by removing unnecessary or sensitive files.

Why Purge Files?

By cleaning up your Git history, you can:

  • Optimize Performance: Smaller repos clone and checkout faster.
  • Eliminate Mistakes: Remove large or accidental commits.
  • Protect Secrets: Expunge API keys, passwords, and other sensitive data.

The image lists three reasons for purging files: shrinking repository size for performance, eliminating mistakenly committed large files, and removing files with sensitive information like passwords or API keys.

Note

Always back up your repository before rewriting history. Purging is irreversible.

Repository Cleanup Tools

Here’s a quick comparison of the two leading Git history-rewriting tools:

ToolUse CaseDocumentation
Git filter-repoOfficial, highly configurable, fine-grainedGit filter-repo
BFG Repo-CleanerFast, simple syntax for common cleanup tasksBFG Repo-Cleaner

The image lists two tools for repository cleanup: "Git filter-repo" and "BFG Repo-Cleaner," with brief descriptions of each.

Practical Examples

1. Deleting Large or Unwanted Files

Remove a file named archive.tar.gz:

# Using BFG Repo-Cleaner:
bfg --delete-files archive.tar.gz

# Or with Git filter-repo:
git filter-repo --path archive.tar.gz --invert-paths

2. Removing Sensitive Content

First, list sensitive patterns in passwords.txt (one per line):

PASSWORD
API_KEY

Then run:

# Using BFG Repo-Cleaner:
bfg --replace-text passwords.txt

# Or with Git filter-repo:
git filter-repo --replace-text passwords.txt

Warning

Force-pushing rewritten history will overwrite the remote. Coordinate with your team to avoid conflicts.

Final Steps

After rewriting history, complete these actions:

  1. Force-push the cleaned history
    git push --force
    
  2. Notify your team to reclone or reset their local copies:
    git fetch --all
    git reset --hard origin/main
    

Note

Ensure everyone is on the same page to prevent divergent histories.

Watch Video

Watch video content

Previous
Recovering Data From Source Control Using Azure Repos