Finding Duplicate Documents in SharePoint using PowerShell

This script looks through all documents stored in a SharePoint site collection and finds duplicate files based on document contents rather than document names. This script has been written for SharePoint 2010 but should find duplicate documents in SharePoint 2007 as well with very little modification.

Over time, it is quite possible the same document will be uploaded to numerous SharePoint libraries. Keeping track of duplicate content spread across multi libraries can be practically impossible.

Building on the article here http://blog.codeassassin.com/2007/10/13/find-duplicate-files-with-powershell/ that details duplicate checking on fileshares, the following PowerShell script scans all your document libraries with a site collection for duplicate content by calculating an MD5 hash of the file contents. The script groups identical hashes and produces a list of all duplicated files, detailing the full url to item and the file name.

To run the script, copy the contents to notepad and save as a .ps1 file on one of your SharePoint servers. Then launch a PowerShell console and run the ps1 file.

Output to console showing duplicate files

PowerShell Console Output

The function returns the full path of all duplicated content.

This information could be piped back into SharePoint, or exported to Excel for analysis. You could even set this as a recurring job. In a future article, I will package this functionality into a timer job feature.

At present, the script stores all results in memory while it is running. If you are running this over a large site collection this may not scale very well. It may be worth streaming the results into a SQL table or similar. Also, at present this script will only evaluate content on a site collection basis but could be scoped to a web application or a whole farm if required.

Add-PSSnapin Microsoft.SharePoint.PowerShell -ErrorAction SilentlyContinue

function Get-DuplicateFiles ($RootSiteUrl)

{

$spSite = Get-SPSite -Identity $RootSiteUrl

$Items = @()

foreach ($SPweb in $spSite.allwebs)

{

Write-Host “Checking ” $spWeb.Title ” for duplicate documents”

foreach ($list in $spWeb.Lists)

{

if($list.BaseType -eq “DocumentLibrary” -and $list.RootFolder.Url -notlike “_*” -and $list.RootFolder.Url -notlike “SitePages*”)

{

foreach($item in $list.Items)

{

$record = New-Object -TypeName System.Object

if($item.File.length -gt 0)

{

$fileArray = $item.File.OpenBinary()

$hash = Get-MD5($fileArray)

$record | Add-Member NoteProperty ContentHash ($hash)

$record | Add-Member NoteProperty FileName ($file.Name)

$record | Add-Member NoteProperty FullPath ($spWeb.Url + “/” + $item.Url)

$Items += $record

}

}

}

}

$spWeb.Dispose()

$duplicateHashes = $Items | Group-Object ContentHash | Where-Object {$_.Count -gt 1}

foreach($duplicatehash in $duplicateHashes)

{

$duplicateFiles += $Items | Where-Object{$_.contentHash -eq $duplicatehash.Name}

$duplicateFiles += “————————————————————”

}

}

return $duplicateFiles |Format-Table FullPath

}

function Get-MD5($file = $(throw ‘Usage:Get-MD5[System.IO.FileInfo]‘))

{

$stream = $null;

$cryptoServiceProvider = [System.Security.Cryptography.MD5CryptoServiceProvider];

$hasAlgorithm = New-Object $cryptoServiceProvider

$stream = $file;

$hashByteArray = $hasAlgorithm.ComputeHash($stream);

return [string]$hashByteArray;

}

Get-DuplicateFiles(<your sharepoint site>)

Back to PointBeyond Website

8 thoughts on “Finding Duplicate Documents in SharePoint using PowerShell

  1. Ian Woodgate

    Great article Tom. Do be aware though that the same Word document in two different locations may contain different properties because SharePoint synchronises the properties with SharePoint metadata. So the two copies of the word document may not show up as a match. If the script could do the comparison while ignoring the document properties that would be really cool! Maybe leverage the Open Office XML SDK at least for 2007+ documents.

    Reply
  2. Pingback: Good De-Dup tools for SharePoint | Q&A System

  3. SharePointSupporter

    Great post!
    Is there a way to delete all the duplicates out of sharepoint with a simple editting of this script?

    Reply
  4. Brian Newman

    I am concerned that the Birthday paradox might increase the odds of false positives (files that aren’t duplicates appearing as duplicates). I’ve done something similar to what you’ve done here (but I’ve done it in C#), but I think after running this script, an additional step – bitwise comparison – is needed for all the results found using the MD5 hash approach. The MD5 hash approach is a good way to reduce the number of files that need to be compared using the bitwise approach, but I don’t think it should be the final step.

    Reply
  5. Pingback: Finding Duplicate Documents in SharePoint 2010 | Spinsight Blog

  6. Pingback: Finding Duplicate Items and The Duplicates Keyword – Enabling collaboration solutions in the enterprise

  7. Pingback: Finding Duplicate Documents in SharePoint 2010 | SharePointDevWiki.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s