Maybe you had a lot of files scattered around on different drives, and you added them all into a single git-annex repository. Some of the files are surely duplicates of others.

While git-annex stores the file contents efficiently, it would still help in cleaning up this mess if you could find, and perhaps remove the duplicate files.

Here's a command line that will show duplicate sets of files grouped together:

git annex find --include '*' --format='${file} ${escaped_key}\n' | \
    sort -k2 | uniq --all-repeated=separate -f1 | \
    sed 's/ [^ ]*$//'

Here's a command line that will remove one of each duplicate set of files:

git annex find --include '*' --format='${file} ${escaped_key}\n' | \
    sort -k2 | uniq --repeated -f1 | sed 's/ [^ ]*$//' | \
    xargs -d '\n' git rm

--Joey

problems with spaces in filenames

Spaces, and other special chars can make filename handeling ugly. If you don't have a restriction on keeping the exact filenames, then it might be easiest just to get rid of the problematic chars.

#!/bin/bash

function process() {
    dir="$1"
    echo "processing $dir"
    pushd $dir >/dev/null 2>&1

    for fileOrDir in *; do
        nfileOrDir=`echo "$fileOrDir" | sed -e 's/\[//g' -e 's/\]//g' -e 's/ /_/g' -e "s/'//g" `
        if [ "$fileOrDir" != "$nfileOrDir" ]; then
            echo renaming $fileOrDir to $nfileOrDir
            git mv "$fileOrDir" "$nfileOrDir"
        else
            echo "skipping $fileOrDir, no need to rename."
        fi
    done

    find ./ -mindepth 1 -maxdepth 1 -type d | while read d; do
    process "$d"
    done
    popd >/dev/null 2>&1
}

process .

Maybe you can run something like this before checking for duplicates.

Comment by mhameed Wed Sep 5 04:38:56 2012
Comments on this page are closed.