Analyze your Git repositories
Published on Friday, January 12, 2018 5:47:09 AM UTC in Tools
Over Christmas I wanted to analyze some of my Git repositories with regards to commit count, contributions by author and other details. I found some nice and simple tools for that, one particular being Git Fame, written in Python. The tool is great, but it had a few problems: first of all it apparently wasn't written for the average Windows user like myself, because it had some problems with author names and non-ASCII characters which in turn needed workarounds on Windows that resulted in misshaped console output. Second, I wasn't quite satisfied with the feature set and performance. My powerful machine was idling along with little CPU and disk load, yet still the remaining hours for the larger repositories was in the upper 20s(!). The output on the other hand was only written to the console, so for further processing i.e. in Excel I had to copy it over into a file and do manual formatting to make it usable. Additionally, statistics per file extension and author were not available at all, you could only have either but not combined. Even worse, depending on the machine I commit from to my repositories, I apparently use slightly different author names (really should fix that), resulting in multiple entries for the same author. Some of these metrics hence were useless, for example: there's a "number of files contributed to" metric that you cannot easily aggregate across multiple author entries without knowing the exact files, an information that is not output by the tool. I'm no Python expert, but a quick look at the sources revealed that the tool was a simple wrapper around the Git command line, which made me think...
My own implementation GitFame# basically has the same feature set as the above tool, but with the following features added:
- Use the maximum number of CPU cores available to maximize performance. Some of the extreme samples like a huge repo that would take almost 30 hours of processing by GitFame on my machine were done in less than 6 hours. You can set the CPU core count from the command line if the auto-setting brings your machine down to its knees and you want to use it for something else while analyzing.
- The result is written to a CSV file by default, so it's easily processible in other tools afterwards.
- You get a real "by file extension per author" statistics that allow deeper analysis.
- There's an option that allows to aggregate multiple aliases of the same author into a single bucket. This is helpful when committing from different machines with different authors (configuration mistake) or if, for example, non-ASCII characters in your name result in multiple author entries depending on the system you're working on.
You can find it on GitHub: https://github.com/MisterGoodcat/GitFameSharp
Command Line usage
The tool is a .NET Core command line application, meaning you need the .NET Core runtime installed to use it. At the moment, the tool does not print any help to the console. But here's the currently available options:
|--GitDir="[path]"||The path to the Git directory to analyze. Default: "."|
|--Branch="[branch]"||The branch to analyze. Default: "HEAD"|
|--Exclude="[RegEx]"||A regular expression (.NET flavor) to determine which files or folders to exclude. Default: [empty]|
|--Include="[RegEx]"||A regular expression (.NET flavor) to determine which files to include. Only inspects the files that have not been excluded by the --Exclude option. Default: [empty]|
|--ParallelBlameProcesses=[number]||The number of CPU cores to use in parallel. Default: [Number of cores on your machine]|
|--Output="[path]"||The target file the results should be written to in CSV format. Default: "result.csv"|
|--AuthorsToMerge="[list of aliases]"||Multiple author aliases to be merged into a single statistic. Syntax: Put each group of aliases into brackets, use the pipe symbol to separate aliases. E.g: "[Author A alias 1|Author A alias 2][Author B alias 1|Author B alias 2]". The first alias entry is used as the author name of the aggregated result. You can use a non-existing author alias as the first entry to beautify the author name. Default: [empty]|
Please note that since this uses Git commands, it obeys any Git configuration. Example: you don't need to exclude files and folders that are already part of your .gitignore file, they will be treated as expected automatically.
dotnet GitFameSharp.dll --GitDir="E:\Projects\Something\Something\Dark\Side" --Exclude="(^lib/.*|\.dll$)" --Include="\.(cs|ts)$" --AuthorsToMerge="[Beautiful Jackemail@example.com]"