How to git diff epub files

Track epub, docx, pptx, and sqlite files in Git

I'm not a big fan of binary formats because it's hard to track changes to them. Fortunately, there are ways to handle binary files like epub, docx, pptx, and even SQLite databases in Git and make them more manageable. In this article, we will explore techniques to git diff these binary formats and enable version control for them.

Git Configuration for Different Binary Formats

Tracking ZIP-Based File Formats

As mentioned in an article from tante.cc, some binary formats like docx and pptx are essentially zipped packages of XML files, which means some form of diffing is possible. To enable Git to handle these formats, follow these steps:

Windows users: In Windows, the path ~/.config/git/attributes is equivalent to .config\git\attributes, where represents the home directory of your Windows user account (C:\Users\).

  1. Open your ~/.gitconfig file (create it if not existing already) and add the following stanza:

    [diff "zip"]
    textconv = unzip -c -a

    This configuration tells Git to use the unzip command to convert the zipfile into ASCII text when performing diffs.

  2. Create or modify the file YOUR_GIT_REPOSITORY/.gitattributes or ~/.config/git/attributes and add the following lines:

    *.pptx diff=zip
    *.docx diff=zip
    *.epub diff=zip
    *.odt diff=zip

    These lines specify which file extensions should be treated as zip-diffing formats by Git.

Now, when you use git diff, Git will automatically unzip these files and show you the ASCII differences, making it easier to track changes in these binary formats.

Handling Older Microsoft Office Files

For Microsoft Office files like .doc, .xls, and .ppt, you can configure Git to handle them as well. Here are the steps:

  1. Add the following lines into your $HOME/.config/git/attributes file:

    *.doc diff=doc
    *.xls diff=xls
    *.xlsx diff=xls
    *.ppt diff=ppt

    This associates these file extensions with specific diffing configurations.

  2. Add the following to your global configuration file at $HOME/.gitconfig or $HOME/.config/git/config:

    [diff "word"]
     textconv = catdoc
     binary = true
    [diff "xls"]
     textconv = xls2csv
     binary = true
    [diff "ppt"]
     textconv = catppt
     binary = true

    These configurations use different text converters for each file format to make diffs more readable.

Managing Older Open Office Files

If you are using Open Office, you can follow a similar approach as with Microsoft Office files. Here's how:

  1. In your attributes file, add:

    *.odt diff=odt

    This associates the .odt file extension with the diffing configuration.

  2. In your config file, add:

    [diff "odt"]
       textconv = odt2txt
       binary = true

    This configuration uses odt2txt to convert .odt files for diffing purposes.

Handling PDF Files

Even PDF files can be managed with Git. To achieve this, make the following changes in your attributes file:

*.pdf diff=pdf

And in your config file:

[diff "pdf"]
  textconv = pdf2txt.py
  binary = true

This configuration allows Git to extract text from PDFs for diffs.

Tracking SQLite Database Changes

SQLite databases are binary files, and tracking changes to them can be challenging. However, you can still get meaningful diffs for SQLite databases in Git. Here's how:

  1. Add a diff type called "sqlite3" to your config:

    git config diff.sqlite3.binary true
    git config diff.sqlite3.textconv "echo .dump | sqlite3"

    Alternatively, add this snippet to your ~/.gitconfig or .git/config in your repository:

    [diff "sqlite3"]
     binary = true
     textconv = "echo .dump | sqlite3"
  2. Create a file called .gitattributes if it's not already present and add this line:

    *.sqlite diff=sqlite3

Now, when you run git diff or any other Git command that produces a diff on an SQLite file, you'll see a nicely formatted diff of the changes, making it easier to track changes in SQLite databases.

Conclusion

In this article, we've explored various techniques to handle binary formats such as epub, docx, pptx, and SQLite databases in Git. By configuring Git to use specific text converters and diffing strategies, you can make these binary files more manageable and gain better visibility into their changes. With these approaches, you can effectively track and version control a wider range of file formats in your Git repositories.

For more information and further customization options, refer to the provided sources and Git documentation: