Managing a large multimedia database

Nowadays, most of my stuff is in digital format: music, movies, and photos. I probably have about 50GB of music, more than I can count that of movies and tv series, and maybe 10GB of photos. Lately I started thinking about how to handle all of that stuff, including backups and sharing with other computers. By the way, when I talk about “database” I don’t mean MySQL or whatever, but just a collection of data. Here is what I would like to have:

  • Repository: The base should be just files on a drive. No database like MySQL or using strange properties of the filesystem, since I’d like it to be accessible from both Windows and Linux (and Mac, why not?) . Maybe there can be extraneous stuff in the directories which I will just ignore while browsing for content, but the extra size should not be too big. Since I will be working with hundreds of gigabytes, an extra of 10% would be acceptable, but more than that should be banished.
  • Partial checkout: I should be able to download just a portion of it and push back the changes to the server. It should be smart at detecting changes and so minimize the data transfers.
  • Backups: It should take care of making automatic incremental backups.
  • There are different approaches to this problem, and I investigated a few.

  • Flat file repo + rsync: This is the easiest solution. Just keep your files in a directory exported by samba or whatever and so the repo wasted space is at 0%. To do partial checkouts you can just copy the files you are interested in and to keep directories in sync you can use rsync. The problem is that rsync is not that smart at detecting changes, and if I just decide to rename a directory, it will just retransfer all the files of that directory, and I like to remind you we are talking about gigabytes of transfers. Producing backups is again just manual.
  • Files inside a tar or iso + rsync: Just create a “virtual filesystem”, i.e., store files inside an iso and mount it, so it should be transparent to dumb media player who don’t care, but now rsync has a much better chance to detect renames or moves automatically and very efficiently. But now the partial checkout doesn’t really work but it’s a bit easier to do the backups since one can store just the deltas of the iso file.
  • Use git or subversion: just stuff all the files inside a git or subversion repository and so partial checkouts and updating files is very very efficient! Well, for now git cannot do partial checkout, but I guess it’s coming, while svn can, and git is very very smart at finding renames and moves inside the repo. This last bit is important because of considerations of the repo size: they make a distinction between the repo and a working copy (the stuff that the dumb media player will read) and so there is wasted space. They do compress their repos, but since we are already dealing with compressed files, the savings are negligible and so we are around a 100% overhead for space on the server. Git actually also has the same overhead in each client, since you (essentially) always transfer your whole repo, and not just a working copy, but if you start moving things around, the git repo is very smart at increasing the least possible, while svn grows a lot. Mercurial (just to throw another one in) has the good property that inside its repo, there is a clean working copy, but its size increases a lot when moving things around so it’s not very good for my purpose. On the backup side, the repo is automatically an incremental backup.

  • geeky