rsync is both an interesting program and an interesting idea. rsync the program does a lot of interesting and useful things but what has always been most interesting to me is the rsync algorithm - its ability to update two versions of the same file over a network, one being local and the other being remote while efficiently sending little more than just the data that has changed. rsync the program uses this ability to efficiently copy or sync file changes between networked computers. Many others have built upon this idea to build general backup tools and services.

What has not been so great about rsync is that as a UNIX program it has never really worked well on Windows. Sure there are 'ports' but by not being a native Windows application, many things were ignored or left by the wayside like file ACLs, permissions, NTFS alternative data streams, etc.… You can get it to work and some have done a pretty good job of fixing it up to make it more Windows-friendly but at its core it just doesn't have access to the information or APIs needed to truly integrate into the Windows ecosystem.

rsync itself is pretty old – it predates much of the Internet as we know it, dating back to 1996. The ideas and algorithms have been tweaked over the years but are stable at this point. Newer ideas, such as Multiround rsync have been proposed to better deal with bandwidth-constrained devices such as mobile devices but these are not currently used in rsync.

Years ago I remember reading about a new feature built into Windows Server 2003 R2 called Remote Differential Compression (RDC). At the time it was a technology only used as part of Windows DFS and not really applicable to anything else. Recently while working on a side project I started to wonder how hard it would be to port the rsync algorithm for use in my project. While researching this I came across RDC again and realized that it has since been built into every version of Windows since Vista as a low-level COM API that anyone can use. This RDC API is data source agnostic meaning that it can be used for files, databases, or any other form of raw data that needs to be efficiently transferred. Since Vista's release I suspect that RDC or the ideas behind it are at the core of Windows features like block-level Offline Files updating and BranchCache.

On the surface RDC seems a lot like rsync but it has been researched and developed by Microsoft in significant new ways. Mostly notably it is a recursive algorithm. What this means is, instead of exchanging one set of block hashes or signatures, RDC recursively processes the block signatures to generate smaller sets of signatures to send over the wire. It's like if RDC took the set of signatures for a file and wrote them to disk as a new file and then understanding that both the local and remote signatures are probably similar (since the original files should be similar), used the RDC algorithm to recursively generate a new set of signatures based on the previous set of signatures, each time cutting the size of block signatures to a fraction of what they first were. Now instead of sending the original large set of signatures, it send this reduced set and then uses the RDC algorithm to build the larger set. In the end this means that even less data needs to be transferred over the wire.

Another novel idea in RDC is the idea what you can use more than just the two versions of the same file to generate signatures. In fact, you can likely use parts of other files to further reduce the data that is transferred over the network. The idea comes from something like this: you may have made multiple copies of the same files or even cut/paste similar data between files. Some files may even have blocks that are mostly the same despite being totally unrelated. Since transferring block signatures uses less bandwidth than sending the actual data, why not take advantage of this fact to reduce the transferred data even more. The problem is how do you tell which parts of other local files may match parts of the remote files you want to transfer. You can't reasonably compare every file to every other file as that would be too long of a process. The answer to this problem is solved by pre-analyzing "traits" of each file and storing them so that they can be quickly compared as candidates. This "trait" data is extremely small, being around 96-bits per file and can easily be stored in the file's metadata, in a database or even in memory.

It is no surprise to me that RDC remains unknown to most. Even though it has been a Windows feature with a public API for years now the MSDN documentation on how to use RDC is severely lacking – mostly notably on how RDC recursively generates signatures and how to transfer, reconstruct, and use them. In my extensive searching I have only come up with two examples of how to use the RDC APIs (which are COM APIs, btw). Both of these are complex with one being completely broken, despite being published as an official MSDN sample. The only truly working sample I could find is the one that ships as part of the Windows SDK, which you'll have to install to get to it. You'll find that sample in the Samples\winbase\rdc folder of your Windows SDK installation (at least in SDK v7.1 – I have not checked the Windows 8 SDK version).

As I mentioned, these sample are complex – the working C++ sample uses DCOM so that you can use a local and remote machine. This means customs interfaces and levels of abstraction that make it hard to get to the core of what is really going on, not to mention it being all cross-process. The second, broken sample is built for .NET and is written in C# and uses COM interop and a custom Web Service so that you can test against a local and remote machine. Both of these samples btw, can be run on the same machine which makes it a bit easier to follow but you're still dealing with cross-process debugging. The .NET one is a little easier to follow though as it uses less custom abstraction to hide the complexities of working with RDC. However it is broken.

It was only by understanding the SDK's C++ RDC sample that I could understand how the .NET sample was broken. Remember how I said that RDC recursively generates signatures to cut down on the amount of bandwidth needed? Well, as it turns out the .NET sample was generating these signatures both locally and remotely just fine. What it was not doing was transferring these signatures correctly. As best I can tell it was assuming that there was only one level of signatures to be transferred – in fact it would work if there was only one level which is typically what happens for very small files (in an unrelated bug it would simply crash with files over 64kb which leaves me to believe it was never tested with larger files). However RDC by default decides how many levels of recursion should take place and when it does it means that multiple levels of signatures need to be transferred (but by using the RDC algorithm so the overall bandwidth stays low). Handling these multiple levels of transfers was what was broken. I was able to fix that and test that it works no matter how many levels of signatures where being used or the size of the file. In the process I added more logging, fixed other bugs, and cleaned a few things up as well. Here is that newer, fixed up sample RDC .NET project. It is by no means ready for production (or even useful as an rsync alternative) but it does show how to use RDC with .NET.

RDC.NET sample on Bitbucket.org: https://bitbucket.org/davidjade/rdc.net 

Note: If you're not a Mercurial user and you just want the source as a .zip file, just go to the Downloads | Branches section of the project and select your archive format.

While complex to use RDC works really well. In my sample data I used two bitmap files of the same photograph, one of which had been altered by writing the word "changed" in a 72pt font across it. Since these files were saved as uncompressed bitmaps (as compared to jpegs) they were mostly the same. The total file size was 2.4mb but by using RDC to transfer the changed version from a remote location, the total RDC bandwidth needed as only 217kb making for a 92% reduction in bandwidth needed to update the file. The RDC overhead is likely too great to use for smaller files but for larger files it can really pay off, especially if you are bandwidth constrained.

RDC was developed several years ago – 2006 being the year that the research paper on RDC was published with the samples dating to shortly after that time as well. I would encourage you to check out that research paper if you really want to understand how RDC functions. It is a great example of how Microsoft Research really is producing novel and significant advances in computing. It is also the missing documentation that is needed to best make use of the minimal MSDN documentation on the RDC COM API.

One last note, as sometimes unfortunately happens in the world of Windows, misinformation has spread among Windows users (the type that think every service they don't use directly should be turned off) that has led users to think that RDC should be uninstalled on Windows so that it somehow fixes slow network file transfers. Besides this making totally no sense whatsoever and besides the fact the RDC isn't used for anything in a client Windows OS, it may be something you have to contend with if you build an app that uses RDC. The myth is so widespread that Microsoft had to write about it.


Scheme (the hard way)

Published 10/24/2012 by David Jade in Programming
Tags: ,

TLDR; I took a lot of complex Java source code for a complete R5RS Scheme interpreter and converted it to C# and it worked! It was much harder than this little TLDR makes it sound.

I’ve been on a Scheme learning kick lately because it was something I always wanted to do for many various reasons. My original goal was to just read and work through the classic SICP book. I’m not very far along on that front because at some point I needed a Scheme system and I got sidetracked and sucked into the totally ridiculous idea of porting a Java-based R5RS compliant Scheme interpreter (SISC) to C# and .NET. I’m not really sure why, it just sounded challenging and interesting to me and I got sucked in by the somewhat naïve and seemingly possible notion that it could mostly be done mechanically. Plus there really are no full-featured Scheme interpreters or compilers for .NET* – the few that exist leave out features like support for full continuations and tail calls - stuff that would be useful for really learning Scheme. Although it hasn’t been updated for a few years, SISC claims to be a fully R5RS compliant Scheme interpreter.

I started with a machine translation of the java source code (using Sharpen). Even getting Sharpen to work was a unusual chore (my fault mostly) and what C# it split out would not build. SISC is comprised of 203 java class files – 18,000+ lines of actual source statements. What Sharpen gave me is currently much more C# code because it uses a lot of custom classes and shims to match Java’s intentions, functionality, etc.… It’s (hopefully) a lot of extra layers that I will be able to mostly remove with careful, targeted re-writes to native C#.

Getting it to build wasn’t too hard, but it took time. Sharpen’s conversion doesn’t have a shim for everything so sometimes the code it translates calls class methods that don’t exist in .NET. Other times it entirely skips and omits lines of Java code altogether (I’m still manually checking for those cases here and there). So I had to write more shim code and hand translate what was missing. Getting it from simply building to actually working though was far harder.

Oh, did I mention that I am not really a Java programmer? I’ve dabbled here and there before but nothing as complex as this and many of Java’s conventions and idioms were less than obvious to me.

The hardest part was that the SISC code base is complex and hard to understand. It seems to be very well written though. There are two primary ways to write a Scheme system: an interpreter or a compiler (either to native code, a byte code VM, or cross compilation to another language like C). Both are complex things to write and understand, especially if you did not write them. SISC is an interpreter which means that much of the code is just plain Java. But that code looks up symbols, builds expression trees, applies procedures, manages registers and call stacks – all abstractions of the Scheme language it is interpreting. In doing so the source code for SISC leverages some features of Java that have to be carefully understood when translating to another language.

First off, Java has some different notions about things like equality from .NET. For instance, in Java if you have two Lists classes and use equals() to compare them, the contents of the lists are checked to determine equality. In .NET this is rarely the case as Equals() usually just settles for reference equality. So more shim code to fix that.

Collection classes can also be pretty different, with Java supporting things like mutable views into Lists instances via subList(). Some Java collection classes have no equivalents in .NET. The Java SISC source was also not written to use generics and instead liberally uses Object which makes it hard to tell what is intended to go into any collection instance (and often it is a mixed bag).

Then there’s things like WeakHashMap or ReferenceQueue which not only do not exist in .NET, they are near impossible to accomplish since the CLR doesn’t publically expose the garbage collector hooks necessary to write .NET equivalents that follow Java semantics. To create similar types of things in .NET you have to liberally use WeakReferences together with polling and/or class finalizers. It works but is not ideal.

Finally there’s exceptions, which normally wouldn’t have been a big deal but the SISC code base unfortunately relies on using exceptions for non-exceptional things, like rewinding the interpreter’s execution when failing to convert a piece of text that might be a number (or a symbol that just looks like a number) into a number. Catching exceptions happens polymorphically  which means that if your exceptions class hierarchies are out of sync it’s going to be a world of hurt. Telling the “normal” exceptions apart from the true failure exceptions was tough. Luckily, it turned out to be not as bad as it could have been but I had to spend considerable time studying both Java’s and .NET’s exception class hierarchies to see where there were out of sync and fix up the converted code so that the original intentions were preserved. I am sure there are more similar Java to C# gotcha’s waiting for me in this converted code base that I have yet to uncover.

The final complication came in the form of Scheme itself. The core of an R5RS Scheme language is a very small, simple language containing just 23 constructs, 11 of which can be built from combinations of the other 12. The entire R5RS language spec is less than 50 pages. To have a functioning Scheme language that does something interesting though requires much more, like a set of primitive functions (math, string functions, basic I/O APIs, etc.…) and a Scheme library so that it can interface with input and output devices, files systems, etc... Most of these primitives are contained within the SISC interpreter as Java (and now C#) native code but much of the library is actual Scheme code that uses these lower level native primitives to build a more complete and useful library. This Scheme library code amounts to more that 70,000 lines of Scheme code that gets built into a heap image during the build process. The trick is you need a functioning Scheme system to build this Scheme heap. However, SISC was designed to be self-bootstrapping, which meant that it could load and parse a small set of special Scheme source files and split out a binary heap image to a file that could then be reloaded when running the full interpreter.

All throughout this project I had the nagging question of “how will I know when it’s really working?” It’s an interpreter so unless it is interpreting lots of code it would never be tested very well. However at this point I could barely understand Scheme source much less write any. The surprise answer came when I finally understood that the self-bootstrapping process was in fact executing lots and lots of actual Scheme code, parsing it and building internal representations of Scheme s-expressions. Moving beyond having it simply not blow up, verifying that the heap image it was creating turned out to be very difficult.

The heap is structured as a memory image of serialized s-expressions with lots of pointer indirection in the form of dynamically generated integer IDs that index into tables of class types, expressions, data fragments, and BER encoded integers. To make matters worst, most of the internal data structures that get serialized are stored internally in hash tables which means that when serializing these sets they usually get written out in a slightly different order each time (i.e enumerating hash set entries with dynamically generated symbol names is not guaranteed to be deterministically ordered from run to run). In the end I got the bugs squashed by modifying both code bases to be completely deterministic when writing heap images and output copious amounts of trace logs. This way I could do a binary comparison of the heap image files and logs as well as step through each code base side-by-side to see where they differed when reloading the heap. it was the most frustrating part of the project but peace of mind in the end to know that internally, they were producing the exact same set of data structures. This also meant that they were interpreting lots of Scheme code identically too to produce those data structures. This was as close to a testing framework as I was likely to get, comparing one working Java version to a converted C# version when processing Scheme source code.

All in all it took a few weeks but I got it working – not complete, but complete enough to be useful (some parts of the library are not ported yet, like networking functions, the FFI, etc.…). It was one of those things where it seemed possible at first glance, then impossible once I got into it, then suddenly and surprisingly possible again the next moment – flip-flop, flip-flop with each little bit of progress I made. I came close to stopping several times but then I’d have a little breakthrough and it would seem like the finish line was very near (only to be not). In the end is just suddenly mostly started working with minor little bugs here and there. I can’t quite prove that it completely works though but I’m able to learn Scheme and use little bits of Scheme snippets and test cases I find around the net. I still don’t fully understand how the interpreter works though – it is very hard to mentally trace through the code.

All so I could learn stuff like this: http://www.willdonnelly.net/blog/scheme-syntax-rules/

Scheme’s ability to extend the language syntax is fascinating to me. You can write a whole new language right inside of Scheme with new syntax rules and everything.

Unfortunately, SISC.net is a ways from being put out there for anyone else to use but someday I hope to do so. The issue is, it contains some source code that is Ok for me to use but not so Ok for me to redistribute (Reflector *cough* *cough*). When I’ll be able to do so I don’t quite know. This was my fun little fall learning project and I’m done for a while now. If you’re a Seasoned Schemer and this is something that would be useful to you though, knowing full well that it is experimental at best, leave me a comment to prod me along to get it into a more stable release state.

P.S. I also wrote a Forth threaded interpreter in .NET but it was much, much simpler than any of this and I had an old but good reference for how to do so. It’s the project I never completed from my early days of computing. It’s not quite finished yet either (it needs a bit of a re-write to solve some issues) but I’m working on getting it out as well. Forth is a magical little language.

* The exception is the Common Larceny project but its future does not look too bright: https://lists.ccs.neu.edu/pipermail/larceny-users/2010-October/000829.html


ExpoDev version 1.1 has been submitted for App Store approval and is waiting for release. Version 1.1 is a minor update that includes a few fixes and enhancements, most notably being that the Bellows Extension Factor calculator has been completely revamped. As you can see below, this new version now supports bellows extension factor compensation by using either the subject distance, the actual bellows extension, or a magnification factor.

Factors

When working with subject distances you can use either meters or feet and inches, as measured from the subject to the lens board. For Extension, the actual bellows extension is measured in millimeters. Magnification is a decimal magnification factor with 1.0 being a 1:1 reproduction factor.

ExpoDev will now directly tell you the calculated bellows factor and as such, ExpoDev no longer tries to determine whether a subject is a close-up or not. This leaves the photographer in charge of determining whether or not to apply a bellows factor as they see fit. ExpoDev does place some reasonable limits on factors though: if the calculated bellows factor results in less than 1/6th of a stop then no factor will be applied. 1/6th of a stop is well within the margin of error, all things considered. ExpoDev will also place a limit on the maximum factor of 65,536 (about 16 stops). In both of these cases, ExpoDev will let you know what is going on so that you, the photographer can make informed choices.

A few other minor improvements are included in this release. It is now possible to use lenses up to 2000mm and the minimum aperture has been increased to f/2048 (for pinhole photographers).


Today I am happy to announce that BTZS ExpoDev for iOS v1.0 has been approved and officially released. It won’t be searchable in iTunes for a few days or so, but in the meantime you can use these direct links:

Official app website with links to the app in iTunes, feature information, screenshots, the user manual, and info on how to get support:  http://TinyOctopus.net/ExpoDev

Direct link to BTZS ExpoDev in iTunes:
http://itunes.apple.com/us/app/btzs-expodev/id507385770


ExpoDev’s Depth of Field calculator was designed with the large format photographer in mind. There are many takes on DOF calculators but a lot of them are either too simplistic or make it needlessly complex for everyday use. ExpoDev’s DOF calculator strives to remain simple to use yet flexible enough for large format photographers who really want to maintain the clarity and visual integrity in their prints. Depth of Field is a complex topic though and I’m not going to go too deeply into it here. Instead, I’ll explain how ExpoDev’s DOF calculator works and how to use it to the greatest advantage. First though, I can’t escape a little background on DOF:

One key aspect of an effective DOF calculator is letting the photographer choose one very important parameter, the Circle on Confusion size (CoC). CoC values are typically expressed in millimeters. ExpoDev offers a selection of CoC values along with the suggested film format size they should be used with*.

There are a lot of opinions though on what CoC values should be used for particular film formats but it’s really a matter of personal choice and greatly depends on the enlargement factor to be used when making prints from your negatives. For instance with contact printing, a larger CoC can be used since there is no enlargement factor. With smaller film formats that will be enlarged, a smaller CoC should be used. The greater the enlargement factor, the smaller the CoC value needs to be.

Typically though, we don’t always think about enlargement factors while working in the field. To make things simpler (but not necessarily less flexible) ExpoDev makes one general assumption: that the larger the print size, the farther away the typical viewing distance will be. ExpoDev therefore assumes that the typical viewing distance will be roughly 1.5 times the diagonal measurement of the print size. This generally holds true for most viewers, whether it is a large print on a wall or a smaller print in their hands. This assumption allows ExpoDev to use one CoC value for each film size without making you think too much about enlargement factors. If however you want to increase the perceived sharpness of a print it is as simple as choosing a smaller CoC value in ExpoDev. This method retains the flexibility without making the user interface needlessly complex.

The important take-away here is that choosing the correct CoC value is what determines final perceived print sharpness and since ExpoDev gives you that control, you can make the DOF calculator work to suit your needs.

Now that we got that out of the way, we can get back to the features of ExpoDev’s Depth of Field calculator.

ExpoDev has three DOF calculators. The first one, Check mode, is the simplest and can be thought of more as a DOF checker. You simply tell ExpoDev what aperture you are using and the subject distance and it will calculate the Total DOF, the Near and Far planes, and the Hyperfocal distance for you.

The next DOF mode, Distance mode, allows you to tell ExpoDev the Near and Far focal planes and it will then tell you the Minimum Aperture to that can be used along with the Focal plane and total DOF. It also tells your one more important thing, the Optimal Aperture.

The Optimal aperture is the concept that print sharpness is really a balance between perceived sharpness and diffraction. There is both a minimal aperture to get the perceived print sharpness you want but there is also a point at which a smaller aperture starts to degrade the sharpness due to diffraction. The Optimal aperture is the aperture which balances these two effects. It is the midpoint where you get the greatest DOF with the least amount of diffraction. You can use either aperture when making your exposure to get the DOF you need, but for optimal print sharpness you sometimes might want to go with the optimal aperture vs. the minimum aperture.

The final DOF mode is called Focus Spread mode and is probably the most practical one to use as it requires no estimations or measurements of subject distances. Focus Spread mode is a way of easily focusing a view camera by measuring the distance that the focus rail has traveled between focusing on both the nearest and farthest points. If you measure and enter this distance (in mm), ExpoDev can then calculate both the minimum and optimal apertures that will make everything from the nearest point to the farthest point fall within the DOF. This makes it really easy to focus a view camera as it does not require you to stop down the lens at all until you are ready to make an exposure. You do all focusing while the lens is wide-open. The really great thing about this method is that it automatically takes into consideration any view camera movements that you are using. This makes figuring out which movements are improving the DOF as simple as measuring the changes in focus rail travel (i.e. if the distance is getting smaller, the movements are helping). You can find more information about this method here, http://www.largeformatphotography.info/how-to-focus.html in the “Procedure II” section.

Once ExpoDev has calculated an aperture to use for the desired DOF, it makes using that aperture easy by automatically settings the Exposure aperture to that value (if you are using Aperture exposure mode). There is also an app setting to tell ExpoDev which aperture to use when it sets one for you, either the minimum aperture or the optimal aperture.

Of course you can change the automatically selected aperture to use on the exposure tab at any time but if you do and that change negatively affects the calculated DOF (i.e. it decreases the DOF), ExpoDev will warn you when you are doing so.

That about covers the DOF calculator in ExpoDev for iOS. It is a significant improvement over the older Palm version of ExpoDev in many ways. It is quite powerful and flexible yet remains as easy to use as ever.

* Right now ExpoDev offers a few commonly used CoC values for different film formats. In the next version, ExpoDev will let you create your own set of CoC values to choose from much like it does for filters and lenses.


Flux and Mutability

The mutable notebook of David Jade