redis: Sorted Sets as iterable key-value work queues

I’m in the process of building an internal search engine at work, and first on the block are our network share drives. During a typical week, we have about 46,000 unique documents that are written to in our Lexington office. This number only represents things like Word docs, Excel spreadsheets, PowerPoint presentations, PDFs, and so forth. (A count of all touches is 3-4x higher.)

This presents some challenges: crawling the network shares at regular intervals is slow, inefficient, and at any given time, a large portion of the search index may be out of date, which is bad for the user experience. So I decided an incremental approach would be better: if I re-index each document as it’s touched, it eliminates unnecessary network and disk IO, and the search index is updated in close to realtime.

Redis

My second-level cache/work queue is a redis instance that I interact with using Booksleeve. The keys are paths that have changed, and the values are serialized message objects stored as byte arrays that contain information about the change. The key-value queue structure is important, because the only write operation that matters is the one that happened most recently. (Why try to index a document if it gets deleted a moment later?)

This is great in theory, but somewhere along the way, I naively assumed it was possible to iterate over redis keys… but you can’t. Not easily or efficiently, anyway. (Using keys in a production environment is dangerous, and should be avoided.)

Faking iteration

The solution was relatively simple, however, and like all problems in software development, it was solved by adding another level of indirection: using the redis Sorted Set data type.

For most use cases, the main feature that differentiates a Set from a Sorted Set is the notion of a score. But with a Sorted Set, you can also return a specific number of elements from the set. In my case, each element returned is the key to a key-value pair representing some work to be done.

Implementing this is as easy as writing the path to the Sorted Set at the same time as the key-value work item is added, which can be done transactionally:

using (var transaction = connection.CreateTransaction())
{
    int i = 0;  //dummy value representing some "work"
    foreach (var word in WORDS)
    {
        transaction.SortedSets.Add(REDIS_DB, set, word, SCORE);
 
        //Set the key => message (int i) value
        transaction.Strings.Set(REDIS_DB, word, i.ToString());
        i++;
    }
    transaction.Execute();
}

Downstream, my consumer fills its L1 cache by reading n elements from the Sorted Set:

var pairs = new Dictionary<string, string>();
using (var transaction = connection.CreateTransaction())
{
    //Get n keys from the set into the Dictionary
    var keyList = connection.Wait(connection.SortedSets.RangeString(REDIS_DB, set, 0, LIMIT));
 
    foreach (var key in keyList)
    {
        var value = Encoding.Default.GetString(connection.Strings.Get(REDIS_DB, key.Key).Result);
        pairs.Add(key.Key, value);
 
        //Remove the key from the SortedSet
        transaction.SortedSets.Remove(REDIS_DB, set, key.Key);
        //Remove the key from the Keys
        transaction.Keys.Remove(REDIS_DB, key.Key);
    }
    transaction.Execute();
}

And there we have fake key “iteration” in redis.

How to install iCloud on Windows Server

Symptom
iCloud sync stops working, or you get an error message on startup that says: The procedure entry point _objc_init_image could not be located in the dynamic link library objc.dll’ and will not start.

Cause
Apple has configured newer versions of iCloud (version 3+, I believe) to only work on Windows 7 or 8, but there’s no reason you can’t use it on Windows Server operating systems.

Fix
You’ll need about 3 minutes, and two utilities.

  1. Install 7-zip.
  2. Install Orca, a Microsoft-provided MSI editor.
    • Orca is bundled with the Windows SDK, and getting it out of that bundle isn’t straightforward, so I’ve extracted it, and zipped it up so you can get it as a standalone program.
  3. Download the iCloud control panel installer, if you haven’t already
  4. Open iCloudSetup.exe with 7-zip.
    1. Right-click it
    2. Select 7-zip > Open with 7-zip
  5. Extract the appropriate version of iCloud somewhere (usually this is iCloud64)
  6. Open Orca, and open the iCloud MSI you just extracted
  7. Go to the LaunchCondition table
  8. Change this line:
    • From: (VersionNT >= 601) AND (MsiNTProductType = 1)
    • To: (VersionNT >= 601) AND (MsiNTProductType = 3)
  9. Save and quit

You should then be able to install iCloud on your Windows Server OS using the MSI you just modified.

A little syntactic sugar to have reference semantics for value types

In C#, structs and other data primitives have value semantics. (This includes strings, even though they are technically reference types.) But sometimes it’s useful to have reference semantics when dealing with what would otherwise be a value type. (Referencing primitives in a singleton object, for example.)

Here are two ways of doing that.

Delegates

Delegate syntax can seem a little weird–particularly when you’re working with primitives–because they’re basically typed function pointers. They look more like methods than variables.

Here’s an example:

class Program
{
    static void Main(string[] args)
    {
        var i = SomeInteger.SomeNumber;     //Value type won't change when SomeNumber changes
        Console.WriteLine(i);               //100
        Func<int> del = () => SomeInteger.SomeNumber;
        Console.WriteLine(del());           //100 (Same as del.Invoke())
 
        SomeInteger.SomeNumber = 42;
        Console.WriteLine(i);               //Still 100
        Console.WriteLine(del());           //42
    }
}
 
static class SomeInteger
{
    public static int SomeNumber = 100;
}

Wrapper class using generics

My preferred way is by combining a wrapper class with C# generics. It has nicer syntax, but does require a little more setup. The results are a little clearer, in my opinion.

Here’s an example:

class Program
{
    static void Main(string[] args)
    {
        //Assume SomeInteger.SomeNumber is 100 again
        var foo = new Reference<int>(() => SomeInteger.SomeNumber);
        Console.WriteLine("foo.Value: {0}", foo.Value);     //100
 
        SomeInteger.SomeNumber = 42;
        Console.WriteLine("foo.Value: {0}", foo.Value);     //42
    }
}
 
class Reference<T>
{
    private readonly Func<T> _theValue;
    public T Value
    {
        get
        {
            //If the value is null, return the default initialization of type T
            return _theValue != null ? _theValue.Invoke() : default(T);
        }
    }
 
    public Reference(Func<T> theValue)
    {
        _theValue = theValue;
    }
}

And there we have type-safe, reference semantics for primitives in C#. (Works for nullable types, too.)

Publisher confirms with RabbitMQ and C#

RabbitMQ lets you handle messages that didn’t send successfully, without resorting to full-on transactions. It provides this capability in the form of publisher confirms. Using publisher confirms requires just a couple of extra lines of C#.

If you’re publishing messages, you probably have a method that contains something like this:

using (var connection = FACTORY.CreateConnection())
{
    var channel = connection.CreateModel();
    channel.ExchangeDeclare(QUEUE_NAME, ExchangeType.Fanout, true);
    channel.QueueDeclare(QUEUE_NAME, true, false, false, null);
    channel.QueueBind(QUEUE_NAME, QUEUE_NAME, String.Empty, new Dictionary<string, object>());
 
    for (var i = 0; i < numberOfMessages; i++)
    {
        var message = String.Format("{0}\thello world", i);
        var payload = Encoding.Unicode.GetBytes(message);
        channel.BasicPublish(QUEUE_NAME, String.Empty, null, payload);
    }
}

 

But you’re out of luck if you want:

  1. A guarantee that your message was safely preserved in the event that the broker goes down (i.e. written to disk)
  2. Acknowledgement from the broker that your message was received, and written to disk

For many use cases, you want these guarantees. Fortunately, getting them is relatively straightforward:

//Set the message to persist in the event of a broker shutdown
var messageProperties = channel.CreateBasicProperties();
messageProperties.SetPersistent(true);

 

//Send an acknowledgement that the message was persisted to disk
channel.BasicAcks += channel_BasicAcks;
channel.ConfirmSelect();
 
//...
 
//Begin loop
channel.BasicPublish(QUEUE_NAME, QUEUE_NAME, messageProperties, payload);
channel.WaitForConfirmsOrDie();
//End loop

 

(You’ll have to implement event handlers for acks and nacks.)

The difference between WaitForConfirms and WaitForConfirmsOrDie is not immediately obvious, but after digging through the Javadocs, it seems that WaitForConfirmsOrDie will give you an IOException if a message is nack‘d, whereas WaitForConfirms won’t.

You’ll get an IllegalStateException if you try to use either variation of WaitForConfirms without first setting the Confirms property with ConfirmSelect.

Here’s the complete code for getting an acknowledgement from the RabbitMQ broker, only after the broker has persisted the message to disk:

using (var connection = FACTORY.CreateConnection())
{
    var channel = connection.CreateModel();
    channel.ExchangeDeclare(QUEUE_NAME, ExchangeType.Fanout, true);
    channel.QueueDeclare(QUEUE_NAME, true, false, false, null);
    channel.QueueBind(QUEUE_NAME, QUEUE_NAME, String.Empty, new Dictionary<string, object>());
    channel.BasicAcks += channel_BasicAcks;
    channel.ConfirmSelect();
 
    for (var i = 1; i <= numberOfMessages; i++)
    {
        var messageProperties = channel.CreateBasicProperties();
        messageProperties.SetPersistent(true);
 
        var message = String.Format("{0}\thello world", i);
        var payload = Encoding.Unicode.GetBytes(message);
        Console.WriteLine("Sending message: " + message);
        channel.BasicPublish(QUEUE_NAME, QUEUE_NAME, messageProperties, payload);
        channel.WaitForConfirmsOrDie();
    }
}

Mediawiki “vendor branch” for Mercurial users

I created a “vendor branch” of the MediaWiki stable releases for Mercurial users. (Git is, after all, pretty terrible by comparison.) Commit history goes from 1.21.2 back to 1.19.7.

You’ll do something like this to update:

$ cd your/personal/mediawiki/branch
$ hg pull -u
$ hg pull https://bitbucket.org/rianjs/mediawiki-vendor
$ hg merge tip
$ hg commit
$ hg push

Then deploy however you deploy. Check Special:Version to see that it’s updated.

How to install pip on Windows

This is a distillation of the instructions at The Hitchhiker’s Guide to Python, mostly for my own future benefit when I inevitably forget how to do it:

  1. Install Python, if you haven’t already
  2. Install distribute by running the distribute_setup.py script:
    1. wget http://python-distribute.org/distribute_setup.py
    2. python distribute_setup.py
  3. Use easy_install to install PIP. PIP is actively maintained, and supports package removal (unlike easy_install)
    1. easy_install pip

This took a grand total of about 60 seconds to complete.

Gmail: Find unlabeled mail, and filter by attachment size

If you’ve wanted to filter by attachment size, or find unlabeled emails… you’re now in luck. Gmail has added some search operators recently.

My favorite is the ability to filter by attachment size:

  • size:2m searches for attachments of 2MB
  • larger:3m searches for attachments of 3MB and larger
  • smaller:5m searches for attachments smaller than 5MB

You can combine these searches with the other Gmail search operators:

  • larger:3m older_than:2y
  • larger:5m from:email@example.com

Need to find unlabeled messages?

has:nouserlabels will show you stuff you haven’t labeled.

3 minute tip: Configure a Linux server to send email

It’s useful to be able to send email from a Linux webserver. I do it to get MediaWiki page change notifications and other automated status updates. I wanted something that supported two-factor authentication, and this does.

This guide is for you, if:

  • You don’t want to run a mail server
  • You want to send email, and you don’t care about receiving it
  • You want people to receive the emails that your server sends

I’ve used this method with Linode, and it works perfectly.

Install mailutils

~ sudo apt-get install mailutils

When the setup wizard launches, choose the unconfigured option. You don’t need to do any special configuration to get this to work.

Install and configure sstmp

  1. ~ sudo apt-get install ssmtp
  2. ~ sudo vim /etc/ssmtp/ssmtp.conf
  3. Hit “i” to enter Insert mode.
  4. Uncomment FromLineOverride=YES by deleting the #
  5. Add the following to the file:

     
    AuthUser=<user>@gmail.com
    AuthPass=Your-Gmail-Password
    mailhub=smtp.gmail.com:587
    UseSTARTTLS=YES

  6. Save and close the file:
    1. Hit Escape
    2. Type :wq
    3. Hit Enter

If you’re using two-factor authentication
Create a new application-specific password to use in the config file above. (If you’re using Gmail, you can manage those passwords here.)

Test it out
~ echo "This is a test" | mail -s "Test" <user>@<email>.com

Using a webmail service other than Gmail
You can follow the same pattern that I used above. You’ll need to:

  1. Subsitute the SMTP address and port for your email service (e.g. Yahoo!) where it says smtp.gmail.com:587. (587 is the port number.)
  2. Set up an application-specific password if your webmail provider allows it, and paste that into the password line, the way I did with Gmail. (Yahoo! appears to have something similar.)