Category Archives: Programming

Introducing Haystack – a grab bag of extension methods for .NET

Like many developers, I have collected a bunch of useful methods over time. Most of the time, these methods don’t have unit tests, nor do they have performance tests. Many of them have origins at StackOverflow — which uses the MIT license — and many of them don’t.

I started collecting them formally about two years ago. Recently I decided to actually turn them into something I could consume via nuget, because I was getting fed up with copying and pasting code everywhere.

Compatibility

Haystack targets .NET Standard 1.3, which means it works with:

  • .NET 4.6+
  • .NET Core 1.0+
  • Mono 4.6+
  • UWP 10+

Tradeoffs

  • Performance vs maintainability: If I have to choose between maintainability and raw speed, I’ll choose maintainability. To that end, if there were was more than one maintainable approach, I chose the faster of the two, using Benchmark.NET to determine the winner. In some cases, like constant time string comparisons, slower is actually better, so as not to leak information, but only in certain places, so those places where optimizations might leak information are purposely slow, whereas the less security-critical areas use the faster implementation.
  • Correctness: For the most part, each method has unit tests associated with it.

Examples

string.ConstantTimeCompare

Constant time string comparison matter in cryptography for various reasons. To that end, fast string comparisons can leak information, so we want to exhaustively check all the bytes in the string, even if we know the strings aren’t equal early on.

const string here = "Here";
const string there = "There";
 
var areSame = here.ConstantTimeEquals(there);   // false

string.TrimStart and string.TrimEnd

It’s useful to be able to remove substrings from the beginning and/or end of a string. With or without a StringComparer overload.

const string trim = "Hello world";
const string hello = "Hello worldThis is a hello worldHello world";
 
var trimFront = hello.TrimStart(trim);   // This is a hello worldHello world
var trimEnd = hello.TrimEnd(trim);       // Hello worldThis is a hello world
var trimBoth = hello.Trim(trim);         // This is a hello world

The library is growing bit-by-bit, and contributions are welcome!

9 observations on 6 months of running a moderately-successful open source project

I’ve run ical.net, an RFC-5545 (icalendar) library for .NET since ~May 2016. It’s basically the only game in town if you need to do anything with icalendar-formatted data. (Those .ics files you get as email attachments are icalendar data.)

A lot of these fall into the “pretty obvious” category of observations.

1) Release notes matter

If nothing else, it serves as a historical reference for your own benefit. It also helps your users understand whether it’s worth upgrading. And when your coworkers ask if a version jump is important weeks after you’ve published it, you can point them to the release notes for that version, and they’ll never ask you again.

2) Automation is important

One of the best things I did when I first figured out how to make a nuget package was push as much into my nuspec file as I could. Everything I learned about various do’s and don’ts was pushed into the code in the moment I learned it.

Not everything in ical.net is automated, and I think I’m OK with that for now. For example, a merge doesn’t trigger a new nuget package version. I think that’s probably a feature rather than a bug.

I suspect I’ll reach a second tipping point where

3) Document in public

Scott Hanselman has the right of this:

Keep your emails to 3-4 sentences, Hanselman says. Anything longer should be on a blog or wiki or on your product’s documentation, FAQ or knowledge base. “Anywhere in the world except email because email is where your keystrokes go to die,” he says.

That means I reply to a lot of emails with a variation of “Please ask this on StackOverflow so I can answer it in public.” And many of those answers are tailored to the question, and then I include a link to a wiki page that answers a more general form of the question. Public redundancy is okay.

Accrete your public documentation.

4) Broken unit tests should be fixed or (possibly) deleted

When I took over dday.ical, there were about 70 (out of maybe 250) unit tests that were failing. There was so much noise that it was impossible to know anything about the state of the code. My primary aim was to improve performance for some production issues that we were having, but I couldn’t safely do that without resolving the crazy number of broken unit tests.

The first thing I did was evaluate each and every broken test, and decide what to do. Having a real, safe baseline was imperative, because you never want to introduce a regression that could have been caught.

The corollary to this is that sometimes your unit tests assert the wrong things. So a bugfix in one place may expose an bad assertion in a unit test elsewhere. That happened quite a lot, especially early on.

5) Making code smaller is always the right thing to do

(So long as your unit tests are passing.)

Pinning down what “smaller” means is difficult. Lines of code may be a rough proxy, but I think I mean smaller in the sense of “high semantic density” + “low cognitive load”.

  • Reducing cognitive load can be achieved by simple things like reducing the number of superfluous types; eliminating unnecessary layers of indirection; having descriptive variable and method names; and having a preference for short, pure methods.
  • Semantic density can be increased by moving to a more declarative style of programming. Loops take up a lot of space and aren’t terribly powerful compared to their functional analogs: map, filter, fold, etc. (I personally find that I write more bugs when writing imperative code. YMMV.) You won’t find many loops in ical.net, but you will find a lot of LINQ.

I think a preference for semantic density is a taste that develops over time.

6) Semantic versioning is the bee’s knees

In a nutshell:

Given a version number MAJOR.MINOR.PATCH, increment the:

  1. MAJOR version when you make incompatible API changes,
  2. MINOR version when you add functionality in a backwards-compatible manner, and
  3. PATCH version when you make backwards-compatible bug fixes.

This seems like common sense advice, but by imposing some modest constraints, it frees you from thinking about certain classes of problems:

  • It’s concrete guidance to contributors as to why their pull requests are or are not acceptable, namely: breaking changes are a no-no
  • Maintaining a stable API is a good way to inspire confidence in consumers of your library

And by holding my own feet to the fire, and following my own rules, I’m a better developer.

7) People will want bleeding-edge features, but delivering them might not be the highest-impact thing you can do

.NET Core is an exciting development. I would LOVE for ical.net to have a .NET Core version, and I’ve made some strides in that direction. But the .NET Core tooling is still beta, the progress in VS 2017 RC notwithstanding. I spent some time trying to get a version working–and I did–but I couldn’t see any easy way to automate the compilation of a .NET Core nuget package alongside the normal framework versions without hating my life.

So I abandoned it.

When the tooling is out of beta, I expect maintaining a Core version will be easier and Core adoption will be higher, both of which improve the ROI with respect to development effort.

8) It’s all cumulative

Automation, comprehensive unit test coverage with a mandatory-100% pass rate, lower cognitive load, higher semantic density, etc. All these things help you go faster with a high degree of confidence later on.

9) People are bad at asking questions and opening tickets

And if you’re not okay with that, then being a maintainer might not be a good fit for you.

  • No, I really can’t make sense of your 17,000-line Google Calendar attachment, sorry.
  • No, I won’t debug your application for you, just because it uses ical.net on 2 lines of your 100+ line method, sorry.
  • No, I’m not going to drop everything to help you, no matter how many emails you send me in a 10 minute time interval, sorry.

All of these things are common when you run an open source project that has traction. Ask anyone.

A self-contained, roll-forward schema updater

I use Dapper for most of my database interactions. I like it because it’s simple, and does exactly one thing: runs SQL queries, and returns the typed results.

I also like to deploy my schema changes as part of my application itself instead of doing it as a separate data deployment. On application startup, the scripts are loaded and executed in lexical order one by one, where each schema change is idempotent in isolation.

The problem you run into is making destructive changes to schema, which is a reasonable thing to want to do. If script 003 creates a column of UNIQUEIDENTIFIER, and you want to convert that column to NVARCHAR in script 008, you have to go back do some reconciliation between column types. Adding indexes into the mix makes it even hairier. Scripts that are idempotent in isolation are easy to write. Maintaining a series of scripts that can be safely applied in order from beginning to end every time an application starts up is not.

Unless you keep track of which schema alterations have already been applied, and only apply the changes that the application hasn’t seen before. Here’s a short, self-contained implementation:

public class SchemaUpdater
{
  private readonly string _connectionString;
  private readonly ILog _logger;
  private readonly string _environment;
 
  public SchemaUpdater(string connectionString, string environment)
    : this(connectionString, environment, LogManager.GetLogger(typeof(SchemaUpdater))) { }
 
  internal SchemaUpdater(string connectionString, string environment, ILog logger)
  {
    _connectionString = connectionString;
    _environment = environment;
    _logger = logger;
  }
 
  public void UpdateSchema()
  {
    MaybeCreateAuditTable();
    var previousUpdates = GetPreviousSchemaUpdates();
 
    var assemblyPath = Uri.UnescapeDataString(new UriBuilder(typeof(SchemaUpdater).GetTypeInfo().Assembly.CodeBase).Path);
    var schemaDirectory = Path.Combine(Path.GetDirectoryName(assemblyPath), "schema-updates");
 
    var schemaUpdates = Directory.EnumerateFiles(schemaDirectory, "*.sql", SearchOption.TopDirectoryOnly)
      .Select(fn => new { FullPath = fn, Filename = Path.GetFileName(fn) })
      .Where(file => !previousUpdates.Contains(file.Filename))
      .OrderBy(file => file.Filename)
      .Select(file => new { file.Filename, Query = File.ReadAllText(file.FullPath) })
      .ToList();
 
    foreach (var update in schemaUpdates)
    {
      using (var connection = new SqlConnection(_connectionString))
      {
        try
        {
          var splitOnGo = SplitOnGo(update.Query);
          foreach (var statement in splitOnGo)
          {
            try
            {
              connection.Execute(statement);
            }
            catch (Exception exception)
            {
              Console.WriteLine(exception);
              throw;
            }
          }
 
          connection.Execute("INSERT INTO SchemaRevision (Filename, FileContents) VALUES (@filename, @fileContent)",
            new { filename = update.Filename, fileContent = update.Query });
        }
        catch (Exception e)
        {
          _logger.Fatal(new { Message = "Unable to apply schema change", update.Filename, update.Query, Environment = _environment }, e);
          throw;
        }
      }
    }
  }
 
  public static ICollection<string> SplitOnGo(string sqlScript)
  {
    // Split by "GO" statements
    var statements = Regex.Split(
      sqlScript,
      @"^[\t\r\n]*GO[\t\r\n]*\d*[\t\r\n]*(?:--.*)?$",
      RegexOptions.Multiline |
      RegexOptions.IgnorePatternWhitespace |
      RegexOptions.IgnoreCase);
 
    // Remove empties, trim, and return
    var materialized = statements
      .Where(x => !string.IsNullOrWhiteSpace(x))
      .Select(x => x.Trim(' ', '\r', '\n'))
      .ToList();
 
    return materialized;
  }
 
  internal void MaybeCreateAuditTable()
  {
    const string createAuditTable =
@"IF NOT EXISTS(SELECT 1 FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'SchemaRevision')
BEGIN
CREATE TABLE [dbo].[SchemaRevision]
(
[SchemaRevisionNbr] BIGINT IDENTITY(1,1),
[Filename] VARCHAR(256),
[FileContents] VARCHAR(MAX),
CONSTRAINT PK_SchemaRevision PRIMARY KEY (SchemaRevisionNbr),
)
END";
 
    using (var connection = new SqlConnection(_connectionString))
    {
      connection.Execute(createAuditTable);
    }
  }
 
  internal HashSet<string> GetPreviousSchemaUpdates()
  {
    using (var connection = new SqlConnection(_connectionString))
    {
      var results = connection.Query<string>(@"SELECT Filename FROM SchemaRevision");
      return new HashSet<string>(results, StringComparer.OrdinalIgnoreCase);
    }
  }
}

Update 2017-09-05: I added the SplitOnGo() method to support the GO delimiter, since I’ve had occasion to need it recently. It’s adapted from Matt Johnson’s answer on StackOverflow.

Proposed functionality and API changes for ical.net v3

Update 2018-04-11: Most all of these changes have been published in ical.net versions 3 and 4. See the release notes for more details.

Downloading remote resources

When I ported ical.net to .NET Core, I removed the ability to download remote payloads from a URI. I did this for many reasons:

  • There are myriad ways of accessing an HTTP resource. There are myriad ways of doing authentication. Consumers of ical.net are in a position to know the details of their environment, including security concerns, so responsibility for these concerns should lie with the developers using the library.
  • Choosing to support HttpClient leaves .NET 4.0 users out in the cold. Choosing to support WebClient brings those people into the fold, but leaves .NET Core and WinRT users out. It also prevents developers working with newer versions of .NET from benefiting from HttpClient.
  • Non-blocking IO leaves developers working with WinForms and framework versions < 4.5 out in the cold. Bringing those developers back into the fold means we can’t make use of async Tasks. Given the popularity of microservices and ical.net’s origins on the server side, this is a non-starter.

We can’t satisfy all use cases if we try to do everything, so instead I’ve decided that we’ll leave over-the-wire tasks to the developers using ical.net.

The primacy of strings

To that end… strings will be the primary way to work with ical.net. A developer should be able to instantiate everything from a huge collection of calendars down to a single calendar component (a VEVENT for example) by passing it a string that represents that thing. In modern C#, working directly with strings is more natural than passing Streams around, which is emblematic of old-school Java. It’s also more error prone: I fixed several memory leaks during the .NET Core port due to undisposed Streams)

  • The constructor will be the deserializer. It is reasonable for the constructor to deserialize the textual representation into the typed representation.
  • ToString() will be the serializer. It is reasonable for ToString() to serialize the typed representation into the textual representation.

Constructors as deserializers buys us…

Immutable types and (maybe) a fluid API

One of the challenges I faced when refactoring for performance was reasoning about mutable properties during serialization and deserialization. Today, deserialization makes extensive use of public, mutable properties. In fact, the documentation reflects this mutability:

var now = DateTime.Now;
var later = now.AddHours(1);
 
var rrule = new RecurrencePattern(FrequencyType.Daily, 1)
{
    Count = 5
};
 
var e = new Event
{
    DtStart = new CalDateTime(now),
    DtEnd = new CalDateTime(later),
    Duration = TimeSpan.FromHours(1),
    RecurrenceRules = new List&lt;IRecurrencePattern&gt; {rrule},
};
 
var calendar = new Calendar();
calendar.Events.Add(e);

To be completely honest, this state of affairs makes it quite difficult to make internal changes without breaking stuff. Many properties would naturally be getter-only, because they can be derived from simple internals, like Duration above. Yet they’re explicitly set during deserialization. This is an incredible vector for bugs and breaking changes. (Ask me how I know…)

If we close these doors and windows, it will increase our internal maneuverability.

Fluid API

Look at the code above. Couldn’t it be more elegant? Shouldn’t it be? I don’t yet have a fully-formed idea of what a more fluid API might look like. Suggestions welcome.

Component names

IICalendarTypeNames

The .NET framework guidelines recommend prefixing interface names with “I”. The calendar spec is called “iCalendar”, as in “internet calendar”, which is an unfortunate coincidence. Naming conventions like IICalendarCollection offend my sense of aesthetics, so I renamed some objects when I forked ical.net from dday. I’ve come around to valuing consistency over aesthetics, so I may go back to the double-I where it makes sense to do so.

CalDateTime

The object that represents “a DateTime with a time zone” is called a CalDateTime. I’m not wild about this; we already have the .NET DateTime struct which has its own shortcomings that’ve been exhaustively documented elsewhere. A reasonable replacement for CalDateTime might be a DateTimeOffset with a string representation of an IANA, BCL, or Serialization time zone, with the time zone conversions delegated to NodaTime for computing recurrences. (In fact, NodaTime is already doing the heavy lifting behind the scenes for performance reasons, but the implementation isn’t pretty because of CalDateTime‘s mutability. Were it immutable, it would have been a straightforward engine replacement.)

CalDateTime is the lynchpin for most of the ical.net library. Most of its public properties should be simple expression bodies. Saner serialization and deserialization will have to come first as outlined above.

Divergence from spec completeness and adherence

VTIMEZONE

The iCalendar spec has ways of representing time change rules with VTIMEZONE. In the old days, dday.ical used this information to figure out Standard Time/Summer Time transitions. But as the spec itself notes:

Note: The specification of a global time zone registry is not addressed by this document and is left for future study. However, implementers may find the Olson time zone database [TZ] a useful reference. It is an informal, public-domain collection of time zone information, which is currently being maintained by volunteer Internet participants, and is used in several operating systems. This database contains current and historical time zone information for a wide variety of locations around the globe; it provides a time zone identifier for every unique time zone rule set in actual use since 1970, with historical data going back to the introduction of standard time.

At this point in time, the IANA (née Olson) tz database is the best source of truth. Relying on clients to specify reasonable time zone and time change behavior is unrealistic. I hope the spec authors revisit the VTIMEZONE element, and instead have it specify a standard time zone string, preferably IANA.

To that end… ical.net will continue to preserve VTIMEZONE fields, but it will not use them for recurrence computations or understanding Summer/Winter time changes. It will continue to rely on NodaTime for that.

URL and ATTACH

As mentioned above, ical.net will no longer include functionality to download resources from URIs. It will continue to preserve these fields so clients can do what they wish with the information they contain. This isn’t a divergence from the spec, per se, which doesn’t state that clients should provide facilities to download resources.

dday.ical is now ical.net and available under the MIT license with many performance enhancements

A few months ago, I needed to do some calendar programming for work, and I came across the dday.ical library, like many developers before me. And like many developers, I discovered that dday.ical doesn’t have the best performance, particularly under heavy server loads.

I dug in, and started making changes to the source code, and that’s when I discovered that the licensing was ambiguous, and that it had been abandoned. I was concerned that I might be exposing my company to risk due to unclear copyright, and a non-standard license.

With some effort, I was able to track down Doug Day (dday), and he gave me permission to fork, rename (ical.net), and relicense his library (MIT), which I have done. So I’m happy to report…

dday.ical is now ical.net

mdavid, who saw to it that the library wasn’t lost to the dustbin of Internet history, has graciously redirected dday users to ical.net. Khalid Abuhakmeh, who published the dday nuget package that you might be using (you should switch ASAP) has also agreed to archive and redirect users to ical.net.

So… why should you use the new package?

Unambiguous licensing

Doug has revoked his copyright, and given unrestricted permission to give dday.ical new life as ical.net. That means ical.net is unencumbered by legal ambiguities.

Many performance enhancements

My changes to ical.net have been mostly performance-focused. I was lucky in that dday.ical has always included a robust test suite with about 170 unit tests that exercise all the features of the library. Some were broken, or referenced non-existent ics files, so I nuked those right away, and concentrated on the set of tests that were working as a baseline for making safe changes.

The numbers:

  • Old dday.ical test suite: ~17 seconds
  • Latest ical.net nuget package: 3.5 seconds

There’s no games here. ical.net really is that much faster.

Profiling showed a few hotspots which I attacked first, but those only bought me maybe 3-4 seconds improvement. There was no single thing that resulted in huge performance gains. Rather it was many, many small changes that contributed, quite often by improve garbage collection pauses, many of which were 5ms+, which is an eternity in computing time.

Here are a few themes that stand out in my memory:

  • Route all time zone conversions through NodaTime, which actually exposed some bugs in what the unit tests were asserting
  • Converting .NET 1.1 collections (Hashtable, ArrayList) to modern, generic equivalents
  • Converting List<T> to HashSet<T> for many collections, including creating stable, minimal GetHashCode() methods, though more attention is still needed in this area. A nice side effect of this was that lot of lookups and collection operations then became set operations (ExceptWith(), UnionWith(), etc.)
  • Converting several O(n^2) methods to O(n) or better by restructuring methods based on information that was available in context
  • Converted a lot of loops to LINQ. (Yes, really!)
  • Specifying initial collection sizes when using array-backed collections like List<T> and Dictionary<TKey, TValue>
  • Moved variables closer to their usage, which sometimes meant that certain expensive calls don’t occur at all, because the method exits before reaches it. This also had the effect of pushing some variables into gen 0 garbage collection. (Anecdotally, I have noticed GC pauses are fewer and further between, though I don’t have any hard data that it’s actually significant.)
  • Moving expensive calls outside of tight loops. Unfortunately the library makes extensive use of the service-provider antipattern. A common thing was to have an expensive call (get me a deserializer for Foo!) inside a tight loop that’s only ever deserializing Foos. So you can make the call once and just reuse the deserializer.
  • Implemented a lazy caching layer as suggested in one of the TODOs in the comments.

Along the way, I converted a lot of code to modern, idiomatic C#, which actually helped performance as much as any of the discrete things I did above. As I work towards a .NET Core port, I have the runtime down to about 2.8 seconds just through clarifying and restructuring existing code, and idiomatic simplifications.

What’s next?

  • A .NET Core port is nearly complete.
  • The ical.net has virtually no documentation. I hope to improve the readme with some simple examples this morning/afternoon.
  • I have been bug collecting on Stack Overflow, and have a few maybe-bugs to investigate and/or write test cases for.
  • Maybe some API changes for v3, still TBD. I’ll discuss these in a future blog post.

Line wrapping at word boundaries for console applications in C#

I didn’t like any of the solutions floating around the web for displaying blocks of text wrapped at word boundaries, so I wrote one.

This:

This is a really long line of text that shoul
dn't be wrapped mid-word, but is

Becomes this:

This is a really long line of text that
shouldn't be wrapped mid-word, but is

Here it is:

public static string GetWordWrappedParagraph(string paragraph)
{
    if (string.IsNullOrWhiteSpace(paragraph))
    {
        return string.Empty;
    }
 
    var approxLineCount = paragraph.Length / Console.WindowWidth;
    var lines = new StringBuilder(paragraph.Length + (approxLineCount * 4));
 
    for (var i = 0; i < paragraph.Length;)
    {
        var grabLimit = Math.Min(Console.WindowWidth, paragraph.Length - i);
        var line = paragraph.Substring(i, grabLimit);
 
        var isLastChunk = grabLimit + i == paragraph.Length;
 
        if (isLastChunk)
        {
            i = i + grabLimit;
            lines.Append(line);
        }
        else
        {
            var lastSpace = line.LastIndexOf(" ", StringComparison.Ordinal);
            lines.AppendLine(line.Substring(0, lastSpace));
 
            //Trailing spaces needn't be displayed as the first character on the new line
            i = i + lastSpace + 1;
        }
    }
    return lines.ToString();
}

Doing things well takes time

I’m not the world’s fastest programmer, nor am I the slowest, and much like any person who does creative work, there are times when things come easily, and times where it’s a slog. But regardless of the day-to-day ups and downs, I’ve come to appreciate a simple fact in recent weeks: doing things well takes time.

A little while ago, I started working on a computational biology library that I’m calling BCompute. The idea came as I was working my way pretty quickly through the challenges at Rosalind.info. If you look at the challenges there, you’ll see that most of them are pretty easy to solve in a script-y way that works for a narrow set of cases. (I.e. the one you’re working on.) String manipulations, analysis, and so forth are pretty straightforward in most high-level languages.

Somewhere around problem 5 or 6, I got to thinking that it would be fun create a compbio library exploring the domain, and becoming a better developer along the way. I didn’t want to just build a collection of scripts; I wanted to build a real, performant library with concepts modeled at the proper level of abstraction, with a type-safe, unit tested, composable domain model that could be used for more than just toy problems. So I started reworking those scripts into something real.

Now, about a month later, I have a library that is getting closer to being able to Do Stuff, and will mostly keep you from doing the wrong thing. Along the way, I’ve learned a lot, and had quite a bit of fun. But man does it take time to do things The Right Way–even when you understand the domain pretty well, which is an advantage that I certainly don’t have. (Though I’m getting there.)

  • Writing tests takes time
  • Squashing bugs takes time
  • Refactoring objects and interfaces takes time
  • Building composable objects takes time
  • Learning and then accounting for the biochemistry edge cases takes time
  • Applying a growing body of domain knowledge to your object model–which appears deficient in new and interesting ways the more you learn–takes time

After 100+ commits, 130+ unit tests, and more refactoring than I can even remember, the most salient thing I’ve learned is that it all takes time. (And there’s still a long way to go…)

Understanding the word “semantics” in the context of programming

tl;dr- It’s usually safe to substitute the phrase behaviors and guarantees into a sentence where you see the word “semantics”–and the discussion is about programming.

Longer version: New programmers often come across the word semantics, and wonder what it means. Pretty much every explanation they will read points out the distinction between syntax (form) and semantics (meaning). This is easy to grasp, but not useful for understanding the word in the context of a sentence like: The stylistic choices should typically be driven by a desire to clearly communicate the semantics of the program fragment.

Go ahead and substitute the word “meaning” there. It isn’t much help unless you’re already an experience developer.

So to that end, new programmers… if ever you come across this word, it’s generally safe to substitute the phrase behaviors and guarantees in its place. This may help you understand the semantic intent (ha!) of the writer a little more.