It's not just the company, but also the Haven

One of the better and more concise arguments against tax havens, appealing to something other than anger at corporations:

Tax havens should be illegal in international law. They are patently unfair. […] They bear few of the costs of the modern nation state, while sucking those states dry of the revenue needed to sustain them.

Simon Jenkins, in the Guardian.

Anti-Patterns in Security #1: Disabling Paste for Passwords

A recent security anti-pattern I've found is websites containing code to disable pasting into password fields. As far as I can make out, this is one of the most brutally effective ways of encouraging users to create insecure passwords.

With many backfiring policies, I can see some benefits; this one is rare in that I see zero benefits to the approach.

You'll have some jQuery that looks like this:

$("#newPassword").bind('cut copy paste', function(event) {

So you'll come to generate your long, secure password using 1Password or whatever, then come to paste it in and -- bam! -- nothing happens. Instead, you're forced to transcribe it one character at a time. It's enough to make you resort to pAssword1.

As far as I can see, this policy:

  • Encourages word-based passwords which are easy to type.
  • Encourages short passwords which are easy to remember.
  • Discourages random passwords.

I can't see a way in which this doesn't encourage the short, easy to remember, easy to type and trivially breakable MyPuppy12. Using a secure password like pKfwXwZDX4PGtbsQrZefu7ZtBVjFMV is incredibly cumbersome in comparison. But ho-hum, whatever, it's only my power company/pension/bank/insurance firm so puppies and lost account data it is, I guess.

It's easy to disable the events with some quick Javascript work in the browser's console. However, this just got me one step further: I could paste my passwords and submit the form, but my passwords failed to pass muster even though they passed the stated rules.

Perhaps 30 characters is just too long for the bank's puppy-password brigade.

(It wouldn't be so annoying if Anti-Pattern #2 wasn't involved: the expiring password of doom.)

Update: Turns out people haven't worked out how to stop you dragging-and-dropping text into forms in browsers. And 30 characters was too long; thanks for not mentioning it Mr Website.

CouchDB 2.0's read and write behaviour in a cluster

CouchDB 2.0 (in preview) has clustering code contributed by Cloudant, which was inspired by Amazon's Dynamo paper.

When using CouchDB in a cluster, databases are sharded and replicated. This means that a single database is split into, say, 24 shards and each shard is stored on more than one node (replica). A shard contains a specific portion of the documents in the database. There are almost always three replicas; this provides a good balance of reliability vs. storage overhead.

Read and write behaviour differs in the clustered database because data is stored on more than one database node (server). I'm going to try to explain what the behaviour is and why it is how it is.

To simplify the explanation, I'm going to assume a database with a single shard, though still with three replicas. This just makes the diagrams simpler, it doesn't change behaviour. In this case we have a three node cluster, so the database has a replica on every database node in the cluster.

Behaviour in a healthy cluster

To start with, in a healthy cluster, a read or write proceeds like this:

  1. Client request arrives at the cluster's load balancer.
  2. The load balancer selects a node in the database cluster to forward the request to. This can be any node. The selected node is called the coordinator for this request.
  3. Each node maintains a shard map which tells the node where to find the shard replicas for a given database, and what documents are found in each shard. The coordinator uses its shard map to work out which nodes hold the documents for the request.
  4. The coordinator makes a request to each node responsible for holding the data required by the request. This might be a read request, write request, view request etc. Note: if the coordinater holds a required shard replica, one of these requests is to itself.
  5. The coordinator waits for a given number of responses to arrive from the nodes it contacted. The number of responses it waits for differ based on the type of request. Reads and writes wait for two responses. A view read only waits for one. Let's call the number needed R (required responses). A few requests allow this to be customised using parameters on the request, but the defaults are almost always most appropriate.
  6. The coordinator processes the responses from the nodes it contacted. This can be a varying amount of work. For reads and writes, it just has to compare the responses to see if they are the same before returning a result to the client. For a view request, it might have to perform a final reduce step on the results returned from the nodes.
  7. Finally, the coordinator passes the response back to the load balancer which passes the response back to the client.

In this article, we're interested in what happens when R responses are not received in step (5). CouchDB indicates this via a response's HTTP status code for some requests. For other requests, it's not currently possible to tell whether R responses were received.

Behaviour in a partitioned cluster

The behaviour of the cluster is similar whether it's partitioned or a node has been taken offline. A partitioned cluster just makes life easier for me because I can illustrate more scenarios with a single diagram.

Our scenario is:

  • A cluster with three nodes which is split into two partitions, A and B.
  • Partition A contains two nodes, and therefore two replicas of the data.
  • Partition B contains one node, and therefore one replica of the data.
  • For whatever reason, the load balancer from the cluster can talk to both partitions, but the nodes in each partition are isolated from each other.
  • We assume the nodes know that the partition has occurred. Details on the converse state are below.
  • We further assume that the nodes on each side of the partition operate correctly. In particular, that they're able to promptly reply to the coordinators requests. The reason for this will become clear later.

From the above description of the read and write path, it should be clear that in this scenario some reads and writes will be allocated to coordinator nodes in Partition A and some to the node in Partition B.

Reads and writes to Partition A

As noted above, the default value for R is 2 for reads and writes to the database. For searches, view reads and Cloudant query requests R is implicitly 1. It should be clear that all reads and writes to Partition A will be able to receive at least R responses in all cases. In this case, the responses to the client will be as follows:

  • A read or query will return 200 HTTP code along with the data.
  • A write will return a 201 Created HTTP code to indicate it wrote the data to R (or more) replicas.

Reads and writes to Partition B

Partition B is more interesting. It's clear we can still meet the implicit R for searches, view reads and Cloudant Query requests. However, the nodes in Partition B cannot meet the two responses required for R for document read and write operations. So what do we do? In essence, the available node will do its best to read and write the data. Specifically, it will read from and write to its own replica of the data. Its returned HTTP status codes and body data are:

  • A read will still return 200 along with the latest version of the document or view/search/query data the node has (read from the single replica). Currently there is no way to tell that fewer than R replies were received from nodes holding replicas of shards.
  • A write will return a 202 Accepted to indicate that the coordinator received a reply from at least one replica (i.e., the one in Partition B, itself) but fewer than R replies.

A note on responses

In the above, it's easy to overlook the fact that R is all about replies to the coordinator's requests. For writes in particular, this has some ramifications that it's important to take note of.

  1. A 202 could be received in a non-partitioned cluster.

    For some reason a replica might receive the write request from the coordinator node, write the data but for some reason not respond in a timely manner. Responses to a client therefore guarantee only a lower-bound to the number of writes which occurred. While there may have been R (or more) successful writes, the coordinator can only know about those for which it received a response. And so it must must return 202.

  2. Interestingly, this also means writes may happen where the coordinator receives no responses. The client receives a 500 response with a reason: timeout JSON body, but the cluster has still written the data. Or perhaps it didn't -- neither the coordinator nor therefore the client can know.

  3. For reads, the behaviour differs slightly when the coordinator knows that it cannot possibly receive R responses. In that case, it returns when it has received as many as it is able to -- in Partition A this is two, in Partition B this is one. If, instead, nodes are just slow to respond, if the coordinator doesn't receive R responses before its timeout, it will return a 500. See this mailing list thread.

This all illustrates that it's essential to remember that R is all about responses received by the coordinator, and not about, for example, whether data was written or not to a given number replica. This is why it was called out in the description of the partitioned cluster that nodes reply promptly.

Concurrent modifications

As time goes by, it's clear from the above that CouchDB 2.0 will accept writes to the same document on both sides of the partition. This will mean the data on each side of the partition will diverge.

For writes to different documents, all will be well. When the partition heals, nodes will work out amongst themselves which updates they are missing and update themselves accordingly. Soon all nodes in the cluster will be consistent with each other, and fully up-to-date.

For changes made to the same document on both sides of the partition, the nodes will reconcile divergent copies of a document to create a conflicted document, in exactly the same manner as happens when remote CouchDB instances replicate with each other. Again, all nodes will become consistent; it's just some documents are conflicted in that consistent state. It is the user's responsibility to detect and fix these conflicts.

There is a brief discussion of resolving conflicted documents in the CouchDB docs. In essence, the cluster will continue to operate, returning one of the versions of the document that was written during the partition. The user must provide the logic for merging the changes made separately during the partition.


This short discussion shows that CouchDB 2.0's clustering behaviour favours availability of data for both reading and writing. It also describes the coordinating node's behaviour in the face of partitions and how it indicates this to the client application, in particular how and why it is related purely to the number of responses received during operations.

It's important to understand this, and what happens as data diverges on the different sides of a network partition -- and how to fix the results of this divergence -- in order to create well-behaved applications. However, this behaviour does allow the building of extremely robust programs with the careful application of knowledge and conflict handling code.

More numbers on adverts vs. content

The New York Times has more information on the often awful balance between editorial content and adverts in The Cost of Mobile Ads on 50 News Websites.

The difference was easy to spot: many websites loaded faster and felt easier to use. Data is also expensive. We estimated that on an average American cell data plan, each megabyte downloaded over a cell network costs about a penny. Visiting the home page of every day for a month would cost the equivalent of about $9.50 in data usage just for the ads.

Content blocking

About three weeks ago I gave in: I turned on Firefox's Tracking Protection feature. Last week I installed 1Blocker on my iPhone. Until now, I'd avoided ad- and tracker-blocking software. I felt uncomfortable hiding that which provided sites' revenues. Looking under the hood at sites I regularly visit, however, I realise now that I've been a fool to hold out for so long.

I have two aims with both Tracking Protection and 1Blocker:

  1. Avoid being tracked online. Almost always this involves blocking adverts. In addition, it covers Facebook/Twitter/Google Like-type buttons.
  2. To simply make the web faster.

The main argument put forward to against using blocking software, and the reason I've held out for so long, is that as a by-product of improving your web experience, you remove revenue from the sites you visit when you block the trackers and the adverts which, purportedly, fund them.

The story goes: by reading a webpage, you've agreed to the means the publisher has decided to use to derive money from that page. I've come to see that this is specious: the publisher's advertising and tracking software has executed long before I've read the article or even considered whether it's worth providing my data in order to enjoy it. If I have no choice, there is no agreement. Implied consent by visiting a URL isn't valid: it's not legal to make me agree to a contract whose content I have not been shown.

The first reason leads directly and obviously to the second. The images, tracking pixels, JavaScript files, movies and other cruft employed to advertise and track have swelled to epic proportions.

The savings speak for themselves. Here are the number of requests and amount of data downloaded for two popular news sites, with Tracking Protection turned on and off:

SiteHTML sizeTP OnTP OffRequests savedData saved
Economist84kB 150 requests; 6MB580 requests; 13MB74%53%
Guardian81kB 29 requests; 1.6MB191 requests; 6MB85%73%

In addition, with Tracking Protection turned off, both sites send ten or so requests per minute to various tracking services, presumably to gauge "engagement with content".

On a wifi connection on a laptop, the extra work doesn't matter too much. On a phone, however, the extra data and energy used is significant. It's worth fighting against to gain longer battery life and more useful data from my tiny 500MB data plan.

Many pundits have suggested that Apple has some ulterior motive to "kill Google" or other such rubbish. To me it's clear that content blocking is aimed at improving battery life and increasing user satisfaction of iPhones and iPads. I suspect Apple would have shipped blocking by default but for the need to avoid being pilloried by publishers and advertisers.

Long ago publishers made a Faustian bargain with advertisers. Unlike Faust, the rewards ended up being fairly scant. Now both advertisers and publishers are now reaping their rewards: this has gone too far and we readers fight back the only way we can.

I used to avoid ad-blocking software. I have changed my mind. I now realise the implied agreement readers made with publishers was torn up by publishers and advertisers long ago, if it ever existed at all. For readers, there are few cons to blocking tracking. Let the publishers and their advertisers figure their own way out of this tarpit of their own making.

I liken this to the publishers' thoughts of agreement above. If in their view I imply consent to advertising when I visit a page, I will view their sending data to me as implied consent that they are happy for me to block their tracking, advertising and other ne'er-do-wells.

I suggest you do too.