Skip to content

Archive for

23
Nov

A somewhat non-technical explantion behind the creation of RoboHash


With hundreds of millions of variations, Robohash.org is the among the leading robot-based hashing tools on the web.

About

A few months ago, I released a new website that turns anything you type into Robots.
I realized I never mentioned it here, so I figured now was as good a time as any to explain how it worked.
First off, the code is available, so you’re free to take a look yourself.

Genesis

I’ve been working on a new type of forum, that’s designed to make it easier for people around the world to communicate, share ideas, and talk to one another, regardless of the potential hostility of their regime. I was inspired by the crackdown in Libya and the rise of the Arab Spring, and the later Occupy movement, to try to make communication more reliable, and harder to oppress.

In the process of making my new forum, I realized that I needed a way to identify people. Regular usernames wouldn’t work, since I needed them to be able to be unique, but work without any central infrastructure. “Ah-Ha!”, I thought to myself. “This isn’t a hard problem! Use Public Keys!”
Cool. It works, and everyone is unique everywhere… The only problem is that if everyone is using a public key, rather than a username, how would you recognize them?

I added code to let people add in a username as well, but we’re back to the uniqueness problem. How do you know that “E1ven” in Message1 is the same as “E1ven” in Message2?

I added various tricks, such as displaying the publickey on mouseover, and displaying the first part of it after the username “E1ven?AAASDASDAS…” but I knew it wasn’t going to work. It’s just too hard to keep track of public keys.

PubKey

I recalled reading about some research a few years back that might help..

Faces

It turns out that people are really crappy at remembering Secret Phrases.
Outside of spies during the Cold War, the only time people really have to memorize Secret Phrases are to log into their computer systems or ATMs.
We can DO this, but we’re not very good at it, the research implied.

Instead, they developed a system that optimized for what people ARE good at – Recognition. In Particular, Recognition of faces.
This is something we do every day- When we see a co-worker, we instantly recall who they are, in a billionth of a second.
Even people we don’t see all that often are easy to recognize, even if you can’t picture what they look like.
If you ran into your Hairdresser at the Supermarket, you’d recognize it was her, even if you couldn’t think of her name right that second.

One system I thought was kinda cool actually took this idea and made it into a Password system-
You’d pick the faces you recognized out of a picture of crowd of people, and it would know you’re you.
An Example site using this style login

I realized I could use this same technique to help people recognize a poster to the forum!
Rather than making people memorize a long annoying public key, they could see a picture, say “That’s the same guy”, or see one that didn’t match, and know it was an imposter.
It didn’t need to be perfect, just a gut-check “Hey, I don’t think this is the same guy.”
They could always still compare the public keys, or go on to more advanced techniques, after they suspected a problem..

The problem is, if I randomly assign people’s Public key to a face, say, out of some sort of hypothetical face-picture-database, people will associate the poster with the model for the picture.
An African dissenter might feel awkward, or be disinclined to use the service, if his avatar rendered to a girl pink hair and a ponytail.

Design

I needed something recognizable, but still generic. Something like Robots or Monsters.

I realized that if if the design were simple enough, I could assemble the pieces, like you would if you were making a cartoon.
Take mouth A, put onto body B.

I could re-use the various pieces, and combine them in a way that was quick to assemble, but still recognizable.

I talked with a number of cartoonists- A number of them had a hard time with the concept, thinking the creatures would all be separate, and I’d just have the forum choose from one of 27 pictures.
Others were totally in-line with what I was looking for, but much more than I could afford for a minor feature, on a free forum.

Eventually, I went to 99 designs.

After spelling everything out, and getting a number of high-quality submissions, I was able to narrow it down to three finalists- Zikri Kader, Hrvoje Novakovic, and Julian Peter Arias.
Each of them had Brilliant designs, full of life, with characterization and expression.
I worked with each of them to narrow things down, tweak the designs, and build expandable characters.
They were great.
At the end of the day, I had to choose one to win the contest.. I hated doing so, though, since I had been working with all three in email for days, asking them to tweak things to work for the site.
It’s something I hate about 99 designs in general. As well as it works for me as a client, it’s rough on the artists.

I wanted to do better. I talked to the artists outside of the site, and was able to negotiate a deal where I’d pay them a secondary prize, in exchange for them giving me a license to use the images.



This gave me a Wealth of Robots.. I certainly didn’t need that many, but now I had to do SOMETHING with them.

Assembly

I spent a day lining up the components- The nose should always be at these XY coords, and the mouth should always be at these..
I exported each component out into a separate PNG file lined up for assembly.

It was a pain, and rather manual, but I only had to do it once. Per set.

After I had each component exported out, I opened up Tornado.. Tornado was the programming package I had used for my Lonava project, so I knew the basics of making it dance.

I decided to make a quick web service, that would take in a string, such as the public key I needed, and spit out a robot.

I’d make it so it gave you the image directly through the HTTP request, that was I didn’t have to do any fancy coding at the forum level, just include a regular <img> tag.

The next step was to figure out what Roboparts to use.
I wanted each Robotic part to be seemingly random, but to give the SAME seemingly random value each time.

That meant, if I made a robot for the word “ANDROID”, it would always make the SAME robot for that word, no matter how many times I ran it.. That was essential, to ensure that I could use them to identify the Public Keys.

So to do this, I turn the word into numbers, and then use that number to pick the robot pieces.
Imagine that we assigned a number to each letter, where A=1, B=2, C=3, etc..
The first letter from the word “ANDROID” (an A) would turn into a 1, the second (N) would be 14, the third (D) would turn into a 4, etc.

Then, we use this number to choose which Robot Part to use.
Each letter basically chooses one part.

So If I have 6 Robot eyes to choose from, and our first number is 1 (because it starts with an A), we choose the first set of eyes in the set.

Eyes

For the body, we might use the second letter.
The mouth would be the third, and so on.

Now, for the code on the site, I don’t use the letters directly like that, because then everything that started with the same letters would look the same.

Instead, I pull bits out of it’s SHA-512 hash. This ensures that similar words don’t necessarily look anything like each other.

I had python squish all the images together, and spit out a single-unified image, with all the components.

Squishing it together, essentially emulating the “merge layers” command in photoshop, was easy, since I had spent the time to make each picture the single-component, rendered against transparent.

Let’s walk through the assembly-
We start with a color, which is determined the same way we determined what eyepiece to use above..

In this case, the color Blue.
This will determine my Robot’s main color.

The next step is to choose a background, if one is requested.

Our seemingly-random value for the background is 5, so we get Background 5.

005 final5

Then, I choose a Body. This time, I have Body 10.
The body has to go before all the other body parts, since other things go on top of it, covering up the top of the neck, for instance.
Blue body 05
The next up is the face. Again, this has to go earlier than the other pieces, since the eyes, mouth and accessory piece go on top of it.
000 blue face 07

Next up, a random pair of eyes.
001 blue eyes 09

Now, a mouth
009 blue mouth 03

Finally, an accessory piece, to accent the robot. In this case, it happens to be a nose. But it could be an antenna, a hat, or anything else.

000 blue accessory 02

If I then paste these together, I get a fully assembled Robot:

Assembled

RoboHash.org

It worked!! I could generate any number of “random” robots, based on whatever text I wanted!

I added some more tweaks, such as being able to pass in a specific size, or request different sets of Robots. I was thrilled. Finally!

At the same time, it seemed like a waste to keep my new little hack to myself.
It had been fun to write, and more, the Robots were cool. I wanted to share them!

I put together a quick page, demonstrating some of the features of the generator.
After posting it online and talking to some people, I realized that most people weren’t going to type in a URL, as it advised. I quickly added a JS loader, which would swap out the main image.
It worked.. Mainly. I’m not a good coder, but it worked well enough for a hurry.

The whole idea was silly. Robots? Really?
I decided to embrace that, and just run with it. The site was just to share my hack, so I didn’t care if it looked ‘unprofessional’.

Flavor
I threw in a bunch of bits of silliness, such as random quotes from famous robot celebrities..
Each demo robot would have a Random quote (Many asking for help!) if you hovered over them.
The source to the page would have hidden Robots in it.

I had so much fun making it, I asked one of the artists make a robotic “scroll” arrow, encouraging people to keep going down to see the rest of the silliness.

Aftermath

Unordered

After the unveiling, I found I had a problem.
When I wrote the script to pull in all the components and assemble it, I didn’t think about the order it was doing it in.
It was always the same, so I assumed it was alphabetical. It wasn’t.

It was importing the images in the order of their inodes in their directory. Crap!

This worked fine for the one server it ran on, but when people started running the code on their own machines, or when I set up a second machine to handle load, it was generating different Robots!

I had included a unique number at the end of each file, which I had believed would be enough.. But since I was in inode order, rather than filename order, this was just random cruft..

I couldn’t remove it, though, since the hashes were partially based on filename!
I fixed the code so that it would properly sort alphabetically, but this would have broken all the existing hashes. No good at all. A hash is supposed to always stay the same!
Eventually I found that in unix, I could do a “ls -lU” to list them in the native-filesystem order.. I renamed them so that the filesystem order matched the alphabetical order. It was ugly, but it worked.

UglyPatch

Gravatar

A few weeks after release, James Devlin wrote about RoboHash, and devised a method to allow a site to use RoboHashes for generic users, but override it with a Gravatar if someone had set one.

It was really clever, but I thought I might be able to do a bit better, if I baked it in.

I ripped open the code, and added support for accepting a RoboHash request, with a Gravatar hash in it, such as

http://robohash.org/620050a4db5104bae758cd75171d64ca?gravatar=hashed

This would make the request to Gravatar– If it found something for that hash, it would use it. If not, it’d Hash it, and display it.

By doing directly on RoboHash.org, I could parse and include all the standard command, such as setting the size. I could also cache it, and serve it out of the same CDN as everything else


Conclusion

I’ve been really happy with the way the project turned out.
A number of different blogs are using it for all their commenters, someone made an iPhone app, and I’ve seen them show up in various online applications.

It’s great to see people using them, and the site is cheap enough to run that it’s not a burden to do so.
Anyone is welcome to use the robots, all I ask is they link back to Robohash.org someplace in their About screen.

Robohash is a cool little project, and I’m glad to have made it, and I’m excited to see people using it

Now, back to coding the Real side-side projects 😉

7
Nov

My Experiences with MongoDB in production over the last year

MongoDB Logo

First off, a disclaimer. I’m writing this article in my off hours, on my personal machine.
I don’t speak for my employer, although my use on the job as well as several projects at home do flavor my impressions.

I’ve been using MongoDB in production for various projects for the last year and a half, and found that there are some things it does incredibly well, and other areas where it tends to fall flat on it’s face.

I’ve seen a lot of people writing up their Mongo Experiences lately, and I’d had this document in the Drafts window for a while, so I thought I’d share my own thoughts with the DB.

Mongo is Fast

First of, Mongo is Fast. Like, seriously, screamingly, How-is-that-possible fast.
When you have a workflow that maps well to what Mongo prefers, Mongo is substantially faster than Couch, Riak, or Postgres.

The key, though, is ensuring your workflow maps to what MongoDB is good at.

Lots of people have written about Mongo’s Write Lock, but I think one point has become somewhat lost in the noise– When MongoDB is writing, the entire DB is locked. No writes can be performed, and no reads can be performed either.
This is normally fine, since writes are very, very fast. In fact, they’ve gotten much faster in Mongo 2.0, since the DB now yields the lock when it sends the command to write to disk, rather than waiting for the disk to return.

The fact remains, however, that when you have a lot of writes, your DB will halt, and reads will block.
Again, to be very, very clear- when MongoDB is writing something, ALL reads to ALL collections on ALL slaves will be blocked.

In the future, 10gen hopes to make this collection-level, which is a start, but other Databases such as Couch have versioned objects, which mean than even individual Objects aren’t locked!

Replica Sets are straightforward

One way to try to solve this is to use Read Slaves.
Replication in Mongo is dead-simple to set up; A competent admin can have a working 3-node replica set in an hour or two.

The intended deployment for MongoDB is that in production, you will always run multiple servers, and thus, always have multiple copies of your data.
This is why until 1.8, any single server might become corrupt if shutdown uncleanly.
This wasn’t a problem– If you’re only running on one server, it was argued, you were doing it wrong!

Essentially Mongo was designed to have Cluster-level safety, rather than server-level safety.

If a machine goes down, no worries, one of the other cluster members will take over for it.

Having multiple read slaves, and using SlaveOK queries means that you can redirect your queries onto these read-only slave replicas, even while the master is blocked by a write.

This generally works well enough, but there are some caveats you should keep in mind:

1) Not all drivers route SlaveOK automatically.
Because Mongo puts a certain degree of logic into the client drivers, you need to be careful to ensure that the driver for your language-of-choice supports the mongo features you want to use.
In some cases you might find that the while core mongo server supports a feature, such as SlaveOK queries, your particular language-driver might not. YMMV.

2) If replication backs up, SlaveOK queries might have bad/old data.
This ought to be obvious, but it’s important to keep in mind. If you store Session data in mongo, make sure you don’t use SlaveOK queries to retrieve it, or logins will mysteriously fail when replication falls behind.

3) Replicas get blocked on writes also!
This was not immediately obvious to me, but it’s important to note.
Writes to the mongoDB block reads. This is due to the aforementioned Global Lock
Less intuitive, is that it reads to the Slaves are also blocked during writes.
This is because during each write, mongo will replicate the write to each slave.
While this write is being written to the slave, as part of normal replication, the slave will be blocked.

There aren’t any good work around here. 10Gen is working to make the Global Lock suck less, but for now, it’s pain on a stick.

Sharding is expensive

If you ask 10Gen about these problems, they’ll patiently explain that they don’t intend for ReplicaSets to be a speed boost, they’re intended to provide data safety.

At the end of day, if you want to get past the blocking writes, you need to shard your data. Note, however, that this is not a panacea- If you have 12 Shards, and you send in a blocking write, 1/12th of your DB will still be blocked!

So here’s my rant about Mongo Sharding. I’m sure if 10Gen reads this, they will say it functions as designed, and I recognize that the way they did it is a valid tradeoff, I just don’t personally agree. Take that for what it is.

Sharding properly requires too many servers.

If you have a database that is getting to be more write-heavy than Mongo can easily keep up with, and you want to divide it up into Shards, you need a lot of machines. Probably 3X as many machines as you think you need.

The reason for this is that each shard needs it’s own replicaset; They don’t share, and you can’t re-use.

That means if you want to divide your DB into 10 partitions, you will need at least 30 servers.

  • Replicaset1
    • mongo1
    • mongo2
    • mongo2
  • Replicaset2
    • mongo1
    • mongo2
    • mongo2
  • Replicaset3
    • mongo1
    • mongo2
    • mongo2
  • Replicaset10
    • mongo1
    • mongo2
    • mongo2

You’re also supposed to have 3 config servers for each replica set, although these are lightweight enough that you can probably share them.

None if this is unbearable, but it’s something to keep in mind in your design configurations.
If you’re writing a lot, even VERY SMALL writes, you’re going to need a lot of servers.

To be fair, it’s also possible to try to offset this by running multiple Mongo Instances on the same HW.
Beyond adding mental and config overhead (which are bearable), since you’re generally sharding in situations when you’re write limited, this won’t help as much as you might hope.
Essentially, you’re sharing I/O, including replication to all the different servers, across fewer spindles.

What I’d really like to see is a more RAID-6 style deployment, wherein each node added to a mongoDB cluster held 1/nth parity data, as well as it’s own data.
This allows you to distribute requests to each node in the deployment, and still survive if a server fails.

This is more conceptually similar to how Riak handles adding nodes. Each node you add to a Riak server adds both redundancy AND speed.

Development is easy

Finally, Developing against MongoDB is about as straightforward as I could possibly ask for.
Getting it running on my Macbook was trivial, and development is fast, with little learning curve.

In Mongo, I can insert an arbitrary JSON document, such as inserting all email received from the outside world. After this, I can query on any element of the JSON.

This makes it VERY simple to pull all the emails sent from Thunderbird, or any message received in a date range..

I could do these same things with SQL, but I’d generally need to fit everything into a defined schema; For things like Email, or HTML pages, which might have any number of arbitrary headers, a schema-less design is a godsend.

Another thing I really enjoyed is that I can query in familiar styles. For example, to retrieve all messages from a certain sender, I could use a built in JSON query format, which if you squint hard enough, isn’t THAT different from SQL
Contrast this with my impression that Couch/Riak, I believe I need to use MapReduce for everything ((This was my impression in going through basic tutorials and development of mini toy apps; I accept I could be way off))

I like MapReduce for a certain class of problem, but sometimes it’s nice to be able to reach for a simple query, and have it just work. According to 10Gen, these queries are by and away the most popular form of requesting data from MongoDB, and they are going to be adding new forms of queries (such as Group by) in upcoming versions.

Support is available

I have to say that 10Gen has always been very helpful in providing support.
In additional, their new MMS tool is very helpful in seeing where things are having trouble.

Whether through paid (Tickets) or free (IRC/Mailing List) means, they’ve almost always come through with solid answers to almost any problem that comes up.
I have a lot of respect for the company for bringing in senior level people on a regular basis to help answer problems, rather than throwing us to people working off a script.

The only negative I have to say on the support is that too often, the answer is “It’s fixed in the next version”.

This seems to be the default answer to any problem that comes up..

  • “MongoS instances are not routing to all mongoD servers”
    Fixed in the next version
  • “Our MongoD server failed with this error overnight “
    Fixed in the next version
  • “Our replica set isn’t sure who the master should be”
    Fixed in the next version

Don’t get me wrong, it’s great that things are being fixed!
Just know going in that if you’re using MongoDB as your primary DB, you’ll be on a constant upgrade treadmill.

10Gen is still deciding how much to backport, and when, but generally it seems like problems with known workarounds only get fixed in Trunk, and problems without usable workarounds will get a backport to one version back. Don’t expect to use the same version for a year; If you want to avoid major bugs, it’s just not going to happen.

Summary

In Summary, I like MongoDB quite a bit, but like any software, it’s important to know it’s warts.
In a lot of ways, Mongo reminds me of MySQL on MyISAM a decade ago-
It’s Fast, has easy replication, and it’s used by everyone, so you can get help when you need it.
But if you’re not careful, it will blow up in unexpected ways.

If you understand the tradeoffs going in, and you have a relatively DB with relatively infrequent writes, MongoDB can be a godsend. It’s dramatically faster than the alternatives, and we’ve never had problem resulting in dataloss. But like any package you do need to design with it’s limitations in mind.