Archive

Archive for January, 2010

MySQL and Binary(16) – The Reasons/Benefits/Drawbacks (#mysql)

January 31st, 2010 Wade No comments

I recently posted an article about using BINARY(16) for storing MD5’s as unique identifiers instead of simple integer ID’s (usually auto increment); in that article I touched on one of the benefits, reducing JOIN’s, but there are other reasons for doing it too, so I thought I’d post an article discussing purely the reasons behind using BINARY(16).

As I discussed in my previous article, an MD5 string is actually a hexadecimal number capable of storing values as large as 340,282,366,920,938,463,463,374,607,431,768,211,456. MySQL doesn’t have any efficient integer field for storing numbers this big so you have two choices for storage, a CHAR(32) or a BINARY(16). If you convert a hexadecimal MD5 into a unhexed character string, it will become 16 bytes rather than 32. MySQL handily has a feature built in for this called UNHEX.

So, why use binary(16) as a unique field for data storage? Databases like MySQL have superb functionality such as JOIN, allowing you to query one table and “join” the results of that query to another table. However, when you get to 10’s, 100’s or even 1000’s of millions of rows of data, JOIN’s become expensive, especially when the join only exists because you need an ID field from one table to query against on another. From tests at work, replacing a JOIN by using a binary(16) unique identifier has seen noticeable improvements to speed, noticeable here being human noticeable, not iterate it a million times and you’ll see 1.5 as opposed to 1.9 seconds noticeable.

The main benefits include:

  • Fast queries against any table where you know the formula that was used to create the MD5 binary(16) using human-readable English and no integers.
  • Complete disassociation of relational data values
  • Ability to use INSERT IGNORE to avoid duplicate data without having to use overly large indexes
  • More unique values than even a BIGINT.

The main drawbacks include:

  • 12 bytes more storage for the ID (INT is 4 bytes)
  • No auto-incrementation
  • Completely unreadable to humans when the data is in BINARY(16) form.

One thing I just mentioned was disassociation of relational data values. What does this mean exactly? Well it means exactly the same as what people do now with MySQL and unique integer ID’s to be honest! The difference here is you can query against it without those pesky JOIN’s a lot of the time. For example, say you are storing every town in the UK in a database and how they link together (i.e. if there is a direct route from one to another.) You’d have a table named towns probably, with a unique ID and the town name. You’d then have a separate table with 2 columns, both columns would store a town ID which would basically mean “this town has a direct route to this town.” If you were to use integers as the town’s unique ID, every time you wanted to get the town’s linked to said town, you’d have to query against the towns table first to get the town ID you want to get links to, then again to get the names of the towns that link to it.
If you were to use a binary(16) representation of the town you could scrap the first join, instead you could query by saying “get me any towns that link to UNHEX(MD5(‘Town Name’))”. You’d still have to do the second join to get the town names, but you’ve instantly dropped a JOIN and simplified the whole experience as you can now query more naturally.
Basically, all you’re doing is replacing any place in your database that is a string that is usually more than 16 characters in length with a binary(16) of it, then storing the strings elsewhere for when you actually need to read the output. This effectively gives you a look-up table that can contain any string whatsoever and a database that stores relationships of strings without requiring special tables and integers for every string.

As a note, a table with 100 million rows of data with two columns – BINARY(16), TEXT – to look-up the textual value of a binary(16) string takes 0.0019 seconds for us and having that table of text has meant we’ve severely de-duped our database as the data we store often is identical, even when the source is completely different. Even if we do a WHERE BINARY(16) IN (list,of,values), the time sticks at 0.0019 up to the maximum test I’ve done so far which is 100 MD5’s.

Categories: Programming Tags:

MySQL – Binary(16) and scalability

January 29th, 2010 Wade 1 comment

Over the past few months at work, we’ve seen our database grown from silly big to really silly big, it’s still a way to go to get to the size of the big boys such as Facebook etc. but it’s still a database stored in MySQL that most day-to-day PHP programmers would avoid like a midget cannibal.

One of the great things about using something like MySQL (and any other “real” database) is the ability to cross-query data, i.e. to grab data from one data-set (table) and join it to another data-set (table) to get a single set of results, either as a combination of the data or the result of an exclusion due to the join. *

However, as tables grow, the time taken to perform queries, particularly in the realm of joins, grows rather quickly. So for example take this query:

SELECT *
FROM table2
LEFT JOIN table1
    ON table1.columnB = table2.columnA
WHERE table1.columnC = 'John.Doe';

Let’s say table1 is a list of all employees in a small business and table2 is a list of their days off, so it’s a one-to-many relationship. Running the above query to get the days off for person 5 would be pretty quick and most developers would be happy with that, even if the columns weren’t indexed, the performance of that query (as it’s a small business – therefore small dataset) would be more than suitable for any real-world application.

Now imagine a table where rather than a couple of hundred rows, you have millions or (such as ours) billions of rows of data; as for why we have that much data, that’s for another topic. That join could could result in a rather painful execution time. The problem you’ve got is, you have to first query table1 to get the ID of user ‘John.Doe’ and then use that ID for table2 to get the actual data.

So how can you optimise this? Well you’ve got three choices, the first would be two queries, one to grab the users ID from table1, then the next to grabs the users data from table2; but that’s 2 queries now. In a lot of places that wouldn’t matter, but we want speed here and reduction of hits to MySQL. The second is have the users name in table2 for each day off – that’s duplicating data though and because (in this case) you’d have a string, it’s not the fastest lookup and creates rather large indexes when people’s usernames are quite long.

The third option? A unique hash associated with that user. In this case, MD5 the username and store it as binary(16). MD5 is, after all, a 128-bit number basically. Most people are used to seeing it as a 32 character string, e.g. 7ecb9bba8130abe56cfd9a8430ca969c. That is just a hexadecimal number though, albeit a very very big one – capable of storing the value 340,282,366,920,938,463,463,374,607,431,768,211,456, for those in the UK that’s 340 sextillion. MySQL Doesn’t really have a suitable INT type for storing a number that big so it’s best to either store it as a 32-byte string (hexadecimal MD5) or better yet, as a binary string of 16 characters.

So how does that change our query now?

SELECT *
FROM table2
WHERE table2.columnA = UNHEX(MD5('John.Doe'));

No more join and only one select. It means you can look up days off for any user simply by knowing the username. MySQL has UNHEX(MD5()) to md5 a string and convert to its binary equivalent. In PHP you’d use md5(’string’, true) or pack(‘H*’, md5(’string’));

In all honesty, this isn’t the best use of binary(16), but it’s a relatively simple example to follow. For us though, moving away from auto-incrementing ID’s towards binary hashes has allowed use to do blind inserts (insert ignore) and lightning fast selects where they used to take minutes or even hours. INSERT IGNORE has to be one of the biggest benefits we’ve seen. By setting the primary key to the BINARY(16) column, you can easily guarantee unique data without wasted extra index space and you only need to query that table when you actually need to data associated with that unique hash, the rest of the time, you can query other tables that relate to that hash without having to do a join.

* I would like to point out I am fully aware of people who store data without a dedicated database and use Map-Reduce due to the sheer size of it, however databases like MySQL allow a quick line of text to get the results you want, there’s no further effort involved.

Categories: Programming Tags:

Avatar is a must see film – at IMAX in 3D

January 13th, 2010 Wade No comments

On Tuesday night we went to see Avatar at the IMAX in Bradford’s National Museum of Film and Photography. I had semi high expectations for the film but had heard some bad press so how good it would be was left wide open. someone suggested we go to see it at the IMAX as it is vastly better, so that’s what we did.

AVATAR_TEASER_05-thumb-460x196The film was visually incredible, the attention to detail in the 3D work is undoubtedly the best I’ve seen. Pandora (the moon the Navi live on) is a world I wish I could experience first hand, from the vast luscious green trees to the plants that illuminate when they’re touched. I find it incredible just how much detail is in the scenery and how difficult it is to realise it is all CG. This is the kind of place I hope awaits us when we finally find a hospitable planet outside of our own.

avatar-8-thumb-460x253The Navi themselves are also a work of art, the method used to animate them means they move naturally, down to the slightest movement in the face.  Cameron’s team sought to go far beyond prior efforts, to ensure the complete reality of the characters. To do this, they developed a new “image-based facial performance capture” system, which used a head-rig camera to accurately record the smallest changes in the actors’ faces. Instead of using the motion capture technique of placing reflective markers on the actors’ faces to capture their expressions, the actors wore special headgear, to which a tiny camera was attached. The helmet/camera faced towards the actors’ faces and the camera recorded facial expression and muscle movements to a degree never before possible. Most importantly, the camera recorded eye movement, which had not been the case with prior systems.

Everyone needs to see this film at the cinema, and if you have an IMAX nearby, be sure to see it there.

Categories: Films Tags:

Federal Court to determine constitutional legality of gay marriage

January 12th, 2010 Wade No comments

It seems like it’s taken a long time for someone in the States to get up and do it, but someone seems to finally be ready to stand up in court and challenge the legality of banning same-sex marriage on a nationwide scale. The result of this trial should finally end the confusion the USA has had in it’s divided states. If it is voted unconstitutional to ban the marriage of two people of the same sex, any state that currently denies it will no longer be able to, effectively ending one of the biggest remaining prejudices in the States; this is, of course, if it is then argued and taken to the Supreme Court for one final showdown. If it’s voted constitutional, then my partner will be glad he’s in the UK now and doesn’t have to deal with a country that thinks it’s OK to treat people that way.

Proceedings opened on Monday with testimony from two plaintiffs in the case, Kristin Perry and Sandra Stier, who wed in California 2004 only to have their union later declared invalid.

Ms Stier said that being allowed to wed her partner would

provide me with a sense of inclusion in the social fabric of the society I live in.

I want our children to feel proud of us,” she told the court. “I don’t want them to worry about us.

Kristen Perry said:

I want it to happen to me. The state isn’t letting me feel happy.

Paul Katami and his partner Jeffrey Zarrillo described slights in gay life that ranged from being pelted with stones and eggs in college to the awkwardness of checking into a hotel and not being able to clarify the relationship.

Being able to call him my husband is so definitive,” Mr Katami said. “There is no subtlety to it. It is absolute.

Categories: News Tags:

Transparent Laptop? Samsung’s new 14 inch OLED display

January 10th, 2010 Wade No comments

We’ve all seen the movies where interactive displays are done on transparent glass, well Samsung, using OLED (organic light emitting diodes) have finally created one for us all to gawp over. Details are scarce, but it shows that yet another sci-fi idea is becoming mainstream reality. For now, here’s a video of the display, courtesy of Engadget.

Categories: Hardware, News Tags:

What a line up – and now Ivana Trump! Celebrity Big Brother

January 9th, 2010 Wade No comments

I’ve not watched much Celebrity Big Brother this time around, as usual it’s become extremely over the top and feels almost scripted (as in the housemates, obviously BB has a script!) One thing they have done this time around though is actually get a couple of people who can actually be classed as celebrities. I know, I know, they’re not switch on the TV and they’ll be on a TV show or film right now, but they’re people who’s names you recognise, I think most notably (until this weekend) was Basshunter, even people who find his music terrible know the name.

ivanaThen of course there is Sisqo, I’m sure a lot of people know him (Thong Song.)

Now, they have Ivana Trump in there. A name I’m sure most people have heard, even if they’re not sure where. For those of you wondering, think about Donald Trump, that’s a name that should help you trigger a memory.

For those of you who hate Big Brother, continue to avoid it. For those of you who used to like it but have wained, I’d consider checking a few episodes out of CBB this year, it’s nowhere near as good as those first 2 years, but it’s probably their best one since (although it could still go all down hill.)

Categories: Events, Random Stuff Tags:

Snowy Britain – NASA Terra image

January 7th, 2010 Wade No comments

NASA uploaded an image today of Great Britain from space as of 11:50 UTC today, it’s a truly remarkable picture showing just how much it has snowed here. For anyone not from the UK, snow like this has not hit Britain for over 30 years, hence all the hoo-haa about. It looks beautiful though, although weirdly sci-fi in nature. I must admit, as cold as I am on my way to work on the morning, I still love the fact everything is white, and it’s that perfect powder snow too, if I lived near mountains, it would be great for snowboarding!

GreatBritain.A2010007.1150.1kmYou can view the original page here.

Categories: News, Random Stuff Tags:

New Bank Holiday coming up (hopefully permanent)

January 6th, 2010 Wade No comments

Yes, that’s right, we’re getting an extra bank holiday in 2012 (I know, a little bit of a wait still) as a result of the Queen’s Diamond Jubilee. The bank holiday will be on the 5th of June and the Whitsun bank Holiday will be moved to the 4th of June to give a nice long 4 day weekend; I’m sure the parents will be pleased!

Lord Mandelson said it would be a “truly historic occasion” which would allow the British people to show their “pride and affection” for The Queen. I’ll be honest, I won’t be showing my pride and affection for the Queen, I’ll be enjoying the day off work, quite possibly in a pub. I’ll join in any cheer for the Queen should it so happen.

As always though, we can’t just get an extra holiday without a bit of controversy: Republic, which supports an elected head of state in the UK, said 2012 should see a “meaningful debate” on the future of the monarchy and proposals for “serious reform”.

If we want an extra public holiday, aren’t there better dates to choose from?

Armistice Day would be a good start, recognising those ordinary men and women who make real sacrifices for our country.

– spokesman Graham Smith

Categories: Holidays, News Tags:

US Toughens up on incoming planes from “terror-prone” countries

January 4th, 2010 Wade No comments

Well, it was good while it was getting better, now we’re back to square one again, I’m talking about airport security of course. After the attempted plane bombing attempt on Christmas Day, security has now been upped at international US airports again for any incoming flights deemed to be from “terror-prone” countries. The new rules came into effect today as Barack Obama has been under pressure to visibly improve security.

The Transportation Security Administration said in a statement the new rules apply to flyers coming from or via countries on the US state department’s state sponsors of terrorism list. This includes Cuba, Iran, Sudan, Syria, Afghanistan, Algeria, Iraq, Lebanon, Libya, Nigeria, Pakistan, Saudi Arabia, Somalia and Yemen. However, any passenger from any country could now be checked at random as well, so it’s entirely possible to be checked when flying directly from the UK.

I’m so tired with all of this now, the ridiculous security that never seems to work well, the constant wars that it seems we must be involved in (thanks America), the endless violence…it just seems to be getting worse with every passing month. What with the admittance made by MP’s here now that they new, along with the states, that the whole Iraq war was really about regime change, it just shows how ignorant and self righteous the West has become.

Categories: News Tags: