Archive

Author Archive

Regex-fu #PHPUK2010

February 26th, 2010 Wade No comments

Good start: don’t use it unless you need to, there’s plenty of alternatives, e.g. DOMXML, str_replace, etc. Also PHP5+ has lots of filters for email validation and URL validation etc, function calls you can make rather than complex regular expressions. Regular expressions can slow down quickly due to back tracking, pattern complexity and long strings.

Then the talk has become abstract, each point is prefixed with an odd statement such as “Only elephants remember everything” and “Not all matches are made in heaven” – people are getting it, but everything needs explaining before they get it!

One very good point I have seen ignored many times is “try not to be greedy.” For example /<(.+)>/ in the string <a href=”">fdsfsd</a> will match the entire thing. To make it ungreedy, either use /<(.+?)>/ or /<([^>]+)/ . Greedy matches can be 20+ times slower.

Categories: Programming Tags:

#PHPUK2010 Part 2 (MySQL stuff)

February 26th, 2010 Wade No comments

Just picked up a nice tid-bit on creating a unique index on a two column table where the values in each column may be either way around but you only ever want one instance of the value in that row. So what this means is, inserting 2,1 and 1,2 for example would result in only the first of the two inserts succeeding.

CREATE UNIQUE INDEX ON tablename (LEAST(col1,col2), GREATEST(col1,col2));

Also, WITH, I’ll be honest, never thought about using it to create temporary views. This is a bad example but shows the structure rather well:

WITH tempView (a,b) AS (
SELECT table1.col1, table2.col2
FROM table1
LEFT JOIN table2
ON table1.id=table2.id
)
SELECT a,b FROM tempView;

Better yet is changing this to WITH RECURSIVE tempView and then adding in a select inside the WITH that recalls tempView. The great example he gave is for getting flights from A to B with a varying  amount of stops, it would be possible to get all routes from A to B with one MySQL query, as long as the data stored all connecting routes.

Incidentally, while there is some great stuff coming out of this RDBMS talk, I think the queries are really hurting a lot of people’s heads. Good stuff though.

Categories: Programming Tags:

#PHPUK2010 Part 1

February 26th, 2010 Wade No comments

Josh began by using the dictionary definition of simplicity (as given by Wikipedia) pointing out that the word is often used as a derogatory statement. He then went onto “clarity of expression” and that striving for it while programming is something a lot of people do but never quite seem to achieve.

He spoke of an example where a user comes to a programmer asking for a report, and the usual first reaction is “ah, you need a reporting system.” He also said that’s not always the case, at the end of the day, the user just wanted a report, at this point I heard quite a few people take a breath in through their teeth (particularly the guy sitting to the right of me, he knows who he is.) That is a hard problem, particularly at Stickyeyes where we really do get a lot of people saying “I want a report” and often we have to build a system, simply because of the sheer amount of similar repetitious reports.

He made a very good point about developers having a tendency to go for the newest, shinyest tools (such as HipHop for PHP.) The reason for this is to point out that these tools exist because they solve a particular type of problem, so unless the tool actually helps you, do you really need to use it?

Categories: Programming Tags:

Photo of Ant Holding 500 times its own Bodyweight

February 21st, 2010 Wade No comments

_47340444_008791133-1This amazing picture of an ant holding its own bodyweight while upside down was taken by zoology specialist Dr Thomas Endlein of Cambridge University while researching creatures sticky feet.

This photo snatched the guy £700 in photographic vouchers from the Biotechnology and Biological Sciences Research Council.

His hope is studying the way ants feet self clean and change their size to support varying weights will help develop new adhesives.

“The pads on ants’ feet are self-cleaning and can stick to almost any type of surface,” he said.

“No man-made glue or adhesive system can match this. Understanding how animals can control their adhesive systems should help us come up with clever adhesives in the future.”

Categories: News, Science Tags:

Find Music by Humming – it really works! #midomi

February 1st, 2010 Wade No comments

We are sitting on the sofa at home at the moment discussing our holiday plans for the year and I got a song in my head that I started humming. Remembering the adverts on TV where you hold your phone up to a speaker and it tells you the song, I thought there must be one that you can hum to and it’ll find the song. So I did a quick google and found Midomi, I hummed the song (and I can’t really sing or even hum very well) and sure enough…it found the exact song, 2 results came back and it was the second! We’re both amazed by how well it worked.

So if you get a song stuck in your head and you can hum it, even if you’re quite bad at humming, give this website a go, chances are you’ll be amazed.

Incidentally, the song it found (which was even more impressive) was Kaoma – Lambada, which I’ve included below.

Categories: Random Stuff Tags:

MySQL and Binary(16) – The Reasons/Benefits/Drawbacks (#mysql)

January 31st, 2010 Wade No comments

I recently posted an article about using BINARY(16) for storing MD5’s as unique identifiers instead of simple integer ID’s (usually auto increment); in that article I touched on one of the benefits, reducing JOIN’s, but there are other reasons for doing it too, so I thought I’d post an article discussing purely the reasons behind using BINARY(16).

As I discussed in my previous article, an MD5 string is actually a hexadecimal number capable of storing values as large as 340,282,366,920,938,463,463,374,607,431,768,211,456. MySQL doesn’t have any efficient integer field for storing numbers this big so you have two choices for storage, a CHAR(32) or a BINARY(16). If you convert a hexadecimal MD5 into a unhexed character string, it will become 16 bytes rather than 32. MySQL handily has a feature built in for this called UNHEX.

So, why use binary(16) as a unique field for data storage? Databases like MySQL have superb functionality such as JOIN, allowing you to query one table and “join” the results of that query to another table. However, when you get to 10’s, 100’s or even 1000’s of millions of rows of data, JOIN’s become expensive, especially when the join only exists because you need an ID field from one table to query against on another. From tests at work, replacing a JOIN by using a binary(16) unique identifier has seen noticeable improvements to speed, noticeable here being human noticeable, not iterate it a million times and you’ll see 1.5 as opposed to 1.9 seconds noticeable.

The main benefits include:

  • Fast queries against any table where you know the formula that was used to create the MD5 binary(16) using human-readable English and no integers.
  • Complete disassociation of relational data values
  • Ability to use INSERT IGNORE to avoid duplicate data without having to use overly large indexes
  • More unique values than even a BIGINT.

The main drawbacks include:

  • 12 bytes more storage for the ID (INT is 4 bytes)
  • No auto-incrementation
  • Completely unreadable to humans when the data is in BINARY(16) form.

One thing I just mentioned was disassociation of relational data values. What does this mean exactly? Well it means exactly the same as what people do now with MySQL and unique integer ID’s to be honest! The difference here is you can query against it without those pesky JOIN’s a lot of the time. For example, say you are storing every town in the UK in a database and how they link together (i.e. if there is a direct route from one to another.) You’d have a table named towns probably, with a unique ID and the town name. You’d then have a separate table with 2 columns, both columns would store a town ID which would basically mean “this town has a direct route to this town.” If you were to use integers as the town’s unique ID, every time you wanted to get the town’s linked to said town, you’d have to query against the towns table first to get the town ID you want to get links to, then again to get the names of the towns that link to it.
If you were to use a binary(16) representation of the town you could scrap the first join, instead you could query by saying “get me any towns that link to UNHEX(MD5(’Town Name’))”. You’d still have to do the second join to get the town names, but you’ve instantly dropped a JOIN and simplified the whole experience as you can now query more naturally.
Basically, all you’re doing is replacing any place in your database that is a string that is usually more than 16 characters in length with a binary(16) of it, then storing the strings elsewhere for when you actually need to read the output. This effectively gives you a look-up table that can contain any string whatsoever and a database that stores relationships of strings without requiring special tables and integers for every string.

As a note, a table with 100 million rows of data with two columns – BINARY(16), TEXT – to look-up the textual value of a binary(16) string takes 0.0019 seconds for us and having that table of text has meant we’ve severely de-duped our database as the data we store often is identical, even when the source is completely different. Even if we do a WHERE BINARY(16) IN (list,of,values), the time sticks at 0.0019 up to the maximum test I’ve done so far which is 100 MD5’s.

Categories: Programming Tags:

MySQL – Binary(16) and scalability

January 29th, 2010 Wade No comments

Over the past few months at work, we’ve seen our database grown from silly big to really silly big, it’s still a way to go to get to the size of the big boys such as Facebook etc. but it’s still a database stored in MySQL that most day-to-day PHP programmers would avoid like a midget cannibal.

One of the great things about using something like MySQL (and any other “real” database) is the ability to cross-query data, i.e. to grab data from one data-set (table) and join it to another data-set (table) to get a single set of results, either as a combination of the data or the result of an exclusion due to the join. *

However, as tables grow, the time taken to perform queries, particularly in the realm of joins, grows rather quickly. So for example take this query:

SELECT *
FROM table2
LEFT JOIN table1
    ON table1.columnB = table2.columnA
WHERE table1.columnC = 'John.Doe';

Let’s say table1 is a list of all employees in a small business and table2 is a list of their days off, so it’s a one-to-many relationship. Running the above query to get the days off for person 5 would be pretty quick and most developers would be happy with that, even if the columns weren’t indexed, the performance of that query (as it’s a small business – therefore small dataset) would be more than suitable for any real-world application.

Now imagine a table where rather than a couple of hundred rows, you have millions or (such as ours) billions of rows of data; as for why we have that much data, that’s for another topic. That join could could result in a rather painful execution time. The problem you’ve got is, you have to first query table1 to get the ID of user ‘John.Doe’ and then use that ID for table2 to get the actual data.

So how can you optimise this? Well you’ve got three choices, the first would be two queries, one to grab the users ID from table1, then the next to grabs the users data from table2; but that’s 2 queries now. In a lot of places that wouldn’t matter, but we want speed here and reduction of hits to MySQL. The second is have the users name in table2 for each day off – that’s duplicating data though and because (in this case) you’d have a string, it’s not the fastest lookup and creates rather large indexes when people’s usernames are quite long.

The third option? A unique hash associated with that user. In this case, MD5 the username and store it as binary(16). MD5 is, after all, a 128-bit number basically. Most people are used to seeing it as a 32 character string, e.g. 7ecb9bba8130abe56cfd9a8430ca969c. That is just a hexadecimal number though, albeit a very very big one – capable of storing the value 340,282,366,920,938,463,463,374,607,431,768,211,456, for those in the UK that’s 340 sextillion. MySQL Doesn’t really have a suitable INT type for storing a number that big so it’s best to either store it as a 32-byte string (hexadecimal MD5) or better yet, as a binary string of 16 characters.

So how does that change our query now?

SELECT *
FROM table2
WHERE table2.columnA = UNHEX(MD5('John.Doe'));

No more join and only one select. It means you can look up days off for any user simply by knowing the username. MySQL has UNHEX(MD5()) to md5 a string and convert to its binary equivalent. In PHP you’d use md5(’string’, true) or pack(’H*’, md5(’string’));

In all honesty, this isn’t the best use of binary(16), but it’s a relatively simple example to follow. For us though, moving away from auto-incrementing ID’s towards binary hashes has allowed use to do blind inserts (insert ignore) and lightning fast selects where they used to take minutes or even hours. INSERT IGNORE has to be one of the biggest benefits we’ve seen. By setting the primary key to the BINARY(16) column, you can easily guarantee unique data without wasted extra index space and you only need to query that table when you actually need to data associated with that unique hash, the rest of the time, you can query other tables that relate to that hash without having to do a join.

* I would like to point out I am fully aware of people who store data without a dedicated database and use Map-Reduce due to the sheer size of it, however databases like MySQL allow a quick line of text to get the results you want, there’s no further effort involved.

Categories: Programming Tags:

Avatar is a must see film – at IMAX in 3D

January 13th, 2010 Wade No comments

On Tuesday night we went to see Avatar at the IMAX in Bradford’s National Museum of Film and Photography. I had semi high expectations for the film but had heard some bad press so how good it would be was left wide open. someone suggested we go to see it at the IMAX as it is vastly better, so that’s what we did.

AVATAR_TEASER_05-thumb-460x196The film was visually incredible, the attention to detail in the 3D work is undoubtedly the best I’ve seen. Pandora (the moon the Navi live on) is a world I wish I could experience first hand, from the vast luscious green trees to the plants that illuminate when they’re touched. I find it incredible just how much detail is in the scenery and how difficult it is to realise it is all CG. This is the kind of place I hope awaits us when we finally find a hospitable planet outside of our own.

avatar-8-thumb-460x253The Navi themselves are also a work of art, the method used to animate them means they move naturally, down to the slightest movement in the face.  Cameron’s team sought to go far beyond prior efforts, to ensure the complete reality of the characters. To do this, they developed a new “image-based facial performance capture” system, which used a head-rig camera to accurately record the smallest changes in the actors’ faces. Instead of using the motion capture technique of placing reflective markers on the actors’ faces to capture their expressions, the actors wore special headgear, to which a tiny camera was attached. The helmet/camera faced towards the actors’ faces and the camera recorded facial expression and muscle movements to a degree never before possible. Most importantly, the camera recorded eye movement, which had not been the case with prior systems.

Everyone needs to see this film at the cinema, and if you have an IMAX nearby, be sure to see it there.

Categories: Films Tags:

Federal Court to determine constitutional legality of gay marriage

January 12th, 2010 Wade No comments

It seems like it’s taken a long time for someone in the States to get up and do it, but someone seems to finally be ready to stand up in court and challenge the legality of banning same-sex marriage on a nationwide scale. The result of this trial should finally end the confusion the USA has had in it’s divided states. If it is voted unconstitutional to ban the marriage of two people of the same sex, any state that currently denies it will no longer be able to, effectively ending one of the biggest remaining prejudices in the States; this is, of course, if it is then argued and taken to the Supreme Court for one final showdown. If it’s voted constitutional, then my partner will be glad he’s in the UK now and doesn’t have to deal with a country that thinks it’s OK to treat people that way.

Proceedings opened on Monday with testimony from two plaintiffs in the case, Kristin Perry and Sandra Stier, who wed in California 2004 only to have their union later declared invalid.

Ms Stier said that being allowed to wed her partner would

provide me with a sense of inclusion in the social fabric of the society I live in.

I want our children to feel proud of us,” she told the court. “I don’t want them to worry about us.

Kristen Perry said:

I want it to happen to me. The state isn’t letting me feel happy.

Paul Katami and his partner Jeffrey Zarrillo described slights in gay life that ranged from being pelted with stones and eggs in college to the awkwardness of checking into a hotel and not being able to clarify the relationship.

Being able to call him my husband is so definitive,” Mr Katami said. “There is no subtlety to it. It is absolute.

Categories: News Tags:

Transparent Laptop? Samsung’s new 14 inch OLED display

January 10th, 2010 Wade No comments

We’ve all seen the movies where interactive displays are done on transparent glass, well Samsung, using OLED (organic light emitting diodes) have finally created one for us all to gawp over. Details are scarce, but it shows that yet another sci-fi idea is becoming mainstream reality. For now, here’s a video of the display, courtesy of Engadget.

Categories: Hardware, News Tags:
Easy AdSense by Unreal