Archive

Posts Tagged ‘PHP Conference 2010’

Regex-fu #PHPUK2010

February 26th, 2010 Wade No comments

Good start: don’t use it unless you need to, there’s plenty of alternatives, e.g. DOMXML, str_replace, etc. Also PHP5+ has lots of filters for email validation and URL validation etc, function calls you can make rather than complex regular expressions. Regular expressions can slow down quickly due to back tracking, pattern complexity and long strings.

Then the talk has become abstract, each point is prefixed with an odd statement such as “Only elephants remember everything” and “Not all matches are made in heaven” – people are getting it, but everything needs explaining before they get it!

One very good point I have seen ignored many times is “try not to be greedy.” For example /<(.+)>/ in the string <a href=”">fdsfsd</a> will match the entire thing. To make it ungreedy, either use /<(.+?)>/ or /<([^>]+)/ . Greedy matches can be 20+ times slower.

Categories: Programming Tags:

#PHPUK2010 Part 2 (MySQL stuff)

February 26th, 2010 Wade No comments

Just picked up a nice tid-bit on creating a unique index on a two column table where the values in each column may be either way around but you only ever want one instance of the value in that row. So what this means is, inserting 2,1 and 1,2 for example would result in only the first of the two inserts succeeding.

CREATE UNIQUE INDEX ON tablename (LEAST(col1,col2), GREATEST(col1,col2));

Also, WITH, I’ll be honest, never thought about using it to create temporary views. This is a bad example but shows the structure rather well:

WITH tempView (a,b) AS (
SELECT table1.col1, table2.col2
FROM table1
LEFT JOIN table2
ON table1.id=table2.id
)
SELECT a,b FROM tempView;

Better yet is changing this to WITH RECURSIVE tempView and then adding in a select inside the WITH that recalls tempView. The great example he gave is for getting flights from A to B with a varying  amount of stops, it would be possible to get all routes from A to B with one MySQL query, as long as the data stored all connecting routes.

Incidentally, while there is some great stuff coming out of this RDBMS talk, I think the queries are really hurting a lot of people’s heads. Good stuff though.

Categories: Programming Tags:

#PHPUK2010 Part 1

February 26th, 2010 Wade No comments

Josh began by using the dictionary definition of simplicity (as given by Wikipedia) pointing out that the word is often used as a derogatory statement. He then went onto “clarity of expression” and that striving for it while programming is something a lot of people do but never quite seem to achieve.

He spoke of an example where a user comes to a programmer asking for a report, and the usual first reaction is “ah, you need a reporting system.” He also said that’s not always the case, at the end of the day, the user just wanted a report, at this point I heard quite a few people take a breath in through their teeth (particularly the guy sitting to the right of me, he knows who he is.) That is a hard problem, particularly at Stickyeyes where we really do get a lot of people saying “I want a report” and often we have to build a system, simply because of the sheer amount of similar repetitious reports.

He made a very good point about developers having a tendency to go for the newest, shinyest tools (such as HipHop for PHP.) The reason for this is to point out that these tools exist because they solve a particular type of problem, so unless the tool actually helps you, do you really need to use it?

Categories: Programming Tags:

MySQL and Binary(16) – The Reasons/Benefits/Drawbacks (#mysql)

January 31st, 2010 Wade No comments

I recently posted an article about using BINARY(16) for storing MD5’s as unique identifiers instead of simple integer ID’s (usually auto increment); in that article I touched on one of the benefits, reducing JOIN’s, but there are other reasons for doing it too, so I thought I’d post an article discussing purely the reasons behind using BINARY(16).

As I discussed in my previous article, an MD5 string is actually a hexadecimal number capable of storing values as large as 340,282,366,920,938,463,463,374,607,431,768,211,456. MySQL doesn’t have any efficient integer field for storing numbers this big so you have two choices for storage, a CHAR(32) or a BINARY(16). If you convert a hexadecimal MD5 into a unhexed character string, it will become 16 bytes rather than 32. MySQL handily has a feature built in for this called UNHEX.

So, why use binary(16) as a unique field for data storage? Databases like MySQL have superb functionality such as JOIN, allowing you to query one table and “join” the results of that query to another table. However, when you get to 10’s, 100’s or even 1000’s of millions of rows of data, JOIN’s become expensive, especially when the join only exists because you need an ID field from one table to query against on another. From tests at work, replacing a JOIN by using a binary(16) unique identifier has seen noticeable improvements to speed, noticeable here being human noticeable, not iterate it a million times and you’ll see 1.5 as opposed to 1.9 seconds noticeable.

The main benefits include:

  • Fast queries against any table where you know the formula that was used to create the MD5 binary(16) using human-readable English and no integers.
  • Complete disassociation of relational data values
  • Ability to use INSERT IGNORE to avoid duplicate data without having to use overly large indexes
  • More unique values than even a BIGINT.

The main drawbacks include:

  • 12 bytes more storage for the ID (INT is 4 bytes)
  • No auto-incrementation
  • Completely unreadable to humans when the data is in BINARY(16) form.

One thing I just mentioned was disassociation of relational data values. What does this mean exactly? Well it means exactly the same as what people do now with MySQL and unique integer ID’s to be honest! The difference here is you can query against it without those pesky JOIN’s a lot of the time. For example, say you are storing every town in the UK in a database and how they link together (i.e. if there is a direct route from one to another.) You’d have a table named towns probably, with a unique ID and the town name. You’d then have a separate table with 2 columns, both columns would store a town ID which would basically mean “this town has a direct route to this town.” If you were to use integers as the town’s unique ID, every time you wanted to get the town’s linked to said town, you’d have to query against the towns table first to get the town ID you want to get links to, then again to get the names of the towns that link to it.
If you were to use a binary(16) representation of the town you could scrap the first join, instead you could query by saying “get me any towns that link to UNHEX(MD5(‘Town Name’))”. You’d still have to do the second join to get the town names, but you’ve instantly dropped a JOIN and simplified the whole experience as you can now query more naturally.
Basically, all you’re doing is replacing any place in your database that is a string that is usually more than 16 characters in length with a binary(16) of it, then storing the strings elsewhere for when you actually need to read the output. This effectively gives you a look-up table that can contain any string whatsoever and a database that stores relationships of strings without requiring special tables and integers for every string.

As a note, a table with 100 million rows of data with two columns – BINARY(16), TEXT – to look-up the textual value of a binary(16) string takes 0.0019 seconds for us and having that table of text has meant we’ve severely de-duped our database as the data we store often is identical, even when the source is completely different. Even if we do a WHERE BINARY(16) IN (list,of,values), the time sticks at 0.0019 up to the maximum test I’ve done so far which is 100 MD5’s.

Categories: Programming Tags:

MySQL – Binary(16) and scalability

January 29th, 2010 Wade 1 comment

Over the past few months at work, we’ve seen our database grown from silly big to really silly big, it’s still a way to go to get to the size of the big boys such as Facebook etc. but it’s still a database stored in MySQL that most day-to-day PHP programmers would avoid like a midget cannibal.

One of the great things about using something like MySQL (and any other “real” database) is the ability to cross-query data, i.e. to grab data from one data-set (table) and join it to another data-set (table) to get a single set of results, either as a combination of the data or the result of an exclusion due to the join. *

However, as tables grow, the time taken to perform queries, particularly in the realm of joins, grows rather quickly. So for example take this query:

SELECT *
FROM table2
LEFT JOIN table1
    ON table1.columnB = table2.columnA
WHERE table1.columnC = 'John.Doe';

Let’s say table1 is a list of all employees in a small business and table2 is a list of their days off, so it’s a one-to-many relationship. Running the above query to get the days off for person 5 would be pretty quick and most developers would be happy with that, even if the columns weren’t indexed, the performance of that query (as it’s a small business – therefore small dataset) would be more than suitable for any real-world application.

Now imagine a table where rather than a couple of hundred rows, you have millions or (such as ours) billions of rows of data; as for why we have that much data, that’s for another topic. That join could could result in a rather painful execution time. The problem you’ve got is, you have to first query table1 to get the ID of user ‘John.Doe’ and then use that ID for table2 to get the actual data.

So how can you optimise this? Well you’ve got three choices, the first would be two queries, one to grab the users ID from table1, then the next to grabs the users data from table2; but that’s 2 queries now. In a lot of places that wouldn’t matter, but we want speed here and reduction of hits to MySQL. The second is have the users name in table2 for each day off – that’s duplicating data though and because (in this case) you’d have a string, it’s not the fastest lookup and creates rather large indexes when people’s usernames are quite long.

The third option? A unique hash associated with that user. In this case, MD5 the username and store it as binary(16). MD5 is, after all, a 128-bit number basically. Most people are used to seeing it as a 32 character string, e.g. 7ecb9bba8130abe56cfd9a8430ca969c. That is just a hexadecimal number though, albeit a very very big one – capable of storing the value 340,282,366,920,938,463,463,374,607,431,768,211,456, for those in the UK that’s 340 sextillion. MySQL Doesn’t really have a suitable INT type for storing a number that big so it’s best to either store it as a 32-byte string (hexadecimal MD5) or better yet, as a binary string of 16 characters.

So how does that change our query now?

SELECT *
FROM table2
WHERE table2.columnA = UNHEX(MD5('John.Doe'));

No more join and only one select. It means you can look up days off for any user simply by knowing the username. MySQL has UNHEX(MD5()) to md5 a string and convert to its binary equivalent. In PHP you’d use md5(’string’, true) or pack(‘H*’, md5(’string’));

In all honesty, this isn’t the best use of binary(16), but it’s a relatively simple example to follow. For us though, moving away from auto-incrementing ID’s towards binary hashes has allowed use to do blind inserts (insert ignore) and lightning fast selects where they used to take minutes or even hours. INSERT IGNORE has to be one of the biggest benefits we’ve seen. By setting the primary key to the BINARY(16) column, you can easily guarantee unique data without wasted extra index space and you only need to query that table when you actually need to data associated with that unique hash, the rest of the time, you can query other tables that relate to that hash without having to do a join.

* I would like to point out I am fully aware of people who store data without a dedicated database and use Map-Reduce due to the sheer size of it, however databases like MySQL allow a quick line of text to get the results you want, there’s no further effort involved.

Categories: Programming Tags:

Superb VPS (Virtual Private Server) Provider – VPS.net Review

November 18th, 2009 Wade No comments

3d servers over a white backgroundA few months ago at work we realised the need for lots of “nodes” (servers) in the UK and in the US initially. We have a lot of data processing that we need to do and we worked out it would be faster and cheaper if we could distribute the work over lots of servers rather than a few beefy servers. Dan started looking around and found VPS.net, they looked good so we thought we’d give them a go. Of course we didn’t want to “put all our eggs in one basket” so we took out some VPS servers with different companies too. A few months later, the only servers we’ve been continuously happy with are the VPS.net servers.

There pricing is one of their best features, you can get a “single node” VPS with some basic specs of 400Mhz processor, 256MB RAM, 10GB storage and 250GB bandwidth (what this site is running on currently) for £15 a month. £15 may seem a little steep to some people to start with but when you realise this is for your own root access Virtual Server (i.e. you can pretty much do anything you want with it like it was at home) it isn’t that bad at all! What makes it better though is, as you buy more nodes, the price comes down for each additional node, so the first node may be £15, but the second is only £14 and so on all the way down to £9 per node.

VPS.net’s idea of nodes is also very cool. You can buy up to 16 nodes that work as one VPS, and you can change this at any time. Lets say you were me with one node and suddenly you got a surge of visitors and it just couldn’t handle the load anymore, not a problem, simply buy an extra node, attach it to the VPS and you’ll immediately get the benefit of it, no need to re-install your VPS or anything!

They also have automated full node backups for £4 a month too, they take a daily, weekly and monthly backup that you can restore from at any time, so if you totally screw up your server, just hit yesterday’s backup and voila! Back to normal again. Not only that, if the actual server your VPS is on decides it’s time to die, they’ll boot your VPS up on another server within minutes, that doesn’t cost and is part of the service.

If your demands are for a webserver, they have free DNS management tools as well so you don’t have to use an external service, just point your domain at their nameservers.

So in summary, if you’re looking for a reliable, fast, cheap place to host a website or host nodes to perform data analysis, I’d certainly give VPS.net a go.

Review ends :)