The World of 5Σ in 2013

Pure Perl String Matching

June 12th, 2013 - As you can see from a couple of entries ago, I was trying to use String::Approx 'amatch' to do fuzzy string matching. This had some downsides. I needed to install the module on my app servers, and the code was returning bizarre results. Attempts to hand-tune the numbers were all failing.

Like any good engineer, I re-invented the wheel (to be fair, this is a useful endeavour, I highly prefer a pure-Perl implementation of fuzzy matching, and without supporting libraries was also a huge bonus).

My algorithm is probably suboptimal, as I didn't follow any existing algorithms. The basic idea is I pull a single character from the first string, then 3 characters "nearby" in the second string, and see if the first char is in the second substring. I do this from front and back, on the theory that people typing strings tend to get the starts and ends correct more often than the middles.

If there's no match, a "miss" is recorded. At the end, a ratio is built between the number of misses and the size of the input strings.

Find the code here on this blog.

5Σ _____

First Look at VPI Virtual Servers

May 15th, 2013 - I am testing out the servers at VirtualPrivateInternet. It's a service similar to Amazon, but meant for developers and prototype-builders rather than production installs. The dashboard is lean and simple, so I allocated a couple servers and began using them.

The OS install was a vanilla Ubuntu, with a root login enabled. After some quick setup, I was off!

I installed MySQL, Perl Brew, and Mojolicious and within a couple of hours had my prototype game site, okAlike up and running. Feel free to check it out. Note that the main images served from okAlike are served from the Dreamhost site faemalia.com where I have pictures hosted, whereas the HTML, CSS, etc are all served from okAlike.com.

5Σ _____

Levenshtein Distances in Perl

April 16th, 2013 - I needed to do some fuzzy word matching in Perl. Extensive web searching reveals the only accepted solution for this is use String::Approx 'amatch'. It took me a little while to understand some of the subtleties, however:

The first argument to the amatch() is a pattern, not a full word, so if the $_ passed to amatch() is a superstring of the pattern, it will match. I wanted actual string matching.
The documentation states, in a somewhat confusing way, the match percent is number of characters in 10, rounded up. Basically, the idea is if you pass 15%, it'll change that to a "2".

In the end, I ended up writing a wrapper to amatch() to match two strings in a fuzzy sense. It compares the lengths of the two inputs and makes sure they're within 1 character length of each other, and only does the fuzzy match if so.

sub fuzzyMatch { my $match = shift; $_ = shift; if (abs(length($match) - length($_)) < 2) { return amatch($match, [ #"i", # match case-insensitively "10%", # tolerate ceil(N/10) characters wrong in 10 "D1", # num deletions "I1" # num insertions ]); } else { return 0; } } my $string1 = "biplane"; my @inputs = qw(artwork countertop lights plant sailboat sculpture ship vase); foreach my $in (@inputs) { print "$string1 <~=> $in :: "; print "[" . fuzzyMatch($in, $string1) . "]\n"; } print "\n";

I hope the code can save you a little time. Good luck out there!

5Σ _____

HBase/Hadoop log4j Gotcha: root logger

March 16th, 2013 - We recently decided we wanted to do syslog-based logging for our Hadoop and HBase daemons. This requires changing the log4j.properties files in /etc/conf/[hbase|hadoop] directory. Your files probably have lines like this at the top:

hbase.root.logger=INFO,console hbase.log.dir=. hbase.log.file=hbase.log

Then, a bit later, you might have something like this:

log4j.rootLogger=${hbase.root.logger}

You might expect, therefore, that if you change hbase.root.logger=INFO,console,SYSLOG, then add some SYSLOG entries like this:

log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender log4j.appender.SYSLOG.syslogHost=${mySyslogHost}:${mySyslogPort} log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout log4j.appender.SYSLOG.layout.ConversionPattern=hbase: %d{ISO8601} %p %c: %m%n log4j.appender.SYSLOG.facility=LOCAL7

Now you would have syslogging going to mySyslogHost. You would be disappointed to find out that no, in fact, this does not work. Why? Somehow the "hbase.root.logger" variable, no matter how you define it, will never work. If you do instead something like "log4j.rootLogger=${hbase.root.logger},SYSLOG" then that will work. If you define another variable as SYSLOG, like say "log4j.rootLogger=${hbase.root.logger},${my.syslog.variable}" that will also work.

It turns out that the environment variable HBASE_ROOT_LOGGER (defined in hbase-env.sh) is responsible for the value of ${hbase.root.logger}, despite its being defined at the top of the log4j.properties file! And even if you don't define that environment variable, it will gain some sort of default (probably "INFO,console") and then overwrite anything you try to set that variable to.

The Hadoop-equivalent ${hadoop.root.logger) and HADOOP_ROOT_LOGGER display the same behaviour.

We submitted a bug report to the HBase project on this, so hopefully you won't have to feel the pain we felt trying to deal with this tricky log4j gotcha.

5Σ _____

What Is SQL in Big Data?

March 3rd, 2013 - Recently I was chatting with some of my Big Data friends about SQL's role in the world. It seems all the Big Data databases of the world were fast and efficient, but lacked a simple query language to help explore the data. One thing that keeps getting reinvented is SQL. For example: Cassandra's CQL or Hadoop's Hive (HQL is SQL-like).

SQL is the secret missing ingredient from NoSQL! Who could have predicted it?

5Σ _____

Matt Reid's Instant InnoDB

February 2013 - I'm reading Matt Reid's book "Instant InnoDB: Short|Fast|Focused" now. As someone who's worked with InnoDB for about a decade, it's all review, but it's nice to see someone writing everything up in a single comprehensive top-down read. It's one of those "spend two hours reading, save 20 hours of headache" sorts of things.

The book is broken up into sections: getting started, basic configuration parameters, advanced configuration parameters, load testing, maintenance and monitoring, and troubleshooting.

The initial section, Getting Started, talks about how InnoDB achieves ACID using MVCC, and how the latter is implemented. It talks a little about downloading, installation, and some simple starting cases. This is a really good section for someone new to InnoDB.

The Basic Configuration section goes over some of the most-used parameters and summarises the industry knowledge of each. This section is again quite good for beginners or "amateur DBAs": the developer who is saddled with DBA duties for a temporary period.

When we get to Advanced Configuration, that's when the beginner devoted professional DBA will sit up and take notice. Here Matt covers some basic load generation tools and how to begin tuning the "important" InnoDB variables. He sets you up with a simple load/test/tune iteration loop to help you begin pushing your test InnoDB installation to higher usage and throughput. He gives some specific variables to deal with in the "tune" portion of that loop, such as file-per-table, the new buffer-pool-instances, read/write threads, and doublewrite buffer tunings.

In the Load Testing section, he gives some basic overview of load testing methodology and an overview of several load-generating tools including the venerable Bonnie++ and MySQLslap.

In the Maintenance and Monitoring section, Matt lays out his methodology for keeping an ongoing and growing InnoDB installation running well. He gives several formulae for determining when to resize InnoDB logfiles, add tablespaces, etc. This is run-of-the-mill administration, but again, if you're new to database administration, and want to be serious about it, this is a great foundation.

Finally, Matt covers some of the scary unknown realm of database administration: what to do when everything goes wrong. When it crashes, or corrupts your data. In the Troubleshooting section, he tells us about the error logs, how to bring up InnoDB tablespaces that are corrupt, and goes over some of the statistics that are useful when performance is very suboptimal.

Especially useful in this last section is going over some of the common InnoDB error messages and common causes. This is something I wish I'd had when I was first working with MySQL, as error messages from the server and client are very often misleading.

In summary, Matt's book is a solid InnoDB starter's manual. It is geared toward the dedicated starting MySQL DBA who needs to administer an OLTP installation of MySQL (this is what InnoDB was meant for). It will save you hours of searching on the web and could pay for itself easily in a variety of common cases. Pick it up. Check it out.

5Σ _____

The Datanormous Project: building and administering large database clusters

January 1st, 2013 - You want to create a giant HBase or MySQL cluster right now. Simply go to the easy UI, specify the cluster size and one of a few typical layouts, and press GO. Your cluster will be built within seconds, and you can start using it for dev purposes. To keep the cluster for more than a few hours, simply pay reasonable cloud computing rates, and you're golden.

That is the power of Datanormous.

Simple UI

The UI is a bootstrap-based administration console. You define your cluster layout and size, and Datanormous does the rest.

Cloud Offering

You don't have to provision hardware if you do not want to. You simply ask for the hardware, and we provision it in our cloud-based compute cluster.

Long-term Administration

If you have ever administered a large distributed database, you know the administration is a headache. Our intelligent administration interface will tell you about problems before they affect your business. See servers going bad before they die. See software bugs beginning to manifest themselves before they get to a critical stage.

5Σ _____

See Fifth Sigma's Contributions to the Year 2012.