Advanced “medium-sized” data skills

Although I’ve been doing an absolute shit job updating and have lost countless amazing insights by ignoring my little blog, I’m going to immediately jump back into a rant and not bother summarizing the last 5 months. Okay maybe at the end.

Tho I will say this, I do believe I am over the learning curve at this point. I’ve learned how to decode stack overflow answers and am mostly google-sufficient. This is a very comfortable place to be in. Feels like I’m *actually* learning relevant things and moving in an *actual* direction instead of stumbling blindly from shitty half-baked advice to overly-granular explanations of irrelevant nuances.

Today I want to cautiously mention big data. Very cautiously. I will approach this subject with the same judgement-free complaining that I give all subjects here.

So, about 3 months ago, my brain started becoming uncomfortable with the inconsistencies in the way people were talking about databases. So I started a document of every word I didn’t think had a clear definition and took it to one of our analytics guys and said “explain these things.

For fun, here’s the list:

Airpal vs. Sql Pro – Are these interfaces? Why different ones?

Presto: def= “a cutting edge technology that facebook has been using at scale for the last year” <– wtf does that mean?

Production vs. Slave- which dbs have slaves and which don’t? How do slaves actually work?

Mysql-what is this actually? A language? a database? A server?


Things I’ve heard people say about Hive:

“The data lives in Hive”

“Hive is an interface”

“Hive is a platform”

“Hive is a database warehouse facility”

“Hive sits on top of Hadoop”

“Hive *acts* like a database”

map-reduce- a framework that hive and hadoop are based off?


Hadoop = manages the data

data itself sits on the servers disks as files

Hive – acts like a database


so you get the point. the problem was, with everyone i talked to, engineers, analysts, the list got longer and longer as they used more buzz words to explain already confusing things. One of my major pet peeves with both engineering and accounting is how the same innocuous words are used to describe vastly different and actually very simple things. I’ve made it a rule that in my meetings no one is allowed to use the words “reconcile,” “system,” “account,” “sits” or “data element.”

The other challenge is you can’t always default to “explain it to me like i’m a 5 year old” because then it’s so high level, it’s not useful. The real trick, and thing domain experts often fail at is knowing how to explain something in someone else’s terms. For me, what finally clicked was comparing database things in terms of excel, which I understand backwards and forwards.

Depending on your level this may or may not be helpful but without further ado, here’s today’s ah-ha moment:

2 kinds of databases worth talking about. Relational and Document. Relational databases have been around since the 70s, they’re basically set up to hold a bunch of excel spreadsheets. Document databases are the latest trendy thing. They’re basically set up to hold a bunch of word documents. They’re less structured but fit a specific need.

Think about when you’re planning a trip and you open an google spreadsheet and then realize quickly that you don’t really know what all the rows and columns should be. Should you divy it out by city? By activity? Do you need times or just dates? So you switch to opening a google doc and just brain dumb everything and call it “Thailand 2014.” Then later when you’re more organized, you move all the pieces back into google spreadsheet since you have a better idea of how to organize it.

That’s honestly a really accurate metaphor for document versus relational databases.

An obvious question is, how the eff would you query a structureless database?

And that’s where map reduce comes in. Hadoop, btw, is just a way to DO map reduce, which is a technique for structuring data.

And here is the insanely simple way that it works-

Going back to my Thailand example, let’s say I want to know how many times I mentioned each city so I can start prioritizing what to hit. I would write a map reduce job, which is done by using programming-friendly languages. And it works in 2 parts.

1. Mapping- I’m going to tell it to make me a list of every word in every document I have and give it a count. So I’ll have something like this:

Bangkok 1

Ko Pha Gnam 1

Bangkok 1

Bangkok 1

2. Reduce- Then I really just do a pivot table so it looks like this:

Bangkok 3

Ko Pha Gnam 1

So why is this so annoyingly trendy right now? Well, the real work of map reduce is the mapping. And If I have my documents on 2 different servers, it can run those at the same time, making it twice as efficient. It basically scales in speed with the number of servers. Versus mysql which has all your excel documents in one place and has to do everything in that same place.

Another analogy I like is, instead of sending your boyfriend to Target to pick up 20 things, you send 20 friends to 20 different Targets to pick up 1 thing each. Much faster.

And that, I swear to god, is it.

Well, except that map-reduce is hard for non-programmers to write so there’s a zillion programs that translate more SQL-type languages into map-reduce so that analysts can leverage it. This is where HIVE comes in.

So there ya go. I won’t even touch all the elitism, snark or the word “data-drive journalism” for now. But there ya go, Big data as explained by google docs and pivot tables.