Big Data you say?

The term “Big Data” seems to be all the rage these days. It reminds me of how the phrase “Big Iron” was being flung around with impunity circa 2001. Now don’t get me wrong – I’m a big fan of data mining and in fact my previous company specialized in reverse-engineering retail competitive-pricing models by aggregating massive amounts of data and employing sneaky algorithms to comb through said data.

It’s just that, as with “Cloud Computing”, the term is being over-used and under-examined.  The crux being:  it is not how much data you have, but what you do with it.  Just as the general populous is dealing with information overload from the Internet, and there is a real need for improvement in the current categorization and filtration mechanisms (ie. Search Engines and Social-filtration), so too is business dealing with a deluge of data, and not knowing what to do with it.  And so SAS and IBM and the like are licking their chops at the impending storm.

The thing is – we’re still developing more ways to generate data than we are ways to make sense of it.  Take the Geo industry (of which MapDash is a part) – data collection is now spilling out of networks and computer systems, and into the real world – with a vengeance.  Adding in 3 more dimensions to consumer data increases both the frequency and the quantity of data collection (the 3 dimensions being: lat, lng, and time – elevation ain’t too big yet).  Multiply this by the burgeoning “Internet of Things” and companies had better get pretty good (pretty fast) at filtering, processing, and mining this data.

This is not a doom-and-gloom piece, so I now offer up a suggestion:  Data Pipelines.  In my previous company we designed a custom system for mining massive amounts of sales data for reverse-engineering retail pricing algorithms and thus opportunities for pricing arbitrage.  We soon found the optimal architecture for this system was a Chained Pipeline – a series of interchangeable processes that took in XML and spewed out XML, and did so in a fashion where processing nodes could easily be interchanged or an overall flow-through could be quickly configured to give different results.

Now Amazon has gone and made this available to everyone, and made it incredibly easy.  4 days ago they launched AWS Data Pipeline – and although we’re not part of the private beta, from what we can tell this looks to be an incredible boon to anyone wanting to start processing and mining their data in a flexible way.  It includes a drag-and-drop interface for creating and configuring these pipelines for gosh sakes!  (A tad bit easier than our old custom system)  And it can save big dollars, because it includes automated instantiation/termination of EC2 instances (including Spot instances) for processing steps in the pipeline.  From experience – this is big, as I can’t count how many hours of compute time we’ve let our data-processing servers sit idle between jobs.

We use AWS in a number of ways here at MapDash, and I must tip my hat to them – they are creating a whole ecosystem of incredibly powerful building blocks to solve some of the biggest tech problems companies are facing , and Big Data seems to be squarely in their sites.  If you are interested in this space, check out their hosted Hadoop (AWS Elastic MapReduce), AWS Simple Work Flow, along with recent additions Glacier, Redshift, and now Data Pipeline.

If you are in on the Data Pipeline private beta, we’d love to hear your experience.  Enjoy your data crunching!

Contact Us

Have a comment or question? Feel free to give us a shout!