As I thought about how I would get started digging into data science this past week, my idea was to lay some groundwork for some playful programming that would allow me to get some data analysis experience with a skill (programming) that I am already comfortable with. Then I received a Reflector Newsletter from great folks at Red Gate Software, and one of the prominent links was to the free eBook: Introducing Microsoft Azure HDInsight. It's a short book, about 130 pages, is intended to give a practical introduction to how a business might apply a "Big Data" technology to use, and uses some tools I'm familiar with: PowerShell and Azure.
It's important to remember why there is so much attention on "Big Data" these days. It's not just that there is so much more data to be consumed, filtered, summarized, etc. Another part of equation is the physical infrastructure in which we perform all of the operations that might be called data analysis. On a single computer, as the volume of data, the complexity of operations (i.e., calculations) or the sheer number of operations increases, we used to think first of scaling up to handle the increased activity or data load. Before Moore's Law stopped being relevant, we upgraded to newer processors, denser RAM and/or greater disk storage.
With the volumes of data at our disposal rising exponentially, however, and with processors not doubling in speed every year, we have had to scale out rather than up, adding more cores and often with their own storage. This led to another interesting problem to solve: with much larger volumes of data to be filtered and analyzed, how could we parallelize computation across networked systems, thus minimizing the wait time for complex analyses to complete? And that's where Map Reduce was born.
I've now made it through about half the book, but unfortunately no programming was required to follow along at this point. Instead I've been setting up HDInsight in Azure and downloading tools: HDInsight Emulator and Azure PowerShell command-line interface.
So, what is HDInsight? As the introduction states, "... HDInsight is Microsoft's 100 percent compliant distribution of Apache Hadoop on Microsoft Azure." Having checked out the Apache Hadoop web site, I learned that there's a newer version (2.3) of Hadoop than that currently supported by HDSight (2.2). I don't know if I'll continue using HDInsight, but if I do, I would hope that Microsoft keeps up with the Hadoop improvements.