Show HN: Hadoop in Excel

karamazov · on Nov 13, 2013

Hi, I'm one of the developers. I'd be happy to answer any questions you have on this.

If you're in New York, I'd love to meet you in person at our Big Data in Excel meetup this Monday: http://www.meetup.com/DataNitro/events/149402612/

And, as the page says, we're looking for beta users! If you're interested in this, know someone who might be, or just have an opinion, I'd love to talk to you. You can comment here or reach me at ben at datanitro.com.

cs702 · on Nov 13, 2013

Looks like a killer product, because there are a lot of business people who readily know how to write what is essentially functional code in Excel but for whatever reason cannot or will not write even simple map and reduce functions in a conventional programming language to extract information from a large Hadoop data set.

Are formulas or spreadsheet browsing limited in any way?

karamazov · on Nov 13, 2013

Formulas should be used as columnar operations: you can apply one formula to every element of a column (map) or aggregate an entire column (reduce). (This isn't that much of a restriction - you can use If-Statements to make complex expressions here.)

Spreadsheet browsing is limited to a sample of the data (head + tail); you can set the size of the sample before pulling.

ajasmin · on Nov 14, 2013

I don't think "conventional programming languages" (and in particular Java) are well suited to the task of analyzing and querying data.

But Excel is also an odd beast. It's used for laying out reports, pivot tables, plotting graphs, working with database, calculations and VBA macros. Not all features play well together.

pvnick · on Nov 13, 2013

Did you write your mappers and reducers in java using the hadoop api or does this translate into hiveql or some other higher-level language? Great job btw, this looks super helpful for the business types to get useful reports on their own rather than interrupt the workflow of someone with more formal training (huge issue typically).

karamazov · on Nov 13, 2013

Thanks! We're working directly in Java right now, but might explore alternatives later. We're also planning to add support for Impala/Presto/etc.

monstrado · on Nov 13, 2013

What are you using on the back-end to perform the queries? Are you using MapReduce? What is the average latency expectations when using the application?

karamazov · on Nov 13, 2013

We are using MapReduce. Latency will depend on your cluster and the query; it's just a regular MapReduce operation from Hadoop's point of view.

staunch · on Nov 13, 2013

Funny as this sounds it may be in fact exactly perfect for a large subset of Hadoop use-cases. If it works well.

prawks · on Nov 13, 2013

Being pretty naive to the space, I'm assuming the killer differentiator from Microsoft's own Power Query (which looks like it can pull from Hadoop) is that this pulls a subset of data as an initial workspace, while Power Query pulls all of the data? Any other key differences?

Really cool tool! Wish I had some large real-world Hadoop cluster to try it out on...

karamazov · on Nov 13, 2013

The major difference is the ability to run queries on Hadoop in addition to the being able to pull data.

eigenvalue · on Nov 13, 2013

I think this would really benefit from a dead simple tool that would allow users to import from csv files into a local Hadoop instance, without having to do anything besides install Hadoop. But this seems like something that could really democratize data analysis on large data sets considering the number of people who are pretty good with Excel.

RobGoretsky · on Nov 14, 2013

I've seen demos of a tool called Datameer which seems to offer very similar functionality (an Excel-like interface for configuring a job on a small set of data, followed by submission of that job to a Hadoop cluster as a MapReduce job). How does DataNitro compare to that?

jackmaney · on Nov 13, 2013

Ummmm...doesn't Excel have a row limit of somewhere around 1 million?

karamazov · on Nov 13, 2013

Yes, it does. This doesn't involve pulling all of your data into a spreadsheet.

wbsun · on Nov 13, 2013

Can Excel open a 1-billion-row data file?

karamazov · on Nov 13, 2013

No, it can't - the limit is just over one million rows. This doesn't involve pulling anywhere near that many rows into your spreadsheet.

wbsun · on Nov 13, 2013

Then why MapReduce?

karamazov · on Nov 14, 2013

We let people with Hadoop Clusters pull a small sample of data into Excel, analyze it with Excel formulas, and then run the analysis on the full data set. The last part happens outside of Excel.

Fomite · on Nov 13, 2013

While impressive in terms of a technical achievement, Excel is a pretty appalling analysis tool generally. I fear for what it will turn into when you throw this much at it. Big Data doesn't let you power through being wrong.

karamazov · on Nov 13, 2013

This is aimed at people doing simple analyses on massive sets of data, which can work extremely well. [1] We're not advocating that people without a data science background start doing ML or something.

[1] See "The Unreasonable Effectiveness of Data", by Peter Norvig, Alon Halevy, and Fernando Pereira at Google.

Fomite · on Nov 14, 2013

Which is why I'm not besmirching your technical achievement as much as...Excel is widely abused by the ignorant, Big Data is widely abused by the ignorant...Hadoop in Excel...