Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Hadoop in Excel (datanitro.com)
82 points by karamazov on Nov 13, 2013 | hide | past | favorite | 22 comments


Hi, I'm one of the developers. I'd be happy to answer any questions you have on this.

If you're in New York, I'd love to meet you in person at our Big Data in Excel meetup this Monday: http://www.meetup.com/DataNitro/events/149402612/

And, as the page says, we're looking for beta users! If you're interested in this, know someone who might be, or just have an opinion, I'd love to talk to you. You can comment here or reach me at ben at datanitro.com.


Looks like a killer product, because there are a lot of business people who readily know how to write what is essentially functional code in Excel but for whatever reason cannot or will not write even simple map and reduce functions in a conventional programming language to extract information from a large Hadoop data set.

Are formulas or spreadsheet browsing limited in any way?


Formulas should be used as columnar operations: you can apply one formula to every element of a column (map) or aggregate an entire column (reduce). (This isn't that much of a restriction - you can use If-Statements to make complex expressions here.)

Spreadsheet browsing is limited to a sample of the data (head + tail); you can set the size of the sample before pulling.


I don't think "conventional programming languages" (and in particular Java) are well suited to the task of analyzing and querying data.

But Excel is also an odd beast. It's used for laying out reports, pivot tables, plotting graphs, working with database, calculations and VBA macros. Not all features play well together.


Did you write your mappers and reducers in java using the hadoop api or does this translate into hiveql or some other higher-level language? Great job btw, this looks super helpful for the business types to get useful reports on their own rather than interrupt the workflow of someone with more formal training (huge issue typically).


Thanks! We're working directly in Java right now, but might explore alternatives later. We're also planning to add support for Impala/Presto/etc.


What are you using on the back-end to perform the queries? Are you using MapReduce? What is the average latency expectations when using the application?


We are using MapReduce. Latency will depend on your cluster and the query; it's just a regular MapReduce operation from Hadoop's point of view.


Funny as this sounds it may be in fact exactly perfect for a large subset of Hadoop use-cases. If it works well.


Being pretty naive to the space, I'm assuming the killer differentiator from Microsoft's own Power Query (which looks like it can pull from Hadoop) is that this pulls a subset of data as an initial workspace, while Power Query pulls all of the data? Any other key differences?

Really cool tool! Wish I had some large real-world Hadoop cluster to try it out on...


The major difference is the ability to run queries on Hadoop in addition to the being able to pull data.


I think this would really benefit from a dead simple tool that would allow users to import from csv files into a local Hadoop instance, without having to do anything besides install Hadoop. But this seems like something that could really democratize data analysis on large data sets considering the number of people who are pretty good with Excel.


I've seen demos of a tool called Datameer which seems to offer very similar functionality (an Excel-like interface for configuring a job on a small set of data, followed by submission of that job to a Hadoop cluster as a MapReduce job). How does DataNitro compare to that?


Ummmm...doesn't Excel have a row limit of somewhere around 1 million?


Yes, it does. This doesn't involve pulling all of your data into a spreadsheet.


Can Excel open a 1-billion-row data file?


No, it can't - the limit is just over one million rows. This doesn't involve pulling anywhere near that many rows into your spreadsheet.


Then why MapReduce?


We let people with Hadoop Clusters pull a small sample of data into Excel, analyze it with Excel formulas, and then run the analysis on the full data set. The last part happens outside of Excel.


While impressive in terms of a technical achievement, Excel is a pretty appalling analysis tool generally. I fear for what it will turn into when you throw this much at it. Big Data doesn't let you power through being wrong.


This is aimed at people doing simple analyses on massive sets of data, which can work extremely well. [1] We're not advocating that people without a data science background start doing ML or something.

[1] See "The Unreasonable Effectiveness of Data", by Peter Norvig, Alon Halevy, and Fernando Pereira at Google.


Which is why I'm not besmirching your technical achievement as much as...Excel is widely abused by the ignorant, Big Data is widely abused by the ignorant...Hadoop in Excel...




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: