Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No. CSV plays nicely in big data. JSON doesn't.

Tabular data allows you to easily optimize performance and costs. For example, Google decided it is a good idea to export some of its billing data columns as JSON. As a consequence, filtering by a low cardinality value means parsing a large amount of data because it's in a JSON column. Something that would otherwise be cheap in a classic columnar compression. Where the value is stored once along the number of subsequent rows with the same value. BigQuery bills according to amount of data parsed.

The above anecdote doesn't cost much, fortunately, because the billing data isn't large. But, in terms of a data pipeline development costs, I don't use it in the context of big data.



It’s not as cute and dry as you claim.

CSV only plays nicely with big data where your CSV encoder and decoder both agree on how to read and write CSV data. The problem is CSV is not formal standard, sure there are documents on how CSV should work they’ve never been enshrined as a standard. This means CSV parsers often differ (like with how some JSON parsers break spec with supporting comments except where JSON differs it’s nondestructive to data integrity but where CSV differs is hugely destructive).

Common issues that break CSV:

* headings or no headings

* multi line comments not being handled the same

* delimiters differing

* differing support for quotation marks

* how to escape characters, particularly control characters such as quotation marks, new lines and delimiters

* parsing of numbers (a lot of popular CSV editors actual mangle numbers a lot)

* parsing of non-numeric data that superficially appears as numbers (like credit card data, dates, basically anything where zero padding needs to be preserved).

But the worse offence with most CSV parsers is that they’ll often silently fail (or silently do the wrong thing) thus garbling your data in ways that can be almost impossible to spot on really large data sets.

You can see a practical example this problem just saving a CSV file in Excel :)

JSON might have its warts but it is a better format for where you care about preserving the integrity of the data between two independent systems. However if you’re reliant on tabulated data then I might recommend jsonlines https://jsonlines.org/examples/ (ndjson is a very similar spec too). While they have not been formally standardised they do at least extend JSON in a complimentary way and they also solve your complaints about JSON but without creating the same problems as CSV (well, aside from the headings problem).

I appreciate if you’re working with large enterprise databases then you’re hands are likely tied as to which file format you can use. But if you’re writing your own routines then jsonlines is definitely a better format for data integrity than CSV.


> how to escape characters, particularly control characters such as quotation marks, new lines and delimiters

I didn't know CSV injection was a thing until it got flagged in a pen test. Then you look around and realize it's a widespread problem and most serializers don't even have an option to escape them.


That’s because technically you can’t escape them:

* CSV only supports one data type: string. Thus formulas are just strings processed as code by some applications based on the content of that string

* CSV doesn’t support character escaping. Everything is supposed to be read unescaped. Even new lines are literal new lines. There no support for C-style escaping. If you need to have control characters then you wrap your string in quotation marks (and the fact that quotation marks are option leads to another class of bugs). If you need quotation marks inside your quotation marks then you double up the punctuation marks (ie to print “ inside “” then your string would look like

  “Bob said “”hello”””
(Please excuse my iPhone replacing ASCII double quotes with their prettier non-ASCII counterparts)


While this might be technically true, I think it's worth pointing out the typical solution: append a character (usually either \t or ') to anything that looks like a formula (i.e. anything that begins with @, +, - or =).

The character will be rendered/visible, but that's better than letter excel execute some arbitrary code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: