Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Tough. You will never ever ever convince the world to conform to a rigid standard of two spaces between sentences, so your program is going to have to handle that kind of input.


Granted, any program that has to deal with other writers' text has to deal with both possibilities, but two spaces improves accuracy, both for typesetting and for semantic recognition of sentences for other purposes. Even in a document with a mix of two spaces and one space between sentences, a program can settle a case that would otherwise be ambiguous if there are two spaces. The more simplistic (and inaccurate) your single-space sentence determination algorithm is, the more valuable it is to be able to assume (or assign a high weighting that) a two-space separation marks a new sentence.

I know the world will never universally adopt using two spaces, but that doesn't mean they should get away with making spurious typesetting-related arguments and undoing the work of those who use 2 spaces. If a few more people decide that semantic richness is enough reason to start using two spaces, so much the better.


The problem is that the argument for using 2 spaces was never originally about parsing the text with a computer. Further, even the justification for using 2 spaces in the article is not doing so, either, it's an attempt at justifying using 2 spaces to emulate a typesetting practice (using a single space that was 2 or 3 times as wide as the space between words) which you still would not have expected someone to blindly apply to all text (you might have someone do it to all of the text just so a more experienced typesetter could come along behind them and make the adjustments that required a good eye for the work). Eventually publishers just found it easier to turn authors into apprentice typesetters, and probably found that most of the work they produce isn't worth paying an experienced typesetter to finish with that level of detail (and this attitude also shows in the editing of mass-market material, too). Of course, most mass-market publications use spaces of the same size after the end of a sentence as between words, now, and probably don't do much manual adjustment of the alignment of the text, either.

Additionally, if you assume that "period space space" indicates the end of a sentence, eventually you're going to run into some case where that's wrong, too. It's bound to happen in a world where we routinely teach people to blindly hit "period space space." You may be better off incrementally improving your lexer/parser (or using and contributing to an open source project) and throwing out the excess spaces.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: