MarkLogic is an XML document store, with the usual shift in access patterns which that entails. I could easily see much of the problem simply being an impedance mismatch with devs & other systems expecting SQL-like behavior.
I haven't worked with it directly but at several places it has a reputation for high memory requirements and requiring you to scale by getting single massive servers rather than clustering – I'm not sure if that's still true or if it's an issue for healthcare.gov but it could definitely make it harder to scale in a hurry.
MarkLogic is an XML document store like Oracle is a CSV file store.
We store compressed trees. We require memory - like all databases - because we actually have rich indexes.
Why? Scaling by requiring massive servers is Oracle's story, not MarkLogic. We scale out on commodity servers in a cluster just fine. However, we do not magically create more capacity when adding additional VM's on top of over-subscribed network and storage systems.
We don't require proprietary shared storage systems (local, NFS, HDFS - just give us bandwidth) or proprietary networks (TCPIP please, 10G preferred, don't need FC).
Cheapest way to scale is a cluster of decent sized machines with a decent amount of IO capacity each.
{Q: What do you call a database without indexes, transactions or security?
A: A file system!}
Ouch. As in, per this NYT article: http://www.nytimes.com/2013/11/23/us/politics/tension-and-wo... the contractors weren't familiar with its paradigm. They got started so late (not until February in earnest from what we've heard), it sounds like they just didn't have the time to learn it ... in a project (micro-)managed by political and bureaucratic types in the government.
Not learning it well enough would likely go along with not provisioning the necessary resources.
Usual lesson of "don't try to do too many new things if you're on a tight time and/or money budget".
Unfortunately, given that this is almost entirely a political exercise, I don't see how you're going to avoid getting some scapegoating.
"...the root causes for these site flaws to be hundreds of software bugs, insufficient hardware and infrastructure."
The infrastructure flaws cannot be chalked up to "not learning it well enough"....There is no database - not Oracle, not PostgreSQL, not MongoDB....and not MarkLogic - that would have handled the traffic on this hardware and infrastructure.
"There is no database ... that would have handled the traffic on this hardware and infrastructure"
That's what I meant by "not provisioning the necessary resources". I'm a programmer who dabbles in small scale systems building (from scratch, as in mount CPU on motherboard, etc.), so I'm perhaps not using the word "provisioning" as domain experts do.
But to the extent MarkLogic was the major database used (I'm getting that impression the more I look into this), unfamiliarity plus all the bad management could have contributed to not procuring beefy enough infrastructure. Or helped contribute to the widespread magical thinking, i.e. an experienced Oracle DBA could say with authority "this won't work" but not have as much weight when talking about MarkLogic. And I'm sure you had at least one field engineer helping them ... or I should say trying to help them. Ditto, I'm sure, the people or contractors in/for CMS who were already using MarkLogic, assuming the ones who really knew their stuff were even consulted.
Engineers not being listened to/respected in favor of magical thinking is obviously one of the biggest problems with this project. What can you say when the integration testing is delayed for the last 2 weeks before launch, proves it can't work the week before, and launches anyway?
Not that much controversy over that. HHS said most of the same this morning...I (and many folks) disagree with the premise that MarkLogic had anything to do with it. If the exact same processes and analysis were applied to a LAMP stack or an Oracle Exa-stack, the results would have likely been the same. I think the public news about the firewall sizing (4G instead of 50G) supports my claim.
So far, there have been two contractors that have been changed out - QSSI took over from CGI to lead rebuild and, recently, Terramark to be replaced by HP. Maybe this has absolutely nothing to do with not being familiar with MarkLogic.
There was a healthcare exchange built 100% on Oracle in Oregon (Oracle team, Oracle packaged software including Siebel, Peoplesoft, IDM, Oracle integration SW + people,Oracle infrastructure, Oracle hosting). It's not going particularly well.
I don't think familiarity of technology has one thing to do with it.
I do think that MarkLogic's ability to be agile - programming (EasyApp), infrastructure (speed of transition) and performance - have a great deal to do with the speed of the team being able to deal with the software bugs above MarkLogic and the weak infrastructure around us.
(For those joining this conversation in progress, I'm a product manager at MarkLogic - I'm in charge of infrastructure like storage, performance monitoring and cloud platforms.)
Small corrections: QSSI took over from HHS's Centers for Medicare and Medicaid Services (CMS) as "general contractor" (as it's being put lately) and integrator, CGI is still as far as I know in the same position. And next quarter or so they'll be moving the site from Terramark to HP.
The latter could be related to MarkLogic; I don't know about Terramark, but I hope HP isn't in this business except to provide at minimum full hardware solutions, not just a co-lo's power, cooling, connectivity and so on. I.e. better able to provide today's "big iron". Even if the current database instance has been beefed up by 12 dedicated servers, I've seen complaints the storage system is not up to snuff, and I know that's an HP focus (in fact, I'm doing a monthly full backup of my home systems to an HP LTO-4 tape drive right now...).
Or maybe Terramark isn't in a position to provide a backup site, something neglected to date....
My Google Fu isn't up to finding the technical firewall issue, do you have a pointer handy? It did find mention of the Administration freaking out when insurer giant United Healthcare bought QSSI before the election....
Your former CEO has the usual good things to say by anyone with a clue; I just hope the meme doesn't build up that MarkLogic was at fault. Especially since it was forced on reluctant contractors, that's enough to get a lot of the learned one thing and not about to learn another types to bad mouth it, apparently from a position of technical authority. Then again, how many of these technical people are in reporters' Rolodexes?
"I do think that MarkLogic's ability to be agile - programming (EasyApp), infrastructure (speed of transition) and performance - have a great deal to do with the speed of the team being able to deal with the software bugs above MarkLogic and the weak infrastructure around us."
Hmmm, I'll be looking for evidence of that in the serious postmortems that'll be coming out over the next decade or so, assuming it continues to be a nightmare if the backend connections to insurers fail to deliver quality data, as various of these articles are suggesting. Not to mention the previously mentioned not even working on the software to pay them, a very complicated thing....
Separate reply the HHS document (thanks for the link!).
400 bug fixes ~= 200 new bugs inserted? (Well, at least initially.)
Page 5, hardware upgrades: "Deployed 12 large, dedicated servers; upgraded storage unit", with a resulting ">3x Database Throughput"
Don't like reading that, although maybe it's not the critical path slowdown now. "upgraded storage unit" sounds like they're using a SAN, NAS, what have you, not the ideal recommended cluster architecture you described. I wonder if they've got enough IOPs....
Hmmm, maybe not a total disaster now. And the fix-it czar said his highest priority was (correctly) to stop sending garbage to insurance companies. We'll see ... especially since they evidently haven't started working on the really hard problem of paying the companies....
> MarkLogic is an XML document store like Oracle is a CSV file store.
Sorry, I wasn't trying to say that was a negative – just different in ways which many developers don't expect. With similar systems (e.g. zodb, MongoDV) I've seen developers use ORM-like patterns where they store properties in separate records or do reports by walking millions of records and then complain about performance rather than using it wrong.
As for memory, that was awhile back so it might have been an old version or poor usage. I just heard about trying to max out server RAM as a key requirement.
I haven't worked with it directly but at several places it has a reputation for high memory requirements and requiring you to scale by getting single massive servers rather than clustering – I'm not sure if that's still true or if it's an issue for healthcare.gov but it could definitely make it harder to scale in a hurry.