When biologist Rahul Sinha started his first independent research project at Stanford last January, he had a single-minded goal. He had just completed his post-doc in the lab of Irv Weissman, the Stanford biologist who helped launch the field of stem cells. They study the stem cells that form blood---the bone marrow-derived cells that help cancer patients recover after chemotherapy destroys their immune systems. Sinha wanted to find a true blood stem cell: one that hadn’t already started turning into a red blood cell, or a platelet, or an immune cell. A universal blood stem cell could reveal the path to all its progeny, helping scientists custom-make any blood cell a patient needed.
For decades, researchers had been using molecular techniques to narrow their search, but that approach had stagnated. To find his unicorn, Sinha would have to dig deeper, into the proteins that would eventually define the cells. That would require him to sequence the RNA of thousands of seemingly identical stem cells from a collection Weissman had built. And like most geneticists working today, the machine he turned to was from Illumina: the San Diego-based company whose products sequence 90 percent of all genetic data.
But instead of a true stem cell, Sinha stumbled onto something very different. Inconsistent results led him to identify an issue with the underlying operations of Illumina’s newer sequencers---an issue that could have contaminated the results of similar high-sensitivity data produced on the machines in the last two years.
Sinha’s research used Illumina’s HiSeq 4000, a fast system that cuts costs by sequencing hundreds of samples at a time. It also uses a proprietary technology, called ExAmp, that makes genetic signals clearer, even very faint ones. That makes it possible to sequence very small amounts of genetic material---like, say, a single cell’s worth. For those reasons, the HiSeq 4000 is a workhorse for geneticists who sequence in bulk. Scientists who manage the sequencing core facilities for the University of California system estimate that the system, introduced in January 2015, handles 90 percent of its sequencing requests.
Sinha and other academic researchers aren’t the only ones who need that kind of needle-in-a-haystack sensitivity. Precision medicine---like spotting a piece of tumor DNA in a drop of blood or finding a rare variant among the 3 billion base pairs in the human genome1---also requires high-resolution sequencing. Clinical researchers and biotech start-ups that need that kind of resolving power are increasingly using Illumina’s ExAmp chemistry and the machines that employ it, including its newest line, the NovaSeq.
Illumina itself is heavily investing in its sequencers’ medical applications. In the last few years, the biotech behemoth has acquired, invested in, partnered with, and spun out companies that can use its aggressively-patented sequencing tech to address disease. At an unveiling in January 2017, Illumina’s CEO Francis deSouza said that Grail, the company’s liquid cancer biopsy spinout, would soon be one of Illumina’s biggest customers. Grail and others are using the sensitive machines to search for shreds of tumor DNA in blood samples---a screening tool that could lead to earlier detection and better patient outcomes. At the time of the announcement, Illumina had 49 NovaSeq orders, and since then machines have been installed in medical centers and precision medicine biotech companies around the world. Getting these sequences right is more than just a matter of academic integrity: Money and medical progress are at stake.
Sinha started his search with a library. Not like one full of paper books---this one is built on a small glass plate with depressions, called wells, that separate genetic material from different cells. After converting his cells’ RNA into DNA and chopping it into small pieces, Sinha tagged each cell’s DNA fragments with a row identifier and a column identifier, coordinates that would trace each fragment back to the well (and therefore the cell) it came from. Once all the fragments were barcoded, he dumped them into a single test tube, washed away the extra barcode-containing molecules, and sequenced them. Like a librarian would use a Dewey Decimal number to return books to their shelves, Sinha would use the barcodes to match each piece of sequenced DNA to the cell it belonged to.
Sinha got his results in August, and they looked amazing. Gene expression had revealed 41 distinct subpopulations of blood-forming stem cells, including a group of cells that seemed capable of transitioning into any of the other ones---his true stem cell. “It fit with every hypothesis we had ever generated in the past 10 years,” says Sinha. “It was really exciting stuff.” That fall, the group started preparing their work for publication.
But meanwhile, Stanford grad students using the same Illumina machines to do similar work were beginning to warn each other to prepare their libraries more carefully. It seemed there was an uptick in tales of cross-contamination, genetic material from one sample jumping into another.
The whispers reached the ears of Geoff Stanley, a biophysicist who had helped Sinha run his computational analysis in August. There was something about the stem cell data that had bugged Stanley at the time---and now he was worried it was due to cross-contamination.
When he reexamined the data, Stanley found a curious pattern: Cells that had looked like genetic neighbors---the ones that belonged to the same stem cell subgroup---turned out to be geographic neighbors too. All the cells in a subgroup always shared a barcode coordinate for the same row or the same column, making a cross-shaped pattern. “The chances of that happening randomly are infinitesimal,” says Stanley. He texted Sinha and two days later showed him the analysis. “That was the first hint we knew something was wrong,” says Sinha.
That was late December. They spent the next few weeks retracing their steps, looking for places they could have made a mistake. And when they resequenced their samples on a different machine---an older Illumina model called the NextSeq 500---the cross-patterns disappeared, and the blood stem cell subtypes along with them. “Immediately we knew that all the 41 populations were fake,” says Sinha. “It was devastating.”
The pair brought in John Coller, who runs the functional genomics facility on campus, to design some additional tests. In one, they sequenced empty wells---but the sequencer’s results showed they weren’t empty at all. The machine was assigning sequenced fragments to wells that had no cellular DNA to start with.
What the wells did have in them were free-floating barcodes---which the scientists thought could be going rogue. So they took leftover material from the libraries Sinha had already sequenced and added two brand new barcodes into the mix. This time, when they sequenced the sample, they found about 7 million fragments with the new barcodes. The free barcodes were interacting with Illumina’s ExAmp reagents to form new fragments, which the machine sequenced along with the real cellular DNA.
Finally, Sinha and Stanley and Coller had nailed down the source of their cross contamination.
Their free-floating barcodes, some of which always escape the library wash process, had never caused problems on the old machines. But, they believed, in machines that use the ExAmp chemistry, those molecules were randomly sticking around. That could make gene expression that belonged to one cell look like it belonged to another one entirely, with no way of knowing where it actually came from.
Sinha wasn’t the first person to notice something funny in the HiSeq 4000’s results. Rumors have been swirling in corners of the internet ever since Illumina introduced the ExAmp technology. A genomics core manager at Cambridge University blogged about the problem, as did a Swedish bioinformatician in Stockholm. They used Illumina’s patents to hypothesize some mechanisms for the issue, but never published any formal data to back them up. Now Sinha had that kind of data, and he wanted to clue in the scientific community. But first, he and his colleagues decided to tell Illumina.
Near the end of January, Coller sent the company the results of their tests. Illumina responded, suggesting the problem looked very minimal, and could in fact be an error on Stanford’s end. The university’s Dean of Research Ann Arvin fired back with a letter to Illumina’s top management, outlining the school’s concerns. The company replied that it would look into the issue and get back to them.
That was where they left things until April 9, 2017, when Sinha dropped his team’s findings onto a biology pre-print server hosted by Cold Spring Harbor, bioRxiv. Science Twitter blew up with anxious researchers desperate to know if their sequencing data had been jeopardized. On April 10, the company responded in a set of tweets:
X content
This content can also be viewed on the site it originates from.
X content
This content can also be viewed on the site it originates from.
A few days later, just after midnight on Tuesday April 17, Illumina added a whitepaper entitled “Effects of Index Misassignment on Multiplexing and Downstream Analysis” to its website. (The company began work on the report in February, following Stanford’s complaint.) Illumina refers to the problem as “barcode hopping,” and writes that it was a known issue---describing its mechanism, how the company measures the effect, and ways to minimize it. Other than the April 10 tweets, it was the company’s first public recognition of the problem. While Sinha has taken some heat for going to pre-print, as opposed to waiting months or years to publish a peer-reviewed paper, he feels validated by how quickly things now seem to be moving.
The company says it has known about barcode hopping for 10 years, well before ExAmp, but that it occurred at such low rates (1 percent and below) that it was considered a small, acceptable level of background noise. But after Stanford came to them with their complaint, they realized that under certain circumstances, the effect could be more dramatic. “By far, that was the most extreme case of index swapping we’ve seen,” said Omead Ostadan, Illumina’s executive vice president of strategy, products and operations. “We realized we had to move quickly to characterize the problem.”
Lutz Froenicke, who runs the sequencing center at UC Davis, said that he’s not aware of anything in the literature or in the training Illumina gives scientists that specifically would have warned researchers about these free barcodes. But he also agrees that Sinha’s data was an extreme case, because he was sequencing so many cells with so little genetic material to work with. A typical mammalian cell contains only 200-600 femtograms (10-15 grams) of usable RNA that actually codes for proteins. It has 10 times as much DNA. And the average vial of spit that a company like 23andMe might use to sequence your genes contains thousands of cells. “There’s no reason to panic just yet,” says Froenicke. “Ninety-nine percent of experiments will be just fine.”
That’s the stance Illumina is taking too. But after reviewing the Stanford data and conducting its own investigation, the company does now admit that the ExAmp chemistry is more sensitive to the presence of free barcodes than its previous platform. Though Illumina disagrees with Sinha and his co-authors, who propose that the switch away from the older chemistry---specifically its multiple wash steps---is likely to blame. The company maintains that the problem can be exacerbated by changes in library preparation, like letting samples sit at room temperature. “What we found is that various unusual factors all combined to create a result that is highly uncommon,” said Gary Schroth, a vice president for product development.
To anyone criticizing the quality of his barcodes, his wash, his libraries, Sinha says he has only one question: “Why don’t all of those things cause a devastating switching effect on a NextSeq 500?” On that question, Illumina still doesn’t have an answer.
And until they do, it’s impossible to know the extent of the issue---how much data has been compromised, how many papers might have to be retracted, how many experiments thrown out.
For Sinha and his colleagues the situation is more stark. Weissman’s lab says it lost nearly $1 million to the problem, including salaries and supplies for studies that piggybacked off the faulty sequencing data. And Weissman is not attempting hyperbole when he says he wishes someone would declare a state of emergency. “If you have a flood in California that suddenly has a general effect on businesses, you can go to the state or federal government for emergency aid,” he says. “We don’t have that.” He pauses. “This is a disaster for us.”
Sinha lost a year’s worth of data. So now he’s not taking any chances. He’s re-running his experiments on one of the older machines and furiously applying for new grants to fund them. He knows now that there aren’t 41 neat, tidy blood-forming stem cell types waiting to be dug out of the genetic data mine. But he hasn’t lost hope that his unicorn is still in there, waiting to be found.
1UPDATE 7:40 pm Eastern 04/20/17: This story has been updated to correct the number of base pairs in the human genome. A previous version stated there were 3 million.