Providing Expertise at a Major 'Big Data' Summit

Scott Gibson

Edmon Begoli of the Joint Institute for Computational Sciences and PYA Analytics

Edmon Begoli, chief data officer at the Joint Institute for Computational Sciences and chief technology officer for PYA Analytics, offered answers for linking disparate datasets during a talk at the Chief Data Officer Summit, Dec. 2–3, 2014, in New York City’s Financial District, Manhattan.

The annual summit, organized and produced by Innovative Enterprise, an independent business-to-business multichannel media brand, dissects the role of the chief data officer and covers the latest innovations for advancing an organization's data strategy and management.

Summit attendees are from among the most prominent companies, organizations, and municipalities in the world, including Equifax, JP Morgan Chase, the New York Times, Cigna, Twitter, Mapquest, the U.S. Department of Defense, the British Army, the City of Los Angeles, the City of New York, and many others.

“The speakers and audience were very motivated to hear about the topics, because the role of a chief data officer is relatively new,” Begoli said. “It’s a very meaningful role—both chief data officer and chief data scientist—with the emergence of large data problems and the deluge of data.”

While the concepts of variety, velocity, and volume are often bandied about relative to the ‘Big Data’ buzzword, summit attendees are aware of the more significant overarching challenge: linking disparate datasets together.

“In an old world, we would have one relational database that eventually would be linked to another one, and we would derive results; and there were some obvious ways to link the data,” Begoli said.

Big data and database [Image credit:
©Koollapan |

However, the deluge of data emanating from social networks, climate research, demographics, and other data sources has created a new scenario. “We live in what I call a post-relational world, and that requires some probabilistic techniques that are beyond what is currently being put in place,” Begoli said.

According to Begoli, the solutions involve probabilistic integration of structured, semi-structured, and unstructured data. Such integration means translating unstructured data into relatively structured, workable form and then probabilistically matching names, places, and other entities, and records with those found in a more structured form, such as financial statements or electronic health records. “This closely relates to our research we’re doing right now with JICS, and where PYA Analytics and JICS are collaborating with others in this space,” Begoli said.

He has extensive experience from both the public and private sectors from which to draw in presenting stories from health care and forensic accounting. In his talk, he demonstrated an example of linking data coming from electronic health records, financial claims data, imaging data, and pathology and radiology reports—all of which was in a variety of different formats.

“In the financial data accounting space, I spoke about techniques for detecting fraud amongst the data that is linked between the accounts payable, the accounts receivable, invoices, contracts, and other legal documents in a very disparate form,” he said. “So that was a very challenging example, and I believe both of these resonated very well with the audience.”

One of the future systems JICS is developing in a set of solutions for ‘Big Data’ is called Photon, which is part of an ongoing project with a President’s Fellow in the U.S. Department of Health and Human Services aimed at linking physician provider sets from Medicare with Medicaid.

Photon is a combination of compact, high-density hardware with a Hadoop file system and a Spark in-memory compute engine in multiple components. Photon uses Spark’s machine learning libraries along with Spark’s processing engine.

“We’re trying to see how well Spark will perform,” Begoli explained. “Photon is a great resource and is going to be available to academia and research. Right now, we are in the trial process, just currently putting it into place. And we plan to build a whole system set around this architecture if it proves itself. We are using this very practical problem as a way to demonstrate its utility.”

JICS is positioned not only to solve the toughest problems in ‘Big Data’ but also in science and engineering, with HPC administrators, hardware experts, computational scientists, and leading-edge machines. JICS collaborates on academic proposals and industry innovations.

“This is the place where industry can come and try things out for the first time,” Begoli said. “They can get easy access to some great experts and use it for production scenarios or for instrumentation purposes.”

Article posting date: 18 February 2015

About JICS: The Joint Institute for Computational Sciences was established by the University of Tennessee and Oak Ridge National Laboratory (ORNL) to advance scientific discovery and leading-edge engineering, and to further knowledge of computational modeling and simulation. JICS realizes its vision by taking full advantage of petascale-and-beyond computers housed at ORNL and by educating a new generation of scientists and engineers well-versed in the application of computational modeling and simulation for solving the most challenging scientific and engineering problems. JICS operates the National Institute for Computational Sciences (NICS), which had the distinction of deploying and managing the Kraken supercomputer. NICS is a leading academic supercomputing center and a major partner in the National Science Foundation's eXtreme Science and Engineering Discovery Environment (XSEDE). In November 2012, JICS sited the Beacon system, which set a record for power efficiency and captured the number one position on the Green500 list of the most energy-efficient computers.