This sponsored post is produced in association with StubHub.

We all know about the importance of creating personalized experiences online — and you can’t do it without knowing your customers behavior and needs better. That begins with bringing together data from a variety of sources and making it readily available to downstream teams and processes — i.e., the data scientists and programmers who rely on it to analyze and develop the behind-the-scene programs that make everything tick — so they can deliver the holy grail of personalization.

There’s no question about the power of Hadoop when it comes to storing and processing huge quantities of data on clusters of commodity hardware. Hadoop acts like a vast and scalable data staging area for accommodating any data type, no matter how unstructured — which is exactly what you need when it comes to dealing with customer behavioral data.

But there are issues in getting that data into Hadoop. Most companies deal with these by building customized add-on solutions. Let’s talk about the four main hurdles, or gaps, in getting data into Hadoop and making it available for processing.

Hurdle #1: Automatic ingestion of data produced in both stream and batch mode

Most companies deal with a huge spectrum of customer data. Take for example, StubHub, the online event ticket marketplace owned by eBay. To create an end-to-end fan experience and recommend events based on customer preferences and interests, StubHub needs to analyze data coming in from a vast number of sources.

Those sources include clickstream data (the real-time information trail a user leaves behind while visiting a website captured in weblogs), social media data, responses to emails and promotions, historical transactions, segmentation data, and more.

The data comes in different formats (name value pairs, JSON, relational, and so on) and gets produced at different frequencies. Yet all of this data is essential to creating a unique experience for the customer.

“Hadoop is well suited for both real-time as well as batch processing,” says Sastry Malladi, chief architect at StubHub. “But right now there’s no single common framework or platform for automatically ingesting the data from all these different sources in a way that makes the data immediately consumable.”

As he points out, most companies rely on one-off solutions (like Sqoop to bring in SQL transaction data and Flume for log data) to address one type of source or another. But as of yet, there is no one-size-fits-all comprehensive solution they can turn to.

Hurdle #2: Validating and monitoring data integrity

As you pipe your data into Hadoop, you need to make sure what you end up with is exactly what you started with. This involves continually monitoring the ingestion process, validating the data, and receiving alerts when something isn’t working.

The problem is different data types require different validation types. For instance, if you are receiving SQL structured data, you want to make sure the rows, columns, and values match. Likewise, if you are receiving XML data, the tree and structure need to match. Right now, what’s missing is a common framework for plugging in different validation mechanisms based on the type of data — as the need arises.

Along with that, most companies need a monitoring solution that lets them know the state of ingestion at all times and generates alerts if something goes wrong, so problems get fixed as they occur.

Hurdle #3: Making the data immediately consumable

You’ve brought the data into Hadoop. Now your data is just sitting there in its original raw format. But without a metadata and schema management for each of the data sources, you have no way to query the data through an SQL interface such as Hive. In other words, you’re just dealing with a big data lake and nobody can find anything they are looking for in an efficient way.

What’s needed is a method for creating these metadata schemas on the fly. Data ingestion processes have no control on the data production side. On top of that, the order and type of data elements can change drastically, depending on the data source. So the ingestion process needs to ensure raw data automatically gets turned into actionable data sets as it lands in Hadoop for downstream applications to make sense out of it. This can be tricky, especially when dealing with non-SQL type of data.

Hurdle #4: Enabling job scheduling based on data availability

You’ve defined your schemas. You’ve developed your jobs that process the data. The next step is scheduling what jobs you want to run, when.

In the big data world, most businesses schedule jobs with tools such a Oozie and Cron. A common limitation with these tools is they are time-based schedulers. Imagine a situation — and this happens all the time — where you have a job that kicks off at 9 p.m., but the data this job depends on has not yet arrived. A better way is to schedule jobs based on data availability, but within some type of window.

“You want the flexibility to say, here are the data sources I am looking for, schedule my jobs when this data arrives. But if the data doesn’t arrive within a specified time period, send an alert so someone can look into and resolve this,” explains Malladi.

Those are the four challenges in bringing data into Hadoop today. Generally companies create their own custom solutions to fill in these areas, but doing so can divert the focus from the core competency of the business.

A better solution may be on the horizon. Based on work it’s been doing to create its own big data architecture, StubHub has created an open source framework it calls BigDime (an acronym that stands for big data ingestion made easy), which it plans to make publically available in the near future, according to Malladi.

Built on Flume and Kafka, BigDime is an extensible framework for making the ingestion and processing of vast quantities of customer data within Hadoop faster and easier.

Moving forward, personalization will be the key to any consumer-oriented website or app. Working with Hadoop in a hybrid fashion, and having a solution like BigDime, extending Hadoop’s capabilities will be the key to making that happen.


Sponsored posts are content that has been produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. The content of news stories produced by our editorial team is never influenced by advertisers or sponsors in any way. For more information, contact sales@venturebeat.com.