I have extensively discussed Big Data’s impact on the corporate data center in my previous article. Today, I will be discussing about the workflows that come with Big Data. Just to summarize, Big Data workflows are about collecting the data from the source and bringing it to the end users to be queried for the answers. Turning back to the real life situations, it is not that easy as you might expect.
One of the most common scenarios in the corporations is that, the Big Data somehow exists, but limited to the departments. When the Big Data is implemented through the corporation, this already-present data has to be consolidated and centralized. In most of the cases, the centralization takes place in the corporate data center. Therefore, the IT department not only has to think about the storage and the management of Big Data, but also the integration with the other data center operations.
To summarize the workflows on Big Data, we have  collection,  preparation and  analysis phases. A problem that I had in a recent consulting job will be a very good real-life example of these workflows.
In my consulting job, I was asked by the Human Resources department to consolidate the employee folders (dossiers) and prepare for company’s “next-generation HR management.” I asked the HR employees about the structure of the employee folders and – as I expected – was presented various formats of unstructured data: there were printed-out pages, photographs of each employee, Microsoft Excel files, Microsoft Word files, Microsoft SQL Server 2000 and IBM DB2 databases. This was not the worst part of course. Since I am living in Turkey, my alphabet contains non-English characters, sometimes in addition to, sometimes instead of the English ones. For example I is not capital i. Imagine the level of complexity this introduces if a lazy typer, types i in place of small I. She introduces an incompatibility that cannot be solved by a simple find -> replace. That was there of course and brought one more complexity to the already incompatible and unstructured data: inconsistency.
Up to this point, the issue is all about creating a consistent and structured database, which is, in my opinion, is a prerequisite for Big Data. However, my client wanted to go one step further and bring in the data from the social networks. This brought the question of finding, bringing and storing the data that originates outside the company and outside the control of the company. Is there a disk quota for tweets?
Consider that the company has 5000 employees, of which 1000 of them have active Twitter and Facebook accounts, each uploading an average 1 Kilobyte of data every day. The amount of data that should be stored each day is 1 Kilobyte x 1000 employees is 1000 Kilobytes, which equal roughly 1 Megabyte per day. If you bring in additional networks – say Tumblr, Instagram and others – the amount of data will grow exponentially. At this point, the IT department should step in with its questions: is there really a need for this amount of data? Will the data be filtered before it is imported to the HR database (will you import “I luv my cat” tweet of the employee)? What will be the retention time for the employees that leave? That promoted? On whose budget will you purchase additional storage? Additional computing power?
The second workflow is the preparation of data, which is preparing the data to integrate with the other systems. Up to this point, we had conceptually built a consistent database of employee information that is expanded outside the boundaries of the enterprise. Now, the point is to integrate the employee data with the company’s retail business. For example, the company wants to analyze the buying behavior of the employees who have the largest number of followers on Twitter. In order to make this analysis, the employee database of the HR department should be integrated with the retail business’s database. At this point the IT department should step in and define the integration, determine the forming of the temporary/permanent data warehouses that the queries will be run against, the scheduling of the updating and synchronizing of the repositories and transaction systems.
The third workflow is about preparing the systems for analysis and querying. In almost all cases this is independent from the IT department because the queries will both be random – executed by the users on demand, and occasionally scheduled. However, the IT department has to tune their systems for performance and to meet the SLA levels.
The bottomline is, Big Data is not a simple database solution that is implemented and then queried. Besides the traditional IT tasks, the workflows that come with it will put additional workload on the infrastructure. Not but not least, the service level expectations will also change with time: the scheduled reports that are run in off-work hours will be required to run in real-time in the near future. And the IT has to think about all these now.
Featured Image: forbes.com