An Outline Of The Data Pipeline


By T. McDonald | 2/07/19
The data pipeline is one way of handling data. This involves acquiring data from a source, or sources, preparing it for use, analysing it, and presenting what was discovered during the analysing to an appropriate audience. Subsequently, there are four stages to the pipeline, which I will outline in this blog:
  • Acquisition
  • Preparation
  • Analyse
  • Presentation

Acquisition

Before you can do anything, you will need to find some data and determine if it is suitable for the task. This involves legal issues surrounding the data such as its licensing: are you allowed to use it and if so, what are you allowed to do? There may be limitations on the use of the dataset for example. Furthermore, files come in different formats such as CSV or JSON for example.
exampleFile.csv
Or
exampleFile.json

 
Meaning of the extensions:

CSV = Coma Separated Values
JSON = JavaScript Object Notation


The above are just two examples and there will be other types.

Two ways of representing data:

  • Tabular
  • Document
A common type of file for sharing datasets is CSV files and is presented in tabular form and is an example of text-delimited representation. The files have datum, a single unit of data, separated by a comma. This allows tools and software to present the dataset as a table with columns and rows. It is notable that tables may exist within websites; however, they are not CSV files, but maybe in tabular form.

Example of tabular form.



Documents are seen by data scientists as any file or representation of a file that includes or contains something as a consistent part of a data record. A document can be text or images documenting some facts or feelings.

JSON files can represent data in tabular and in hierarchical form and is an example of value attribute representation.

JSON key points:
  • Data is in name/value pairs
  • Data is separated by commas
  • Curly braces hold objects
  • Square brackets hold arrays
A JSON file may hold more than just a single piece of data in a cell; it can hold an object, key value pairs or an array.

The point I am making is that data is out there on a website in one form or another about all sorts of things such as numerical data or text. Data can be from a business or data on an experiment. In a nutshell, find the dataset, check the licencing and reference it if you need to. Moreover, once data is captured it will need to be represented somehow and packaged up for transfer.

Preparation

In data science, if a dataset does not seem right in some way, it is said to smell. Before analysing the dataset, the dataset must be cleaned of all smells. Sometimes there is a mismatch between data types in a column or row, which can lead to problems with calculations. Additionally, the correct handling of missing data is essential and may depend upon the database management system (DBMS) you intend to use. Handling missing by replacing it with a string or numerical value is one way of solving the problem; however, if a column or row has lots of missing data it is possible to drop the whole row or column. Other potential problems are dealing with unwanted whitespace, case of characters and rounding of numbers.

Encoding is the mapping of digital representation of a character and the character itself. As a result, the file will need to be read with the correct character encoding, or it may not be displayed correctly or not displayed at all.

Before anything else, it is a good idea to examine the datasets using the command line. This will allow the analyst to briefly investigate the dataset such as the number of rows, columns and location of the files. If analysis requires more than one dataset, the columns with common data may have slightly different names, which means they will have to be corrected. Once this is done, it is possible to combine the datasets making a new dataset with a combination of columns from the different datasets.

In brief,

  • Investigate the data for smells and any useful information.
  • Clean the data by handling things like missing data, incorrect data types, upper or lower case and unwanted whitespace.
  • Data integration: combining and shaping datasets.

Analyse

The purpose of this stage is to look for a story in the data. Analysing a dataset will need tools and an adequate programming language. Which tools needed will depend upon the size of the data and there are many database management systems (DBMS); however, I will just mention two:
  • SQL
  • MongoDB
A useful programming language to use for analysing datasets is Python because it comes with a comprehensive set of modules. One such module is pandas, which is an open source tool-kit under a BSD-licensed library. These tools are used to perform action on a dataset or multiple datasets.

Pandas

It is possible to use pandas dataframes for both CSV and JSON files although the possible nested data in JSON may make it difficult to do. For this reason, it would be easier to use standard python modules for deeply nested structures that will not map easily to tabular form. In other words, if the JSON file will not easily squeeze into a table, use other modules and handle the rows individually or group common rows together. If the dataset is small enough, it is possible to analyse and present the data using only python or SQL.

SQL

SQL uses a fixed schema. It can perform queries and do the things that can be done in pandas. If your data is in a database SQL is like the tool to use. SQL stands for Structured Query Language.  

Some advantages to SQL over a collection of documents:
  • A single query can access all the data.
  • A single query can use data from several tables in the database at the same time.
  • It is possible to define constraints on the data.
Some disadvantages to SQL:
  • Does not scale out well (spread out over many servers).
  • Has a fixed schema (a logical view of a database, which makes it difficult to reshape data).
You might be thinking that most if not all of this could be done using a spreadsheet program. However, although Excel can handle a million rows that does not account for multiple tabs and functions. The truth is that SQL is much faster, and it is possible to send a small text file with instructions for the analysis instead of a huge spreadsheet file.

It is worth noting that a fixed schema makes it difficult to reshape data. SQL has a fixed schema because it was intended to store and retrieve data rather than analyse it: SQL cannot easily produce crosstabs and pivot tables.

Document databases such as MongoDB

MongoDB is a DBMS that is designed to deal with huge amounts of data and be flexible enough for analysists.
  • Some advantages to MongoDB over SQL:
  • Scales out well (can be spread across many servers).
  • Has no schema (no schema means it can be reshaped easily)
MongoDB is a document-oriented database. While relational databases have tables with rows, a document-orientated database replaces the concept of a row with a document, which makes it more flexible. As a result, hierarchical relationships in a document database can exist within a single record/row.

An analysist may not need all the data in the datasets. Picking out certain rows is called selection and picking out certain columns is called projection. It may also be necessary to sort the columns or perform other functions on a dataset. Python will allow the analysist to do all these things, but if the data needs to be stored in a database or if it is so large that it needs to be stored a cross several servers, something more is needed like SQL or MongoDB.

In a nutshell, if it is small enough to fit in memory use pandas. SQL makes a reasonably good management system if the database can be stored on one server, but not that good at analysis because it is difficult to reshape data due to its fixed schema. If the data is stored across several servers, use MongoDB, which has no schema; thus, can be reshaped easily. In any of these cases the must be explored to revel anything of interest that is descriptive and even bring out new ideas from it such as confirming or refuting a hypothesis.

Presentation

The last stage of the data pipeline is to present the story that the datasets tell us. Bar charts, scatter graphs and maps are all ways to display data for a human to make sense of; the data exploration must be explained.

First find a story through exploratory investigations using closed conversation for checking facts, looking for evidence and identifying correlations or journalistic interview, which is freer flowing and more likely to uncover a story. Yes, it is like interviewing a person. Next the is the report that will contain the explanations of the findings using tables and graphs.

Comments