Best Practices For Designing Hive Schemas




When looking at the Hadoop ecosystem, Hive can be considered a data warehouse. The name hive also corresponds to a storage facility, but there is more to it. Although with the tool set available, a Hive can be used for data storage, but its BI (business intelligence) capabilities are unfortunately limited. To give the BI a bit of a boost, Simba Technologies have come up with an ODBC connector that makes the BI tools resemble Tableau, SAP Business Objects and Excel to connect to Hive. So you can also look at Hive from a tool that handles Business Intelligence.

When considering data science certification, Hive schemas creation and analysis are two things that a person should consider specializing in, but how does one create a Hive Schema that can not only perform the task at hand, but also be easily maintained and manipulated.

Denormalize

When building a Hive, the star schema offers the best way for access and storage of data. For this design, you will start by creating a fact table which contains the dimension tables and metrics storing the description of the metrics. Since we have to query the data, it is a good practice to denormalize the tables to decrease the query response times.

As an example let’s suppose we are analyzing cricket players’ data. The Hive design will have a fact table named fct_players_analysis. This table will be storing the denormalized metrics. The metrics stored in normalized tables: AllRounders, Hall of fame, Bowling, Batting, Fielding, Accolades, Salaries, AwardShare.

Sorting Data

Four dimension tables will help us describe the metrics stored in the fact table. The first dimension table here will be dim_player_bio which will contain the data including name, D.O.B., and other personal info. This is an important step for those looking to go into big data certification. Similarly we will create dimensional tables for Data in the TeamFranchise table under the name dim_franchise, dim_coaching for coaching data and for better analysis of data with time, constructing a dimension table called dim_year will be done which will hold both the month and the year data.

Importing Data

Importing data from CSV files directly into the star schema is not possible. For this the process of loading the data into have tables from a suitable place where only the columns can be selected is required. This is data staging towards the final data, this will provide good support in the long term process. The commands below can be used to make a directory on HDFS to place the CSV files.

hadoop fs -mkdir /usr/local/cricket

hadoop fs -put ~/Downloads/lahman2012-csv /usr/local/cricket

Invoking And CSV Management In A Hive Shell

A Hive shell can be invoked by running the hive at the terminal. The shell can be used to create a staging database, load data and columns. A Hive database is to be created to hold most of the CSV data for this process. The process has to be repeated for all tables identified previously for contribution of data to the warehouse. When the table to hold the data is created, the .CSV data can be loaded onto it.

Once all the CSV tables contributing the data are loaded, the process of creating data warehouse tables can begin. With this approach, all the CSV files can be loaded on the staging tables and then staging tables can be queried for the relevant columns in the warehouse. The dimension tables and fact tables will be populated with data with this iteration based approach.

This process for a Hive Schema is one of the best ways to create a data warehouse. The star schema consists of fact tables, dimensions and CSV files converted to suit the software’s needs. Taking this example of the cricket data, you can create, similar data warehouses for different kinds of data that needs to be compiled and analyzed in one space. For those considering doing a data science certification, note that hive schemas will be something that you would be focusing on constantly and improving as you go ahead.

Data Science certification courses are available easily at affordable prices who are interested in learning more about data analysis and Hive schema development.

About The Author
James
Data Scientist (Growth) at QuickStart

James Maningo

James is a stochastic tinkerer with over 8 years of experience in digital analytics. His passion lies in providing meaningful impact through data, utilizing growth hacking techniques for business and "quantified self" for personal life. His weapons of choice are linux, python, tmux+vim and good old common sense.