Hue Setup Instructions

Goals

  • Starting EMR Cluster
  • Using Hadoop User Experience (HUE) Platform
  • Importing data into Hive using Hue
  • Learn how to run Hive SQL queries in Hue

Hue Operations

Hue S3 Support

Click on the S3 icon on the left sidebar to go to the S3 dashboard. Hue has an IAM role that allows it to connect to all your S3 buckets. This means you can add new files and folders, upload files to S3, and more. You will see all of your S3 buckets, similar to the figure below.

For example, you can click on one of your buckets and view the files there. You also have buttons to Upload files and make New files or folders. You can also select files with the checkboxes and run actions like renaming, moving, copying, or deleting within your bucket. Add this to the list of ways you can interact with S3!

You can also click on the address bar to navigate to buckets you do not own. For example, we can go to your own bucket that you made and navigate through the files in Hue.

TO-DO: Move the NCBirths2004.csv file into a new folder in your S3 bucket called ncbirths2004table The reason we need to make a folder for the data is that Hive considers each table as a folder of one or many files. We need to point to a folder so Hive can set up the schema.

Use the S3 navigation window in Hue to accomplish this task.

  1. Open your bucket like you would open a folder.

  2. Click New in the upper right and choose Directory.

  3. Enter the name of the folder ncbirths2004table then click the blue Create button

  4. You can either:

    1. Drag and drop the NCBirths2004.csv file into the new folder or
    2. You can click the check box of the file, then click Actions, choose Move. Then select the folder name and click the blue Move button.

Loading S3 Data into your Cluster –>

LOADING ONTO HDFS DOES NOT WORK IN AWS ACADEMY BECAUSE HDFS WEB SERVICE IS CURRENTLY BROKEN

S3 to Driver Node

We will use the command line to copy the data from your new bucket into the master node, and then the files will go into the Hadoop Distributed File System (HDFS) on the cluster.

  1. Use the AWS command line interface (CLI) to list files in your S3 bucket. On your terminal, you need to run an ls command to view your S3 bucket.
[hadoop@ip-172-31-65-203 ~]$ aws s3 ls s3://[NET_ID]-labdata/
2021-10-27 15:56:20          0
2021-10-27 15:56:25      44641 NCBirths2004.csv
[hadoop@ip-172-31-65-203 ~]$
  1. Once you get to the folder with the data file NCBirths2004.csv, run a command like the following to copy the contents to the master node. We are using the copy command cp to copy the file from S3 to the master node.
[hadoop@ip-172-31-65-203 ~]$ aws s3 cp s3://[NET_ID]-labdata/NCBirths2004.csv ./
download: s3://[NET_ID]-labdata/NCBirths2004.csv to ./NCBirths2004.csv
[hadoop@ip-172-31-65-203 ~]$

Driver Node to HDFS

The reference guide for all the HDFS Shell commands is here: https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/FileSystemShell.html

  1. List files in your newly created cluster’s HDFS (will be empty): hdfs dfs -ls /user/hadoop

  2. Now that the data file NCBirths2004.csv is on your master node, you need to push it into HDFS. We will use the HDFS command line to put the file into the cluster. To do this, run the following command for each file:

[hadoop@ip-172-31-65-203 ~]$ hdfs dfs -put NCBirths2004.csv /user/hadoop
[hadoop@ip-172-31-65-203 ~]$

In this command we are putting NCBirths2004.csv to the HDFS location /user/hadoop

  1. Once that command executes, use the ls command with HDFS to confirm the file is where we want it:
[hadoop@ip-172-31-65-203 ~]$ hdfs dfs -ls /user/hadoop
Found 1 items
-rw-r--r--   1 hadoop hdfsadmingroup      44641 2021-11-03 02:28 /user/hadoop/NCBirths2004.csv
[hadoop@ip-172-31-65-203 ~]$

Hue Data Importer

Hue will get us set up with our table schema without having to code at all! The Hue Data Importer can process the data that you loaded onto S3. At the end of these steps, you will be able to leverage your SQL knowledge from this course to answer data science questions about the NCBirths dataset.

  1. On the left sidebar, you first click on the stack of disks. This logo is often used to refer to a database. Click on the database symbol as indicated by the yellow figure in the figure below.

  1. Next, you have to pick out which source of data you want to work with. You could interact with data in a variety of different database types in Hue. For our lab, you want to click Hive as shown by the yellow arrow in the figure below.

  1. Now you are shown the list of databases available within the Hive data source. If you were working in a company with an enterprise environment, then you might see many different databases to choose from. In this case, we made this Hadoop cluster from scratch, so there is only the default database. Click on the default database as indicated by the yellow arrow in the figure below.

  1. Now you are shown the list of tables within the default database. Note that you can navigate back and forth using the back arrow to the left of the “default” database title. This is helpful if you have multiple databases and tables. We want to make a new table! Click on the + icon to start the Hue Importer. The yellow arrow in the figure below is pointing to the + icon.

  1. In this step, you have to select the location of the file to load into Hue. Click on the button indicated by the yellow arrow in the figure below to launch a file selection box.

  1. Select the file NCBirths2004.csv in S3 by navigating to it in the popup. This file is located in your S3 bucket [NET_ID]-labdata in the ncbirths2004table folder.

  1. Hue has now parsed the first few lines of the csv file and provided you options for formatting the csv. You can keep all of the defaults since this is a simple csv. You can experiment with the Field separator, Record separator, Quote character, and Has header options and see the effect in the Preview section. Leave all the defaults as shown in the figure below before clicking the blue Next button.

  1. In this step you are defining the settings for your new table. You just have to click the settings button in the Extras field to show additional options. Below are descriptions of the other settings.

    • Destination: You can set the table name to anything. You should leave the default of default.ncbirths2004.
    • Properties: Leave Format as Text.
    • Fields: You can change the data type for each of your variables as well as change the name of the variables. It is important to note that the data type selection will only be based on the first few row of data. If the head of the data is not representative of the entire variable, then you might have to manually change the data type. Go here if you want to read up on Hive data types.

  1. Now that you have expanded the extra options, make the following changes:

    • Uncheck Transactional table. This is useful if you plan on loading new data into this table many times. We only want to load data into the table once.
    • In the External location text box, remove the file name from the end of the path. Make the text entry include the folder name ncbirths2004table at the end. The result should look similar to the image below. Hive will look in the folder and import all files from the specified path, which means if you split your data across multiple files it will read them all.
    • The others options keep the same. You could add a description if you want. We will not be adding any partitions, but one could add partitions to a data table if it is a big data table.
    • Once you have unchecked the box for Transactional table and your settings look like the figure below, click on the blue Submit button!
  2. Clicking the submit button has launched a Hive command that will load in your data as a new table in the default database. You should see a pop-up in the top right corner like the figure below. This could take a minute or two. You may see a generic error message. If so, that is OK. As long as you see the green bar in the window you are fine. Close the pop-up window. In the next step we will confirm that the table loaded correctly.

  1. The Hue Importer sends you to the new table page. It looks like the figure below. If you do not see the table listed in the left sidebar, you need to click the refresh icon next to the + icon as shown by the yellow arrow.

  1. Once the table shows up, you have a few places to click around and explore:

    • Click on the table name (ncbirths2004) so that you see the variables in the table and the data types
    • Click on the i symbol to show a pop up showing the table with the variables, data types, and a small sample of the data. Everything looks good!
    • Click on the Table Browser text of the pop up to go to the table page.

  1. Once on the Table Browser page, click on Sample (100) to show the top 100 rows of the data. The result looks like the figure below. Note that the generic appeared in the upper right. You can close this as it does not affect your work in Hive. Click on the Query button as shown by the yellow arrow. This will send us to the query engine so we can start doing SQL queries!

  1. You are now at the query engine, and a sample query has executed for you automatically. The command was SELECT * FROM default.ncbirths2004 LIMIT 100;.

  1. Let’s zoom into the resulting dataset. We have a few views of the data. To the left of the table there are a few symbols, each of these are useful.

    • The first one, which is just a few dots, is the raw data section.
    • The second one, which looks like a bar chart, is the charting section.
    • The third one, which looks to be two windows, is for unselecting certain variables in the results.
    • The last one, which looks like a download button, is for downloading the results of your query.
    • Click on the bar chart icon so we can see our results in graph form.

  1. Here are some results in graph form. You can select various options to change the type of chart.

    • Clicking on the down blue arrow next to the blue chart lets you pick your chart type. You can pick Bars, Pie, Scatter, Marker Map, or Gradient Map.
    • In the bar section, you can decide on the type of chart, the x-axis variable, the y-axis variable, the group (color), and the limit of the number of rows.
    • You can adjust sorting too. The first option leaves the order alone. This is useful if you used an ORDER BY SQL command. If you want to sort by size you can sort in ascending or descending order.

Hive Querying in Hue

This section has several to-dos to complete for the lab submission

Hive commands:

  • Execute a command: CTRL + ENTER keyboard combo or the blue play button
  • Old queries: Check the Query History tab

Answer the following questions using Hive on the NCBirths dataset. Save the SQL commands you make and any associated plots into your lab submission document.

  1. Get the count of rows grouped by smoker, alcohol, and mothersage variables. Sort the data using the interactive query results. Is the data balanced?

  2. Plot a bar graph showing the distribution of counts by mothersage.

  3. What is the average gestation and standard deviation of children partitioned by whether the mother is a smoker or not? Based on the means and standard deviations, do these two groups have statistically similar means?

  4. Plot a scatter plot of average gestation on the x-axis and average weight on the y-axis and grouped by mothers age.

  5. Make a query of average weight groups by mothersage and smoker. Plot a bar plot of the results with mothersage on the x-axis, smoker as the group, and average weight on the y-axis.

Saving Query to S3

  1. Click on the download button as shown in the figure below.

  1. You could just download the data to a file, which will let you download to your local machine. We want to save to S3, so click Export.

  1. Select the dots as shown by the yellow arrow in the figure below. This will launch a file selection tab.

  1. Select the S3 bucket and folder where you want to save your results.

  1. Decide if you want to save just the top 100,000 rows or all the rows. This is an important distinction if you are working with big data! Click the blue Export button to save the results to S3.

  1. The file browser section opens up so you can see your new file in S3. The file is in S3 now!

Take a screenshot showing your results file in your S3 bucket

Use the AWS CLI in your terminal to save the results file into your git repo then add, commit, and push it!