Azure Data Lake Catalog (U-SQL)
4 mins read

Azure Data Lake Catalog (U-SQL)

The Azure Data Lake Catalog (U-SQL) is one of the ADLA component by which U-SQL organizes data and code for sharing and re-use. The Catalog stores databases, tables, views, stored procedures,  table-valued functions (TVFs), schemas, assemblies, external data sources and all other code-related items. Basically catalog are useful when there is a requirement of code sharing and performance enhance in Azure Data Lake Analytics.

Followings are the few key points:-

  • Every Data Lake Analytics account has a catalog associated with it.
  • Catalog can not be deleted.
  • One master database will be available by default, which can not be deleted.
  • Every catalog can have one or more databases .
  • Catalog stores the U-SQL code in the form of views, stored proc and table valued functions(TVFs).
  • Catalog database data gets stored in the Azure data lake stores in the catalog folder.
  • Catalog data file extension is .ss .
  • Catalog Structure

To Start with database creation and other entities follow the next steps:

  1. You need one ADLA account which will have catalog by default.
  2. To create any entity in catalog you will have to write U-SQL code and run it in ADLA jobs.
    U-Sql Job
  3. Create database

CREATE DATABASE TestDB;
// with IF NOT EXISTS
CREATE DATABASE IF NOT EXISTS TestDB;

4. Create Schema

USE DATABASE TestDB;
 CREATE SCHEMA NewSchema;

5. Create Table

It can be created in two ways,  you can directly define the schema and structure and then load the data later, like below.

DROP TABLE IF EXISTS TestDB.NewSchema.NewTable;

CREATE TABLE TestDB.NewSchema.NewTable
( 
DimID int,
DimName string, 
RoleDescription string,
isActive bool,
INDEX idx 
CLUSTERED(DimID ASC)
DISTRIBUTED BY HASH(DimID)
);

Other way to create the table is, using the RowSet extracted from the file, but the table must have the index defined. As below:

@input = 
SELECT * FROM 
(VALUES
("Val1", 1 ),
("Val2", 2 ),
("Val3", 3 )
) AS T( Col1, Id );

DROP TABLE IF EXISTS [TestDB].[NewSchema].[NewTable];

CREATE TABLE [TestDB].[NewSchema].[NewTable]
( 
INDEX idx 
CLUSTERED(col1 ASC)
DISTRIBUTED BY HASH(col1) 
) AS SELECT * FROM @input;

6. Create View

DROP VIEW IF EXISTS [TestDB].[NewSchema].[VW_TestView];

CREATE VIEW [TestDB].[NewSchema].[VW_TestView]
AS
SELECT 3000000 + ROW_NUMBER() OVER ( ORDER BY R.DimID ASC) AS [NewId] 
, "Source" AS [DataSource] 
, Convert.ToString((Convert.ToString(R.RID)) ?? "N/A") AS [R_Id] 
, R.DimName AS [OrgName] 
, R.ZipCode AS [PostalCode] 
, G.ISOCCode AS [C_Code] 
,B.Name ?? "UNKNOWN" AS [BCode] 

FROM [TestDB].[NewSchema].Dim_R AS R
INNER JOIN 
(SELECT DISTINCT ISOCCode, Name
FROM [TestDB].[NewSchema].[DimC]) AS G
ON G.Name == R.Code
LEFT JOIN [TestDB].[NewSchema].[Dim_B] AS B
ON R.ID == B.ID;

7. Create Stored Procedure

DROP PROCEDURE IF EXISTS [TestDB].[NewSchema].[CatalogFirstStoredProc];
CREATE PROCEDURE [TestDB].[NewSchema].[CatalogFirstStoredProc]
(
@Param1 string = "Test",
@Param2 int = 0
) AS
BEGIN

DECLARE @Var1 int= 8000000;

INSERT INTO [TestDB].[NewSchema].[NewTable]
(
Col1
,Col2
,Col3
,Col4
)
SELECT Convert.ToInt32(@Var1 + ROW_NUMBER() OVER(ORDER BY R.Column1 ASC)) AS Col1
,B.col2
,B.col3
,R.col4
FROM [TestDB].[NewSchema].Dim_R AS R
INNER JOIN 
(SELECT DISTINCT ISOCCode, Name FROM [TestDB].[NewSchema].[DimC]) AS G
ON G.Name == R.Code
LEFT JOIN [TestDB].[NewSchema].[Dim_B] AS B
ON R.ID == B.ID;

@result = 
SELECT * 
FROM [TestDB].[NewSchema].[NewTable]; 

OUTPUT @result 
TO "/Output/Procedure/CatalogFirstStoredProc.csv" 
USING Outputters.Csv(); 

END;

8. Call Stored Procedure

// This example will call the procedure CatalogFirstStoredProc: 
[TestDB].[NewSchema].[CatalogFirstStoredProc](); 

// This example will call the procedure CatalogFirstStoredProc With Parameters: 
[TestDB].[NewSchema].[CatalogFirstStoredProc]
 ( 
400,
DateTime.Parse("12/24/2018")
 ); 

Some of the content has been taken from the MSDN sites, i have consolidated and put it together to understand it better. I hope, you now have a better idea about Azure Data Lake Catalog. For any clarification please comment.

Feedback will be appreciated. Cheers!

Leave a Reply

Your email address will not be published. Required fields are marked *