
Azure Data Lake Catalog (U-SQL)
The Azure Data Lake Catalog (U-SQL) is one of the ADLA component by which U-SQL organizes data and code for sharing and re-use. The Catalog stores databases, tables, views, stored procedures, table-valued functions (TVFs), schemas, assemblies, external data sources and all other code-related items. Basically catalog are useful when there is a requirement of code sharing and performance enhance in Azure Data Lake Analytics.
Followings are the few key points:-
- Every Data Lake Analytics account has a catalog associated with it.
- Catalog can not be deleted.
- One master database will be available by default, which can not be deleted.
- Every catalog can have one or more databases .
- Catalog stores the U-SQL code in the form of views, stored proc and table valued functions(TVFs).
- Catalog database data gets stored in the Azure data lake stores in the catalog folder.
- Catalog data file extension is .ss .
To Start with database creation and other entities follow the next steps:
- You need one ADLA account which will have catalog by default.
- To create any entity in catalog you will have to write U-SQL code and run it in ADLA jobs.
-
Create database
CREATE DATABASE TestDB; // with IF NOT EXISTS CREATE DATABASE IF NOT EXISTS TestDB;
4. Create Schema
USE DATABASE TestDB;
CREATE SCHEMA NewSchema;
5. Create Table
It can be created in two ways, you can directly define the schema and structure and then load the data later, like below.
DROP TABLE IF EXISTS TestDB.NewSchema.NewTable; CREATE TABLE TestDB.NewSchema.NewTable ( DimID int, DimName string, RoleDescription string, isActive bool, INDEX idx CLUSTERED(DimID ASC) DISTRIBUTED BY HASH(DimID) );
Other way to create the table is, using the RowSet extracted from the file, but the table must have the index defined. As below:
@input = SELECT * FROM (VALUES ("Val1", 1 ), ("Val2", 2 ), ("Val3", 3 ) ) AS T( Col1, Id ); DROP TABLE IF EXISTS [TestDB].[NewSchema].[NewTable]; CREATE TABLE [TestDB].[NewSchema].[NewTable] ( INDEX idx CLUSTERED(col1 ASC) DISTRIBUTED BY HASH(col1) ) AS SELECT * FROM @input;
6. Create View
DROP VIEW IF EXISTS [TestDB].[NewSchema].[VW_TestView]; CREATE VIEW [TestDB].[NewSchema].[VW_TestView] AS SELECT 3000000 + ROW_NUMBER() OVER ( ORDER BY R.DimID ASC) AS [NewId] , "Source" AS [DataSource] , Convert.ToString((Convert.ToString(R.RID)) ?? "N/A") AS [R_Id] , R.DimName AS [OrgName] , R.ZipCode AS [PostalCode] , G.ISOCCode AS [C_Code] ,B.Name ?? "UNKNOWN" AS [BCode] FROM [TestDB].[NewSchema].Dim_R AS R INNER JOIN (SELECT DISTINCT ISOCCode, Name FROM [TestDB].[NewSchema].[DimC]) AS G ON G.Name == R.Code LEFT JOIN [TestDB].[NewSchema].[Dim_B] AS B ON R.ID == B.ID;
7. Create Stored Procedure
DROP PROCEDURE IF EXISTS [TestDB].[NewSchema].[CatalogFirstStoredProc];
CREATE PROCEDURE [TestDB].[NewSchema].[CatalogFirstStoredProc]
(
@Param1 string = "Test",
@Param2 int = 0
) AS
BEGIN
DECLARE @Var1 int= 8000000;
INSERT INTO [TestDB].[NewSchema].[NewTable]
(
Col1
,Col2
,Col3
,Col4
)
SELECT Convert.ToInt32(@Var1 + ROW_NUMBER() OVER(ORDER BY R.Column1 ASC)) AS Col1
,B.col2
,B.col3
,R.col4
FROM [TestDB].[NewSchema].Dim_R AS R
INNER JOIN
(SELECT DISTINCT ISOCCode, Name FROM [TestDB].[NewSchema].[DimC]) AS G
ON G.Name == R.Code
LEFT JOIN [TestDB].[NewSchema].[Dim_B] AS B
ON R.ID == B.ID;
@result =
SELECT *
FROM [TestDB].[NewSchema].[NewTable];
OUTPUT @result
TO "/Output/Procedure/CatalogFirstStoredProc.csv"
USING Outputters.Csv();
END;
8. Call Stored Procedure
// This example will call the procedure CatalogFirstStoredProc:
[TestDB].[NewSchema].[CatalogFirstStoredProc]();
// This example will call the procedure CatalogFirstStoredProc With Parameters:
[TestDB].[NewSchema].[CatalogFirstStoredProc]
(
400,
DateTime.Parse("12/24/2018")
);
Some of the content has been taken from the MSDN sites, i have consolidated and put it together to understand it better. I hope, you now have a better idea about Azure Data Lake Catalog. For any clarification please comment.
Feedback will be appreciated. Cheers!