Welcome to Aginity Enterprise Documentation!¶
Aginity Enterprise enables businesses to create, catalog and manage all analytics as assets which then become reusable across the enterprise.
This documentation provides you with conceptual, procedural, and reference information to help you install and use Aginity Enterprise and find answers to questions you may have about it.
Getting Started with Aginity Enterprise¶
This section provides a video overview of Aginity Enterprise and contains videos that show you how to accomplish certain basic tasks with Enterprise. Use the links below to access the videos:
Signing Into Aginity Enterprise¶
When you run Aginity Enterprise for the first time, you need to sign in. You can sign into Enterprise by providing the user name and password from the Aginity Enterprise account that your administrator has created for you.
Note
If you don’t know your user name and password in Aginity Enterprise, contact your administrator.
If you are an administrator, you can use one of the following default users set up in Aginity Enterprise to sign in:
- Admin user - Has all permissions in Enterprise.
- User - Has a limited set of permissions and cannot see and change the application settings (clusters, connections, etc.).
For instructions on how an administrator can add new users to Aginity Enterprise, see Users.
To sign into Aginity Enterprise:
On the sign-in page that appears when you run Aginity Enterprise in your browser, type your user name and password.
If you are signing in using one of the default users, type admin or user in the User Name box and the password for the user provided by Aginity.
You can change your password in your user account.
Do one of the following:
To sign in with your Aginity account credentials, click STANDARD and click Sign In.
To sign in using your single sign-on credentials, click OPEN ID.
The single sign-on OPEN ID option means that you can sign into Aginity Enterprise using the same user name and password that you use for signing into other applications and systems in your organization. After you enter your user name on the Enterprise sign-in page and click OPEN ID, you will be redirected to your single sign-on provider’s sign-in page, where you need to enter your password to be authenticated. Ask your Enterprise administrator if single sign-on is configured in your organization and if the single sign-on option is available for accessing Enterprise.
If your user credentials are authenticated, the Home page of Aginity Enterprise opens. From this page you can:
- Create new notebooks, sources, views, and publications.
- Browse your catalog and run notebooks and publications.
- View status of your jobs.
- Change the application settings (administrators only).
Your user name and password are remembered, so you don’t have to re-enter them the next time you run Aginity Enterprise.
Part I: Aginity Enterprise Overview¶
Aginity Enterprise enables businesses to create, catalog and manage data and analytics as reusable assets. This overview will introduce you to the different screens and components you can use to build your analytic repository. Detailed instructions for these tools will be provided in subsequent topics.
If you prefer text, you can follow along by using the steps below the video.
Step 1. Login and the Home Page¶
The following options are available from the navigation pane on the Home page:
- New - Create assets such as sources, notebooks, and publications.
- Browse - Review and modify assets.
- Jobs - View a history of jobs or cancel running processes.
- Settings - Manage clusters, connections, users, and more.
Hint
The Home option on the navigation pane lets you return to the Home page.
Step 2: Asset Browser¶
In Aginity Enterprise, assets, such as sources, notebooks, and views, are arranged in workspaces, which contain multiple projects. A project is a logical collection of assets and can contain multiple assets.
On the navigation pane, click Browse to see existing workspaces.
Select the User Testing workspace to view its associated projects.
Select the Lineup Project Assets project to see the list of associated sources, such as Roster, and notebooks, such as query LINEUP1.
To see an example of a source, under Sources, click Roster.
On the Schema tab, you can view the columns that exist. You can customize the labels, data types, and descriptions as well as indicate whether a column belongs to a primary key. Sources point to specific tables (in this case, “roster2017” in the Enterprise Demo database), databases, and connections into Hive clusters.
To view sample data from the source, click Test.
On the Relationships tab, define how sources are related. Map the primary-key fields between the related sources.
To see an example of a notebook, under Analytics, click query LINEUP1.
The query LINEUP1 notebook consists of input schemas and parameters on the left, the SQL or Scala code in the middle, and output schemas on the right. You can add output schemas to enable access by external applications.
In the Lineup Project Assets project’s assets, under Publications, click Roster Publication.
This is a published view showing a subset of columns, with a filter applied. Note that the input schema (on the left) and the output schema (the Schema tab) are defined and are published to a table named “Baseball Publication.”
Go back to the assets list in the Lineup Project Assets project and under Views, select the Baseball View asset.
You can select the input schema for the view (the Input Schema tab); define columns (the Result Set tab); add filters to the view (the Filters tab), and review the results (the Review tab).
Step 3: New Assets¶
We are not creating new assets in this demo. In the navigation pane, click New to display the following options for creating assets:
- Notebook - Create notebooks to analyze data.
- Source - Create sources to define source data.
- View - Create views (custom combinations of existing assets). Build on existing assets by joining sources and views, managing the result sets, filtering, and publishing.
- Publication - Create publications to publish assets to files or tables. This enables outside applications to view assets through the files and tables.
Step 4: Use Jobs to View Job Details¶
- In the navigation pane, click Jobs to view or cancel jobs that have been running on the Jobs screen.
- Click the name of any job to view its details.
- Click the name of any job with “Failed” status and then click the AMP LOG tab to view log information.
Step 5: Use Settings to Manage System Settings and Users¶
- In the navigation pane, click Settings to view Spark clusters, Connections, Drivers, Users, and Groups.
- Click Users to view information on all users, such as name, user ID, and role.
- For example, to add a new user, do the following:
- Click Add new user.
- Enter “test@aginity.com” as the user’s e-mail address.
- In the Role list, select User and then click Create.
Step 6: Verify the Workspace Setup¶
- In the navigation pane, click Home to return to the Home page.
- In the upper-left corner, click in the Current Workspace list.
- Select User Testing to ensure that the workspace is set up accordingly.
Part II: Creating a View That Aggregates Data Across Two Sources¶
Using Aginity Enterprise, you can create a view that aggregates data across two sources. The following video shows how to do so.
If you prefer text, you can follow along by using the steps below the video.
Prerequisites¶
We assume that the following are in place in your catalog in Aginity Enterprise:
- Sources are created in Enterprise for two Hive tables:
- Roster source - Contains the names and details of all baseball players.
- Pitch Location source - Contains details of every pitch thrown.
- The Roster table has a primary key (PK) on the Player column. The Pitch Location table does not need a PK.
We want to aggregate information from both sources into a new view “batter outcome.”
Having trouble seeing this video? Click here for troubleshooting information
Step 1: Create a Relationship from Pitch Location to Roster¶
In this step, we will create a relationship from the Pitch Location source to the Roster source. The sources will be joined by a player identifier, which is common to both sources.
- In the navigation pane, click Browse and open the Pitch Location source by selecting the User Testing workspace and then the Lineup Project Assets project.
- On the Source Editor screen, click the Relationships tab and then New relationship.
- Name the relationship “Pitch Location Batter.”
- Click Reference and browse to the Roster source by selecting the User Testing workspace and then the Lineup Project Assets project.
- Create the relationship between the two sources. Map the Batter Team column to the Team column and the Batter column to the Player column.
- Click Save.
Step 2: Create a View that Aggregates Across Roster and Pitch Location¶
In this step, we will create a view that aggregates data from the Roster and Pitch Location sources.
- In the navigation pane, click New and then View to create a new view for the Lineup Project Assets project.
- Enter a name “Batter Outcomes” for the view.
- Select the Roster source as a starting point for the view:
- On the right of Select starting point, click the ellipsis button.
- Select the User Testing workspace.
- Select the Lineup Project Assets project.
- Click Roster.
- Click Create to start editing the input schema of the view.
- In the Input Schema pane on the left, click Incoming to show incoming relationships.
- Drag the Pitch Location Batter relationship onto the input schema.
- Click the Result Set tab.
- In the Input Schema pane, expand the Roster source and drag over the ID column.
- In the Input Schema pane, expand the Pitch Location source and drag over the UID column.
- Under Column Name, rename the UID column to “Number of Plate Appearances.”
- Under Function, select an aggregate function of Count for the Number of Plate Appearances column.
- Click the Filters tab. In the Input Schema pane on the left, expand the Pitch Location source and drag the end_ab_flag column onto the filter (to the right of the AND operator at the top).
- Select an operator of Equals, click Edit at the far right, and set the value to 1. Click OK at the upper right.
- Click Save to save the view.
Step 3: Review the Results of the View¶
In this final step, we will review the results of the view in a notebook.
- In the navigation pane on the left, click New and then Notebook to create a new notebook and name it “Batter Outcomes Analysis.” Click Create.
- Drag the Batter Outcomes view from the navigation pane (under Views) into the input schema of the notebook.
The Input Schema pane appears on the right side of the Notebook Editor screen.
Create an SQL snippet with the following code:
%sql SELECT * FROM `batter outcomes`
In the upper-right of the Notebook Editor screen, click
to execute the view and display the results in the notebook.
Part III: Reusing an Aggregate Data Set to Create a New Dataset¶
Aginity Enterprise allows you to reuse an aggregate data set to create an entirely new dataset. The following video shows how to do so.
If you prefer text, you can follow along by using the steps below the video.
Having trouble seeing this video? Click here for troubleshooting information.
Step 1: Create a Relationship Between the Batter Outcomes View and the Roster Source¶
In this step, we will create a relationship between the Batter Outcomes aggregate view, which we created in Creating a View That Aggregates Data Across Two Sources, and the Roster source to show the number of plate appearances for each batter.
- In the navigation pane on the left, click Browse and open the Batter Outcomes aggregate view by selecting the User Testing workspace, the Lineup Project Assets project, and then Batter Outcomes under Assets.
- In the upper-right of the View screen, click the Review tab.
- Click the Relationships tab. Click New Relationship and then click the three dots (…) on the right end of Reference. In the catalog browser, under Lineup Project Assets, select Roster (the source that we want to connect to).
- Under Mapping, map the batter_team column to the team column and the batter column to the player column: * Under From, select batter_team and under To next to it, select team. * On the next row, under From, select batter and under To, select player.
- In the Name box, rename the relationship to “Roster.”
- At the upper-right of the View screen, click Save to save the relationship in the view.
Step 2: Create a New View That Uses Columns from Batter Outcomes and Roster¶
In this step, we will create a new view, called “Batter Metrics.” This view will use columns from both the Batter Outcomes view and the Roster source.
- In the navigation pane, click New and then View.
- In the New view dialog box, name the view “Batter Metrics” and on the right end of the Select starting point box, click the three dots and select Roster to set the Roster source as the starting point for the view. Click Create.
- Click Input Schema. In the left pane, under Roster, click Incoming to remove the strikethrough. This will show the incoming relationships to the Roster source. Drag the “Roster” relationship into the input schema (the large area on the right).
- Click Result Set. From the left pane, under Roster, drag the ID, first_name, and last_name columns into the result set. Under Batter Outcomes, drag the Number of Plate Appearances column over.
- Click Save to save the view.
Step 3: Review the Results of the View¶
We will create and execute a new notebook to review the results of the newly created aggregate view.
In the navigation pane, click New and then Notebook.
In the New notebook dialog box, name the notebook “Batter Metrics Analysis” and click Create.
Open the Batter Metrics Analysis notebook. From the Lineup Project Assets pane on the left, under Views, drag the Batter Metrics view into the input schema of the notebook.
Create the following SQL snippet:
%sql SELECT * FROM `batter metrics`In the upper-right of the Notebook screen, click
to execute the notebook and display the results.
Part IV: Finding, Adding and Viewing Data¶
You can locate data or analytics to incorporate into the Aginity Enterprise catalog as sources. Data sources are created from existing Hive tables, and analytic sources are created from the output schemas of any existing Aginity Enterprise analytics in our notebooks. In this video we will show how to find a data source, using a 2017 baseball player table in Hive called “roster2017.”
If you prefer text, you can follow along by using the steps below the video.
Having trouble seeing this video? Click here for troubleshooting information.
Step 1: Attach the roster2017 Table as a Source¶
- In the navigation pane, click New and then Source.
- In the New source dialog box, enter “roster2017” as the name for the new source, select Table in the Type list, and then click Create.
- On the Source Editor screen, find the data:
- In the Connection list, select Amp Demo.
- In the Database list, which displays many databases, each containing its own group of tables to choose from, select default.
- In the Table list, select the database table (in this case, roster2017).
- In the Description box, provide an optional description of the table. Enter “This table holds the names and positions of each player for 2017.”
- In the upper right of the Source Editor screen, click Save to save the new source.
Step 2: Review that the Data is Working Properly¶
- On the Source Editor screen, click the Preview tab and then click Test to see a preview of the data and ensure that it is working properly.
A source can be reviewed in a notebook or a view.
Step 3: Review the Source in a Notebook¶
In the navigation pane, click New and then Notebook.
In the New notebook dialog box, enter the name “roster2017 analytic” and click Create.
In the Notebook Editor screen, click the plus sign next to Input Schema and select Insert Frame.
Browse to the new source—”roster2017.” You can now see it in the input schema of the notebook.
In the notebook, enter the following code:
%sql select * from roster2017
Click to confirm that the roster2017 source contains data.
Reference Guide for Aginity Enterprise¶
Aginity Enterprise consists of a number of screens. Find a detailed description of a screen by clicking the corresponding link below:
Source Editor Screen¶
You can use the Source Editor screen to create and edit sources. A source is a table-like entity that is represented as a logical table in the catalog. In Aginity Enterprise, sources are used to define an input schema for analytics, views, and publications.
You can create a source from a database table, a file, or a notebook. When you select a table, a file, or notebook for the source, its input schema is generated based on the selected entity. The input schema defines an initial structure of the source, that is, a collection of columns, column data types, and, in the case of a source created for a notebook, any parameters inherited from the notebook.
Note
To create a source from a table, file, or notebook:
- In the navigation pane, click New and then Source.
- Type a name for the source and select a source type (table, file, or notebook).
- Click Create.
- On the Source Editor screen, optionally provide a description of the source to help identify it.
- Depending on the source type you selected, do one of the following:
- For a notebook source, in the Select Analytic box, click the three dots and select the notebook to create the source from.
- For a table source, set the Database table properties by selecting a database connection, database, and table.
- For a file source, set the file properties by selecting a file system connection, a field delimiter character, and the path to the file.
- Make other appropriate changes on the Source Editor screen.
- To save the source, click Save in the upper-right corner.
When you finish making changes on the Source Editor screen (such as changes in column names and primary-key columns), an output schema is generated for the completed source. The output schema presents a final structure of the source, including your changes.
The Source Editor screen consists of:
Source Properties¶
The Source properties section of the Source Editor screen displays the following information.
Property | Description |
---|---|
Name | The name of the source appears at the top of the screen. You can change the name if needed. |
Description (optional; table and file sources only) | The description of the source (if one was entered). |
Select Analytic (only a notebook source) | Displays the path to the notebook that the source was created from. You can change the source’s notebook.
|
Database table properties (a table source only) | For a source created from a database table, these display the name of the database connection in use, the database name, and the name of the table. You can change any of these properties if needed. For example, you can select a different table in the database to use for the source. Attention We recommend that you change the database table properties with caution. Changes to these properties will cause the input schema in entities that use the source (analytics, publications, views) to be out of sync. You will need to synchronize with the source from each of those entities. |
File properties (file sources only) | These display the file system connection in use, the field delimiter (the character that is used to separate rows of data in the file), and the path to the file in the file system. You can also specify where Enterprise should include the first row (header row) when reading data from the file by dragging the slider under Include header row. Note If you want to use a tab character (tab stop) as the field delimiter, type t in the Field delimiter box. You can change any of these properties if needed. For example, you can change a different file for the source by selecting it in Select Path. Attention We recommend that you change the file properties with caution. Changes to these properties will cause the input schema in entities that use the source (analytics, publications, views) to be out of sync. You will need to synchronize with the source from each of those entities. |
Advanced Properties¶
The Advanced properties section appears only for file sources. You can use this section to provide advanced properties for a file source. To add a property, click Add and then enter the property name under Key and the property value under Value. To remove a property, click Remove at the end of the property row.
For Aginity Enterprise to read data from file sources with a File Type of Parquet that use Snappy compression and for you to be able to preview such sources, add the following advanced property:
- Key - compression.
- Value - snappy.
Frame Parameters Section¶
The screen includes the Frame Parameters section only for a source that was created from a notebook. If any parameters are defined for the notebook, this section displays a row for each parameter.
If the Prompt for Input check box is selected, a value will need to be specified for the parameter when an entity that is created from the source is run. For example, if a notebook is created from the source with a parameter with Prompt for Input selected, you will need to provide a value for the parameter when the notebook is computed. If you want to specify a value for the parameter now, clear the check box and type the value under Values.
Attention
Aginity Enterprise passes parameter values “as is” (that is, without any transformation). Therefore, you need to make appropriate changes in target code to make sure that the parameter values are processed correctly. For example, if you want to include a string parameter value from a source in a notebook, add the parameter name to the notebook’s code and then enclose it in single quotation marks to make sure that the parameter value is treated as a string.
Note
If changes are made to a notebook (for example, a column was renamed or a column’s data type was changed) after a source is created from it, the Source Editor screen displays a “Source is out of sync with underlying analytic output” error. You must bring the source back in sync with the notebook by clicking Synchronize next to the error message.
Schema Tab¶
The Schema tab displays the source’s input schema. The following information is provided.
Input Schema Detail | Description |
---|---|
Column Name | The physical name of each column in the input schema. |
Label | The logical name of the column. By default, the Label is the same as the Column Name. You can provide a different, more friendly name for the column. |
Data Type | The column’s data type. Attention Currently, Enterprise’s source support does not include columns with complex data types (ARRAY, MAP, or STRUCT). If a column uses a complex data type, convert the column to STRING in the underlying database table or file (if possible) and then recreate the source from the table or file. Alternatively, change the column’s data type to STRING under Data Type here. |
Description | The description of the column if one was provided. |
Primary Key | If the column is part of the primary key in the input schema data set, this check box is selected. |
You can change the input schema information if necessary. For example, you can select the Primary Key check box for a column to indicate that the column is part of the primary key.
Relationships Tab¶
This tab shows any outgoing relationships that exists between the source and other entities in your data model. You can create new relationships.
Relationships that you define for a source will make the source reachable through related entities in the catalog when you need to use the source in defining the input schema for views.
To create a relationship between a source and another entity:
Click New relationship.
Under Name, type a name for the relationship.
Click the three dots at the right end of the Reference box.
In the catalog browser, select a workshop, a project, and then the entity with which you want to create a relationship from the source.
The Reference box displays the path to the entity. The Mapping section (under To) displays the primary-key columns from the selected entity. You must map the primary-key columns to appropriate columns in the source.
Tip
If the Mapping section is not visible, click Show in the upper-right corner of the tab.
Under From, select a column that you want to map to a primary-key column under To. Repeat this step until all the primary-key columns are mapped.
To define additional relationships, repeat steps 1–5.
To delete a relationship:
- Click Delete next to the name of the relationship.
Preview Tab¶
On the Preview tab, you can generate a preview of the first five rows from the source’s resultant output schema. This is helpful to see if the output schema contains desired results. Click Test to generate a preview.
Note
The Test button is unavailable for a source that is out of sync with its underlying notebook (see the note in the “Frame Parameters Section” section for more information) or if the source contains at least one parameter that does not have a value with Prompt for Input cleared. Synchronize the source with the notebook (click Synchronize at the top of the screen) or provide missing parameter values.
Output Schema Pane¶
The Output Schema pane on the right side of the screen provides information about the source’s output schema. The output schema results from the source’s input schema and any changes you have made to the source, such as:
- Changed column names and changes in the primary-key columns.
- Specified parameter values.
- New relationships defined for the source.
The Output Schema pane displays the following information:
- The source name.
- The name of the underlying entity (a database table, file, or notebook) that the source was created from.
- The names of the columns that make up the output schema. An icon next to a column name identifies the type of column. For example, a key icon next to a column indicates a primary-key column.
Tip
If the Output Schema pane is not visible on the right side, click .
To save your changes to the source:
In the upper-right corner of the screen, click Save.
To delete the source:
In the upper-right corner of the screen, click More and then Delete.
Notebook Editor Screen¶
A notebook refers to an interactive web-based environment for data analytics that uses Spark as the computation engine. With a notebook, you can ingest, explore, and visualize data and export results to share and collaborate on them with others.
Creating a notebook in Aginity Enterprise involves the following steps:
- Step 1: Define and materialize the input schema of the notebook. The input schema defines the frames and parameter (variable) values that you can use as input to your notebook code. To be able to run queries on the input schema frames from the notebook, you must first materialize the input schema. Materializing the input schema creates necessary temporary tables for the frames for the duration of a current Spark session.
- Step 2: Write and execute the notebook code. You can write code for your notebooks in Scala and Spark SQL. You can run individual snippets of the code to test results or all the code at once. The code will be executed on the Spark cluster selected as the default cluster in your Enterprise user preferences or for your instance of Enterprise.
- Step 3: Add the output of the notebook to its output schema. To use the analytic results produced by the notebook as input to views or other notebooks or to publish these results to a table or file, you need to add them to the output schema of the notebook. This will save the results as output schema frames and thus make them available for selection from the notebook within the catalog.
Note
Currently, to use an output schema frame from a notebook as input to a view or another notebook, you need to create a source from that frame and then add the source to the input schema of the view or notebook. Similarly, to publish an output schema frame, you need to create a source from the frame and add the source to the input schema of a publication.
You can use the Notebook Editor screen to create and edit notebooks in Aginity Enterprise. The Notebook Editor screen consists of:
Top Bar¶
The top bar of the Notebook Editor screen displays the following.
Property | Description |
---|---|
Name | The name of the notebook appears at the top of the screen. You can change the name if needed. |
![]() |
Click this button to expand or collapse the Input Schema and Output Schema panes. |
![]() |
Run the entire notebook code (all paragraphs). |
More | Access the following commands:
|
Save | Save all changes in the notebook. |
Input Schema Pane¶
You use the Input Schema pane on the left to define the input schema of the notebook. To use variable values (such as different sales years) in the notebook code, you can add parameters to the input schema, specifying a default value for each parameter, and then reference the parameters in the code.
To add a frame to the input schema:
- Click the plus sign next to Input Schema and click Insert Frame.
- In the catalog browser, select a workspace, a project and then the frame that you want to add.
- To add additional frames, repeat steps 1–2.
To add a parameter to the input schema:
- Click the plus sign next to Input Schema and click New Parameter.
- Provide a name, a data type, and a default value for the parameter. Optionally, enter a description of the parameter.
The name of each added frame and the path to it are displayed. To see the columns in the frame, click the arrow to the right of its name. If any parameters are defined for a frame, they are displayed under Frame Parameters.
Note
If changes are made to an underlying frame (for example, a column was renamed or a column’s data type was changed) added to the input schema, an exclamation mark is displayed to the left of the frame name, and the affected columns are highlighted, as shown in the following image. Click Synchronize to update the input schema with the changes in the frame.

The parameters added to the input schema are displayed under Parameters.
To delete a frame or parameter from the input frame:
- At the top of the Input Schema pane, click the minus sign.
- Select the check box next to the name of each frame and parameter that you want to delete.
- Click Delete.
Notebook Code Area¶
You write the notebook code in the Notebook Code area (the middle area of the screen). Currently, only code in SQL (Spark SQL) and Scala is supported.
The Notebook Code area is divided into individual code sections called paragraphs. Each paragraph is designed to contain a snippet of code.
Note these points when writing notebook code:
An SQL code snippet must begin with the SQL interpreter, as shown below. (An interpreter is a plug-in that enables code in a specific language.)
%sql code
You do not need to begin Scala code snippets with the Scala interpreter because Scala is used by default.
When you include the name of a frame from the input schema or the name of a column from an input schema frame in the code, put grave accents (`) around the name, for example:
SELECT * from `My Notebook Source` WHERE `column1` IS NOT NULL
Tip
On a US QWERTY keyboard, the grave accent is the key in the upper-left corner to the left of the 1 key.
To reference a parameter in your code, use the following format:
${parameter_name}
Note
Aginity Enterprise automatically adds the dollar sign ($) in front of the parameter name, so you don’t need to type it.
Enclose the dollar sign ($) and the parameter name in single quotations marks for parameters of STRING data type and other types that require single quotation marks according to Spark SQL syntax:
'${parameter_name}'
Examples:
SELECT * from `My Notebook Source` WHERE `Sales Year` = ${year} SELECT * from `My Notebook Source` WHERE `Sales Country` = '${country}' /*(referencing a parameter with STRING data type)*/
Attention
Aginity Enterprise passes parameter values “as is” (that is, without any transformation). Therefore, you need to make appropriate changes in target code to make sure that the parameter values are processed correctly. For example, if you want to include a string parameter value in a notebook’s code, add the parameter name to the code and then enclose it in single quotation marks to make sure that the value is treated as a string (see the second example above).
Do not put a semicolon (;) at the end of SQL statements.
Tip
If a paragraph is disabled for entry of code, the editor setting is hidden for the paragraph. Click the Show the editor button, as shown below.

The status of the code in the paragraph (for example, READY or FINISHED), the Run this paragraph button, and the different options that pertain to the paragraph and its code are displayed in the upper-right.

Status of the paragraph code.
The Run this paragraph button.
The Show / Hide editor button, the Show / Hide output button, and the gear button, which displays the paragraph options.
Before you can run queries against the input schema frames from the notebook code, you must first materialize these frames for the current Spark session.
To materialize the input schema frames:
In the upper-right corner of the Notebook Editor screen, click
.
The code snippets in the paragraphs are executed sequentially in top-down order. As the code in a paragraph is executed, the status in the right corner of the paragraph progresses from READY, PENDING, RUNNING to FINISHED. A status of FINISHED indicates that the code has been executed. If a frame is materialized, the name of the frame is highlighted in green in the Input Schema pane.
The output is displayed below the code in each paragraph. You can visualize the output by selecting from several representation options about the output section.
Tip
If the output is not visible, click the Show output button in the upper right of the paragraph, as shown.
You can export the resulting data set to a file by clicking the arrow on the right of the download button and select one of the supported file formats (for example, Comma Separated Value (CSV)).
Once the input frames have been materialized, you can run code snippets in individual paragraphs for your data science experiments by clicking
in the upper right of the paragraph section you want.
To be able to use the output of the notebook within your catalog as input to views, publications, and other notebooks, you need to place the output data set in a temporary view using an appropriate SELECT query and then add that view as a frame to the output schema of the notebook. Below is an example of the code for creating a temporary view in Spark SQL:
%sql CREATE TEMPORARY VIEW My_Notebook_Output AS SELECT * FROM `My Notebook Source`
Output Schema Pane¶
The Output Schema pane on the right side of the screen provides information about the notebook’s output schema. The output schema consists of individual output frames that you added from previously created temporary views.
To add a frame to the output schema:
To the right of Output Schema, click the plus sign.
In the Add Frame dialog box, click to select the name of the view that you created.
Click Add.
The frame is added to the Output Schema pane.
The Output Schema pane displays the following information for each output frame:
- The frame name.
- The names of the columns in the frame. (Click the arrow to the right of the frame name to display the columns.) An icon next to a column name identifies the type of column. For example, a key icon next to a column indicates a primary-key column.
Tip
If the Output Schema pane is not visible on the right side, click the four-corners icon.
To delete a frame from the output schema:
- To the right of Output Schema, click the minus sign.
- Select the check box next to the name of the frame to delete.
- Click Delete (under the Save button).
How to Access the Notebook Editor Screen¶
- In the navigation pane on the left, click Browse.
- In the Workspaces pane, select a workspace.
- In the Projects pane, select a project.
- In the Assets pane, select a notebook.
- On the Asset Details screen for the notebook, click Edit.
View Builder Screen¶
You can use the View Builder screen to create and edit views. A view is a frame that is built from a combination of existing sources. You can combine selected sources into a view by using joins, SELECT queries, and filters. You create a view to define a custom dataset to publish or to use as input for a notebook or another view.
The View Builder screen contains several tabs that represent the sequence of steps involved in building a view. The View Builder screen consists of:
Top Bar¶
The top bar gives access to the tabs that each represent a logical step of building a view, and also displays the following.
Property | Description |
---|---|
View name | The name of the view appears at the top of the screen. You can change the name if needed. |
More | Access the following commands:
|
Save | Save the changes in the view. |
Input Schema Tab¶
You can use the Input Schema tab to define the view’s input schema. The input schema defines a set of frames from the catalog that provide the input data for the view, or input frames, and any parameters that you define to refine the data held in the view. You can select sources, notebooks, and other views as input frames for the view.
The relationship tree on the left side of the tab displays the starting-point frame (that is, the basic input frame) selected for the view. The name of the frame and the path to the frame are displayed on the right. You can change the frame name.
You can select other input frames for the view from the frames that are connected to the starting-point frame through a path of existing incoming or outgoing relationships in the catalog. If a frame has an incoming relationship with the starting-point frame, a column in that frame references the primary key in the starting-point frame. If the starting-point frame has an outgoing relationship with a frame, a column in the starting-point frame references the primary key in that other frame.
The names of the outgoing relationships from the starting-point frame are displayed under the frame name. By default the incoming relationships are hidden. To display the incoming relationships, click Incoming in strikethrough formatting under the starting-point frame. A green icon appears to the left of an outgoing relationship. A red icon appears to the left of an incoming relationship. To view the relationships nested in a parent relationship , click that parent relationship in the tree (see the image below).
Tip
When you point to a relationship in the tree, the name of the relationship, the type of relationship (outgoing or incoming), and the name of the frame associated with the relationship are displayed in a tooltip.
To view the columns in the starting-point frame and the related frames, click Columns in strikethrough formatting under the starting-point frame.
To view the parameters (if any) defined in the starting-point frame and the related frames, click Parameters in strikethrough formatting under the starting-point frame.

The name of the starting-point frame appears at the top of the relationship tree.
Click Incoming (in strikethrough) to display incoming relationships.
Click Columns (in strikethrough) to display the columns in the starting-point frame and the related frames.
Click Parameters (in strikethrough) to display the parameters (if any) that are defined in the starting-point frame and the related frames.
A nested outgoing relationship.
A nested incoming relationship.
The names of the columns in the starting-point frame.
The names of the columns in a related frame.
To add a frame to the input schema of the view:
Drag its relationship from the relationship tree into the Input Schema tab on the right.
The name of the selected frame and the path to it relative to the starting-point frame are displayed on the right. You can change the frame name. To view the columns in the frame, click the arrow at the far right of its name. If the frame contains any parameters that require a value (the Prompt for Input option is selected for the parameters in the frame), these parameters are displayed under Frame Parameters. Click a parameter name and then enter a value for the parameter in the top box and click Done.
Note
In addition to the parameters inherited into the view from its input frames, you can also define parameters for the view. For more information, see Parameters Section below.
Parameters Section¶
You can use the Parameters section at the top of the Input Schema tab to define parameters for the view.
Warning
Aginity Enterprise passes parameter values “as is” (that is, without any transformation). Therefore, you need to make appropriate changes in target code to make sure that the parameter values are processed correctly. For example, if you want to include a string parameter value from a view in a notebook, add the parameter name to the notebook’s code and then enclose it in single quotation marks to make sure that the parameter value is treated as a string.
To add a parameter to the view:
At the far right of Parameters, click Add.
Enter a name for the parameter, select a data type, and enter a default value. Make sure that you use the following format for the parameter:
${parameter_name}
Note
Aginity Enterprise automatically adds the dollar sign ($) in front of the parameter name, so you don’t need to type it.
Examples:
${year}
Click Add.
To delete a parameter from the view:
- Point to the arrow at the far right of the parameter row to display the Remove button.
- Click Remove.
After you have defined the view’s input schema, you can now select the columns from each input frame to include in the view on the Columns tab.
Columns Tab¶
You can use the Columns tab to select columns from the frames in the input schema to add to the view.
To add a column to the view:
In the pane on the left side of the Columns tab, under the name of the frame that contains the column, drag the column into the right side of the tab.
Tip
If the list of columns in a frame is not visible, click the arrow next to the frame name.
On the right side, under Column Name, change the column name if needed.
If you want to apply an aggregate function to the data in the column, click in Function and select the aggregate function.
If needed, under Type, change the column’s data type.
If you want to define a custom expression on the column, do the following:
Point to the right of Expression and click Edit.
In the Expression builder box on the right, enter the expression to apply to the column.
Note
- You can use the expression on the column in combination with the aggregate function you selected for the column. If needed, you can select a different aggregate function for the column in the list on the left.
- You can also reference a parameter defined in the input schema of the view in the expression. On the left side, under Parameters, select a parameter and drag it into the appropriate place in the expression. A dollar sign ($) is automatically added before the parameter name in the expression.
- You can optionally provide a description for the expression by click Edit on the left and entering the expression.
Click OK in the upper-right corner.
The column is added under Result Set on the left.
To add more columns to the view, repeat steps 1–5.
To delete a column from the view:
- Point to the right end of the column row and click Delete.
Filters Tab¶
You can use the Filters tab to define filters on the view. Filters are useful for restricting the data available for the view to a specific subset that meets the conditions defined in the filter. For example, if you created a Sales Metrics view and wanted to limit the data in the view to only customers with a conversion year of 2017 and later, you would define the following filter:
Sales Metrics.conversion_date GREATER THAN OR EQUALS '2017-01-01'
A simple filter defines a condition that the data in the view must meet by combining one or more column references, operators, and a filter expression. The filter expression can include references to parameters and constants.
You can also use a filter to define more than one condition to evaluate the data against. Such a filter is referred to as a complex filter and is said to contain a condition set. The individual conditions in the condition set are joined with the AND or OR logical operator. Data will be included in the view only if it meets both (AND) or either of the conditions (OR). We could add an additional condition to the filter on our example view “Sales Metrics” to only include customers from the Midwest region, joining both conditions with the AND operator:
Sales Metrics.conversion_date GREATER THAN OR EQUALS '2017-01-01'
AND
Sales.Metrics.region EQUALS 'Midwest'
You can further refine the subset of data in the view by adding nested condition sets to the filter. Your filter can have only one top-level (parent) condition set and any number and levels of nested (child) condition sets, with the top-level condition set joining to the nested sets with AND or OR operators.
To define a filter on the view:
At the top of the tab, click Add.
To define a top-level condition, do the following:
In the Input Schema pane on the left side, select a column to include in the filter and drag it onto the Add button.
In our example, we would drag the conversion_date column onto the Add button.
In the list on the right of the column name, select a comparison operator for the filter expression (for example, CONTAINS, EQUALS or GREATER THAN).
In our example , we would select an operator of GREATER THAN OR EQUALS.
To define an expression for the condition, at the right end of the condition row, click Edit and write the expression in the Expression editor area. To reference a column or parameter in the expression, drag it from the Input Schema pane into the Expression editor area on the right side. A dollar sign ($) is automatically added before the parameter name. For parameters of STRING data type and other types that require single quotation marks according to Spark SQL syntax, enclose the dollar sign and the parameter name in single quotation marks, for example:
'$country_code'
The expression must follow the syntax of Spark SQL, which supports most of the HiveQL syntax. For more information, refer to the “Compatibility with Apache Hive” section in the Spark SQL documentation.
Warning
Aginity Enterprise processes the expression as is (that is, without transformation). You therefore need to write the expression in such a way that it suits your specific needs. For example, to include a string in the expression, make sure that you enclose it in single quotation marks.
In the upper-right corner, above the Expression editor area, click Done.
To define additional filters, repeat steps 1–2.
Output Schema Pane¶
The Output Schema pane on the right side of the screen provides information about the view’s output schema. The output schema results from the view’s input schema and any changes you have made to the view, such as:
- Changed column names and changes in the primary-key columns.
- Specified parameter values.
- New relationships defined for the view.
The Output Schema pane displays the following information:
- The view name.
- The name of the underlying entity (a database table, file, or analytic) that the view was created from.
- The names of the columns that make up the output schema. An icon next to a column name identifies the type of column. For example, a key icon next to a column indicates a primary-key column.
Tip
If the Output Schema pane is not visible on the right side, click the four-corners button.
To save your changes to the view:
In the upper-right corner of the screen, click Save.
To delete the view:
In the upper-right corner of the screen, click More and then Delete.
How to Access the View Builder Screen¶
- In the navigation pane on the left, click Browse.
- In the Workspaces pane, select a workspace.
- In the Projects pane, select a project.
- In the Assets pane, select a view.
- On the Asset Details screen for the view, click Edit.
Administering Aginity Enterprise¶
Installing Aginity Enterprise: Prerequisites and High-Level Installation Steps¶
This topic lists the prerequisites for installation of Aginity Enterprise and describes the high-level steps needed to install Aginity Enterprise in your environment.
Spark Cluster Configurations¶
When a user runs a publication or notebook, Aginity Enterprise submits a Spark application to a configured Spark cluster for execution of the publication or notebook. To execute publications and notebooks on Spark, you must add a configuration for each cluster that you want Aginity Enterprise to connect to.
Setting a default Spark cluster for your Aginity Enterprise instance is another requirement.
If you have an Amazon EMR cluster, you need to configure certain settings on the cluster to enable users to materialize notebooks in Aginity Enterprise.
This topic assumes knowledge of Apache Hive and Apache Hadoop concepts, such as “HDFS,” “YARN,” and “Livy server.” For information about these concepts, refer to the Hive and Hadoop documentation.
Add a Spark Cluster Configuration¶
You can configure a connection to a Spark cluster in Aginity Enterprise by describing its configuration. After adding the configuration, you can test a connection to the cluster.
You need to perform the following steps for each Spark cluster that you want Aginity Enterprise to connect to.
To add a Spark cluster configuration:
In the navigation pane, click Settings and then click Spark clusters.
In the Spark clusters pane, click Add.
In the New cluster name dialog box, type a friendly name for the cluster to help identify it in Aginity Enterprise, in the Cluster list, select YARN, and then click Create.
Note
Currently, only clusters of YARN type are supported.
In the main content area, in the Spark Version list, select the version of Spark (a combination of a Spark version and a version of the Scala programming language) that is used by the cluster.
Note
The read-only Master Address currently displays the type of cluster (Yarn) specified for the cluster.
Specify whether this is the default cluster for your instance of Aginity Enterprise. To indicate that this is the default cluster, under Make default in the upper right, drag the slider to the right.
Aginity Enterprise uses the default Spark cluster to execute queries, notebooks or publications and generate previews (for example, for sources).
Note
- You can also specify a default Spark cluster for each user on the Users screen. The user can change his or her default cluster in the account preferences. The user’s default cluster takes precedence over the default cluster specified for the instance and will be used to execute notebooks and publications and generate previews for that user.
- On the list of Spark clusters, “default” appears next to the name of the default cluster specified for the instance of Amp.
To describe the configuration of the cluster, add the configuration properties from the following tables. Do the following to add each property and specify its value:
Under Advanced properties, click Add, as shown in the image below.
Under Key, type the name of the property as it appears in the table.
Under Value, type a value for the property. Use the recommended value from the table where given.
General Cluster Configuration Properties
All the following properties are required.
Key Recommended Value Description fs.defaultFS or fs.default.name The location of the HDFS that the Spark cluster works with in the format hdfs://host_name:port_number
spark.yarn.archive hdfs://host_name:port_number/user/spark/spark-jars.tgz The path to the spark-jars.tgz file on HDFS. Make sure that you use the full path, as shown in Recommended Value.
The spark-jars.tgz file contains the required Spark JAR libraries. You need to upload the spark-jars.tgz file to your Spark cluster as part of provisioning the cluster.
yarn.nodemanager.vmem-check-enabled false Specifies whether virtual memory limits will be enforced for containers. yarn.resourcemanager.address Use the value from yarn-site.xml of the Spark cluster configuration. The address of the applications manager interface in the Resource Manager. yarn.resourcemanager.scheduler.address Use the value from yarn-site.xml of the Spark cluster configuration. The address of the scheduler interface. env.HADOOP_USER_NAME The user under which Aginity Enterprise submits Spark applications to a non-Kerberos Spark cluster. The default user is root. Notebook-Related Configuration Properties
The following configuration properties are required for notebooks.
Key Recommended Value Description amp.livy.url The location of the Livy server. livy.spark.jars If you want to make additional Java libraries available for notebooks, provide a comma-separated list of the paths to the library JAR files.
Attention
Unlike the path in spark.yarn.archive, the paths in livy.spark.jars must be relative to the HDFS root (for example, /user/livy/lib/some-lib.jar).
Make sure that the livy.spark.jars property contains the paths to the following files:
amp-hive-udf-<version>.jar - This file contains the Hive user-defined functions (UDFs) that Aginity Enterprise requires to execute external analytics. You can download the file from the Aginity Enterprise Downloads page (click the Download Hadoop Extensions Toolkit (Amp .jar file) link). Place the file in any directory on your Spark cluster. An example of the path to the file:
/user/user_name/amp-hive-udf-<version>.jar
Note
- Use the version of the amp-hive-udf-<version>.jar file that was released with your version of Amp Cache Catalog.
- You must also specify the path to the amp-hive-udf-<version>.jar file in the spark.jars property.
spark-<Spark_version>-deps-<version>.jar - This JAR file contains the redistributable connectors and associated dependencies that are required by Aginity Enterprise. You need to upload this file to your Spark cluster as part of provisioning the cluster. An example of the path to the file:
user/lib/spark-<Spark_version>-deps-<version>.jar
Note
You must also specify the path to the spark-<Spark_version>-deps-<version>.jar file in the spark.jars property.
spark-pushdown-<Spark_version>-<version>.jar - This JAR file contains the libraries that are required to enable the pushdown functionality in Aginity Enterprise. You need to upload this file to your Spark cluster as part of provisioning the cluster.
Note
You must also specify the path to the spark-pushdown-<Spark_version>-<version>.jar file in the spark.jars property.
The path on HDFS to the JAR file for each third-party JDBC driver that you use in Aginity Enterprise. The Spark cluster will use this path to access the driver JAR file to connect to the source database when executing notebooks.
The specified HDFS location must be accessible by the user under which Aginity Enterprise submits Spark applications to the Spark cluster. For a Kerberos-secured cluster, this is the user specified by the amp.livy.kerberos.principal configuration property (see the Kerberos-secured cluster configuration properties section). For a non-Kerberos cluster, this is the user specified by the env.HADOOP_USER_NAME configuration property (see the General cluster configuration properties section).
livy.spark.driver.cores The number of Livy driver cores. This property is automatically added and set to the default value of 1. You can specify a different value that meets your specific needs. livy.spark.driver.memory Livy driver memory. This property is automatically added and set to the default value of 512m. You can specify a different value that meets your specific needs. livy.spark.executor.cores The number of cores per executor. This property is automatically added and set to the default value of 1. You can specify a different value that meets your specific needs. livy.spark.executor.instances The number of executor instances. This property is automatically added and set to the default value of 1. You can specify a different value that meets your specific needs. livy.spark.executor.memory Executor memory per worker instance. This property is automatically added and set to the default value of 512m. You can specify a different value that meets your specific needs. livy.spark.hadoop.hive.metastore.uris thrift://metastore_host:port A comma-separated list of URIs. The Livy Server connects to one of these URIs to make metadata requests to a remote metastore. spark.jars Make sure that the spark.jars property provides the paths to the following files.
Note
Make sure to provide full paths, including the HDFS root.
amp-hive-udf-<version>.jar - This file contains the Hive user-defined functions (UDFs) that Aginity Enterprise requires to execute external analytics. You can download the file from the Amp Downloads page (click the Download Hadoop Extensions Toolkit (Amp .jar file) link). Place the file in any directory on your Spark cluster.
An example of the path to the file:
hdfs://server:port/user/user_name/amp-hive-udf-<version>.jar
Note
- Use the version of the amp-hive-udf-<version>.jar file that was released with your version of Amp Cache Catalog.
- You must also specify the path to the amp-hive-udf-<version>.jar file in the livy.spark.jars property.
spark-<Spark_version>-deps-<version>.jar - This JAR file contains the redistributable connectors and associated dependencies that are required by Aginity Enterprise. You need to upload this file to your Spark cluster as part of provisioning the cluster. An example of the path to the file:
hdfs://host:port/user/lib/spark-<Spark_version>-deps-<version>.jar
spark-pushdown-<Spark_version>-<version>.jar - This JAR file contains the libraries that are required to enable the pushdown functionality in Aginity Enterprise. You need to upload this file to your Spark cluster as part of provisioning the cluster.
The path on HDFS to the JAR file for each third-party JDBC driver that you use in Aginity Enterprise. The Spark cluster will use this path to access the driver JAR file to connect to the source and target databases when executing publications.
The specified HDFS location must be accessible by the user under which Aginity Enterprise submits Spark applications to the Spark cluster. For a Kerberos-secured cluster, this is the user specified by the amp.kerberos.principal configuration property (see the Kerberos-secured cluster configuration properties section). For a non-Kerberos cluster, this is the user specified by the env.HADOOP_USER_NAME configuration property (see the General cluster configuration properties section).
Attention
In addition to adding the configuration properties above, to enable users to run notebooks, set the livy.repl.enableHiveContext property to true in the configuration file of the Livy server (livy.conf). This is required for Livy to see the Hive metastore correctly.
Hive Configuration Properties
These configuration properties are required.
Key Recommended Value Description hive.mapred.supports.subdirectories true Specifies whether the version of Hadoop used by the cluster supports subdirectories for tables and partitions. hive.metastore.uris Use the value from hive.site-xml of the Spark cluster configuration. A comma-separated list of URIs. Hive connects to one of these URIs to make metadata requests to a remote metastore. HDP-Specific Configuration Properties
Add the following properties only if the cluster uses the Hortonworks Data Platform (HDP).
Key Recommended Value Description spark.driver.extraJavaOptions -Dhdp.version=${version_of_the_HDP}, for example, -Dhdp.version=2.6.1.0-129 Used to pass a string of extra Java virtual machine (JVM) options to the Spark driver. Use this property to pass the version of HDP. See Recommended Value. spark.yarn.am.extraJavaOptions -Dhdp.version=${version_of_the_HDP}, for example, -Dhdp.version=2.6.1.0-129 Used to pass a string of extra JVM options to the YARN Application Master in client mode. Use this property to pass the version of HDP. See Recommended Value. Kerberos-Secured Cluster Configuration Properties
The following configuration properties are required for a Kerberos-secured Spark cluster. You need to add all or some of these properties depending on your needs (see the notes under the table).
Key Recommended Value Description amp.kerberos.kdc Fully qualified domain name (FQDN) of the Kerberos Key Distribution Center (KDC) server on your network. amp.kerberos.principal Specifies the Kerberos principal (a unique identity to which Kerberos can assign tickets to access Kerberos-secured Hadoop services) under which Aginity Enterprise will authenticate with KDC to submit publication jobs to the Spark cluster. amp.kerberos.password The password of the principal. amp.livy.kerberos.principal Specifies the Kerberos principal under which Aginity Enterprise will authenticate with KDC to connect to the Livy server when working with notebooks. This can be the same principal as the one specified in amp.kerberos.principal or a different principal. amp.livy.kerberos.password The password of the principal. If you use the same principal as the one specified in amp.kerberos.principal to connect to the Livy server, set this property to the same password that you specified in amp.kerberos.password. Note
- To enable Aginity Enterprise only to submit publication jobs to a Kerberos-secured Spark cluster, you need to add only the amp.kerberos.kdc, amp.kerberos.principal, and amp.kerberos.password properties.
- To enable Aginity Enterprise only to connect to the Livy server in a Kerberos-secured Spark cluster when working with notebooks, you need to add only the amp.kerberos.kdc, amp.livy.kerberos.principal, and amp.livy.kerberos.password properties.
Kerberos Hive Configuration Properties
Add the following properties if your Spark cluster connects to a Kerberos-secured Hive server to execute a publication that uses a Hive database table as a source.
Key Recommended Value Description hive.metastore.sasl.enabled true Specifies whether client connections use MapR-SASL when authenticating with the Hive Metastore. hive.server2.authentication Kerberos Specifies the authentication mode for connections to HiveServer2. hive.metastore.kerberos.principal The Kerberos principal for Hive Metastore. hive.server2.authentication.kerberos.principal The Kerberos principal for HiveServer2. Amazon EMR Configuration Properties
Add the following configuration properties if you have an Amazon EMR cluster and need to read from and write to Hive tables.
Key Recommended Value Description spark.driver.extraClassPath The default (vendor-provided) value. This configuration property contains the paths of the libraries required by a Spark driver process. You can look up the default value of spark.driver.extraClassPath in SPARK_HOME/conf/spark-default.conf on the master node of your cluster. hive.metastore.client.factory.class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory Add this configuration property if the AWS Glue Data Catalog is used as the Hive metastore on your EMR cluster.
Make sure that you also add the spark.driver.extraClassPath configuration property to the Spark cluster configuration. This property includes the path of the AWS SDK for Java, which contains the class specified in hive.metastore.client.factory.class.
To save the cluster configuration, click Save in the upper-right corner of the main content area.
Test a Connection to the Cluster¶
In the upper-right corner of the screen, click Test.
Aginity Enterprise creates a test job and sends it to the cluster. If the test job is successfully completed on the cluster, a “Success” message appears in Aginity Enterprise.
Delete a Spark Cluster Configuration from Amp¶
- In the Spark clusters pane, select the cluster configuration to delete.
- In the upper-right corner of the main content area, click Delete.
Enable Notebook Materialization on an EMR Cluster¶
If you have an Amazon EMR cluster and users create notebooks that use Netezza or Hive source tables as input, you need to add the following settings to the Livy configuration file (livy.conf) on the cluster to enable materialization of notebooks in Aginity Enterprise:
{
"Classification": "livy-conf",
"Properties":
{ "livy.repl.enable-hive-context": "true", "livy.spark.deploy-mode": "cluster" }
}
Tip
The livy.conf file is typically located in /etc/livy/conf/ on the cluster.
Minimum Installation Prerequisites¶
Make sure that your environment meets the following minimum requirements:
The following system requirements for Aginity Enterprise are met:
Application Requirements
Hardware * Eight Core CPU * 32 GB RAM * 10 GB reserved disk space * A minimum screen resolution of 1366 x 768
Software * Linux (CentOS, RedHat, Ubuntu) * Docker (CE/EE) 17+ * Docker Compose 1.16+
Other requirements * The host computer must have a resolvable fully qualified domain name (FQDN). * Aginity Enterprise requires port 443. The ports defined in your organization for communication with Spark must also be open.
Also, make sure that the ports that are used by the following services are open for Aginity Enterprise to access these services.
- ..list-table::
header-rows: 1
- Port
- Used by
- 10000
- Hive Server.
- 9083
- Hive Metastore.
8999
Livy Server. Aginity Enterprise requires access to Livy Server to run notebooks.
Note
On a Spark cluster that is on Amazon EMR, Livy uses port 8998.
- 8020
- HDFS. This port must be open to enable Aginity Enterprise to perform file-system operations on metadata.
- 8030
- YARN Resource Manager Scheduler.
8050
YARN Resource Manager.
On an Amazon EMR Spark cluster, YARN Resource Manager uses port 8032.
- 50010
- HDFS. This port must be open to enable Aginity Enterprise to transfer publication code packages to HDFS for execution of publications on Spark.
Compute Engine Requirements * Apache Spark 2.2 or 2.3
Attention
Starting with version 3.0.12, Aginity Enterprise will no longer support Spark 2.1.
- Apache Livy 0.3, 0.5
Supported Hadoop Platforms * HDP 2.6.3 and later * EMR 5.9.0 and 5.17 * HDInsight 3.6
Earlier versions of these Hadoop platforms are not supported.
Note
For users to browse and query Aginity Enterprise catalog assets from an external tool through AQE, your Spark cluster must use EMR version 5.17 or HDP version 2.6.5.
During installation, the Aginity Enterprise CLI connects to the Docker Hub to pull the application images from the Aginity Enterprise private repository. Therefore, make sure that all the nodes of your Swarm cluster and the location that you run the Aginity Enterprise CLI from have an Internet connection and can connect to the following Docker servers:
- registry-1.docker.io
- auth.docker.io
- index.docker.io
You must have access to the Aginity Enterprise private Docker Hub repository, which stores the Aginity Enterprise application images. Contact Aginity to make sure that you have access.
You must have a Docker Hub account. For information about creating a Docker Hub account, refer to the Docker documentation.
Docker is installed on the host computer where Aginity Enterprise will be running. Docker must be configured to run in swarm mode.
High-Level Installation Steps¶
The installation of Aginity Enterprise in your environment consists of these high-level steps:
Install Aginity Enterprise. Use the Aginity Enterprise Command-Line Interface (CLI) to install Aginity Enterprise in your environment.
Make sure that you install the Aginity Enterprise CLI first.
Provision your Spark cluster. To provision Spark for execution of jobs from Aginity Enterprise, you need to upload required JAR libraries to Hadoop Distributed File System (HDFS).
Configure a connection to your Spark cluster. You need to configure a connection to your Spark cluster by setting the different configuration properties.
Provisioning a Spark Cluster¶
Whenever a user runs a publication or notebook in Aginity Enterprise, a job is created and is submitted to a configured YARN Spark cluster for execution. To enable Aginity Enterprise to work with sources for notebooks and publications and submit jobs to Spark, you need to provision your cluster with the following:
- Spark libraries - These JAR libraries are required by Spark. On an EMR or HDP cluster, the Spark libraries are contained in the jars directory inside the Spark installation directory.
- Aginity Enterprise dependencies library - Contains the third-party libraries and connectors that are required to connect to all supported data sources from Aginity Enterprise. In particular, this library is required to enable optimized Redshift connections from Aginity Enterprise.
- Pushdown libraries JAR file - Contains the required libraries that enable pushdown functionality in Aginity Enterprise.
This topic assumes knowledge of Apache Hive and Apache Hadoop concepts, such as “HDFS,” “YARN,” and “Livy server.” For information about these concepts, refer to the Hive and Hadoop documentation.
You need to perform the following steps for each of your clusters after:
- You installed Aginity Enterprise in your environment.
- The Hadoop distribution running on your cluster was upgraded (for example, after an upgrade to a newer version of Hadoop Data Platform (HDP)).
Provision the Cluster with the Spark Libraries¶
Create a secure shell (SSH) connection to the node of your cluster that contains the Spark installation directory (the value of the SPARK_HOME environment variable). Typically, this directory is on the master node.
Create the following directory on HDFS to hold the Spark libraries:
hdfs dfs -mkdir /apps/Aginity
Depending on your Hadoop distribution (EMR or HDP), run the following command to add the Spark libraries to an archive (.tgz) file:
- EMR:
tar -C /usr/lib/spark/jars/ -cvzf spark-jars.tgz .
- HDP:
tar -C /usr/hdp/current/spark2-client/jars/ -cvzf spark-jars.tgz .
Note
Make sure that you put a space and a period (.) at the end of the command, as shown.
- EMR:
Copy the spark-jars.tgz archive to the created directory on HDFS:
hdfs dfs -put spark-jars.tgz /apps/Aginity/
Run the following command to change the permissions of the archive to full access:
hdfs dfs -chmod 777 /apps/Aginity/spark-jars.tgz
In Aginity Enterprise, specify the full path to the file in the spark.yarn.archive configuration property (see the General cluster configuration properties table in the Spark Cluster Configurations for more information.)
Provision the Cluster with the Amp Dependencies Library¶
Download the spark-Spark_version-deps-version.jar file for your version of Aginity Enterprise from the Aginity Enterprise Downloads page.
To enable users in your organization to browse and query assets in your Amp catalog from an external tool, you need to download the Amp dependencies library for EMR (spark-EMR-EMR_version-deps-version.jar) or HDP (spark-HDP-HDP_version-deps-version.jar), depending on the Hadoop distribution that is running on your Spark cluster.
Upload the file to a directory on the HDFS host that is accessible from your Spark cluster.
Note
To minimize the number of network hops, choose an HDFS directory that is close to the Spark cluster, the closer the better. Ideally, the HDFS directory and the Spark cluster are co-located. Since the path to the spark-Spark_version-deps-version.jar file is required to be specified in two different configuration properties in Aginity Enterprise (see step 3 below), we recommend that you upload this JAR file to the livy folder on HDFS: hdfs://host:port/user/livy/lib/spark-<Spark_version>-deps-<version>.jar. This saves you from having to enter two different paths for the same file.
In Aginity Enterprise, specify the path to the file in the spark.jars configuration property (required for publications) and the livy.spark.jars configuration property (required for notebooks) in your Spark cluster configuration.
Provision the Cluster with the Pushdown Libraries¶
- Download the latest spark-pushdown-Spark_version-version.jar file for your version of Aginity Enterprise from the Aginity Enterprise Downloads page.
- Upload the file to a directory on the HDFS host that is accessible from your Spark cluster.
- In Aginity Enterprise, specify the path to the file in the spark.jars configuration property (required for publications) and the livy.spark.jars configuration property (required for notebooks) in your Spark Cluster configuration.
Connections¶
Aginity Amp can make the following types of connection to read source data for notebooks and publications. Click the links below for instructions on how to create each type of connection.
Creating a Database Connection¶
JDBC Drivers¶
Database connections from Enterprise require appropriate JDBC drivers. Enterprise supports JDBC connections to the following database types:
- Hive
- Postgres
- Redshift
- Netezza
- Oracle
- MS SQL Server
- Snowflake
For each database type that users will connect to from Enterprise, you need to have a driver. For Hive or Postgres database connections, you don’t need to worry about drivers. The driver software for these database types is packaged along with the Enterprise product software. For other database connections (for example, Redshift or Oracle), you need to complete the following steps to add the drivers to Enterprise.
Note
If users in your organization need to query source data in a database (for example, in a Postgres database) from an external tool or application through AQE, you need to download and install the appropriate JDBC driver, add it to HDFS on your cluster, and specify the path to the driver in the spark.jars property in your Spark cluster configuration. For more information, see Configuring a Spark Cluster for AQE.
To execute a publication or a notebook, Spark needs to connect to the database specified in the Spark application for that publication or notebook. Spark uses the appropriate JDBC driver to connect to the database. To enable Spark to access the driver, you need to place the driver JAR file on HDFS and specify the path to it in the Spark cluster configuration, as part of adding the driver to Enterprise.
We assume that you are familiar with the JDBC-driver concepts, such as “class name” and “connection template,” that are mentioned here.
Add a JDBC Driver to Enterprise¶
To add a driver to Amp:
Place the driver JAR file on any HDFS location.
The selected location must be accessible by the user under which Enterprise submits Spark applications to the Spark cluster.
Note
Enterprise submits Spark applications to a Spark cluster secured with Kerberos under the user specified by the amp.kerberos.principal and amp.livy.kerberos.principal properties in the cluster configuration. Enterprise submits Spark applications to a non-Kerberos Spark cluster under the user specified by the env.HADOOP_USER_NAME configuration property.
In the Spark cluster configuration in Enterprise, enter the path to the driver JAR file into the following configuration properties:
- livy.spark.jars - Spark will use the path in this property to connect to the source database when executing notebooks.
- spark.jars - Spark will use the path in this property to connect to the source and target databases when executing publications.
In the navigation pane, click Settings and then click Drivers.
Note
The Hive and Postgres JDBC drivers appear under Packaged on the list of drivers. The properties of these drivers cannot be edited.
In the main content area displaying information about existing drivers, click New.
In the New Drivers dialog box, type a friendly name for the connection.
On the Driver screen, provide the following information:
Class name - The driver class name. For example, for the SQL Server JDBC driver, you would enter com.microsoft.sqlserver.jdbc.SQLServerDriver.
You can add only one driver with a specific class name to Enterprise.
JAR File - Click in the JAR File box, select the driver JAR file on your file system, and click Open.
Connection template - Enter the connection string template according to the driver. For example, for the SQL Server JDBC driver, you would enter jdbc:sqlserver://{host}:{port};{database}.
SQL dialect - Select the SQL dialect used by the database.
For the Oracle JDBC driver, you must select the SQL dialect of ANSI.
Note
The Snowflake JDBC driver information
Class name - Enter net.snowflake.client.jdbc.SnowflakeDriver.
JAR File - Select the downloaded driver JAR file on your file system and click Open.
You can download the Snowflake JDBC driver from the Snowflake JDBC repository.
Connection template - Enter jdbc:snowflake://{host}/?user={user}&password={password}.
SQL dialect - Select PGSQL.
Click Save to save the driver in Enterprise.
You now need to specify the driver each time you create a connection for that database type.
Delete a JDBC Driver from Enterprise¶
Note
You cannot delete the packaged Hive and Postgres JDBC drivers.
To delete a driver:
- On the drivers list, under Uploaded, select the name of the driver that you want to delete.
- In the Driver screen, in the upper-right corner, click More and then click Delete.
Limitations and Known Issues¶
Consider the following limitations and known issues of Spark before using JDBC drivers to connect to databases on different data platforms. Workarounds are provided where available.
Limitation or known issue | Workaround |
---|---|
Spark supports only these DATE and TIMESTAMP formats where reading data from a Redshift table:
|
N/A |
When creating a table in a Redshift database, Spark tries to save a BOOLEAN column as BIT, which is an unsupported data type. As a result, creating the table fails with an error. | N/A |
Limitation or known issue | Workaround |
---|---|
Because of a known issue, data cannot be written to a MySQL database if the data includes a TIMESTAMP column | N/A |
Limitation or known issue | Workaround |
---|---|
Spark does not support the MONEY data type in Postgres and incorrectly recognizes it as Double type. As a result, reading a MONEY value from a table fails with an error. This issue does not occur when values less than 1000 are read. |
N/A |
Limitation or known issue | Workaround |
---|---|
Spark cannot generate values of hierarchyid data type. | N/A |
Spark cannot generate values of image data type. | N/A |
Spark maps the geography and geometry data types to the BinaryType Spark data type. | N/A |
The datetimeoffset data type is not supported. | Convert datetimeoffset values to the TIMESTAMP WITH TIMEZONE data type. |
Writing to a binary column fails with an error. Spark tries to save the column as BLOB data type | N/A |
Limitation or known issue | Workaround |
---|---|
Spark does not support the following data types when reading data from an Oracle database:
|
N/A |
When reading data, Spark converts the FLOAT data type to DECIMAL (127,-127) and the NUMBER data type (without size and precision) to DECIMAL (10,-127). | Specify the appropriate size and precision for such columns manually. |
When writing data, Spark converts FLOAT and LONG data types to NUMERAL (19). If the column size is larger than 19, the data write operation fails with an error. | Use the DECIMAL data type with the appropriate column size. |
When writing data, Spark converts the STRING data type to VARCHAR2 (255). If the column size is larger, the data write operation fails with an error. | N/A |
Limitation or known issue | Workaround |
---|---|
Spark does not support the JSON and ARRAY data types, so reading data from JSON and ARRAY columns fails with an error. | N/A |
Data cannot be written to Teradata. Generating a CREATE TABLE statement fails with an error because of a known issue. | N/A |
Connecting to a Hive SSL-Enabled Database¶
You need to perform the following steps in Enterprise and outside Enterprise to connect to a Hive database that requires Secure Sockets Layer (SSL) connections.
Add a copy of the SSL certificate file, hive.jks, to the Docker container of the Enterprise Datasource Browser service on the host where Enterprise is installed:
Copy the certificate file from the head node of the cluster to a directory on the host.
The file is typically located in /etc/Hive/conf on the head node.
Run the following Docker command:
docker cp <local directory on host>/hive.jks <datasource-browser-service container id>:/etc/ssl
Tip
You can see the container ID of the Datasource Browser service by running the following Docker command on the host:
docker ps -a
The container ID is shown under CONTAINER ID next to “aginityamp/amp-datasource-browser-service:version” under IMAGE.
In Enterprise, create a database connection to the Hive database and in the Advanced properties section of the Database Connection screen, add the following properties.
Property Key Property Value Description ssl true Indicates that HiveServer2 uses SSL encryption. sslTrustStore /etc/ssl/hive.jks Path to the SSL certificate file in the container. trustStorePassword password Password for the certificate. transportMode http HiveServer2 transport mode. This is an optional property. Add it only if hive.server2.transport.mode on the cluster is set to http. httpPath cliservice The path component of the URL endpoint when HiveServer2 is in HTTP mode. This is an optional property. Add it only if hive.server2.thrift.http.path on the cluster is set to cliservice. In the upper-right corner of the screen, click Save to save the changes in the database connection.
Setting Up a Redshift Database Connection¶
Following are the high-level steps that you need to perform to set up a connection to a Redshift database from Enterprise:
- Download the latest spark-Spark_version-deps-version.jar file and place it in any HDFS directory on your cluster.
- Specify the path of the spark-Spark_version-deps-version.jar file on HDFS in the livy.spark.jars and spark.jars properties in the Spark cluster configuration in Enterprise.
- Add the Redshift JDBC driver to Amp.
- Create a Redshift database connection. Make sure that you provide values for the advanced Redshift connection properties.
To enable users to use database tables as sources, you need to create appropriate database connections in Enterprise. As a prerequisite for creating a connection, make sure that you have added the appropriate JDBC driver to Amp.
Note
For a Hive database connection, you don’t need to worry about adding a driver. The drivers for these database types are packaged along with the Enterprise product software.
Supported Databases¶
- Redshift
- Netezza
- Hive
- Postgres
- MS SQL Server
- Oracle
- Snowflake
Create a Database Connection¶
To create a database connection:
In the navigation pane, click Settings and click Connections.
At the top of the connections pane, click New.
In the New Connection dialog box, do the following: #. Type a friendly name for the connection. #. In the Type list, select Database Connection. #. In the Driver list, if you are creating a connection for a database other than Hive or Postgres (for example, a Netezza database), select the name of the driver that you have previously uploaded. If you are creating a connection for a Hive or Postgres database, select Hive or Postgres. #. Click Save.
In the main content area, provide the catalog connection information.
Connection Property Description Host The name or IP address of the host server.
For a Snowflake database connection, specify the host in the format “{account_name}.{region_id}.snowflakecomputing.com)”, where account_name is the name of your Snowflake account, and region_id is the ID of the Snowflake region (for example, “us-east-1”).
Port The port number for the server. User name The user name to use to connect to the host server.
Warning
If you are creating a connection to a Hive, Netezza, or Redshift database, the user under which a connection will be made to the database must have read/write permissions for the objects in the database.
Password The user’s password. Database If the correct connection information is provided, this contains a list of the databases on the server. Select the database. In the Advanced properties area, specify appropriate advanced connection properties.
Advanced Kerberos Hive connection properties
If you are creating a connection to a Hive database that uses Kerberos authentication, you must add the following advanced properties.
Key Value amp.kerberos.kdc Fully qualified domain name (FQDN) of the Kerberos Key Distribution Center (KDC) server that is used by the Hive server. principal The Kerberos principal (a unique identity to which Kerberos can assign tickets to access Kerberos-secured Hadoop services) under which Enterprise will authenticate with KDC before connecting to the database. Warning
For Enterprise to be able to connect to a Hive Kerberos-secured database, you must also provide values for the Kerberos Hive properties in the Spark cluster configuration.
Advanced Redshift connection properties
If you are creating a connection to a Redshift database, you must add the following advanced properties.
Key Value fs.s3a.access.key The access key to the S3 bucket you want to use. fs.s3a.secret.key The secret to the S3 bucket (automatically encrypted by Enterprise).
Enterprise reads and writes data to Redshift by using Amazon S3 for temporary storage. Both the Spark cluster and the Redshift server must have access to the specified S3 bucket.
tempdir The full path of the directory on S3 for storing intermediate data. forward_spark_s3_credentials Leave the value of true for this property. driver.defaultDatabase A Redshift database name. This can be any database. Provide a value for this property if the connection template does not contain a specific database name. Note
Aginity Enterprise automatically adds the fs.s3a.access.key, fs.s3a.secret.key, and tempdir properties without values to the UI and sets the forward_spark_s3_credentials property to true.
Advanced Snowflake connection properties
If you are creating a connection to a Snowflake database, you must add the following advanced properties.
Key Value account_name The Snowflake account name. This is the same value that you included in the host string. region_id The Snowflake region ID. This is the same value that you included in the host string. warehouse The name of an existing Snowflake warehouse. Required for browsing and defining Snowflake sources using the JDBC driver. sfwarehouse The name of an existing Snowflake warehouse. Required for publications to Snowflake. sfdatabase The name of an existing Snowflake database. Required for publications to Snowflake. To test the connection to the database, click Test in the upper-right corner.
The message “Connection test of database_name was successful” appears. If “Connection test failed” appears, check that the correct connection information is provided and make needed changes.
To save the connection, click Save.
Users can now use selected tables from the database as sources.
You need to perform certain steps to enable connections to a Hive SSL-secured database. You also need to perform additional steps to set up Redshift database connections.
Creating a File-System Connection¶
To enable users to use specific files as sources, you need to create connections to the file systems that contain the files. When creating a source from a file that resides in a specific file system, a user needs to select a file-system connection. Enterprise will use the parameters in the connection to connect to the file system and read the data from the file.
Note the specific file formats and file systems that Amp supports.
Supported File Systems and File Types¶
- Netezza
- IBM Cloud Object Storage (COS) - Comma-separated values (CSV) and Apache Parquet file formats.
- Amazon Simple Storage Service (S3)
- Hadoop Distributed File System (HDFS) - Comma-separated values (CSV) and Apache Parquet file formats.
- Oracle 12
Create a File-System Connection¶
This topic provides instructions on how to create a file-system connection.
To create a file-system connection:
In the navigation pane, click Settings and click Connections.
At the top of the connections pane, click New.
In the New Connection dialog box, do the following:
- Type a friendly name for the connection.
- In the Type list, select File System.
- In the File system list, select a file system (for example, HDFS or S3).
- Click Create.
In the main content area, provide the required connection properties.
The Connection Property column specifies the file system for which each property is required.
Connection Property Description User name (HDFS) The user name to use to connect to the HDFS. Host (HDFS) The name or IP address of the HDFS NameNode. Access key (S3) The access key ID of your Amazon Web Services (AWS) account. Secret key (S3) The secret access key of your AWS account. Region (S3) The region in which the S3 bucket resides. The selected region will be used to access the bucket. Bucket (S3 and COS) The bucket to connect to on S3 or in the Object Storage instance. Endpoint (COS) The endpoint to use for accessing the specified instance of Object Storage. iam serviceid crn (COS) The unique identifier of the Service ID associated with the service credential that is used to connect to the specified instance of Object Storage. api key (COS) The API key created for the Service ID associated with the service credential that is used to connect to the specified Object Storage instance. Resource instance id (COS) The unique identifier of the instance of Object Storage that the service credential will access. Note
The Advanced properties area is designed for entry of advanced connection properties and is reserved for future use.
To test the connection to the file system, click Test in the upper-right corner.
The message “Connection test of file_system_name was successful” appears. If “Connection test failed” appears, check that the correct connection information is provided and make needed changes.
To save the connection, click Save.
Users can now create sources from files stored in the file system.
Creating an Enterprise Catalog Connection¶
To synchronize your Amp Cache Catalog with Amp, you must provide the connection information for the catalog by creating a catalog connection in Ent. This topic provides instructions on how to create an Ent catalog connection.
To create an Enterprise catalog connection:
In the navigation pane, click Settings and then click Connections.
At the top of the pane displaying a list of existing connections, click New.
In the New Connection dialog box, type a friendly name for the connection and select External Catalog in the Type list.
In the main content area, provide the catalog connection information.
Connection Property Description Host The name or IP address of the host server. Port The port number for the server. SSL mode Select a Secure Socket Layer (SSL) connection mode for the database that hosts the catalog:
- Disable - Select this mode if the server accepts only unsecured connections. If the server requires a secured connection, the connection attempt will be rejected.
- Allow - The connection will be unsecured unless the server requires a secured connection.
- Prefer - The connection will be secured unless the server is configured to only use unsecured connections.
- Require - The connection to the server will be secured. The connection will fail if the server only accepts unsecured connections.
Connection timeout The amount of time in seconds after which the connection times out. User name The user name to use to connect to the host server.
Attention
If you are creating a connection to an Enterprise Cache Catalog that is hosted by an IBM Netezza database, make sure that the specified user has the following permissions in Netezza:
- The CREATE EXTERNAL TABLE permission for the database
- The CREATE TABLE permissions for the database
- Read permissions on the _v_table_storage_stat and _v_table, _v_dslice system views
These permissions are required to enable users to publish data from the sources created in Enterprise during synchronization with the Enterprise Cache Catalog.
Password The user’s password. Database If a connection is successfully established to the server, this contains a list of the databases on the server. Select the database that hosts the catalog. The Base connection area contains the read-only connection information for the Enterprise Cache (the database that stores dimensional and fact data). This area is automatically populated when you complete a synchronization with the catalog.
The Advanced properties area enables entry of advanced connection properties.
If the catalog resides on a Kerberos-secured Spark cluster that uses the Cloudera Distributed Hadoop (CDH), you must provide the following properties in the Advanced properties area. These properties, along with the Base connection properties, are required to enable users to preview sources that will be automatically created as a result of synchronization with the catalog.
Key Value base.amp.kerberos.kdc The fully qualified domain name (FQDN) of the Kerberos Key Distribution Center (KDC) server. base.principal The Kerberos principal (a unique identity to which Kerberos can assign tickets to access Kerberos-secured Hadoop services) under which Enterprise will authenticate with KDC to submit requests to the cluster. The Synchronization options provides options for scheduling a synchronization with the catalog.
To save the connection information, click Save.
You are now ready to synchronize with the Amp Cache Catalog.
Synchronizing with the Enterprise Cache Catalog¶
To enable users to use dimensional and fact data from an Enterprise Cache as sources, you can synchronize your Enterprise Cache Catalog with Enterprise.
Synchronization reads the metadata—the dimensions, fact tables, and external analytics—from the Enterprise Cache and creates a source for each of these metadata entities in your catalog. After the synchronization, a workspace having the same name as the catalog connection name you entered is created in Enterprise, and the dimensions and fact-table sources are grouped into projects by subject area within the workspace. The synched external analytics are added to a separate project called “External Analytics.” The synchronization also recreates the relationships (references) that exist between tables. Users cannot make any changes in the new sources (for exEnterprisele, add or delete columns). All necessary changes need to be made in the Enterprise Cache Catalog, and you need to re-synchronize with the catalog to bring over the changes.
Since the underlying tables in the Enterprise Cache Catalog can easily change (for example, tables or columns can be added or deleted), it’s a good idea to keep the sources updated by regularly re-synchronizing with the catalog. You can re-synchronize manually or schedule a synchronization to run automatically at a specified interval.
Before you can begin a synchronization, make sure that you created a connection for the catalog.
To synchronize with the Enterprise Cache Catalog for the first time:
In the navigation pane, click Settings and then click Connections.
In the connections pane, select the connection for the catalog.
In the main content area, in the upper-right corner, click Synchronize.
Note
The Synchronize option is unavailable if some of the required connection information is missing. Make sure that all the connection properties are provided. For information about the connection properties, see Creating an Enterprise Catalog Connection.
“Synchronize” in the upper-right corner changes to “Synchronizing” to indicate that synchronization is in progress.
Note
If an error occurred during synchronization (for example, a connection to the catalog database was lost, or some of the required connection information was inadvertently deleted), “Resync” appears in red. Correct the error or wait for the connection to be restored and then re-synchronize by clicking Resync.
When the synchronization is complete, “Synchronize” appears in green, as shown below.
The synchronization also reads the connection information for the Enterprise Cache from the catalog and automatically populates the Base connection area, with the exception of the User name and Password boxes.
Type a user name and password in these boxes to connect to the Enterprise Cache.
These credentials are required to enable connection to the dimensions and fact tables in the cache whenever users work with the corresponding sources in Enterprise.
To view the created dimension and fact-table sources, do the following:
Under Synchronization options, next to the workspace name, click the ellipsis button to open a catalog browser window.
In the catalog browser window, in the first list, select the workspace created for the catalog.
Tip
The workspace has the same name as the catalog connection.
In the second list, select a project.
The third list displays the names of the sources grouped under that project.
Repeat the previous step to view the sources in the other projects.
Note
The Advanced options area is designed for entry of advanced connection properties and is reserved for future use.
Users can access the sources by selecting the new workspace and then the appropriate project.
To select a synchronization option:
Under Synchronization options, select one of the following options.
Option Description Manual Default option. Each time you need to run a synchronization, you will need to do it manually by clicking Synchronize. Daily at A synchronization will run automatically daily at a specified time. Hourly A synchronization will run automatically every hour. Every … minutes A synchronization will run automatically each time a specified number of minutes passes. The default is every 10 minutes. To save your selection, click Save.
The message “Connection was successfully saved” indicates that the selection was saved.
To re-synchronize with the catalog:
If the Manual synchronization option is selected, click Synchronize in the upper-right corner.
If the synchronization was successful, “Synchronize” appears in green.
If the Daily at, Hourly, or Every … minutes option is selected, you don’t need to initiate a synchronization manually; the next synchronization will run automatically according to the option’s schedule.
Attention
You need to specify the connections that each user has access to. When creating sources or publications, a user will be able to select only from the connections that are accessible to him or her. Currently, user access to connections can only be configured through the Amp API. For more information, refer to the Connection section in the Amp API documentation.
Users¶
User Roles and Permissions¶
Aginity Enterprise users require certain permissions to successfully create, edit, and delete assets in the catalog as well as run jobs, based on user roles. A user role is associated with a specific set of permissions and defines what users assigned that role are allowed to do in Enterprise.
You can assign new users one of the two default user roles: Admin Role or User Role. The following table shows the system-level permissions granted to each of the default user roles. “Yes” means that the role can perform an operation; “No” means that an operation is unavailable for the role.
Warning
To be able to create, edit, and delete assets in a specific workspace, users with the Create assets, Update assets, and Delete assets permissions must also have editor permissions for that workspace. For information about granting users editor permissions for a workspace, see Workspaces.
Category | Permission | Description | Admin Role | User Role |
---|---|---|---|---|
All | All permissions | Can perform all operations in Enterprise. | Yes | No |
User Management | Create users | Can create and register users. | Yes | No |
Update users | Can update user accounts. | Yes | No | |
View users | Can view user accounts. | Yes | No | |
Assign user to role | Can assign roles to users. | Yes | No | |
Unassign user from role | Can remove roles from users. | Yes | No | |
View list of permissions | Can view a list of permissions. | Yes | No | |
Create user directory | Can create user directories. At least one user directory must be created in Aginity Enterprise to enable LDAP authentication of users<`ext_auth>. |
Yes | No | |
View user directory | Can view user directories. | Yes | No | |
Update user directory | Can update user directories. | Yes | No | |
Delete user directory | Can delete user directories. | Yes | No | |
Catalog | Create assets | Can create catalog assets. | Yes | Yes |
Delete assets | Can delete catalog assets. | Yes | Yes | |
Update assets | Can delete catalog assets. | Yes | Yes | |
View assets | Can view catalog assets. | Yes | Yes | |
Create projects | Can create projects in a catalog. | Yes | Yes | |
Update projects | Can update projects in a catalog. | Yes | Yes | |
Delete projects | Can delete projects in a catalog. | Yes | Yes | |
Create workspaces | Can create workspaces in a catalog. | Yes | No | |
Update workspaces | Can update workspaces in a catalog. | Yes | No | |
Delete workspaces | Can delete workspaces in a catalog. | Yes | No | |
Create connections | Can create database, file-system and catalog connections. | Yes | No | |
Update connections | Can update database, file-system and catalog connections. | Yes | No | |
Delete connections | Can delete database, file-system and catalog connections. | Yes | No | |
View connections | Can view a list of database, file-system and catalog connections. | Yes | Yes | |
View connection credentials | Can view the credentials (user name and password) for a selected connection. | Yes | No | |
Create Spark cluster configurations | Can add Spark cluster configurations to Aginity Enterprise | Yes | No | |
Update Spark cluster configurations | Can update Spark cluster configurations. | Yes | No | |
Delete Spark cluster configurations | Can delete Spark cluster configurations from Aginity Enterprise | Yes | No | |
View Spark cluster configurations | Can view a list of Spark cluster configurations. | Yes | Yes | |
Job Execution | Execute job | Can execute jobs (such as publication jobs). | Yes | Yes |
Cancel own job | A user can cancel jobs he or she initiated, but not other users’ jobs. | Yes | Yes | |
View own job details | A user can view details of jobs he or she initiated, but not other users’ jobs. | Yes | Yes |
As part of administering Amp, you need to create user accounts and specify a role for each user. A role determines what a user can do in Amp. Currently, only the following user roles are available:
- Administrator - Can perform all operations, access all areas of Amp, and change the application settings.
- User - Has a limited set of permissions and cannot see and change the application settings (clusters, connections, etc.).
Each of these roles is granted a specific set of permissions. User Roles and Permissions shows what permissions are available for each role.
To add users:
In the navigation pane, click Settings and then click Users.
On the Users tab, click Add new user.
In the Add new users dialog box, do one of the following:
To add a single user, type the user’s e-mail address and select a role for the user.
To add multiple users, click Add multiple users, type a list of the users’ e-mail addresses separated by a comma, and select a role for the users.
Note
The selected role will be assigned to all the users.
Click Add.
On the Users screen, under Default Spark Cluster, select the default Spark cluster for each user.
Amp uses the user’s default cluster to execute queries, notebooks, and publications and generate previews (for example, for sources) for that user. The default cluster specified for your instance of Amp is automatically selected as the default cluster for new users.
Note
- A user’s default cluster takes precedence over the default cluster set for the instance of Amp.
- A user can change his or her default cluster in the account preferences.
The user accounts are added to Amp.
Instruct the users to sign in with the default password and then change the default password in their Amp user accounts by performing the following steps.
Note
The default password is defaultPass123.
To change the default password:
In the upper-right corner of the Amp window, on the right of the user name, click the arrow.
Click My account.
On the My Account page, click the Password tab.
In the Current Password box, enter the default password.
In the boxes below, enter and confirm a new password.
Click Change password.
The users can now sign into Amp with the new passwords.
Note
An administrator can also change a user’s password by using the Amp API. For more information, refer to “Update user” in the “User” section of the API Documentation.
You can change users’ information (for example, the user role or default cluster) on the Users screen.
Workspaces¶
You can perform the following operations to manage workspaces in Amp:
- Create, edit, and delete workspaces.
- Grant specific users editor permissions for workspaces. (Editor permissions allow users to create, edit, and delete assets in a workspace.)
Creating Workspaces¶
As an administrator, you can create any number of personal and shared workspaces in Amp to meet the needs of your organization. A personal workspace is one where a specific single user is permitted to create and execute assets. A shared workspace is one that is accessible by several users who collaborate in creating, executing, and sharing assets.
Note
A workspace is created automatically to hold sources created from dimensional and fact data as a result of synchronization with an Amp Cache Catalog.
You can delete a workspace that is no longer in use.
To create a workspace:
- In the navigation pane, click Settings and then click Workspaces.
- To the right of Workspaces, click New.
- In the New Workspace dialog box, type a name for the workspace to help distinguish it from other workspaces in Amp (for example, “Marketing Dept. Analytics” or “John Smith’s Personal Workspace”).
- In the Description box, provide an optional description or details about the workspace.
- Click Create.
- To designate the workspace as a personal or shared workspace, grant the appropriate user or users editor permissions for it.
- In the upper right of the Workspace screen, click Save.
To delete a workspace:
- In the upper right of the Workspace screen, click More and then Delete.
Granting Users Editor Permissions for a Workspace¶
By default, users can only view, reference, and execute assets (such as publications and notebooks) in all workspaces. You can choose to grant specific users editor permissions. If a user has editor permissions for a workspace, he or she can create, edit, and delete all assets in that workspace.
You can choose to revoke the editor permissions for a workspace from a user.
To grant users editor permissions for a workspace:
In the navigation pane, click Settings and then click Workspaces.
Under Workspaces, select the workspace.
Click the Users tab.
Click More and then Add user(s).
In the Add Users dialog box, do one of the following:
- To designate the workspace as a specific user’s personal workspace, select the check box next to the user name.
- To designate this as a shared workspace, select the check box to the name of each user that you want to collaborate in this workspace.
Note
- You can use the search box at the top to search for a specific user. Start typing the user’s name to display all matching users.
- To select all displayed users, click Select all. To cancel the selection of all selected users, click Select none.
- The User Permissions box displays a read-only value of Editor.
Click Done.
The selected users are added to the Users tab.
To revoke editor permissions for a workspace from a user:
- In the navigation pane, click Settings and then click Workspaces.
- Under Workspaces, select the workspace.
- Click the Users tab.
- Next to the user’s name, click
.
Enabling External Authentication¶
If your organization uses Lightweight Directory Access Protocol (LDAP) to store user identities, you can configure LDAP authentication of users in Amp. When a user attempts to sign into Amp, the application will first query your LDAP server with the user’s credentials. If the user authenticates successfully with the LDAP server, he or she is signed in. If the user fails to authenticate against your LDAP server, Amp falls back to trying to authenticate the user as a non-LDAP (Amp) account.
Attention
For LDAP authentication of users to work, make sure that you first add the user accounts to Amp.
To configure LDAP authentication, you first need to register an LDAP user directory in Amp by using the Amp API. For information about registering an LDAP user directory, refer to the “User Directory” section in the Amp API documentation.
Although you can register more than one LDAP user directory in Amp, users will be authenticated only against the first one. You can change the order of user directories by using an API endpoint. For information about changing the order of user directories, refer to the “Changing the order of user directories” subsection in the API documentation.
You can select from the following methods of authentication with the LDAP server:
- NO_AUTH - Allows anonymous access to the server.
- SIMPLE_AUTH - A user name and a password are required for authentication.
You can select from the following methods for encryption of connections to the LDAP server:
- PLAIN - Amp uses unencrypted connections.
- START_TLS - Amp uses StartTLS connections.
- SIMPLE_TLS - Amp uses LDAPS (LDAP over SSL) connections.
Working with Enterprise Assets¶
You can perform the following tasks to manage assets in your catalog in Amp:
Searching for Assets¶
You can search for assets (projects, sources, notebooks, views, and publications) in your catalog by name, description, or by both name and description.
Some things to keep in mind before you start a search:
Search returns matching assets from a project or workspace if you have Viewer or Editor permissions for that project or workspace.
Tip
If you don’t know what permissions you have for a specific workspace or project, ask your administrator.
Your user role in must have a permission to view a list of permissions for assets.
Tip
Ask your administrator if your user role has this permission.
Currently, searching by workspace name or description or by project name or description is not available.
To search for an exact match, enclose your search text in quotation marks (for example, “sum of sales”).
Partial matches are supported. For example, if you search for “New York,” an asset with a name of “New Yorker” will be returned.
The minimum length of a search term is 3 characters, and the maximum length is 122 characters.
Special characters (such as _, ?, {}, (), and []) are ignored within search text.
Currently, Boolean search and wildcards are not supported.
To run a search:
In the upper left of the screen, in Search by, select a search scope.
All means to search by both name and description.
In the search box, enter the word or phrase to find and click Search.
All matching assets will be displayed. The name, the path (workspace/project), and the asset type are shown for each asset.
Note
- The search results are sorted by relevance (exact matches appearing higher in the list than partial matches). You can also sort the search results by name, creation date, or last modification date. Select the appropriate sort option in the Sort by list in the upper-right corner.
- If search returns more than one page of results, you can specify the maximum number of results to display per page by selecting the number in the Results per page list in the upper-right corner.
- You can delete an asset that is listed in the search results from your catalog. Click the ellipsis (…) at the end of the asset row and select Delete.
Click an asset’s name or path in the search results list to go to the asset overview screen for that asset.
The asset overview screen displays an overview of the asset (asset name, description, creation date, and last modification date) and information about the columns that the asset contains. From the asset overview screen, you can open the asset in edit mode if you have Editor permissions for the asset. The Edit button appears on the asset overview page if you have Editor permissions for that asset.
Browsing and Querying Enterprise Assets in External Tools¶
You can browse and query assets in your Enterprise catalog from the tool or application of your choice. The Enterprise Query Engine service, or AQE, provides you with access to an Enterprise catalog from any external tool or application that can make a JDBC or ODBC connection to a database, such as database administration tools, SQL clients, analytic tools, data visualization tools, and notebooks (for example, Jupyter notebooks).
AQE exposes the Enterprise catalog to your tool or application as a database. Each project in a workspace within the catalog is shown as a schema in the database. Assets contained in each project are shown as tables, views, and functions in the corresponding schema.
Using AQE, you can run queries on the following kinds of asset:
- All supported sources (including dimensions and facts from an external catalog and HDFS Parquet files)
- Parameterized and nonparameterized views
- Parameterized and nonparameterized notebooks
- External analytics
Note
Only registered Enterprise users can browse and query an Enterprise catalog from an external tool or application.
How Queries Are Processed with AQE¶
Here’s a brief overview of how queries are processed when you submit them to AQE:
- You write and submit an SQL query in your application.
- The query is sent to AQE over a JDBC or ODBC connection.
- AQE validates, parses, and optimizes the query.
- AQE submits the query to your Spark cluster for execution.
- Spark executes the code in the query, retrieving the appropriate data from the source database or file system, and transfers the results to AQE.
- AQE creates a result set for your query and transfers it to your tool or application.
Before You Begin¶
Before you begin running queries against the Enterprise catalog through AQE, consider the following important points:
We recommend that each user in your organization who will be running queries externally through AQE has his or her own Enterprise user account to get the best query performance from AQE. Since AQE creates a Spark application for each Enterprise user who submits a query through AQE, using the same user account for multiple users will cause queries from these users to be executed by the same application on the Spark cluster. As a result, the first user’s query will be executed first whereas the other users’ queries will be queued up for processing. If the first user submits a long-running query, AQE may seem unresponsive to the other users while in fact their queries are waiting in the queue.
Make sure that the physical configuration of your Spark cluster has enough resources to support Enterprise users who submit queries through AQE.
Follow these general steps to determine the number of Enterprise users who can submit queries through AQE and have them executed on your Spark cluster:
Find out the system resources (memory and CPU) that are allocated to your cluster in YARN.
Tip
The memory and the number of CPU cores available for each node are specified in the yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores settings in YARN. Multiply the value of each setting by the number of nodes in your cluster to get the total memory and the total number of cores:
Total memory = nodes * YARN memory per node Total CPU cores = nodes * YARN cores per node
Determine the percentage of total resources that is available for application masters on your cluster by looking at the value of the yarn.scheduler.capacity.maximum-am-resource-percent YARN parameter.
Tip
You can find this parameter in etc/hadoop/conf/capacity-scheduler.xml.
For example, if this parameter is set to .0.5, 50% of the resources of the cluster (memory and cores) are available for application masters. For a cluster that has a total of 60 GB and 12 cores, this means that 30 GB and 6 cores are available for application masters.
Determine the number of users in your organization who need to run queries through AQE.
In the Spark cluster configuration in for each user, provide appropriate values for the following advanced properties:
spark.driver.memory - The amount of memory to use for the driver process. The default is 1g (1 GB).
Note
The memory that is allocated to the YARN container for each user’s application master is calculated as follows:
spark.driver.memory * 0.10 (a minimum of 384 MB)
Where 0.10 (10%) is the amount of overhead memory reserved for the YARN container. A minimum of 384 MB will be used as the overhead memory. If spark.driver.memory is 1g, the YARN container for each user’s application master will be 1408 MB (1024 МB + 384 MB).
spark.driver.cores - The number of cores to use for the driver process. The default is 1.
Calculate the number of users who will have queries executed on the Spark cluster as follows:
No. of application masters = total memory available for application masters / size of YARN container for each user's application
Or:
No. of application masters = total CPU cores available for application masters / no. of cores for driver
Use the total memory or the total number of cores available for application masters in the formula above, whichever value is the limiting factor. In our example of a total of 30 GB and 6 cores available for application masters, we are limited by the number of cores, so we will divide the total cores available (6) by the value of
spark.driver.cores
(1):6 / 1 = 6
Thus, in our example, six users can have their queries executed on the Spark cluster.
Set Up AQE¶
Before you can begin browsing and querying assets in your catalog through AQE, complete the following steps:
Depending on the kind of connection that you want to use, download the Aginity JDBC driver or the Aginity ODBC driver for Windows from the Amp 3.0 Downloads page.
JDBC driver
The amp-jdbc-standalone-version.jar file is the standalone JDBC driver. This file contains all the third-party libraries that the driver relies on. We recommend using the standalone driver when you use third-party database tools and applications (such as DBeaver or Jupyter Notebook) to query the catalog.
The amp-jdbc-version.jar file is the lightweight JDBC driver. We recommend using the lightweight driver if you are developing a Java application to query the catalog.
ODBC driver
You must use the 32-bit version of the driver (amp-odbc-version-x86.exe) if you have 32-bit Windows or want to connect to from a 32-bit application (on 32-bit or 64-bit Windows).
You must use the 64-bit version of the driver (amp-odbc-version-x64.exe) if you want to connect to from a 64-bit application.
Attention
Currently, you cannot install both versions of the driver on the same computer.
Configure a JDBC connection or an ODBC connection to the catalog.
Download the Amp dependencies library for your platform (spark-EMR-EMR_version-deps-version.jar or spark-HDP-HDP_version-deps-version.jar) and provision your Spark cluster with this library.
Set the required cluster configuration properties.
You need to provide values for certain properties in your Spark cluster configuration to enable browsing and querying assets through AQE. For more information, see Configuring a Spark Cluster for AQE.
Query Syntax¶
JDBC requires that queries be written using a specific syntax. Note the following things when writing SELECT
queries:
- Separate the name of the workspace and the name of the project that contain the asset to query with a slash (/) and enclose them in quotation marks. See the examples below.
- Enclose column names in an asset with double quotation marks.
- Separate the asset name from the workspace and project names with a period. If the asset name is lowercase, contains spaces, and/or contains special characters (such as ~, #, %, & , *, or {}), enclose it in quotation marks.
- Enclose string values in single quotation marks. See the third example below.
- If the asset is a parameterized view, notebook, or external analytic, use the
TABLE
operator, enclose each parameter value in single quotation marks, and separate the values with a comma. See the second example below.
Example of a SELECT query on a nonparameterized asset (source)
SELECT * FROM “My Workspace/My Project”.“My source”
Example of a SELECT query on a parameterized asset (notebook)
SELECT * FROM table(“My Workspace/My Project”.“My Parameterized Notebook”(‘parameter_value’,‘parameter_value’) )
Example of a SELECT query with a WHERE clause using a string value
SELECT "Column A", "Column B" FROM "My Workspace/My Project”.“My view”
WHERE "Column C" LIKE '%string value%'
Limitations¶
You need to be aware of the limitations of and known issues with AQE. For a description of these limitations and issues, see the section “Limitations of AQE” in the Amp 3.0.17 release notes.
Troubleshooting Failed Queries¶
If your query against the catalog fails for some reason, JDBC returns an error code to your application or tool. The following general steps can help determine the cause of the failure:
Review the log of the AQE service. To open the log, do the following:
- In , in the navigation pane on the left, click Settings > Service Logs.
- In the Service Logs pane, click Query Engine Service to display the log on the right side.
Note
Only administrators (users with the Admin user role) can access the service logs. If you are not an administrator, ask your administrator to review the log of the Query Engine service for you.
If you cannot determine the cause of the failure with the AQE service log, try reviewing the YARN log for the application that was used on the Spark cluster to execute your query. You can access the application log through Resource Manager.
Note
If you don’t have access to Resource Manager, ask your Spark administrator to review the YARN application log for you.
See Also¶
Configuring JDBC Driver to Enterprise Catalog¶
To browse and query Enterprise assets in your application or tool through AQE, you need to configure a JDBC connection to your Enterprise catalog using the Enterprise JDBC driver.
Note
You can download the Enterprise JDBC driver (amp-jdbc-standalone-1.0.1.jar or query-engine-jdbc-driver-1.0.0.jar) from the Aginity Enterprise™ 3.0 Downloads page.
Depending on the application or tool in use, you can configure a JDBC connection in one of the following ways:
- Using the JDBC driver interface in your tool or application
- Using a JDBC driver connection string
- In a notebook (for example, a Jupyter notebook)
Use a JDBC Driver Interface¶
If your application or tool provides a JDBC driver interface, you can use it to configure a JDBC connection to the Enterprise catalog with the Enterprise JDBC driver. The following instructions use the DBeaver database administration tool and SQL Workbench/J as examples.
Configuring a JDBC Connection in DBeaver¶
- Go to File and then select Connection Manager > New.
- Provide the following information:
- Name - Amp Query Engine.
- Driver Type - Generic.
- Class Name - org.apache.calcite.avatica.remote.AmpDriver.
- URL Template - jdbc:amp:.
- Default Port - 80 (for HTTP) or 443 (for HTTPS).
- Library - The path to the Enterprise JDBC driver (.JAR) file on your computer.
- Specify the connection settings:
- Go to Database and select New Connection.
- In the Connection settings section, provide the following information:
- User name - The user name of your Enterprise user account.
- Password - The password of your Enterprise user account.
- In the Driver Properties section, next to URL, enter Amp_instance_URL/query-engine-jdbc.
You are now ready to run queries from DBeaver over the configured JDBC connection.
Note
Note the query syntax.
Configuring a JDBC Connection in SQL Workbench/J¶
In SQL Workbench/J, click the Manage Drivers button.
In the Manage drivers window, in the drivers pane on the left, select Aginity AQE and enter the classname.
Note
The automatically discovered classname may be incorrect.
Create a connection profile:
- Driver - Select Aginity AQE (org.apache.calcite.avatica.remote.AmpDriver).
- JDBC - Enter jdbc:amp:Amp_instance_URL/query-engine-jdbc.
- Username - The user name of your Enterprise user account.
- Password - The password of your Enterprise user account.
Click Extended Properties, add a “URL” property, and enter the “https” version of the connection as the value: https://Amp_instance_URL/query-engine-jdbc.
To test the connection, click Test.
On the Tools menu, select Show Database Explorer.
A catalog called “AMP” with a drop-down list showing workspace/project “databases” appears.
Note
You will get an error if you try to show a project that contains a notebook with no input schema and output schema. Either fix that notebook, delete it, or select a different project.
You are now ready to run queries from SQL Workbench/J over the configured JDBC connection.
Note
Note the query syntax.
Use a JDBC Driver Connection String¶
If your application or tool does not provide a JDBC driver interface, you can configure a JDBC connection to the Enterprise catalog using a driver connection string. The following instructions use QlikView as an example.
Note
Before you begin configuring a JDBC connection in QlikView, you need to install a JDBC connector, such as TIQ Java Service Connector.
To configure a JDBC connection using a JDBC driver connection string:
In the QlikView JDBC Connector Config dialog box, on the JDBC Driver tab, click Add Library and browse to the Enterprise JDBC driver (.JAR) file on your computer.
On the Java VM Options tab, specify the path to the jvm.dll file for Java Runtime Environment (JRE).
On the Advanced Settings tab, in the JDBC Driver Classname box, provide the Enterprise JDBC driver classname: org.apache.calcite.avatica.remote.AmpDriver
To configure the connection settings, do the following:
Create a new document.
On the File menu, select Edit Script.
On the Data tab, click Configuration.
Enter the following connection string for the Enterprise JDBC driver:
jdbc:amp:url={Amp_instance_URL}/query-engine-jdbc;user={Amp_user_name};password={Amp_user_password}
Click Test and then OK.
Configure a JDBC Connection in a Notebook¶
You can configure a JDBC connection to the Enterprise catalog in a notebook. The following instructions use the Jupyter notebook with Python code as an example.
Note
Before you begin configuring a JDBC connection in a Jupyter notebook, do the following:
Install the requests package via terminal:
pip install requests
Install the JayDeBeAPI package:
pip install jaydebeap
Download the latest standalone Enterprise JDBC driver (amp-jdbc-standalone-version.jar) from the Enterprise 3.0 Downloads page.
Add the following code at the beginning of the first paragraph of the notebook.
import jaydebeapi amp_user_name = 'admin' password = 'admin' server_with_path = "my.server.local/query-engine-jdbc" jdbcUrl = "jdbc:amp:" + server_with_path URL = "https://" + server_with_path driverPath = "path_AMP_JDBC_driver/amp-jdbc-standalone-version.jar" driverName = "org.apache.calcite.avatica.remote.AmpDriver" jdbcProps = {'user_name': amp_user_name, 'password': password, 'Amp_catalog_URL': URL} conn = jaydebeapi.connect(driverName, jdbcUrl, jdbcProps, driverPath) cur = conn.cursor() # now we have a cursor we can do something with cur.execute('select count(*) from "My Workspace/My Project".dh_textfile') print(cur.fetchall()) # should print [(6257,)] # should print an array with the DB-API description (https://www.python.org/dev/peps/pep-0249) and an array with the data for the first row cur.execute('select * from "My Workspace/My Project".dh_textfile') print(cur.description) print("\n\n ----------------------------- \n\n") a = cur.fetchall() print(a[0])
Configuring ODBC Driver to Enterprise Catalog¶
In addition to browsing and querying your Amp catalog through AQE over a JDBC connection, you can connect to the catalog over ODBC. To make an ODBC connection to your Enterprise catalog, you must use the Aginity Enterprise ODBC driver for Windows.
This topic explains how to configure an ODBC connection to Enterprise using the Aginity Enterprise ODBC driver. Here are the high-level steps:
Prerequisites¶
Java Runtime Environment (JRE) for Java Platform version 8 or later is installed on the computer from which you want to query the Enterprise catalog through AQE.
The 32-bit version of the Aginity Enterprise ODBC driver requires JRE for the 32-bit (x86) version of Java Platform. The 64-bit version of the driver requires JRE for the 64-bit version of Java Platform.
Make sure that you complete the steps required to set up AQE in your environment.
Install the Aginity Enterprise ODBC Driver¶
Download the 32-bit (x86) version (amp-odbc-version-x86.exe) or the 64-bit version (amp-odbc-version-x64.exe) of the Aginity Enterprise ODBC driver for Windows from the Aginity Amp 3.0 Downloads page.
You must use the 32-bit version of the driver if you have 32-bit Windows or want to connect to Enterprise from a 32-bit application (on 32-bit or 64-bit Windows).
You must use the 64-bit version of the driver if you want to connect to Enterprise from a 64-bit application.
Attention
Currently, you cannot install both versions of the driver on the same computer.
Run the installation executable file from the location to which you downloaded it.
The Aginity Amp ODBC Driver dialog box appears, showing the temporary folder to which the driver installation files will be extracted. We recommend that you do not change the default path and name of the temporary folder. After installation this folder will be automatically deleted.
Click Unzip to extract the files to the temporary folder.
When the files are extracted, the Aginity Enterprise ODBC Driver Setup Wizard opens.
Note
You can now close the Aginity Enterprise ODBC Driver dialog box, which is still displayed behind the wizard window.
On the License Agreement page, read the license agreement and select I accept the terms of the license agreement.
On the Choose Destination Location page, leave the default folder for the driver or select a different one.
On the Service Configuration page, leave the default service name or enter a different service name.
Note
If you choose to use a service name other than the default one (“AmpODBCService”), make sure that the name does not contain any spaces. The spaces in the service name will cause the installation to fail.
On the next page, browse to the folder on your computer that contains the server Java Virtual Machine (jvm.dll), for example, C:Program FilesJavajre1.8.0_201binserver (the 64-bit version of the driver) or C:Program Files (x86)Javajre1.8.0_201binclient (the 32-bit version of the driver).
Attention
We strongly recommend that you specify the location of the server jvm.dll when installing the 64-bit version of the driver.
Note
The error message “No Valid JVM location has been configured yet” means that you have selected an invalid location of jvm.dll. Click Yes and browse to the correct location.
On the Select Program Folder page, leave the default program folder name or enter a different name.
Review the selected installation settings and click Next.
When the installation is complete, the message “The InstallShield Wizard has successfully installed Aginity Enterprise ODBC Driver” appears.
Click Finish.
Create an ODBC Data Source¶
After you have installed the Aginity Amp ODBC driver, you need to create an ODBC data source based on the driver. A data source is used to store the information required to connect to a specific instance of Enterprise (a specific Enterprise catalog), in particular, the URL of the Enterprise instance to connect to. If you want to connect to multiple Enterprise instances from your application or tool, we recommend that you create an ODBC data source for each instance.
To create an ODBC data source for the 32-bit version of the driver, you must use ODBC Data Source Administrator (32-bit) in Windows. To create an ODBC data source for the 64-bit version of the driver, you must use ODBC Data Source Administrator (64-bit). For information on how to access ODBC Data Source Administrator, see the Microsoft Windows documentation.
To create an ODBC data source:
In ODBC Data Source Administrator, do one of the following:
- If you want to create a data source that will be available only to your Windows user, go to the User DSN tab.
- If you want to create a data source that will be visible to all users on this computer, go to the System DSN tab.
Click Add.
In the Create New Data Source dialog box, select Aginity Amp (x64) or Aginity Amp (x86) and click Finish.
In the Aginity Amp ODBC Driver Local ODBC Driver Setup dialog box, provide the following information:
Data Source Name - Enter a friendly name for the data source.
Description - Enter an optional description.
Service Name - Click the ellipsis button (…) and select the service name specified during installation (for example, AmpODBCService).
Service Data Source - Click the ellipsis button (…) and select Default.
Custom Properties - Enter the URL of the Enterprise instance that houses the Enterprise catalog you want to query. The URL must use the following format:
host=https://my.amp.instance.com
Note
When copying the Enterprise URL from your browser’s address bar, you might copy the entire URL, including the “frontend” portion at the end. Remove the “frontend” portion from the URL in the Custom Properties box.
To test the connection to the Enterprise instance, do the following:
Click Test Connect.
Provide the following information:
Data Source User Name - The name of your Enterprise user account under which you want to connect to Enterprise.
Data Source Password - The user’s password.
Note
Custom Properties is automatically filled in with the URL of the Enterprise instance that you entered for the data source.
I get an error “Failed to initialize the Service component” when testing the connection.
I get an error “OpenAccess SDK SQL Engine” when testing the connection.
I get an error “UnknownHostException:<URL>” when testing the connection.
I get an error “Failed to execute HTTP Request, got HTTP/401” when testing the connection.
If the test was successful, the message “Connection test was successful” appears.
Click OK to save the new data source.
Make an ODBC Connection to Enterprise in Your Tool or Application¶
After you create the ODBC data source, in your tool or application, you need to establish a connection to the Enterprise instance specified in the data source by selecting the data source name and providing an Enterprise user name and the user’s password.
Note
We have successfully tested ODBC connections to Enterprise in Microsoft Excel and Tableau. ODBC connections to Enterprise that you create based on the Aginity Enterprise ODBC driver should work in other tools and applications that support ODBC.
As an example, here’s how to make an ODBC connection to Enterprise in Microsoft Excel.
To make an ODBC connection to Amp in Excel:
Click the Data tab.
Select Get Data > From Other Sources > From ODBC.
In the From ODBC dialog box, select the data source name.
In the ODBC driver dialog box, enter the name of the Enterprise user account under which you want to connect and the user’s password, and click Connect.
I get an error “Failed to initialize the Service component.”
Once the connection is established, the Enterprise catalog is exposed as an “Enterprise” database in the Navigator window in Excel. Each workspace-project combination in the catalog is shown as a schema in the database. The assets contained in each project are shown as tables, views, and functions in the corresponding schema. You can now run queries on desired assets.
Upgrade the Aginity Enterprise ODBC Driver¶
To upgrade the Aginity Enterprise ODBC driver to a new version, you need to uninstall the old version and then install the new version by following the instructions in the section Install the Aginity Amp ODBC Driver.
Follow the standard Windows procedure for uninstalling the driver. During the uninstallation, you will get the message “The following ODBC DSN(s) were detected to be using the ODBC driver that is being uninstalled.” Select No to keep the data sources that were created from the previous version of the driver, so you can use them with the new version. If you select Yes, you will need to re-create the data sources.
Troubleshoot ODBC Connection Issues¶
The following table describes solutions for errors that you might encounter when making an ODBC connection to Enterprise.
Issue | Cause | Resolution |
---|---|---|
Error “Failed to initialize the Service component” |
An invalid location of Java Virtual Machine (jvm.dll) was specified during installation of the Aginity Enterprise ODBC driver. | Reinstall the driver, specifying the correct location of jvm.dll. If you are installing the 64-bit version of the driver, make sure that you specify the location of the server jvm.ll (for example, C:Program Files (x86)Javajre1.8.0_201binserver). |
Error “OpenAccess SDK SQL Engine” |
The URL of Enterprise is specified in the ODBC data source using an invalid format. |
|
Error “UnknownHostException:<URL>” |
An invalid URL of Enterprise is specified in the ODBC data source that is used for the connection to Enterprise. |
|
Error “Failed to execute HTTP Request, got HTTP/401” |
An invalid Enterprise user name or password is provided. This error appears when you are testing a connection to an Enterprise instance with an ODBC data source or when you attempt to connect to Enterprise in your tool or application. | Provide a valid user name or password. |
Error “The specified DSN contains an architecture mismatch between the Driver (64-bit) and Application (32-bit)” |
You are trying to connect to Enterprise from a 32-bit application using the 64-bit version of the Aginity Enterprise ODBC driver. |
|
Error “Servicename attribute not found in the configuration file” |
This error appears in ODBC Data Source Administrator if a service name other than the one specified during the installation of the Aginity Enterprise ODBC driver is entered into the currently open ODBC data source. | The Connect to Service dialog box automatically appears. Enter the correct service name into the Service Name box. |
Configuring a Spark Cluster for AQE¶
Glossary¶
- Aginity Docker Registry
- Aginity’s public registry on Docker Hub that stores the Amp application images. Also see “registry” and “image.”
- Amazon EMR
- Amazon Elastic MapReduce (EMR) web service, part of Amazon Web Services (AWS). EMR provides a Hadoop and Spark platform for processing large amounts of data. For more information about Amazon EMR, refer to the AWS documentation.
- Amp
- Aginity Amp enables businesses to create, catalog and manage all analytics as assets which then become reusable across the enterprise.
- Amp CLI
- Amp Command Line Interface, the component that provides a command-line method for managing Amp installations.
- Analytic
- A unit of computation that takes frames and parameters as input and produces frames as output.
- AQE
- Aginity Query Engine. The service in Aginity Enterprise that provides access to an Aginity Enterprise catalog from an external tool or application over a JDBC connection. You can browse and query assets in your Enterprise catalog through AQE.
- Asset
- Source data and analytics that is built from the source data are viewed as a catalog of reusable corporate assets in Amp. Examples of assets are sources, notebooks, and publications. Also see “source,” “notebook,” and “publication.”
- Catalog
- A specific collection of metadata assets and analytics that is managed in an instance of Aginity Enterprise. Also see “asset” and “analytic.”
- Container
- A runtime instance of a Docker image. A Docker container provides a lightweight, self-contained operating-system environment. The services (pieces) of a distributed application, such as Amp, are packaged and run inside Docker containers. For more information about containers
- Current Deployment Context
- The context that is used by the Amp CLI on the computer or virtual machine where you run it for Amp installations and upgrades. The current context describes the deployment environment, including the version of Amp that will be installed and the name or IP address of the host where this version will be installed. You can change the current context. Also see “context.” Also see “context” and “Amp CLI.”
- Dimension
- A dimension is a metadata entity that supplies context for business events and measures contained in a referenced fact table. A dimension table contains textual attribute columns that describe the rows in the dimension table and a single primary key. Dimension tables are contained in an Amp Cache Catalog. When you synchronize your Amp catalog with an Amp Cache Catalog, the dimension tables are extracted from there and become available in your Amp catalog as sources. Also see “fact table” and “source.”
- Docker
- An open-source platform for building, deploying, and running applications inside containers. For more information about Docker, refer to the Docker documentation at https://docs.docker.com. Also see “container.”
- Docker Compose File
- The Docker Compose file defines the services that make up a distributed application, such as Amp. The Compose file is used to configure and run the application services. For more information about the Compose file, refer to the Docker documentation at https://docs.docker.com. Also see “service,” “Docker,” and “container.”
- External Analytic
- A metadata table that represents a subset of data prepared for publication. External Analytics exist in an Amp Cache Catalog. When you synchronize your Amp catalog with an Amp Cache Catalog, the external analytics are extracted from there, along with dimensions and fact tables, and become available in your Amp catalog as sources. Also see “publication,” “source,” “dimension,” and “fact table.”
- Fact Table
- A fact table is a metadata table that describes measurements associated with a business process. Fact tables are contained in an Amp Cache Catalog. When you synchronize your Amp catalog with an Amp Cache Catalog, the fact tables are extracted from there and become available in your Amp catalog as sources. Also see “dimension” and “source.”
- Image
- An executable package that provides everything needed to run an application–the code, a runtime, libraries, environment variables, and configuration files. When executed, an image becomes a container. For more information about images, refer to the Docker documentation at https://docs.docker.com. Also see “Docker” and “container.”
- Input Schema
- Describes the structure of input data for sources, analytics, publications, and views in Aginity Enterprise. The input schema defines the columns in a source, analytic, publication, or view; the column data types; relationships to other entities; and any required parameters. Also see “output schema.”
- A rectangular area that is displayed along the left side of the Aginity Enterprise window. The navigation pane provides links to help you access main screens and features of the application: Home (open the Home page); Browse (browse assets in your catalog); New (create assets in your catalog); Jobs (open the Jobs screen to view job details); and Settings (manage application settings, connections, and users).
- Notebook
- Provides an interactive web-based environment for data analytics, using Spark as the computation engine. Using a notebook, you can ingest, explore, and visualize data and export results to share and collaborate on them with others.
- Private Docker Registry
- A Docker registry to which only authorized users (not the public) have access. Using a private registry helps protect proprietary images from being used by unauthorized users. Also see “registry,” “public registry,” and “image.”
- Project
- A subgrouping of entities (sources, analytics, and publications) by subject area within a workspace. Also see “workspace.”
- Publication
- The materialization of a single catalog frame to a persistent (not temporary) table or file.
- Service
- A distinct piece (component) of a distributed application. Amp is built as a set of microservices; each microservice is run inside a Docker container. For more information about services, see the Docker documentation at https://docs.docker.com. Also see “Docker” and “container.”
- Source
- A table-like entity that is represented as a logical table in the catalog. Source data can come from a table, a file, or the output frame of an analytic.
- Spark Application
- In Spark, a self-contained unit of computation (program) that runs supplied source code. To execute a publication, Amp packages the source Java code for the publication in a Java archive (.jar) and submits it as a Spark application to a Spark cluster for execution. Spark executes the code and places the resulting data into the target table or file. To execute queries from a notebook, Amp sends a request to Livy, which creates and submits a long-running Spark application to a Spark cluster. Spark then executes the queries against the source database.
- Stack File
- Defines a collection of the Amp application services. During installation of a specific version of Amp, the services are deployed in an environment using their definitions in the stack file for that version. A stack file is a YAML (.yml) file. For more information about stack files, refer to the Docker documentation.
- View
- A frame that is created from a combination of existing sources. A view provides a custom data set that is assembled from selected columns.
- Workspace
- A top-level grouping of entities within an instance of Amp. A workspace contains one or more projects. Also see “project.”