Cluster Administration Storage Tools¶
CAST stands for Cluster Administration Storage Tools.
CAST is comprised of several open source components:
CSM - Cluster System Management
A C API for managing a large cluster. Offers a suite of tools for maintaining the cluster:
- Discovery and management of system resources
- Database integration (PostgreSQL)
- Job launch support (workload management APIs)
- Node diagnostics (diag APIs and scripts)
- RAS events and actions
- Infrastructure Health checks
- Python Bindings for C APIs
Burst Buffer
A cost-effective mechanism that can improve I/O performance for a large class of high-performance computing applications without requirement of intermediary hardware. Burst Buffer provides:
- A fast storage tier between compute nodes and the traditional parallel file system
- Overlapping job stage-in and stage-out of data for checkpoint and restart
- Scratch volumes
- Extended memory I/O workloads
Function Shipping
A file I/O forwarding layer for Linux that aims to provide low-jitter access to remote parallel file system while retaining common POSIX semantics.
Table of Contents¶
CSM API¶
CSM is an API for managing a large cluster.
CSM Configuration¶
CSM Pam Daemon Module¶
The libcsmpam.so module is installed by the rpm to /usr/lib64/security/libcsmpam.so.
To enable the this module for sshd perform the following steps:
- Uncomment the following line in /etc/pam.d/sshd
#account required libcsmpam.so
#session required libcsmpam.so
Note
The libcsmpam.so module is deliberately configured to be the last session in this file. If your configuration changes this make sure the libcsmpam.so is loaded after the default session modules. It is recommended that libcsmpam.so be immediately after the default postlogin line in the sshd config if the admin is adding additional session modules.
- Run systemctl restart sshd.service to restart the sshd daemon with the new config.
After the daemon has been restarted the modified pam sshd configuration should now be used.
Module Behavior¶
This module is designed for session authentication and cgroup assignment in the pam sshd utility. The following checks are performed to verify that the user is allowed to access the system:
- The user is root.
- Allow entry.
- Place the user in the default cgroup (session only).
- Exit module with success.
- The user is defined in /etc/pam.d/csm/activelist.
- Allow entry.
- Place the session in the cgroup that the user is associated with in the activelist (session only).
- note: The activelist is modified by csm, admins should not modify.
- Exit module with success.
- The user is defined in /etc/pam.d/csm/whitelist.
- Allow entry.
- Place the user in the default cgroup (session only).
- note: The whitelist is modified by the admin.
- Exit module with success.
- The user was not found.
- Exit the module, rejecting the user.
Module Configuration¶
Configuration may occur in either a pam configuration file (e.g. /etc/pam.d/sshd) or the csm pam whitelist.
libcsmpam.so¶
File location: | /usr/lib64/security/libcsmpam.so |
---|---|
Configurable: | Through pam configuration file. |
The libcsmpam.so is a session pam module. For details on configuring this module and other pam modules please consult the linux man page; man pam.conf.
When ibm-csm-core is uninstalled, this library is always removed.
Warning
The libcsmpam.so module is recommended be the last session line in the default pam configuration file. The module requires the session to be established to move the session to the correct cgroup. If the module is invoked too early in the configuration, users will not be placed in the correct cgroup. This advice is configuration dependant.
whitelist¶
File location: | /etc/pam.d/csm/whitelist |
---|---|
Configurable: | Yes |
The whitelist is a newline delimited list of user names. If a user is specified they will always be allowed to login to the node.
If the user has an active allocation on the node an attempt will be made to place them in the correct allocation cgroup. Otherwise, the use will be placed in the default cgroup.
When ibm-csm-core is uninstalled, if this file has been modified it will NOT be deleted.
Sample Configuration File
jdunham
pmix
csm_admin
The preceding configuration will add three users who will always be allowed to start a session. If the user has an active allocation they will be placed into the appropriate cgroup as described above.
activelist¶
File location: | /etc/pam.d/csm/whitelist |
---|---|
Configurable: | No |
The activelist file should not be modified by the admin or user. CSM will modify this file when an allocation is created or deleted.
The file contains a newline delimited list of entries with the following format: [user_name];[allocation_id]. This format is parsed by libcsmpam.so to determine whether or not a user can begin the session (username) and which cgroup it belongs to (allocation_id).
When ibm-csm-core is uninstalled, this file is always removed.
Module Compilation¶
Note
Ignore this section if the csm pam module is being installed by rpm.
In order to compile this module the pam-devel package is required to compile.
Troubleshooting¶
If users are having problems with core isolation, unable to log onto the node, or not being placed into the correct cgroup, first perform the following steps.
- Manually create an allocation on a node that has the PAM module configured. This should be executed from the launch node as a non root user.
$ csm_allocation_create -j 1 -n <node_name> --cgroup_type 2
---
allocation_id: <allocation_id>
num_nodes: 1
- compute_nodes: <node_name>
user_name: root
user_id: 0
state: running
type: user managed
job_submit_time: 2018-01-04 09:01:17
...
POSSIBLE FAILURES
- If the allocation create fails, ensure the node is ready:
$ csm_node_attributes_update -r y -n <node_name>
- After the allocation has been created with core isolation ssh to the node <node_name> as the user who created the allocation:
$ ssh <node_name>
POSSIBLE FAILURES
User Rejected: <user_name>; Not Authorized
Indicates the /etc/pam.d/csm/activelist was not populated with <user_name>.
Verify the allocation is currently active, if the allocation is not currently active attempt to recreate the allocation.
csm_allocation_query_active_all | grep "allocation_id.* <allocation_id>$"
Login to <node_name> as root and check to see if the user is on the activelist:
$ ssh <node_name> -l root "grep <user_name> /etc/pam.d/csm/activelist"
If the user is not present and the allocation create is functioning this may be a CSM bug, open a defect to the CSM team.
- Check the cgroup of the user’s ssh session.
$ cat /proc/self/cgroup
11:blkio:/
10:memory:/allocation_<allocation_id>
9:hugetlb:/
8:devices:/allocation_<allocation_id>
7:freezer:/
6:cpuset:/allocation_<allocation_id>
5:net_prio,net_cls:/
4:perf_event:/
3:cpuacct,cpu:/
2:pids:/
1:name=systemd:/user.slice/user-9999137.slice/session-3957.scope
Above is an example of a properly configured cgroup. The user should be in an allocation cgroup for the memory, devices and cpuset groups.
POSSIBLE FAILURES
- The user is only in the cpuset:/csm_system cgroup This generally indicates that the libcsmpam.so module was not added in the correct location or is disabled. Refer to the quick start at the top of this document for more details.
- The user is in the cpuset:/ cgroup. Indicates that core isolation was not performed, verify core isolation is enabled in the allocation create step.
- Any further issues are beyond the scope of this trouble shooting document, contacting the CSM team or opening a new issue is the recommended course of action.
CSM Database¶
CSM database (CSM DB) holds information about systems hardware configuration, hardware inventory, RAS, diagnostics, job steps, job allocations, and CSM configuration. This information is essential for the CORAL system to run properly and for resources accounting.
CSM DB uses PostgreSQL.
CSM Database Appendix¶
Naming conventions¶
CSM Database Overview
Table | Table names start with “csm” prefix,
example csm_node. History table
names add “_history” suffix, example:
|
csm_node_history
|
Primary Key | Primary key names are automatically
generate within PostgreSQLstartingi
with table name and followed by pkey.
|
${table name}_pkey
csm_node_pkey
|
Unique Key | Unique key name start with “uk” followed
with table name and a letter indicating
the sequence (a, b, c, etc.).
|
uk_${table name}_b
uk_csm_allocation_b
|
Foreign key | Foreign key names are automatically
generate within PostgreSQL starting
with the table name and followed
with a list of field(s) and followed
by fkey.
|
${table}_${name_column names}_fkey
csm_allocation_node_allocation_id_fkey
|
Index | Index name starts with a prefix “ix”
followed by a table name and a letter
indicating the sequence.(a, b, c, etc.).
|
ix_${table name}_a
ix_csm_node_history_a
|
Functions | Function names will start with a prefix
with the prefix “fn” and followed by a
name, usually related to the table and
purpose or arguments if any.
|
fn_function_name_purpose
fn_csm_allocation_history_dump
|
Triggers | Trigger names will start with a prefix
with the prefix “tr” and followed by a
name, usually related to the table and
purpose.
|
tr_trigger_name_purpose
|
History Tables¶
CSM DB keeps track of data as it change over time. History tables will be used to store these records and a history time stamp is generated to indicate the transaction has completed. The information will remain in this table until further action is taken.
Usage and Size¶
The usage and size of each table will vary depending on system size and system activity. This document tries to estimate the usage and size of the tables. Usage is defined as how often a table is accessed and is recorded asLow
,Medium
, orHigh
. Size indicates how many rows are within the database tables and is recorded as total number of rows.
Table Categories¶
The CSM database tables are grouped and color coordinated to demonstrate which category they belong to within the schema. These categories include,
Tables¶
Node attributes tables¶
csm_node¶
Description
This table contains the attributes of all the nodes in the CORAL system including: management node, service node, login node, work load manager, launch node, and compute node.
Table | Overview | Action On: |
---|---|---|
Usage | High (CSM APIs access this table regularly)
|
|
Size | 1-5000 rows (total nodes in a CORAL System)
|
|
Key(s) | PK: node_name
|
|
Index | csm_node_pkey on (node_name)
ix_csm_node_a on (node_name, ready)
|
|
Functions | fn_csm_node_ready
fn_csm_node_update
fn_csm_node_delete
|
|
Triggers | tr_csm_node_ready on (csm_node)
tr_csm_node_update
|
update/delete
update/delete
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_allocation_node | csm_allocation_node_node_name_fkey | node_name | (FK) |
csm_dimm | csm_dimm_node_name_fkey | node_name | (FK) |
csm_gpu | csm_gpu_node_name_fkey | node_name | (FK) |
csm_hca | csm_hca_node_name_fkey | node_name | (FK) |
csm_processor | csm_processor_node_name_fkey | node_name | (FK) |
csm_ssd | csm_ssd_node_name_fkey | node_name | (FK) |
csm_node_history¶
- Description
- This table contains the historical information related to node attributes.
Table | Overview | Action On: |
---|---|---|
Usage | Low (When hardware changes and to query
historical information)
|
|
Size | 5000+ rows (Based on hardware changes)
|
|
Index | ix_csm_node_history_a on (history_time)
ix_csm_node_history_a on (history_time)
|
csm_node_ready_history¶
- Description
- This table contains historical information related to the node ready status. This table will be updated each time the node ready status changes.
Table | Overview | Action On: |
---|---|---|
Usage | Med-High
|
|
Size | (Based on how often a node ready status changes)
|
|
Index | ix_csm_node_ready_history_a on (history_time)
ix_csm_node_ready_history_b on (node_name, ready)
|
csm_processor¶
- Description
- This table contains information on the processors of a node.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 25,000+ rows (Witherspoon will consist of
256 processors per node. (based on 5000 nodes)
|
|
Key(s) | PK: serial_number
FK: csm_node (node_name)
|
|
Index | csm_processor_pkey on (serial_number)
ix_csm_processor_a on (serial_number, node_name)
|
|
Functions | fn_csm_processor_history_dump
|
|
Triggers | tr_csm_processor_history_dump
|
update/delete
|
csm_processor_history¶
- Description
- This table contains historical information associated with individual processors.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 25,000+ rows (Based on how often a processor
is changed or its failure rate)
|
|
Index | ix_csm_processor_history_a on (history_time)
ix_csm_processor_history_b on (serial_number, node_name)
|
csm_gpu¶
- Description
- This table contains information on the GPUs on the node.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 30,000+ rows
(Max per load =
6 (If there are 5000 nodes than
30,000 on Witherspoons)
|
|
Key(s) | PK: node_name, gpu_id
FK: csm_node (node_name)
|
|
Index | csm_gpu_pkey on (node_name, gpu_id)
|
|
Functions | fn_csm_gpu_history_dump
|
|
Triggers | tr_csm_gpu_history_dump
|
update/delete
|
csm_gpu_history¶
- Description
- This table contains historical information associated with individual GPUs. The GPU will be recorded and also be timestamped.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | (based on how often changed)
|
|
Index | ix_csm_gpu_history_a on (history_time)
ix_csm_gpu_history_b on (serial_number)
|
csm_ssd¶
- Description
- This table contains information on the SSDs on the system. This table contains the current status of the SSD along with its capacity and wear.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 1-5000 rows (one per node)
|
|
Key(s) | PK: serial_number
FK: csm_node (node_name)
|
|
Index | csm_ssd_pkey on (serial_number)
ix_csm_ssd_a on (serial_number, node_name)
|
|
Functions | fn_csm_ssd_history_dump
|
|
Triggers | tr_csm_ssd_history_dump
|
update/delete
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_vg_ssd | csm_vg_ssd_serial_number_fkey | serial_number, node_name | (FK) |
csm_ssd_history¶
- Description
- This table contains historical information associated with individual SSDs.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 5000+ rows
|
|
Index | ix_csm_ssd_history_a on (history_time)
ix_csm_ssd_history_b on (serial_number, node_name)
|
csm_hca¶
- Description
- This table contains information about the HCA (Host Channel Adapters). Each HC adapter has a unique identifier (serial number). The table has a status indicator, board ID (for the IB adapter), and Infiniband (globally unique identifier (GUID)).
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1-10K – 1 or 2 per node
|
|
Key(s) | PK: serial_number
FK: csm_node (node_name)
|
|
Index | csm_hca_pkey on (serial_number)
|
|
Functions | fn_csm_hca_history_dump
|
|
Triggers | tr_csm_hca_history_dump
|
update/delete
|
csm_hca_history¶
- Description
- This table contains historical information associated with the HCA (Host Channel Adapters).
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | (Based on how many are changed out)
|
|
Key(s) | ||
Index | ix_csm_hca_history_a on (history_time)
|
csm_dimm¶
- Description
- This table contains information related to the DIMM “”Dual In-Line Memory Module” attributes.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1-80K+ (16 DIMMs per node)
|
|
Key(s) | PK: serial_number
FK: csm_node (node_name)
|
|
Index | csm_dimm_pkey on (serial_number)
|
|
Functions | fn_csm_dimm_history_dum
|
|
Triggers | tr_csm_dimm_history_dump
|
update/delete
|
csm_dimm_history¶
- Description
- This table contains historical information related to the DIMM “Dual In-Line Memory Module” attributes.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | (Based on how many are changed out)
|
|
Index | ix_csm_dimm_history_a on (history_time)
|
Allocation tables¶
csm_allocation¶
- Description
- This table contains the information about the system’s current allocations. Specific attributes include: primary job ID, secondary job ID, user and system flags, number of nodes, state, username, start time stamp, power cap, power shifting ratio, authorization token, account, comments, eligible, job name, reservation, Wall clock time reservation, job_submit_time, queue, time_limit, WC Key, type.
Table | Overview | Action On: |
---|---|---|
Usage | High (Every time allocated and allocation query)
|
|
Size | 1-5000 rows (1 allocation per node (5000 max per 1 node))
|
|
Key(s) | PK: allocation_id
|
|
Index | csm_allocation_pkey on (allocation_id)
|
|
Functions | fn_csm_allocation_history_dump
fn_csm_allocation_state_history_state_change
fn_csm_allocation_update
|
insert/update/delete (API call)
|
Triggers | tr_csm_allocation_state_change
tr_csm_allocation_update
|
delete
update
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_allocation_node | csm_allocation_node_allocation_id_fkey | allocation_id | (FK) |
csm_step | csm_step_allocation_id_fkey | allocation_id | (FK) |
csm_allocation_history¶
- Description
- This table contains the information about the no longer current allocations on the system. Essentially this is the historical information about allocations. This table will increase in size only based on how many allocations are deployed on the life cycle of the machine/system. This table will also be able to determine the total energy consumed per allocation (filled in during “free of allocation”).
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | (Depending on customers work load (100,000+ rows))
|
|
Index | ix_csm_allocation_history_a on (history_time)
|
Step tables¶
csm_step¶
- Description
- This table contains information on active steps within the CSM database. Featured attributes include: step id, allocation id, begin time, state, executable, working directory, arguments, environment variables, sequence ID, number of nodes, number of processes (that can run on each compute node), number of GPU’s, number of memory, number of tasks, user flags, system flags, and launch node name.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | 5000+ rows (depending on the steps)
|
|
Key(s) | PK: step_id, allocation_id
FK: csm_allocation (allocation_id)
|
|
Index | csm_step_pkey on (step_id, allocation_id)
uk_csm_step_a on (step_id, allocation_id)
|
|
Functions | fn_csm_step_history_dump
|
insert/update/delete (API call)
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_step_node | csm_step_node_step_id_fkey | step_id | (FK) |
csm_step_history¶
- Description
- This table contains the information for steps that have terminated. There is some additional information from the initial step that has been added to the history table. These attributes include: end time, compute nodes, level gpu usage, exit status, error text, network band width, cpu stats, total U time, total S time, total number of threads, gpu stats, memory stats, max memory, max swap, ios stats.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | Millions of rows (depending on the customer’s work load)
|
|
Index | ix_csm_step_history_a on (history_time)
ix_csm_step_history_b on (begin_time, end_time)
ix_csm_step_history_c on (allocation_id, end_time)
ix_csm_step_history_d on (end_time)
ix_csm_step_history_e on (step_id)
|
Allocation node, allocation state history, step node tables¶
csm_allocation_node¶
- Description
- This table maps current allocations to the compute nodes that make up the allocation. This information is later used when populating the csm_allocation_history table.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | 1-5000 rows
|
|
Key(s) | FK: csm_node (node_name)
FK: csm_allocation (allocation_id)
|
|
Index | ix_csm_allocation_node_a on (allocation_id)
uk_csm_allocation_node_b on (allocation_id, node_name)
|
insert (API call)
|
Functions | fn_csm_allocation_node_sharing_status
fn_csm_allocation_node_change
|
|
Triggers | tr_csm_allocation_node_change
|
update
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_lv | csm_lv_allocation_id_fkey | allocation_id, node_name | (FK) |
csm_step_node | csm_step_node_allocation_id_fkey | allocation_id, node_name | (FK) |
csm_allocation_node_history¶
- Description
- This table maps history allocations to the compute nodes that make up the allocation.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | 1-5000 rows
|
|
Index | ix_csm_allocation_node_history_a on (history_time)
|
csm_allocation_state_history¶
- Description
- This table contains the state of the active allocations history. A timestamp of when the information enters the table along with a state indicator.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | 1-5000 rows (one per allocation)
|
|
Index | ix_csm_allocation_state_history_a on (history_time)
|
csm_step_node¶
- Description
- This table maps active allocations to jobs steps and nodes.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | 5000+ rows (based on steps)
|
|
Key(s) | FK: csm_step (step_id, allocation_id)
FK: csm_allocation (allocation_id, node_name)
|
|
Index | uk_csm_step_node_a on (step_id, allocation_id, node_name)
|
|
Functions | fn_csm_step_node_history_dump
|
|
Triggers | tr_csm_step_node_history_dump
|
delete
|
csm_step_node_history¶
- Description
- This table maps historical allocations to jobs steps and nodes.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | 5000+ rows (based on steps)
|
|
Index | ix_csm_step_node_history_a on (history_time)
|
RAS tables¶
csm_ras_type¶
- Description
- This table contains the description and details for each of the possible RAS event types. Specific attribute in this table include: msg_id, severity, message, description, control_action, threshold_count, threshold_period, enabled, set_not_ready, set_ready, viable_to_users.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1000+ rows (depending on the different RAS types)
|
|
Key(s) | PK: msg_id
|
|
Index | csm_ras_type_pkey on (msg_id)
|
|
Functions | fn_csm_ras_type_update
|
|
Triggers | tr_csm_ras_type_updat
|
insert/update/delete
|
csm_ras_type_audit¶
- Description
- This table contains historical descriptions and details for each of the possible RAS event types. Specific attribute in this table include: msg_id_seq, operation, change_time, msg_id, severity, message, description, control_action, threshold_count, threshold_period, enabled, set_not_ready, set_ready, visible_to_users.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1000+ rows (depending on the different RAS types)
|
|
Key(s) | PK: msg_id_seq
|
|
Index | csm_ras_type_audit_pkey on (msg_id_seq)
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_ras_event_action | csm_ras_event_action_msg_id_seq_fkey | msg_id_seq | (FK) |
csm_ras_event_action¶
- Description
- This table contains all RAS events. Key attributes that are a part of this table include: rec id, msg id, msg_id_seq, timestamp, count, message, and raw data. This table will populate an enormous amount of records due to continuous event cycle. A solution needs to be in place to accommodate the mass amount of data produced.
Table | Overview | Action On: |
---|---|---|
Usage | High
|
|
Size | Million ++ rows
|
|
Key(s) | PK: rec_id
FK: csm_ras_type (msg_id_seq)
|
|
Index | csm_ras_event_action_pkey on (rec_id)
ix_csm_ras_event_action_a on (msg_id)
ix_csm_ras_event_action_b on (time_stamp)
ix_csm_ras_event_action_c on (location_name)
ix_csm_ras_event_action_d on (time_stamp, msg_id)
ix_csm_ras_event_action_e on (time_stamp, location_name)
|
CSM diagnostic tables¶
csm_diag_run¶
- Description
- This table contains information about each of the diagnostic runs. Specific attributes including: run id, allocation_id, begin time, status, inserted RAS, log directory, and command line.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1000+ rows
|
|
Key(s) | PK: run_id
|
|
Index | csm_diag_run_pkey on (run_id)
|
|
Functions | fn_csm_diag_run_history_dump
|
insert/update/delete (API call)
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_diag_result | csm_diag_result_run_id_fkey | run_id | (FK) |
csm_diag_run_history¶
- Description
- This table contains historical information about each of the diagnostic runs. Specific attributes including: run id, allocation_id, begin time, end_time, status, inserted RAS, log directory, and command line.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1000+ rows
|
|
Index | ix_csm_diag_run_history_a on (history_time)
|
csm_diag_result¶
- Description
- This table contains the results of a specific instance of a diagnostic.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1000+ rows
|
|
Key(s) | FK: csm_diag_run (run_id)
|
|
Index | ix_csm_diag_result_a on (run_id, test_case, node_name)
|
|
Functions | fn_csm_diag_result_history_dump
|
|
Triggers | tr_csm_diag_result_history_dump
|
delete
|
csm_diag_result_history¶
- Description
- This table contains historical results of a specific instance of a diagnostic.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1000+ rows
|
|
Index | ix_csm_diag_result_history_a on (history_time)
|
SSD partition and SSD logical volume tables¶
csm_lv¶
- Description
- This table contains information about the logical volumes that are created within the compute nodes.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 5000+ rows (depending on SSD usage)
|
|
Key(s) | PK: logical_volume_name, node_name
FK: csm_allocation (allocation_id)
FK: csm_vg (node_name, vg_name)
|
|
Index | csm_lv_pkey on (logical_volume_name, node_name)
ix_csm_lv_a on (logical_volume_name)
|
|
Functions | fn_csm_lv_history_dump
fn_csm_lv_modified_history_dump
fn_csm_lv_update_history_dump
|
insert/update/delete (API call)
|
Triggers | tr_csm_lv_modified_history_dump
tr_csm_lv_update_history_dump
|
update
update
|
csm_lv_history¶
- Description
- This table contains historical information associated with previously active logical volumes.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 5000+ rows (depending on step usage)
|
|
Index | ix_csm_lv_history_a on (history_time)
ix_csm_lv_history_b on (logical_volume_name)
|
csm_lv_update_history¶
- Description
- This table contains historical information associated with lv updates.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 5000+ rows (depending on step usage)
|
|
Index | ix_csm_lv_update_history_a on (history_time)
ix_csm_lv_update_history_b on (logical_volume_name)
|
csm_vg_ssd¶
- Description
- This table contains information that references both the SSD logical volume tables.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 5000+ rows (depending on SSD usage)
|
|
Key(s) | FK: csm_ssd (serial_number, node_name)
|
|
Index | csm_vg_ssd_pkey on (vg_name, node_name, serial_number)
ix_csm_vg_ssd_a on (vg_name, node_name, serial_number)
uk_csm_vg_ssd_a on (vg_name, node_name)
|
|
Functions | fn_csm_vg_ssd_history_dump
|
|
Triggers | tr_csm_vg_ssd_history_dump
|
update/delete
|
csm_vg_ssd_history¶
- Description
- This table contains historical information associated with SSD and logical volume tables.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 5000+ rows (depending on step usage)
|
|
Index | ix_csm_vg_ssd_history_a on (history_time)
|
csm_vg¶
- Description
- This table contains information that references both the SSD logical volume tables.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 5000+ rows (depending on step usage)
|
|
Key(s) | PK: vg_name, node_name
FK: csm_node (node_name)
|
|
Index | csm_vg_pkey on (vg_name, node_name)
|
|
Functions | fn_csm_vg_history_dump
|
|
Triggers | tr_csm_vg_history_dump
|
update/delete
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_lv | csm_lv_node_name_fkey | node_name, vg_name | (FK) |
csm_vg_history¶
- Description
- This table contains historical information associated with SSD and logical volume tables.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 5000+ rows (depending on step usage)
|
|
Index | ix_csm_vg_history_a on (history_time)
|
Switch & ib cable tables¶
csm_switch¶
- Description
- This table contain information about the switch and it attributes including; switch_name, discovery_time, collection_time, comment, description, fw_version, gu_id, has_ufm_agent, ip, model, num_modles, num_ports, physical_frame_location, physical_u_location, ps_id, role, server_operation_mode, sm_version, system_guid, system_name, total_alarms, type, and vendor.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 500 rows (Switches on a CORAL system)
|
|
Key(s) | PK: switch_name
|
|
Index | csm_switch_pkey on (serial_number)
|
|
Functions | fn_csm_switch_history_dump
|
|
Triggers | tr_csm_switch_history_dump
|
update/delete
|
Referenced by table | Constraint | Fields | Key |
---|---|---|---|
csm_switch_inventory | csm_switch_inventory_host_system_guid_fkey | host_system_guid | (FK) |
csm_switch_history¶
- Description
- This table contains historical information associated with individual switches.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | (Based on failure rate/ or how often changed out)
|
|
Index | ix_csm_switch_history_a on (history_time)
ix_csm_switch_history_b on (serial_number, history_time)
|
csm_ib_cable¶
- Description
- This table contains information about the InfiniBand cables including; serial_number, discovery_time, collection_time, comment, guid_s1, guid_s2, identifier, length, name, part_number, port_s1, port_s2, revision, severity, type, and width.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 25,000+ rows (Based on switch topology and
or configuration)
|
|
Key(s) | PK: serial_number
|
|
Index | csm_ib_cable_pkey on (serial_number)
|
|
Functions | fn_csm_ib_cable_history_dump
|
|
Triggers | tr_csm_ib_cable_history_dump
|
update/delete
|
csm_ib_cable_history¶
- Description
- This table contains historical information about the InfiniBand cables.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 25,000+ rows (Based on switch topology and
or configuration)
|
|
Index | ix_csm_ib_cable_history_a on (history_time)
|
csm_switch_inventory¶
- Description
- This table contains information about the switch inventory including; name, host_system_guid, discovery_time, collection_time, comment, description, device_name, device_type, max_ib_ports, module_index, number_of_chips, path, serial_number, severity, and status.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 25,000+ rows (Based on switch topology and
or configuration)
|
|
Key(s) | PK: name
FK: csm_switch (switch_name)
|
|
Index | csm_switch_inventory_pkey on (name)
|
|
Functions | fn_csm_switch_inventory_history_dump
|
|
Triggers | tr_csm_switch_inventory_history_dump
|
update/delete
|
csm_switch_inventory_history¶
- Description
- This table contains historical information about the switch inventory.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 25,000+ rows (Based on switch topolog and or configuration)
|
|
Index | ix_csm_switch_inventory_history_a on (history_time)
|
csm_switch_ports¶
- Description
- This table contains information about the switch ports including; name, parent, discovery_time, collection_time, active_speed, comment, description, enabled_speed, external_number, guid, lid, max_supported_speed, logical_state, mirror, mirror_traffic, module, mtu, number, physical_state, peer, severity, supported_speed, system_guid, tier, width_active, width_enabled, and width_supported.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 25,000+ rows (Based on switch topology and
or configuration)
|
|
Key(s) | PK: name
FK: csm_switch (switch_name)
|
|
Index | csm_switch_ports_pkey on (name)
|
|
Functions | fn_csm_switch_ports_history_dump
|
|
Triggers | tr_csm_switch_ports_history_dump
|
update/delete
|
csm_switch_ports_history¶
- Description
- This table contains historical information about the switch ports.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 25,000+ rows (Based on switch topology and
or configuration)
|
|
Index | ix_csm_switch_ports_history_a on
(history_time)
|
CSM configuration tables¶
csm_config¶
- Description
- This table contains information about the CSM configuration.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 1 row (Based on configuration changes)
|
|
Key(s) | PK: config_id
|
|
Index | csm_config_pkey on (csm_config_id)
|
|
Functions | fn_csm_config_history_dump
|
|
Triggers | tr_csm_config_history_dump
|
update/delete
|
csm_config_history¶
- Description
- This table contains historical information about the CSM configuration.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 1-100 rows
|
|
Index | ix_csm_config_history_a on (history_time)
|
csm_config_bucket¶
- Description
- This table is the list of items that will placed in the bucket. Some of the attributes include: bucket id, item lists, execution interval, and time stamp.
Table | Overview | Action On: |
---|---|---|
Usage | Medium
|
|
Size | 1-400 rows (Based on configuration changes)
|
|
Index | ix_csm_config_bucket_a on
(bucket_id, item_list, time_stamp)
|
CSM DB schema version tables¶
csm_db_schema_version¶
- Description
- This is the current database schema version when loaded.
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1-100 rows (Based on CSM DB changes)
|
|
Key(s) | PK: version
|
|
Index | csm_db_schema_version_pkey on (version)
ix_csm_db_schema_version_a on (version, create_time)
|
|
Functions | fn_csm_db_schema_version_history_dump
|
|
Triggers | tr_csm_db_schema_version_history_dump
|
update/delete
|
csm_db_schema_version_history¶
- Description
- This is the historical database schema version (if changes have been made)
Table | Overview | Action On: |
---|---|---|
Usage | Low
|
|
Size | 1-100 rows (Based on CSM DB changes/updates)
|
|
Index | ix_csm_db_schema_version_history_a on history_time)
|
PK, FK, UK keys and Index Charts¶
Primary Keys (default Indexes)¶
Name | Table | Index on | Description |
---|---|---|---|
csm_allocation_pkey | csm_allocation | pkey index on | allocation_id |
csm_config_pkey | csm_config | pkey index on | csm_config_id |
csm_db_schema_version_pkey | csm_db_schema_version | pkey index on | version |
csm_diag_run_pkey | csm_diag_run | pkey index on | run_id |
csm_dimm_pkey | csm_dimm | pkey index on | serial_number |
csm_gpu_pkey | csm_gpu | pkey index on | node_name, gpu_id |
csm_hca_pkey | csm_hca | pkey index on | node_name, serial_number |
csm_ib_cable_pkey | csm_ib_cable | pkey index on | serial_number |
csm_lv_pkey | csm_lv | pkey index on | logical_volume_name, node_name |
csm_node_pkey | csm_node | pkey index on | node_name |
csm_processor_pkey | csm_processor | pkey index on | serial_number |
csm_ras_event_action_pkey | csm_ras_event_action | pkey index on | rec_id |
csm_ras_type_audit_pkey | csm_ras_type_audit | pkey index on | msg_id_seq |
csm_ras_type_pkey | csm_ras_type | pkey index on | msg_id |
csm_ssd_pkey | csm_ssd | pkey index on | serial_number |
csm_step_pkey | csm_step | pkey index on | step_id, allocation_id |
csm_switch_inventory_pkey | csm_switch_inventory | pkey index on | name |
csm_switch_pkey | csm_switch | pkey index on | switch_name |
csm_switch_ports_pkey | csm_switch_ports | pkey index on | name |
csm_vg_ssd_pkey | csm_vg_ssd | pkey index on | vg_name, node_name, serial_number |
Foreign Keys¶
Name | From Table | From Cols | To Table | To Cols |
---|---|---|---|---|
csm_allocation_node_allocation_id_fkey | csm_allocation_node | allocation_id | csm_allocation | allocation_id |
csm_allocation_node_node_name_fkey | csm_allocation_node | node_name | csm_node | node_name |
csm_diag_result_run_id_fkey | csm_diag_result | run_id | csm_diag_run | run_id |
csm_dimm_node_name_fkey | csm_dimm | node_name | csm_node | node_name |
csm_gpu_node_name_fkey | csm_gpu | node_name | csm_node | node_name |
csm_hca_node_name_fkey | csm_hca | node_name | csm_node | node_name |
csm_lv_allocation_id_fkey | csm_lv | allocation_id, node_name | csm_allocation_node | allocation_id, node_name |
csm_lv_node_name_fkey | csm_lv | node_name, vg_name | csm_vg | node_name, vg_name |
csm_processor_node_name_fkey | csm_processor | node_name | csm_node | node_name |
csm_ras_event_action_msg_id_seq_fkey | csm_ras_event_action | msg_id_seq | csm_ras_type_audit | msg_id_seq |
csm_ssd_node_name_fkey | csm_ssd | node_name | csm_node | node_name |
csm_step_allocation_id_fkey | csm_step | allocation_id | csm_allocation | allocation_id |
csm_step_node_allocation_id_fkey | csm_step_node | allocation_id, node_name | csm_allocation_node | allocation_id, node_name |
csm_step_node_step_id_fkey | csm_step_node | step_id, allocation_id | csm_step | step_id, allocation_id |
csm_switch_inventory_host_system_guid_fkey | csm_switch_inventory | host_system_guid | csm_switch | switch_name |
csm_switch_ports_parent_fkey | csm_switch_ports | parent | csm_switch | switch_name |
csm_vg_ssd_serial_number_fkey | csm_vg_ssd | serial_number, node_name | csm_ssd | serial_number, node_name |
csm_vg_vg_name_fkey | csm_vg | vg_name, node_name | csm_vg_ssd | vg_name, node_name |
Indexes¶
Name | Table | Index on | Description field |
---|---|---|---|
ix_csm_allocation_history_a | csm_allocation_history | index on | history_time |
ix_csm_allocation_state_history_a | csm_allocation_state_history | index on | history_time |
ix_csm_config_bucket_a | csm_config_bucket | index on | bucket_id, item_list, time_stamp |
ix_csm_config_history_a | csm_config_history | index on | history_time |
ix_csm_db_schema_version_a | csm_db_schema_version | index on | version, create_time |
ix_csm_db_schema_version_history_a | csm_db_schema_version_history | index on | history_time |
ix_csm_diag_result_a | csm_diag_result | index on | run_id, test_name, node_name |
ix_csm_diag_result_history_a | csm_diag_result_history | index on | history_time |
ix_csm_diag_run_history_a | csm_diag_run_history | index on | history_time |
ix_csm_dimm_history_a | csm_dimm_history | index on | history_time |
ix_csm_gpu_history_a | csm_gpu_history | index on | history_time |
ix_csm_gpu_history_b | csm_gpu_history | index on | serial_number |
ix_csm_hca_history_a | csm_hca_history | index on | history_time |
ix_csm_ib_cable_history_a | csm_ib_cable_history | index on | history_time |
ix_csm_lv_a | csm_lv | index on | logical_volume_name |
ix_csm_lv_history_a | csm_lv_history | index on | history_time |
ix_csm_lv_history_b | csm_lv_history | index on | logical_volume_name |
ix_csm_lv_update_history_a | csm_lv_update_history | index on | history_time |
ix_csm_lv_update_history_b | csm_lv_update_history | index on | logical_volume_name |
ix_csm_node_a | csm_node | index on | node_name, ready |
ix_csm_node_history_a | csm_node_history | index on | history_time |
ix_csm_node_history_b | csm_node_history | index on | node_name |
ix_csm_node_ready_history_a | csm_node_ready_history | index on | history_time |
ix_csm_node_ready_history_b | csm_node_ready_history | index on | node_name, ready |
ix_csm_processor_a | csm_processor | index on | serial_number, node_name |
ix_csm_processor_history_a | csm_processor_history | index on | history_time |
ix_csm_processor_history_b | csm_processor_history | index on | serial_number, node_name |
ix_csm_ras_event_action_a | csm_ras_event_action | index on | msg_id |
ix_csm_ras_event_action_b | csm_ras_event_action | index on | time_stamp |
ix_csm_ras_event_action_c | csm_ras_event_action | index on | location_name |
ix_csm_ras_event_action_d | csm_ras_event_action | index on | time_stamp, msg_id |
ix_csm_ras_event_action_e | csm_ras_event_action | index on | time_stamp, location_name |
ix_csm_ssd_a | csm_ssd | index on | serial_number, node_name |
ix_csm_ssd_history_a | csm_ssd_history | index on | history_time |
ix_csm_ssd_history_b | csm_ssd_history | index on | serial_number, node_name |
ix_csm_step_history_a | csm_step_history | index on | history_time |
ix_csm_step_history_b | csm_step_history | index on | begin_time, end_time |
ix_csm_step_history_c | csm_step_history | index on | allocation_id, end_time |
ix_csm_step_history_d | csm_step_history | index on | end_time |
ix_csm_step_history_e | csm_step_history | index on | step_id |
ix_csm_step_node_history_a | csm_step_node_history | index on | history_time |
ix_csm_switch_history_a | csm_switch_history | index on | history_time |
ix_csm_switch_history_b | csm_switch_history | index on | switch_name, history_time |
ix_csm_switch_inventory_history_a | csm_switch_inventory_history | index on | history_time |
ix_csm_switch_ports_history_a | csm_switch_ports_history | index on | history_time |
ix_csm_vg_history_a | csm_vg_history | index on | history_time |
ix_csm_vg_ssd_a | csm_vg_ssd | index on | vg_name, node_name, serial_number |
ix_csm_vg_ssd_history_a | csm_vg_ssd_history | index on | history_time |
Unique Indexes¶
Name | Table | Index on | Description field |
---|---|---|---|
uk_csm_allocation_node_b | csm_allocation_node | uniqueness on | allocation_id, node_name |
uk_csm_ssd_a | csm_ssd | uniqueness on | serial_number, node_name |
uk_csm_step_a | csm_step | uniqueness on | step_id, allocation_id |
uk_csm_step_node_a | csm_step_node | uniqueness on | step_id, allocation_id, node_name |
uk_csm_vg_a | csm_vg | uniqueness on | vg_name, node_name |
uk_csm_vg_ssd_a | csm_vg_ssd | uniqueness on | vg_name, node_name |
Functions and Triggers¶
Function Name | Trigger Name | Table On | Tr Type | Result Data Type | Action On | Argument Data Type | Description |
fn_csm_allocation_create_data_aggregator | (Stored Procedure) | csm_allocation_node | void | i_allocation_id bigint, i_node_names text[], i_ib_rx_list bigint[], i_ib_tx_list bigint[], i_gpfs_read_list bigint[], i_gpfs_write_list bigint[], i_energy bigint[], i_power_cap integer[], i_ps_ratio integer[] | csm_allocation_node function to populate the data aggregator fields in csm_allocation_node. | ||
fn_csm_allocation_finish_data_stats | (Stored Procedure) | csm_allocation_node | void | allocationid bigint, node_names text[], ib_rx_list bigint[], ib_tx_list bigint[], gpfs_read_list bigint[], gpfs_write_list bigint[], energy_list bigint[] | csm_allocation function to finalize the data aggregator fields. | ||
fn_csm_allocation_history_dump | (Stored Procedure) | csm_allocation | void | allocationid bigint, endtime timestamp without time zone, exitstatus integer, i_state text, node_names text[], ib_rx_list bigint[], ib_tx_list bigint[], gpfs_read_list bigint[], gpfs_write_list bigint[], energy_list bigint[] | csm_allocation function to amend summarized column(s) on DELETE. (csm_allocation_history_dump) | ||
fn_csm_allocation_node_change | tr_csm_allocation_node_change | csm_allocation_node | BEFORE | trigger | DELETE | csm_allocation_node trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_allocation_node_sharing_status | (Stored Procedure) | csm_allocation_node | void | i_allocation_id bigint, i_type text, i_state text, i_shared boolean, i_nodenames text[] | csm_allocation_sharing_status function to handle exclusive usage of shared nodes on INSERT. | ||
fn_csm_allocation_state_history_state_change | tr_csm_allocation_state_change | csm_allocation | BEFORE | trigger | UPDATE | csm_allocation trigger to amend summarized column(s) on UPDATE. | |
fn_csm_allocation_update | tr_csm_allocation_update | csm_allocation | BEFORE | trigger | UPDATE | csm_allocation_update trigger to amend summarized column(s) on UPDATE. | |
fn_csm_allocation_update_state | (Stored Procedure) | csm_allocation,csm_allocation_node | record | i_allocationid bigint, i_state text, OUT o_primary_job_id bigint, OUT o_secondary_job_id integer, OUT o_user_flags text, OUT o_system_flags text, OUT o_num_nodes integer, OUT o_nodes text, OUT o_isolated_cores integer, OUT o_user_name text | csm_allocation_update_state function that ensures the allocation can be legally updated to the supplied state | ||
fn_csm_config_history_dump | tr_csm_config_history_dump | csm_config | BEFORE | trigger | UPDATE, DELETE | csm_config trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_db_schema_version_history_dump | tr_csm_db_schema_version_history_dump | csm_db_schema_version | BEFORE | trigger | UPDATE, DELETE | csm_db_schema_version trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_diag_result_history_dump | tr_csm_diag_result_history_dump | csm_diag_result | BEFORE | trigger | DELETE | csm_diag_result trigger to amend summarized column(s) on DELETE. | |
fn_csm_diag_run_history_dump | (Stored Procedure) | csm_diag_run | void | _run_id bigint, _end_time timestamp with time zone, _status text, _inserted_ras boolean | csm_diag_run function to amend summarized column(s) on UPDATE and DELETE. (csm_diag_run_history_dump) | ||
fn_csm_dimm_history_dump | tr_csm_dimm_history_dump | csm_dimm | BEFORE | trigger | UPDATE, DELETE | csm_dimm trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_gpu_history_dump | tr_csm_gpu_history_dump | csm_gpu | BEFORE | trigger | UPDATE, DELETE | csm_gpu trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_hca_history_dump | tr_csm_hca_history_dump | csm_hca | BEFORE | trigger | UPDATE, DELETE | csm_hca trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_ib_cable_history_dump | tr_csm_ib_cable_history_dump | csm_ib_cable | BEFORE | trigger | UPDATE, DELETE | csm_ib_cable trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_ib_cable_inventory_collection | (Stored Procedure) | csm_ib_cable | record | i_record_count integer, i_serial_number text[], i_comment text[], i_guid_s1 text[], i_guid_s2 text[], i_identifier text[], i_length text[], i_name text[], i_part_number text[], i_port_s1 text[], i_port_s2 text[], i_revision text[], i_severity text[], i_type text[], i_width text[], OUT o_insert_count integer, OUT o_update_count integer | function to INSERT and UPDATE ib cable inventory. | ||
fn_csm_lv_history_dump | (Stored Procedure) | csm_lv | void | _logicalvolumename text, _node_name text, _allocationid bigint, _state character, _currentsize bigint, _updatedtime timestamp without time zone, _endtime timestamp without time zone, _numbytesread bigint, _numbyteswritten bigint | csm_lv function to amend summarized column(s) on DELETE. (csm_lv_history_dump) | ||
fn_csm_lv_modified_history_dump | tr_csm_lv_modified_history_dump | csm_lv | BEFORE | trigger | UPDATE | csm_lv_modified_history_dump trigger to amend summarized column(s) on UPDATE. | |
fn_csm_lv_update_history_dump | tr_csm_lv_update_history_dump | csm_lv | BEFORE | trigger | UPDATE | csm_lv_update_history_dump trigger to amend summarized column(s) on UPDATE. | |
fn_csm_lv_upsert | (Stored Procedure) | csm_lv | void | l_logical_volume_name text, l_node_name text, l_allocation_id bigint, l_vg_name text, l_state character, l_current_size bigint, l_max_size bigint, l_begin_time timestamp without time zone, l_updated_time timestamp without time zone, l_file_system_mount text, l_file_system_type text | csm_lv_upsert function to amend summarized column(s) on INSERT. (csm_lv table) | ||
fn_csm_node_attributes_query_details | (Stored Procedure) | csm_node,csm_dimm,csm_gpu,csm_hca,csm_processor,csm_ssd | node_details | i_node_name text | csm_node_attributes_query_details function to HELP CSM API. | ||
fn_csm_node_delete | (Stored Procedure) | csm_node,csm_dimm,csm_gpu,csm_hca,csm_processor,csm_ssd | record | i_node_names text[], OUT o_not_deleted_node_names_count integer, OUT o_not_deleted_node_names text | Function to delete a node, and remove records in the csm_node, csm_ssd, csm_processor, csm_gpu, csm_hca, csm_dimm tables. | ||
fn_csm_node_ready | tr_csm_node_ready | csm_node | BEFORE | trigger | UPDATE | csm_node_ready trigger to amend summarized column(s) on UPDATE. | |
fn_csm_node_update | tr_csm_node_update | csm_node | BEFORE | trigger | UPDATE, DELETE | csm_node_update trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_processor_history_dump | tr_csm_processor_history_dump | csm_processor | BEFORE | trigger | UPDATE, DELETE | csm_processor trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_ras_type_update | tr_csm_ras_type_update | csm_ras_type | AFTER | trigger | INSERT, UPDATE,DELETE | csm_ras_type trigger to add rows to csm_ras_type_audit on INSERT and UPDATE and DELETE. (csm_ras_type_update) | |
fn_csm_ssd_dead_records | (Stored Procedure) | csm_vg_ssd, csm_vg, csm_lv | void | i_sn text | Delete any vg and lv on an ssd that is being deleted. | ||
fn_csm_ssd_history_dump | tr_csm_ssd_history_dump | csm_ssd | BEFORE | trigger | UPDATE, DELETE | csm_ssd trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_step_begin | (Stored Procedure) | csm_step | void | i_step_id bigint, i_allocation_id bigint, i_status text, i_executable text, i_working_directory text, i_argument text, i_environment_variable text, i_num_nodes integer, i_num_processors integer, i_num_gpus integer, i_projected_memory integer, i_num_tasks integer, i_user_flags text, i_node_names text[] | csm_step_begin function to begin a step, adds the step to csm_step and csm_step_node | ||
fn_csm_step_end | (Stored Procedure) | csm_step_node,csm_step | record | i_stepid bigint, i_allocationid bigint, i_exitstatus integer, i_errormessage text, i_cpustats text, i_totalutime double precision, i_totalstime double precision, i_ompthreadlimit text, i_gpustats text, i_memorystats text, i_maxmemory bigint, i_iostats text, OUT o_user_flags text, OUT o_num_nodes integer, OUT o_nodes text | csm_step_end function to delete the step from the nodes table (fn_csm_step_end) | ||
fn_csm_step_history_dump | (Stored Procedure) | csm_step | void | i_stepid bigint, i_allocationid bigint, i_endtime timestamp with time zone, i_exitstatus integer, i_errormessage text, i_cpustats text, i_totalutime double precision, i_totalstime double precision, i_ompthreadlimit text, i_gpustats text, i_memorystats text, i_maxmemory bigint, i_iostats text | csm_step function to amend summarized column(s) on DELETE. (csm_step_history_dump) | ||
fn_csm_step_node_history_dump | tr_csm_step_node_history_dump | csm_step_node | BEFORE | trigger | DELETE | i_switch_name text | csm_step_node trigger to amend summarized column(s) on DELETE. (csm_step_node_history_dump) |
fn_csm_switch_attributes_query_details | (Stored Procedure) | csm_switch,csm_switch_inventory,csm_switch_ports | switch_details | i_record_count integer, i_name text[], i_host_system_guid text[], i_comment text[], i_description text[], i_device_name text[], i_device_type text[], i_max_ib_ports text[], i_module_index text[], i_number_of_chips text[], i_path text[], i_serial_number text[], i_severity text[], i_status text[] | csm_switch_attributes_query_details function to HELP CSM API. | ||
fn_csm_switch_children_inventory_collection | (Stored Procedure) | csm_switch_inventory | void | i_record_count integer, i_switch_name text[], i_comment text[], i_description text[], i_fw_version text[], i_gu_id text[], i_has_ufm_agent text[], i_ip text[], i_model text[], i_num_modules text[], i_num_ports text[], i_physical_frame_location text[], i_physical_u_location text[], i_ps_id text[], i_role text[], i_server_operation_mode text[], i_sm_mode text[], i_state text[], i_sw_version text[], i_system_guid text[], i_system_name text[], i_total_alarms text[], i_type text[], i_vendor text[] | function to INSERT and UPDATE switch children inventory. | ||
fn_csm_switch_history_dump | tr_csm_switch_history_dump | csm_switch | BEFORE | trigger | UPDATE, DELETE | csm_switch trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_switch_inventory_collection | (Stored Procedure) | csm_switch | void | function to INSERT and UPDATE switch inventory. | |||
fn_csm_switch_inventory_history_dump | tr_csm_switch_inventory_history_dump | csm_switch_inventory | BEFORE | trigger | UPDATE, DELETE | i_available_size bigint, i_node_name text, i_ssd_count integer, i_ssd_serial_numbers text[], i_ssd_allocations bigint[], i_total_size bigint, i_vg_name text, i_is_scheduler boolean | csm_switch_inventory trigger to amend summarized column(s) on UPDATE and DELETE. |
fn_csm_switch_ports_history_dump | tr_csm_switch_ports_history_dump | csm_switch_ports | BEFORE | trigger | UPDATE, DELETE | csm_switch_ports trigger to amend summarized column(s) on UPDATE and DELETE. | |
fn_csm_vg_create | (Stored Procedure) | csm_vg_ssd,csm_vg,csm_ssd | void | Function to create a vg, adds the vg to csm_vg_ssd and csm_vg |
CSM DB Schema (pdf)¶
(CSM DB schema version 16.1): – Coming soon –
Using csm_db_archive_script.sh¶
This section describes the archiving process associated with the CSM DB history tables. If run alone it will archive all history tables in the CSM Database, including the csm_ras_event_action table.
Usage Overview¶
-bash-4.2$ ./csm_db_history_archive.sh -h
----------------------------------------------------------------------------------------
CSM Database History Archive Usage -
=======================================================================================
-h Display this message.
-t <dir> Target directory to write the archive to.
Default: "/var/log/ibm/csm/archive"
-n <count> Number of records to archive in the run.
Default: 100
-d <db> Database to archive tables from.
Default: "csmdb"
======================================================================================
[Example ] ./csm_db_history_delete.sh -d [dbname] -n [time_interval] -t [/data_dir/]
--------------------------------------------------------------------------------------
Note
This is a general overview of the CSM DB archive history process using the csm_history_wrapper_archive_script_template.sh
script.
Script overview¶
The script may largely be broken into
- Create a temporary table to archive history data based on a condition.
- Connect to the Database with the postgres user.
- Drops and creates the temp table used in the archival process.
- The first query selects all the fields in the table.
- The second and third query is a nested query that defines a particular row count that a user can pass in or can be set as a default value. The data is filter by using the history_time)..
- The where clause defines whether the archive_history_time field is NULL.
- The user will have the option to pass in a row count value (ex. 10,000 records).
- The data will be ordered by
history_time ASC
.
- Copies all satisfied history data to a json file.
- Copies all the results from the temp table and appends to a json file
- Then updates the archive_history_timestamp field, which can be later deleted during the purging process).
- Updates the csm_[table_name]_history table
- Sets the archive_history_time = current timestamp
- From clause on the temp table
- WHERE (compares history_time, from history table to temp table) AND history.archive_history_time IS NULL.
Wrapper script¶
- Database name. (Default: “csmdb”)
- Archive counter. (how many records to be archived: Default: 100)
- Specified directory to be written to. (Default:
/var/log/ibm/csm/archive
)
[./csm_history_wrapper_archive_script_template.sh] [dbname] [archive_counter] [history_table_name] [/data_dir/]
Attention
If this script below is run manually it will display the results to the screen. This script only handles per table archiving.
Script out results¶
-bash-4.2$ ./csm_history_wrapper_archive_script_template.sh csmdb 10000 csm_node_history /tmp/
------------------------------------------------------------------------------
Table | Time | Archive Count
-------------------------------|-------------|--------------------------------
csm_node_history | 0.157 | 10000
------------------------------------------------------------------------------
Date/Time: | 2018-04-05.09.26.36.411615684
DB Name: | csmdb
DB User: | postgres
archive_counter: | 10000
Total time: | 0.157
Average time: | 0.157
------------------------------------------------------------------------------
Attention
While using the csm_stats_script (in another session) the user can monitor the results
./csm_db_stats.sh –t <db_name>
Note
Directory: Currently the scripts are setup to archive the results in a specified directory.
The history table data will be archived in a csv file along with the log file:
csm_db_archive_script.log
csm_node_history.archive.2018-07-30.json
Using csm_db_backup_script_v1.sh¶
To manually perform a cold backup a CSM database on the system the following script may be run.
/opt/ibm/csm/db/csm_db_backup_script_v1.sh
Note
This script should be run as the root or postgres user.
Attention
There are a few step that should be taken before backing up a CSM or related DB on the system.
Backup script actions¶
The following steps are behaviors recommended for use of the back up script:
- Stop all CSM daemons.
- Run the backup script.
Invocation: | /opt/ibm/csm/db/csm_db_backup_script_v1.sh [DBNAME] [/DIR/] |
---|---|
Default Directory: | |
/var/lib/pgsql/backups/ |
The script will check the DB connections and if there are no active connections then the backup process will begin. If there are any active connections to the DB, an Error message will be displayed and the program will exit.
To terminate active connections: csm_db_connections_script.sh
- Once the DB been successfully backed-up then the admin can restart the daemons.
Running the csm_db_backup_script_v1.sh¶
Example (-h, –help)¶
./csm_db_backup_script_v1.sh –h, --help
===============================================================================================================
[Info ] csm_db_backup_script_v1.sh : csmdb /tmp/csmdb_backup/
[Info ] csm_db_backup_script_v1.sh : csmdb
[Usage] csm_db_backup_script_v1.sh : [OPTION]... [/DIR/]
---------------------------------------------------------------------------------------------------------------
[Options]
----------------|----------------------------------------------------------------------------------------------
Argument | Description
----------------|----------------------------------------------------------------------------------------------
-h, --help | help menu
----------------|----------------------------------------------------------------------------------------------
[Examples]
---------------------------------------------------------------------------------------------------------------
csm_db_backup_script_v1.sh [DBNAME] | (default) will backup database to/var/lib/pgpsql/backups/ (directory)
csm_db_backup_script_v1.sh [DBNAME] [/DIRECTORY/ | will backup database to specified directory
| if the directory doesnt exist then it will be mode and
| written.
==============================================================================================================
Attention
Common errors
If the user tries to run the script as local user without PostgreSQL installed and does not provide a database name:
- An info message will prompt ([Info ] Database name is required)
- The usage message will also prompt the usage help menu
Example (no options, usage)¶
bash-4.1$ ./csm_db_backup_script_v1.sh
[Info ] Database name is required
===============================================================================================================
[Info ] csm_db_backup_script_v1.sh : csmdb /tmp/csmdb_backup/
[Info ] csm_db_backup_script_v1.sh : csmdb
[Usage] csm_db_backup_script_v1.sh : [OPTION]... [/DIR/]
---------------------------------------------------------------------------------------------------------------
[Options]
----------------|----------------------------------------------------------------------------------------------
Argument | Description
----------------|----------------------------------------------------------------------------------------------
-h, --help | help menu
----------------|----------------------------------------------------------------------------------------------
[Examples]
---------------------------------------------------------------------------------------------------------------
csm_db_backup_script_v1.sh [DBNAME] | (default) will backup database to/var/lib/pgpsql/backups/ (directory)
csm_db_backup_script_v1.sh [DBNAME] [/DIRECTORY/ | will backup database to specified directory
| if the directory doesnt exist then it will be mode and
| written.
Note
If the user tries to run the script as local user (non-root and postgresql not installed):
Example (postgreSQL not installed)¶
bash-4.1$ ./csm_db_backup_script_v1.sh csmdb /tmp/
-----------------------------------------------------------------------------------------
[Error ] PostgreSQL may not be installed. Please check configuration settings
-----------------------------------------------------------------------------------------
Note
If the user tries to run the script as local user (non-root and postgresql not installed)and doesnt specify a directory (default directory: /var/lib/pgsql/backups
Example (no directory specified)¶
bash-4.1$ ./csm_db_backup_script_v1.sh csmdb
-----------------------------------------------------------------------------------------
[Error ] make directory failed for: /var/lib/pgsql/backups/
[Info ] User: csmcarl does not have permission to write to this directory
[Info ] Please specify a valid directory
[Info ] Or log in as the appropriate user
-----------------------------------------------------------------------------------------
Using csm_db_connections_script.sh¶
This script is designed to list and/or kill all active connections to a PostgreSQL database. Logging for this script is placed in /var/log/ibm/csm/csm_db_connections_script.log
Usage Overview¶
/opt/ibm/csm/db/./csm_db_connections_script.sh –h, --help.
The help command (-h, –help) will specify each of the options available to use.
Options | Description | Result |
---|---|---|
running the script with no options | ./csm_db_connections_script.sh | Try ‘csm_db_connections_script.sh –help’ for more information. |
running the script with –l, –list | ./csm_db_connections_script.sh –l, –list | list database sessions. |
running the script with -k, –kill | ./csm_db_connections_script.sh –k, –kill | kill/terminate database sessions. |
running the script with –f, –force | ./csm_db_connections_script.sh –f, –force | force kill (do not ask for confirmation, use in conjunction with -k option). |
running the script with –u, –user | ./csm_db_connections_script.sh –u, –user | specify database user name. |
running the script with –p, –pid | ./csm_db_connections_script.sh –p, –pid | specify database user process id (pid). |
running the script with –h, –help | ./csm_db_connections_script.sh –h, –help | see details below |
Example (usage)¶
-bash-4.2$ ./csm_db_connections_script.sh --help
[Info ] PostgreSQL is installed
=================================================================================================================
[Info ] csm_db_connections_script.sh : List/Kill database user sessions
[Usage] csm_db_connections_script.sh : [OPTION]... [USER]
-----------------------------------------------------------------------------------------------------------------
[Options]
----------------|------------------------------------------------------------------------------------------------
Argument | Description
----------------|------------------------------------------------------------------------------------------------
-l, --list | list database sessions
-k, --kill | kill/terminate database sessions
-f, --force | force kill (do not ask for confirmation,
| use in conjunction with -k option)
-u, --user | specify database user name
-p, --pid | specify database user process id (pid)
-h, --help | help menu
----------------|------------------------------------------------------------------------------------------------
[Examples]
-----------------------------------------------------------------------------------------------------------------
csm_db_connections_script.sh -l, --list | list all session(s)
csm_db_connections_script.sh -l, --list -u, --user [USERNAME] | list user session(s)
csm_db_connections_script.sh -k, --kill | kill all session(s)
csm_db_connections_script.sh -k, --kill -f, --force | force kill all session(s)
csm_db_connections_script.sh -k, --kill -u, --user [USERNAME] | kill user session(s)
csm_db_connections_script.sh -k, --kill -p, --pid [PIDNUMBER]| kill user session with a specific pid
=================================================================================================================
Listing all DB connections¶
To display all current DB connections:
/opt/ibm/csm/db/csm_db_connections_script.sh --list
Example (-l, –list)¶
-bash-4.2$ ./csm_db_connections_script.sh –l
-----------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] PostgreSQL is installed
===========================================================================================================
[Info ] Database Session | (all_users): 13
-----------------------------------------------------------------------------------------------------------
pid | database | user | connection_duration
-------+----------+----------+---------------------
61427 | xcatdb | xcatadm | 02:07:26.587854
61428 | xcatdb | xcatadm | 02:07:26.586227
73977 | postgres | postgres | 00:00:00.000885
72657 | csmdb | csmdb | 00:06:17.650398
72658 | csmdb | csmdb | 00:06:17.649185
72659 | csmdb | csmdb | 00:06:17.648012
72660 | csmdb | csmdb | 00:06:17.646846
72661 | csmdb | csmdb | 00:06:17.645662
72662 | csmdb | csmdb | 00:06:17.644473
72663 | csmdb | csmdb | 00:06:17.643285
72664 | csmdb | csmdb | 00:06:17.642105
72665 | csmdb | csmdb | 00:06:17.640927
72666 | csmdb | csmdb | 00:06:17.639771
(13 rows)
===========================================================================================================
4.) To display specified user(s) currently connected to the DB:
run /opt/ibm/csm/db/csm_db_connections_script.sh (-l, --list –u, --user <username>).
Note
The script will display the total users connected along with total users.
Example (-l, –list –u, –user)¶
-bash-4.2$ ./csm_db_connections_script.sh -l -u postgres
------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] DB user: postgres is connected
[Info ] PostgreSQL is installed
==============================================================================================================
[Info ] Database Session | (all_users): 13
[Info ] Session List | (postgres): 1
------------------------------------------------------------------------------------------------------
pid | database | user | connection_duration
-------+----------+----------+---------------------
74094 | postgres | postgres | 00:00:00.000876
(1 row)
==============================================================================================================
Example (not specifying user or invalid user in the system)¶
-bash-4.2$ ./csm_db_connections_script.sh -k -u
[Error] Please specify user name
------------------------------------------------------------------------------------------------------
-bash-4.2$ ./csm_db_connections_script.sh -k -u csmdbsadsd
[Error] DB user: csmdbsadsd is not connected or is invalid
------------------------------------------------------------------------------------------------------
Kill all DB connections¶
The user has the ability to kill all DB connections by using the –k, --kill
option:
run /opt/ibm/csm/db/csm_db_connections_script.sh (-k, --kill).
Note
If this option is chosen by itself, the script will prompt each session with a yes/no request. The user has the ability to manually kill or not kill each session. All responses are logged to the:
/var/log/ibm/csm/csm_db_connections_script.log
Example (-k, –kill)¶
-bash-4.2$ ./csm_db_connections_script.sh –k
------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] PostgreSQL is installed
[Info ] Kill database session (PID:61427) [y/n] ?:
======================================================================================================
-bash-4.2$ ./csm_db_connections_script.sh –k
------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] PostgreSQL is installed
[Info ] Kill database session (PID:61427) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:61428) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:74295) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72657) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72658) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72659) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72660) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72661) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72662) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72663) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72664) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72665) [y/n] ?:
[Info ] User response: n
[Info ] Kill database session (PID:72666) [y/n] ?:
[Info ] User response: n
============================================================================================================
Force kill all DB connections¶
The user has the ability to force kill all DB connections by using the –k, --kill –f, --force
option.
run /opt/ibm/csm/db/csm_db_connections_script.sh (-k, --kill –f, --force).
Warning
If this option is chosen by itself, the script will kill each open session(s).
All responses are logged to the:
/var/log/ibm/csm/csm_db_connections_script.log
Example (-k, –kill –f, –force)¶
-bash-4.2$ ./csm_db_connections_script.sh –k -f
------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] PostgreSQL is installed
[Info ] Killing session (PID:61427)
[Info ] Killing session (PID:61428)
[Info ] Killing session (PID:74295)
[Info ] Killing session (PID:72657)
[Info ] Killing session (PID:72658)
[Info ] Killing session (PID:72659)
[Info ] Killing session (PID:72660)
[Info ] Killing session (PID:72661)
[Info ] Killing session (PID:72662)
[Info ] Killing session (PID:72663)
[Info ] Killing session (PID:72664)
[Info ] Killing session (PID:72665)
./csm_db_connections_script.sh: line 360: kill: (72665) – No such process
=============================================================================================================
Example (Log file output)¶
2017-11-01 15:54:27 (postgres) [Start] Welcome to CSM datatbase automation stats script.
2017-11-01 15:54:27 (postgres) [Info ] DB Names: template1 | template0 | postgres |
2017-11-01 15:54:27 (postgres) [Info ] DB Names: xcatdb | csmdb
2017-11-01 15:54:27 (postgres) [Info ] PostgreSQL is installed
2017-11-01 15:54:27 (postgres) [Info ] Script execution: csm_db_connections_script.sh -k, --kill
2017-11-01 15:54:29 (postgres) [Info ] Killing user session (PID:61427) kill –TERM 61427
2017-11-01 15:54:29 (postgres) [Info ] Killing user session (PID:61428) kill –TERM 61428
2017-11-01 15:54:29 (postgres) [Info ] Killing user session (PID:74295) kill –TERM 74295
2017-11-01 15:54:29 (postgres) [Info ] Killing user session (PID:72657) kill –TERM 72657
2017-11-01 15:54:29 (postgres) [Info ] Killing user session (PID:72658) kill –TERM 72658
2017-11-01 15:54:30 (postgres) [Info ] Killing user session (PID:72659) kill –TERM 72659
2017-11-01 15:54:30 (postgres) [Info ] Killing user session (PID:72660) kill –TERM 72660
2017-11-01 15:54:30 (postgres) [Info ] Killing user session (PID:72661) kill –TERM 72661
2017-11-01 15:54:30 (postgres) [Info ] Killing user session (PID:72662) kill –TERM 72662
2017-11-01 15:54:31 (postgres) [Info ] Killing user session (PID:72663) kill –TERM 72663
2017-11-01 15:54:31 (postgres) [Info ] Killing user session (PID:72664) kill –TERM 72664
2017-11-01 15:54:31 (postgres) [Info ] Killing user session (PID:72665) kill –TERM 72665
2017-11-01 15:54:31 (postgres) [Info ] Killing user session (PID:72666) kill –TERM 72666
2017-11-01 15:54:31 (postgres) [End ] Postgres DB kill query executed
-----------------------------------------------------------------------------------------------------------
Kill user connection(s)¶
The user has the ability to kill specific user DB connections by using the –k, --kill
along with –u, --user
option.
run /opt/ibm/csm/db/csm_kill_db_connections_test_1.sh (-k, --kill –u, --user <username>).
Note
If this option is chosen then the script will prompt each session with a yes/no request. The user has the ability to manually kill or not kill each session.
All responses are logged to the:
/var/log/ibm/csm/csm_db_kill_script.log
Example (-k, –kill –u, –user <username>)¶
-bash-4.2$ ./csm_db_connections_script.sh -k -u csmdb
------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] DB user: csmdb is connected
[Info ] PostgreSQL is installed
[Info ] Kill database session (PID:61427) [y/n] ?:
------------------------------------------------------------------------------------------------------
Example (Single session user kill)¶
-bash-4.2$ ./csm_db_connections_script.sh -k -u csmdb
------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] DB user: csmdb is connected
[Info ] PostgreSQL is installed
[Info ] Kill database session (PID:61427) [y/n] ?:y
[Info ] Killing session (PID:61427)
------------------------------------------------------------------------------------------------------
Example (Multiple session user kill)¶
-bash-4.2$ ./csm_db_connections_script.sh -k -u csmdb
------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] DB user: csmdb is connected
[Info ] PostgreSQL is installed
[Info ] Kill database session (PID:61427) [y/n] ?:y
[Info ] Killing session (PID:61427)
[Info ] Kill database session (PID: 61428) [y/n] ?:y
[Info ] Killing session (PID:61428)
------------------------------------------------------------------------------------------------------
Kill PID connection(s)¶
The user has the ability to kill specific user DB connections by using the –k, --kill
along with –p, --pid
option.
run /opt/ibm/csm/db/csm_db_connections_script.sh (-k, --kill –p, --pid <pidnumber>).
Note
If this option is chosen then the script will prompt the session with a yes/no request.
The response is logged to the:
/var/log/ibm/csm/csm_db_connections_script.log
Example (-k, –kill –u, –pid <pidnumber>)¶
-bash-4.2$ ./csm_db_connections_script.sh -k -p 61427
---------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] DB PID: 61427 is connected
[Info ] PostgreSQL is installed
[Info ] Kill database session (PID:61427) [y/n] ?:
---------------------------------------------------------------------------------------------------------
-bash-4.2$ ./csm_db_connections_script.sh -k -p 61427
---------------------------------------------------------------------------------------------------------
[Start] Welcome to CSM datatbase connections script.
[Info ] DB PID: 61427 is connected
[Info ] PostgreSQL is installed
[Info ] Kill database session (PID:61427) [y/n] ?:y
[Info ] Killing session (PID:61427)
---------------------------------------------------------------------------------------------------------
Using csm_db_history_delete.sh¶
This section describes the deletion process associated with the CSM Database history tables. If run alone it will delete all history tables including the csm_event_action table, which contain a non-null archive history timestamp.
Usage Overview¶
-bash-4.2$ ./csm_db_history_delete.sh -h
----------------------------------------------------------------------------------------
CSM Database History Delete Usage
===================================================================================
-h Display this message.
-t <dir> Target directory to write the delete log to.
Default: "/var/log/ibm/csm/delete"
-n <time-count mins.> The time (in mins.) of oldest records which to delete.
Attention: requires users input value
-d <db> Database to delete tables from.
Attention: requires users input value
===================================================================================
[Example ] ./csm_db_history_delete.sh -d [dbname] -n [time_interval] -t [/data_dir/]
----------------------------------------------------------------------------------------
Note
This is a general overview of the CSM DB deletion process using the csm_history_wrapper_delete_script_template.sh
script.
The csm_history_wrapper_delete_script_template.sh
script (when called manually) will delete history records which have been
archived with a archive_history_timestamp. Records in the history table that do not have an archived_history_timestamp
will remain in the system until it has been archived.
- The
csm_history_wrapper_delete_script_template.sh
script will accept certain flags: - Database name
- Interval time (in minutes)
- History table name
- Specified directory to be written to
Example (wrapper script)¶
[./csm_history_wrapper_delete_script_template.sh] [dbname] [time_mins] [history_table_name] [data_dir]
Attention
The deletion wrapper script is a per history table deletion process.
Example (Script out results)¶
-bash-4.2$ ./csm_history_wrapper_delete_script_template.sh csmdb 1 csm_node_history /tmp/csm_node_history_delete/
------------------------------------------------------------------------------
Table | Time | Delete Count (DB Actual)
-------------------------------|-------------|--------------------------------
csm_node_history | 0.005 | 0
------------------------------------------------------------------------------
Date/Time: | 2018-04-05.09.57.42
DB Name: | csmdb
DB User: | postgres
Interval time (cmd-line): | 1 Min(s).
Total time (Cleanup): | 0.005
------------------------------------------------------------------------------
Note
Directory: Currently the scripts are setup to archive the results in a specified directory.
Example output file:
The history delete timing results will be logged in a csv file:
csm_db_delete_script.log
Using csm_db_schema_version_upgrade_16_1.sh¶
Important
Prior steps before migrating to the newest DB schema version.
- Stop all CSM daemons
- Run a cold backup of the csmdb or specified DB (csm_db_backup_script_v1.sh)
- Install the newest RPMs
- Run the csm_db_schema_version_upgrade_16_1.sh
- Start CSM daemons
Attention
To migrate the CSM database from 15.0, 15.1, or 16.0
to the newest schema version
run /opt/ibm/csm/db/csm_db_schema_version_upgrade_16_1.sh <my_db_name>
Note
The csm_db_schema_version_upgrade_16_1.sh
script creates a log file: /var/log/ibm/csm/csm_db_schema_upgrade_script.log
16.1
).Note
For a quick overview of the script functionality:
run /opt/ibm/csm/db/ csm_db_schema_version.sh. –h, --help
If the script is ran without any options, then the usage function is displayed.
Usage Overview¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_1.sh -h
-------------------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrade schema script.
[Error ] Please specify DB name
=================================================================================================
[Info ] csm_db_schema_version_upgrade.sh : Load CSM DB upgrade schema file
[Usage ] csm_db_schema_version_upgrade.sh : csm_db_schema_version_upgrade.sh [DBNAME]
-------------------------------------------------------------------------------------------------
Argument | DB Name | Description
-----------------|-----------|-------------------------------------------------------------------
script_name | [db_name] | Imports sql upgrades to csm db table(s) (appends)
| | fields, indexes, functions, triggers, etc
-----------------|-----------|-------------------------------------------------------------------
=================================================================================================
Upgrading CSM DB (manual process)¶
Note
To upgrade the CSM or specified DB:
run /opt/ibm/csm/db/csm_db_schema_version_upgrade_16_1.sh <my_db_name> (where my_db_name is the name of your DB).
Note
The script will check to see if the given DB name exists. If the database name does not exist, then it will exit with an error message.
Example (non DB existence):¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_1.sh csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrate script.
[Info ] PostgreSQL is installed
[Error ] Cannot perform action because the csmdb database does not exist. Exiting.
-------------------------------------------------------------------------------------
Note
- The script will check for the existence of these files:
csm_db_schema_version_data.csv
csm_create_tables.sql
csm_create_triggers.sql
When an upgrade process happens, the new RPM will consist of a new schema version csv, DB create tables file, and or create triggers/functions file to be loaded into a (completley new) DB.
Example (non csv_file_name existence):¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_1.sh csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrate script.
[Error ] File csm_db_schema_version_data.csv can not be located or doesnt exist
-------------------------------------------------------------------------------------
Note
The second check makes sure the file exists and compares the actual SQL upgrade version to the hardcoded version number. If the criteria is met successfully, then the script will proceed. If the process fails, then an error message will prompt.
Example (non compatible migration):¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_1.sh csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrate script.
[Error ] Cannot perform action because not compatible.
[Info ] Required DB schema version 15.0, 15.1, 16.0, or appropriate files in directory
[Info ] csmdb current_schema_version is running: 15.1
[Info ] csm_create_tables.sql file currently in the directory is: 15.1 (required version) 16.1
[Info ] csm_create_triggers.sql file currently in the directory is: 16.1 (required version) 16.1
[Info ] csm_db_schema_version_data.csv file currently in the directory is: 16.1 (required version) 16.1
[Info ] Please make sure you have the latest RPMs installed and latest DB files.
-------------------------------------------------------------------------------------
Note
If the user selects the "n/no"
option when prompted to migrate to the newest DB schema upgrade, then the program will exit with the message below.
Example (user prompt execution with “n/no” option):¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_1.sh csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrate script.
[Info ] PostgreSQL is installed
[Info ] csmdb current_schema_version 15.1
[Info ] csmdb schema_version_upgrade: 16.1
[Warning ] This will migrate csmdb database to schema version 16.1. Do you want to continue [y/n]?:
[Info ] User response: n
[Error ] Migration session for DB: csmdb User response: ****(NO)**** not updated
---------------------------------------------------------------------------------------------------------------
Note
If the user selects the "y/yes"
option when prompted to migrate to the newest DB schema upgrade, then the program will begin execution.
Example (user prompt execution with “y/yes” option):¶
-bash-4.2$ ./csm_db_schema_version_upgrade.sh csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrade script.
[Info ] PostgreSQL is installed
[Info ] csmdb current_schema_version 15.1
[Info ] csmdb schema_version_upgrade: 16.1
[Warning ] This will migrate csmdb database to schema version 16.1. Do you want to continue [y/n]?:
[Info ] User response: y
[Info ] csmdb migration process begin.
[Info ] There are no connections to csmdb
[Complete] csmdb database schema update 16.1.
---------------------------------------------------------------------------------------------------------------
Note
If the migration script has already ran already or a new database has been created with the latest schema version of 16.1
then this message will be prompted to the user.
Running the script with existing newer version¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_1.sh csmdb
-------------------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrade script.
[Info ] PostgreSQL is installed
[Info ] csmdb is currently running db schema version: 16.1
-------------------------------------------------------------------------------------------------
Warning
If there are existing DB connections, then the migration script will prompt a message and the admin will have to kill connections before proceeding.
Hint
The csm_db_connections_script.sh script can be used with the –l option to quickly list the current connections. (Please see user guide or –h
for usage function). This script has the ability to terminate user sessions based on pids, users, or a –f
force option will kill all connections if necessary. Once the connections are terminated then the csm_db_schema_version_upgrade_16_1.sh
script can be executed. The log message will display current connection of user, database name, connection count, and duration.
Example (user prompt execution with “y/yes” option and existing DB connection(s)):¶
-bash-4.2$ ./csm_db_schema_version_upgrade_16_1.sh csmdb
---------------------------------------------------------------------------------------------------
[Start ] Welcome to CSM database schema version upgrate script.
[Info ] PostgreSQL is installed
[Info ] csmdb current_schema_version 15.1
[Info ] csmdb schema_version_upgrade: 16.1
[Warning ] This will migrate csmdb database to schema version 16.1. Do you want to continue [y/n]?:
[Info ] User response: y
[Info ] csmdb migration process begin.
[Error ] csmdb has existing connection(s) to the database.
[Error ] User: csmdb has 1 connection(s)
[Info ] See log file for connection details
---------------------------------------------------------------------------------------------------
Using csm_db_script.sh¶
Note
For a quick overview of the script functionality:
run /opt/ibm/csm/db/csm_db_script.sh –h, --help.
This help command <-h, --help> specifies each of the options available to use.
Usage Overview¶
A new DB set up <default db> | Command | Result |
---|---|---|
running the script with no options | ./csm_db_script.sh | This will create a default db with tables and populated data <specified by user or db admin> |
running the script with –x, –nodata | ./csm_db_script.sh –x ./csm_db_script.sh –nodata | This will create a default db with tables and no populated data |
A new DB set up <new user db> | Command | Result |
---|---|---|
running the script with –n, –newdb | ./csm_db_script.sh -n <my_db_name> ./csm_db_script.sh –newdb <my_db_name> | This will create a new db with tables and populated data. |
running the script with –n, –newdb, -x, –nodata | ./csm_db_script.sh -n <my_db_name> –x ./csm_db_script.sh –newdb <my_db_name> –nodata | This will create a new db with tables and no populated data. |
If a DB already exists | Command | Result |
---|---|---|
Drop DB totally | ./csm_db_script.sh -d <my_db_name> ./csm_db_script.sh –delete <my_db_name> | This will totally remove the DB from the system |
Drop only the existing CSM DB tables | ./csm_db_script.sh -e <my_db_name> ./csm_db_script.sh –eliminatetables <my_db_name> | This will only drop the specified CSM DB tables. <useful if integrated within another DB <e.x. “XCATDB”> |
Force overwrite of existing DB. | ./csm_db_script.sh -f <my_db_name> ./csm_db_script.sh –force <my_db_name> | This will totally drop the existing tables in the DB and recreate them with populated table data. |
Force overwrite of existing DB. | ./csm_db_script.sh -f <my_db_name> –x ./csm_db_script.sh –force <my_db_name> –nodata | This will totally drop the existing tables in the DB and recreate them without table data. |
Remove just the data from all the tables in the DB | ./csm_db_script.sh -r <my_db_name> ./csm_db_script.sh –removetabledata <my_db_name> | This will totally remove all data from all the tables within the DB. |
Example (usage)¶
bash 4.2$ ./csm_db_script.sh -h
===============================================================================================================
[Info ] csm_db_script.sh : CSM database creation script with additional features
[Usage] csm_db_script.sh : [OPTION]... [DBNAME]... [OPTION]
---------------------------------------------------------------------------------------------------------------
[Options]
-----------------------|-----------|---------------------------------------------------------------------------
Argument | DB Name | Description
-----------------------|-----------|---------------------------------------------------------------------------
-x, --nodata | [DEFAULT] | creates database with tables and does not pre populate table data
| [db_name] | this can also be used with the -f --force, -n --newdb option when
| | recreating a DB. This should follow the specified DB name
-d, --delete | [db_name] | totally removes the database from the system
-e, --eliminatetables| [db_name] | drops CSM tables from the database
-f, --force | [db_name] | drops the existing tables in the DB, recreates and populates with table data
-n, --newdb | [db_name] | creates a new database with tables and populated data
-r, --removetabledata| [db_name] | removes data from all database tables
-h, --help | | help
-----------------------|-----------|-----------------------------------------------------------------------------
[Examples]
-----------------------------------------------------------------------------------------------------------------
[DEFAULT] csm_db_script.sh | |
[DEFAULT] csm_db_script.sh -x, --nodata | |
csm_db_script.sh -d, --delete | [DBNAME] |
csm_db_script.sh -e, --eliminatetables | [DBNAME] |
csm_db_script.sh -f, --force | [DBNAME] |
csm_db_script.sh -f, --force | [DBNAME] | -x, --nodata
csm_db_script.sh -n, --newdb | [DBNAME] |
csm_db_script.sh -n, --newdb | [DBNAME] | -x, --nodata
csm_db_script.sh -r, --removetabledata | [DBNAME] |
csm_db_script.sh -h, --help | |
===============================================================================================================
Note
Setting up or creating a new DB <manually>
To create your own DB¶
run /opt/ibm/csm/db/db_script.sh –n, --newdb <my_db_name>.
By default if no DB name is specified, then the script will
create a DB called csmdb.
Example (successful DB creation):¶
$ /opt/ibm/csm/db/csm_db_script.sh
------------------------------------------------------------------------------------------------------
[Start ] Welcome to CSM database automation script.
[Info ] PostgreSQL is installed
[Info ] csmdb database user: csmdb already exists
[Complete] csmdb database created.
[Complete] csmdb database tables created.
[Complete] csmdb database functions and triggers created.
[Complete] csmdb table data loaded successfully into csm_db_schema_version
[Complete] csmdb table data loaded successfully into csm_ras_type
[Info ] csmdb DB schema version <16.1>
------------------------------------------------------------------------------------------------------
Note
The script checks to see if the given name exists. If the database does not exist, then it will be created. If the database already exists, then the script prompts an error message indicating a database with this name already exists and exits the program.
Example (DB already exists)¶
$ /opt/ibm/csm/db/csm_db_script.sh
------------------------------------------------------------------------------------------------------
[Info ] PostgreSQL is installed
[Error ] Cannot perform action because the csmdb database already exists. Exiting.
------------------------------------------------------------------------------------------------------
- The script automatically populates data in specified tables using csv files.
For example, ras message type data, into the ras message type table.
If a user does not want to populate these tables, then they should indicate a
-x, --nodata in the command line during the initial setup process.
/opt/ibm/csm/db/csm_db_script.sh -x, --nodata
Example (Default DB creation without loaded data option)¶
$ /opt/ibm/csm/db/csm_db_script.sh –x
------------------------------------------------------------------------------------------------------
[Info ] PostgreSQL is installed
[Info ] csmdb database user: csmdb already exists
[Complete] csmdb database created.
[Complete] csmdb database tables created.
[Complete] csmdb database functions and triggers created.
[Info ] csmdb skipping data load process. <----------[when running the -x, --nodata option]
[Complete] csmdb initialized csm_db_schema_version data
[Info ] csmdb DB schema version <16.1>
------------------------------------------------------------------------------------------------------
Existing DB Options¶
Note
There are some other features in this script that will assist users in a “clean-up” process. If the database already exists, then these actions will work.
1. Delete the database /opt/ibm/csm/db/csm_db_script.sh –d, --delete
followed
by the <my_db_name>
Example (Delete existing DB)¶
$ /opt/ibm/csm/db/csm_db_script.sh –d csmdb
------------------------------------------------------------------------------------------------------
[Info ] PostgreSQL is installed
[Info ] This will drop csmdb database including all tables and data. Do you want to continue [y/n]?y
[Complete] csmdb database deleted
------------------------------------------------------------------------------------------------------
2. Remove just data from all the tables /opt/ibm/csm/db/csm_db_script.sh –r, --removetabledata
followed by the <my_db_name>
Example (Remove data from DB tables)¶
$ /opt/ibm/csm/db/csm_db_script.sh –r csmdb
------------------------------------------------------------------------------------------------------
[Info ] PostgreSQL is installed
[Complete] csmdb database data deleted from all tables excluding csm_schema_version and
csm_db_schema_version_history tables
------------------------------------------------------------------------------------------------------
3. Force a total overwrite of the database <drops tables and recreates them>.
/opt/ibm/csm/db/csm_db_script.sh –f, --force
followed by the <my_db_name>
auto populates table data.
Example (Force DB receation)¶
$ /opt/ibm/csm/db/csm_db_script.sh –f csmdb
------------------------------------------------------------------------------------------------------
[Start ] Welcome to CSM database automation script.
[Info ] PostgreSQL is installed
[Info ] csmdb database user: csmdb already exists
[Complete] csmdb database tables and triggers dropped
[Complete] csmdb database functions dropped
[Complete] csmdb database tables recreated.
[Complete] csmdb database functions and triggers recreated.
[Complete] csmdb table data loaded successfully into csm_db_schema_version
[Complete] csmdb table data loaded successfully into csm_ras_type
[Info ] csmdb DB schema version <16.1>
------------------------------------------------------------------------------------------------------
4. Force a total overwrite of the database <drops tables and recreates them without
prepopulated data>. /opt/ibm/csm/db/csm_db_script.sh –f, --force
followed by the <my_db_name>
followed by –x, --nodata
does not populate table data.
Example (Force DB recreation without preloaded table data)¶
$ /opt/ibm/csm/db/csm_db_script.sh –f csmdb –x
------------------------------------------------------------------------------------------------------
[Start ] Welcome to CSM database automation script.
[Info ] PostgreSQL is installed
[Info ] csmdb database user: csmdb already exists
[Complete] csmdb database tables and triggers dropped
[Complete] csmdb database functions dropped
[Complete] csmdb database tables recreated.
[Complete] csmdb database functions and triggers recreated.
[Complete] csmdb skipping data load process.
[Complete] csmdb table data loaded successfully into csm_db_schema_version
[Info ] csmdb DB schema version <16.1>
------------------------------------------------------------------------------------------------------
CSMDB user info.¶
5. The "csmdb"
user will remain in the system unless an admin manually deletes this option.
If the user has to be deleted for any reason the Admin can run this command inside the psql postgres DB connection. DROP USER csmdb
. If any current database are running with this user, then the user will
get a response similar to the example below
ERROR: database "csmdb" is being accessed by other users
DETAIL: There is 1 other session using the database.
Warning
It is not recommended to delete the csmdb user.
su – postgres
psql -t -q -U postgres -d postgres -c "DROP USER csmdb;"
psql -t -q -U postgres -d postgres -c "CREATE USER csmdb;"
Note
The command below can be executed if specific privileges are needed.
psql -t -q -U postgres -d postgres -c "GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO csmdb"
Note
If admin wants to change the ownership of the DB to postgres then use the command below.
ALTER DATABASE csmdb OWNER TO postgres
ALTER DATABASE csmdb OWNER TO csmdb
Please see the log file for details:
/var/log/ibm/csm/csm_db_script.log
Using csm_db_stats.sh script¶
This script will gather statistical information related to the CSM DB which includes, table data activity, index related information, and table lock monitoring, CSM DB schema version, DB connections stats query, DB user stats query, and PostgreSQL version installed .
Note
For a quick overview of the script functionality,
run /opt/ibm/csm/db/csm_db_stats.sh –h, --help.
This help command <-h, --help> will specify each of the options available to use.
The csm_db_stats.sh
script creates a log file for each query executed. (Please see the log file for details): /var/log/ibm/csm/csm_db_stats.log
Usage Overview¶
Options | Command | Result |
---|---|---|
Table data activity | ./csm_db_stats.sh –t <my_db_name> ./csm_db_stats.sh –tableinfo <my_db_name> | see details below |
Index related information | ./csm_db_stats.sh –i <my_db_name> ./csm_db_stats.sh –indexinfo <my_db_name> | see details below |
Index analysis information | ./csm_db_stats.sh –x <my_db_name> ./csm_db_stats.sh –indexanalysis <my_db_name> | see details below |
Table Locking Monitoring | ./csm_db_stats.sh –l <my_db_name> ./csm_db_stats.sh –lockinfo <my_db_name> | see details below |
Schema Version Query | ./csm_db_stats.sh –s <my_db_name> ./csm_db_stats.sh –schemaversion <my_db_name> | see details below |
DB connections stats Query | ./csm_db_stats.sh –c <my_db_name> ./csm_db_stats.sh –connectionsdb <my_db_name> | see details below |
DB user stats query | ./csm_db_stats.sh –u <my_db_name> ./csm_db_stats.sh –usernamedb <my_db_name> | see details below |
PostgreSQL Version Installed | ./csm_db_stats.sh -v csmdb ./csm_db_stats.sh –postgresqlversion csmdb | see details below |
DB Archiving Stats | ./csm_db_stats.sh -a csmdb ./csm_db_stats.sh –archivecount csmdb | see details below |
–help | ./csm_db_stats.sh | see details below |
Example (usage)¶
-bash-4.2$ ./csm_db_stats.sh --help
=================================================================================================
[Info ] csm_db_stats.sh : List/Kill database user sessions
[Usage] csm_db_stats.sh : [OPTION]... [DBNAME]
-------------------------------------------------------------------------------------------------
Argument | DB Name | Description
-------------------------|-----------|-----------------------------------------------------------
-t, --tableinfo | [db_name] | Populates Database Table Stats:
| | Live Row Count, Inserts, Updates, Deletes, and Table Size
-i, --indexinfo | [db_name] | Populates Database Index Stats:
| | tablename, indexname, num_rows, tbl_size, ix_size, uk,
| | num_scans, tpls_read, tpls_fetched
-x, --indexanalysis | [db_name] | Displays the index usage analysis
-l, --lockinfo | [db_name] | Displays any locks that might be happening within the DB
-s, --schemaversion | [db_name] | Displays the current CSM DB version
-c, --connectionsdb | [db_name] | Displays the current DB connections
-u, --usernamedb | [db_name] | Displays the current DB user names and privileges
-v, --postgresqlversion | [db_name] | Displays the current version of PostgreSQL installed
| | along with environment details
-a, --archivecount | [db_name] | Displays the archived and non archive record counts
-h, --help | | help
-------------------------|-----------|-----------------------------------------------------------
[Examples]
-------------------------------------------------------------------------------------------------
csm_db_stats.sh -t, --tableinfo [dbname] | Database table stats
csm_db_stats.sh -i, --indexinfo [dbname] | Database index stats
csm_db_stats.sh -x, --indexanalysisinfo [dbname] | Database index usage analysis stats
csm_db_stats.sh -l, --lockinfo [dbname] | Database lock stats
csm_db_stats.sh -s, --schemaversion [dbname] | Database schema version (CSM_DB only)
csm_db_stats.sh -c, --connectionsdb [dbname] | Database connections stats
csm_db_stats.sh -u, --usernamedb [dbname] | Database user stats
csm_db_stats.sh -v, --postgresqlversion [dbname] | Database (PostgreSQL) version
csm_db_stats.sh -a, --archivecount [dbname] | Database archive stats
csm_db_stats.sh -h, --help [dbname] | Help menu
=================================================================================================
1. Table data activity¶
Run: /opt/ibm/csm/db/csm_db_stats.sh –t, --tableinfo [my_db_name]
Example (Query details)¶
Column_Name | Description |
tablename |
table name |
live_row_count |
current row count in the CSM_DB |
insert_count |
number of rows inserted into each of the tables |
update_count |
number of rows updated in each of the tables |
delete_count |
number of rows deleted in each of the tables |
table_size |
table size |
Note
This query will display information related to the CSM DB tables (or other specified DB). The query will display results based on if the insert, update, and delete count is > 0
. If there is no data in a particular table it will be omitted from the results.
Example (DB Table info.)¶
-bash-4.2$ ./csm_db_stats.sh -t csmdb
--------------------------------------------------------------------------------------------------
relname | live_row_count | insert_count | update_count | delete_count | table_size
-----------------------+----------------+--------------+--------------+--------------+------------
csm_db_schema_version | 1 | 1 | 0 | 0 | 8192 bytes
csm_gpu | 4 | 4 | 0 | 0 | 8192 bytes
csm_hca | 2 | 2 | 0 | 0 | 8192 bytes
csm_node | 2 | 2 | 0 | 0 | 8192 bytes
csm_ras_type | 4 | 4 | 0 | 0 | 8192 bytes
csm_ras_type_audit | 4 | 4 | 0 | 0 | 8192 bytes
(6 rows)
--------------------------------------------------------------------------------------------------
3. Index Analysis Usage Information¶
Run: /opt/ibm/csm/db/./csm_db_stats.sh –x, --indexanalysis <my_db_name>
Example (Query details)¶
Column_Name | Description |
relname |
table name |
too_much_seq |
case when seq_scan - idx_scan > 0 |
case |
If Missing Index or is Ok |
rel_size |
OID of a table, index returns the on-disk size in bytes. |
seq_scan |
Number of sequential scans initiated on this table. |
idx_scan |
Number of index scans initiated on this index |
Note
This query checks if there are more sequence scans being performed instead of index scans. Results will be displayed if the relname
, too_much_seq
, case
, rel_size
, seq_scan
, and idx_scan
. This query helps analyze database.
Example (Indexes Usage)¶
-bash-4.2$ ./csm_db_stats.sh -x csmdb
--------------------------------------------------------------------------------------------------
relname | too_much_seq | case | rel_size | seq_scan | idx_scan
------------------------------+--------------+----------------+-------------+----------+----------
csm_step_node | 16280094 | Missing Index? | 245760 | 17438931 | 1158837
csm_allocation_history | 3061025 | Missing Index? | 57475072 | 3061787 | 762
csm_allocation_state_history | 3276 | Missing Index? | 35962880 | 54096 | 50820
csm_vg_history | 1751 | Missing Index? | 933888 | 1755 | 4
csm_vg_ssd_history | 1751 | Missing Index? | 819200 | 1755 | 4
csm_ssd_history | 1749 | Missing Index? | 1613824 | 1755 | 6
csm_dimm_history | 1652 | Missing Index? | 13983744 | 1758 | 106
csm_gpu_history | 1645 | Missing Index? | 24076288 | 1756 | 111
csm_hca_history | 1643 | Missing Index? | 8167424 | 1754 | 111
csm_ras_event_action | 1549 | Missing Index? | 263143424 | 1854 | 305
csm_node_state_history | 401 | Missing Index? | 78413824 | 821 | 420
csm_node_history | -31382 | OK | 336330752 | 879 | 32261
csm_ras_type_audit | -97091 | OK | 98304 | 793419 | 890510
csm_step_history | -227520 | OK | 342327296 | 880 | 228400
csm_vg_ssd | -356574 | OK | 704512 | 125588 | 482162
csm_vg | -403370 | OK | 729088 | 86577 | 489947
csm_hca | -547463 | OK | 1122304 | 1 | 547464
csm_ras_type | -942966 | OK | 81920 | 23 | 942989
csm_ssd | -1242433 | OK | 1040384 | 85068 | 1327501
csm_step_node_history | -1280913 | OK | 2865987584 | 49335 | 1330248
csm_allocation_node_history | -1664023 | OK | 21430599680 | 887 | 1664910
csm_gpu | -2152044 | OK | 5996544 | 1 | 2152045
csm_dimm | -2239777 | OK | 7200768 | 118280 | 2358057
csm_allocation_node | -52187077 | OK | 319488 | 1727675 | 53914752
csm_node | -78859700 | OK | 2768896 | 127214 | 78986914
(25 rows)
--------------------------------------------------------------------------------------------------
4. Table Lock Monitoring¶
Run: /opt/ibm/csm/db/./csm_db_stats.sh –l, --lockinfo <my_db_name>
Example (Query details)¶
Column_Name | Description |
blocked_pid |
Process ID of the server process holding or awaiting this lock, or null if the lock is held by a prepared transaction. |
blocked_user |
The user that is being blocked. |
current_or_recent_statement_in_blocking_process |
The query statement that is displayed as a result. |
state_of_blocking_process |
Current overall state of this backend. |
blocking_duration |
Evaluates when the process begin and subtracts from the current time when the query began. |
blocking_pid |
Process ID of this backend. |
blocking_user |
The user that is blocking other transactions. |
blocked_statement |
The query statement that is displayed as a result. |
blocked_duration |
Evaluates when the process begin and subtracts from the current time when the query began. |
Example (Lock Monitoring)¶
-bash-4.2$ ./csm_db_stats.sh -l csmdb
-[ RECORD 1 ]-----------------------------------+--------------------------------------------------------------
blocked_pid | 38351
blocked_user | postgres
current_or_recent_statement_in_blocking_process | update csm_processor set status=’N’ where serial_number=3;
state_of+blocking_process | active
blocking_duration | 01:01:11.653697
blocking_pid | 34389
blocking_user | postgres
blocked_statement | update csm_processor set status=’N’ where serial_number=3;
blocked_duration | 00:01:09.048478
---------------------------------------------------------------------------------------------------------------
Note
This query displays relevant information related to lock monitoring. It will display the current blocked and blocking rows affected along with each duration. A systems administrator can run the query and evaluate what is causing the results of a “hung” procedure and determine the possible issue.
5. DB schema Version Query¶
Run: /opt/ibm/csm/db/./csm_db_stats.sh –s, --schemaversion <my_db_name>
Example (Query details)¶
version |
This provides the current CSM DB version that is current being used. |
create_time |
This column indicated when the database was created. |
comment |
This column indicates the “current version” as comment. |
Example (DB Schema Version)¶
-bash-4.2$ ./csm_db_stats.sh -s csmdb
-------------------------------------------------------------------------------------
version | create_time | comment
---------+----------------------------+-----------------
16.1 | 2018-04-04 09:41:57.784378 | current_version
(1 row)
-------------------------------------------------------------------------------------
Note
This query provides the current database version the system is running along with its creation time.
6. DB Connections with details¶
Run: /opt/ibm/csm/db/./csm_db_stats.sh –c, --connectionsdb <my_db_name>
Example (Query details)¶
pid |
Process ID of this backend. |
dbname |
Name of the database this backend is connected to. |
username |
Name of the user logged into this backend. |
backend_start |
Time when this process was started, i.e., when the client connected to the server. |
query_start |
Time when the currently active query was started, or if state is not active, when the last query was started. |
state_change |
Time when the state was last changed. |
wait |
True if this backend is currently waiting on a lock. |
query |
Text of this backends most recent query. If state is active this field shows the currently executing query. In all other states, it shows the last query that was executed. |
Example (database connections)¶
-bash-4.2$ ./csm_db_stats.sh -c csmdb
-----------------------------------------------------------------------------------------------------------------------------------------------------------
pid | dbname | usename | backend_start | query_start | state_change | wait | query
-------+--------+----------+----------------------------+----------------------------+----------------------------+------+---------------------------------
61427 | xcatdb | xcatadm | 2017-11-01 13:42:53.931094 | 2017-11-02 10:15:04.617097 | 2017-11-02 10:15:04.617112 | f | DEALLOCATE
| | | | | | | dbdpg_p17050_384531
61428 | xcatdb | xcatadm | 2017-11-01 13:42:53.932721 | 2017-11-02 10:15:04.616291 | 2017-11-02 10:15:04.616313 | f | SELECT 'DBD::Pg ping test'
55753 | csmdb | postgres | 2017-11-02 10:15:06.619898 | 2017-11-02 10:15:06.620889 | 2017-11-02 10:15:06.620891 | f |
| | | | | | | SELECT pid,datname AS dbname,
| | | | | | | usename,backend_start, q.
| | | | | | |.uery_start, state_change,
| | | | | | | waiting AS wait,query FROM pg.
| | | | | | |._stat_activity;
(3 rows)
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Note
This query will display information about the database connections that are in use on the system. The pid (Process ID), database name, user name, backend start time, query start time, state change, waiting status, and query will display statistics about the current database activity.
7. PostgreSQL users with details¶
Run: /opt/ibm/csm/db/./csm_db_stats.sh –u, --usernamedb <my_db_name>
Example (Query details)¶
Column_Name | Description |
rolname |
Role name (t/f). |
rolsuper |
Role has superuser privileges (t/f). |
rolinherit |
Role automatically inherits privileges of roles it is a member of (t/f). |
rolcreaterole |
Role can create more roles (t/f). |
rolcreatedb |
Role can create databases (t/f). |
rolcatupdate |
Role can update system catalogs directly. (Even a superuser cannot do this unless this column is true) (t/f). |
rolcanlogin |
Role can log in. That is, this role can be given as the initial session authorization identifier (t/f). |
rolreplication |
Role is a replication role. That is, this role can initiate streaming replication and set/unset the system backup mode using pg_start_backup and pg_stop_backup (t/f). |
rolconnlimit |
For roles that can log in, this sets maximum number of concurrent connections this role can make. -1 means no limit. |
rolpassword |
Not the password (always reads as ****). |
rolvaliduntil |
Password expiry time (only used for password authentication); null if no expiration. |
rolconfig |
Role-specific defaults for run-time configuration variables. |
oid |
ID of role. |
Example (DB users with details)¶
-bash-4.2$ ./csm_db_stats.sh -u postgres
-----------------------------------------------------------------------------------------------------------------------------------
rolname | rolsuper | rolinherit | rolcreaterole | rolcreatedb | rolcatupdate | rolcanlogin | rolreplication | rolconnlimit | rolpassword | rolvaliduntil | rolconfig | oid
----------+----------+------------+---------------+-------------+--------------+-------------+----------------+--------------+-------------+---------------+-----------+--------
postgres | t | t | t | t | t | t | t | -1 | ******** | | | 10
xcatadm | f | t | f | f | f | t | f | -1 | ******** | | | 16385
root | f | t | f | f | f | t | f | -1 | ******** | | | 16386
csmdb | f | t | f | f | f | t | f | -1 | ******** | | | 704142
(4 rows)
-----------------------------------------------------------------------------------------------------------------------------------
Note
This query will display specific information related to the users that are currently in the postgres database. These fields will appear in the query: rolname, rolsuper, rolinherit, rolcreaterole, rolcreatedb, rolcatupdate, rolcanlogin, rolreplication, rolconnlimit, rolpassword, rolvaliduntil, rolconfig, and oid. See below for details.
8. PostgreSQL Version Installed¶
Run: /opt/ibm/csm/db/./csm_db_stats.sh –v, --postgresqlversion <my_db_name>
Column_Name | Description |
version |
This provides the current PostgreSQL installed on the system along with other environment details. |
Example (DB Schema Version)¶
-bash-4.2$ ./csm_db_stats.sh -v csmdb
-------------------------------------------------------------------------------------------------
version
-------------------------------------------------------------------------------------------------
PostgreSQL 9.2.18 on powerpc64le-redhat-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-9), 64-bit
(1 row)
-------------------------------------------------------------------------------------------------
Note
This query provides the current version of PostgreSQL installed on the system along with environment details.
9. DB Archiving Stats¶
Run: /opt/ibm/csm/db/./csm_db_stats.sh –a, --indexanalysis <my_db_name>
Example (Query details)¶
Column_Name | Description |
table_name |
Table name. |
total_rows |
Total Rows in DB. |
not_archived |
Total rows not archived in the DB. |
archived |
Total rows archived in the DB. |
last_archive_time |
Last archived process time. |
Example (DB archive count with details)¶
-bash-4.2$ ./csm_db_stats.sh -a csmdb
---------------------------------------------------------------------------------------------------
table_name | total_rows | not_archived | archived | last_archive_time
-------------------------------+------------+--------------+----------+----------------------------
csm_allocation_history | 94022 | 0 | 94022 | 2018-10-09 16:00:01.912545
csm_allocation_node_history | 73044162 | 0 | 73044162 | 2018-10-09 16:00:02.06098
csm_allocation_state_history | 281711 | 0 | 281711 | 2018-10-09 16:01:03.685959
csm_config_history | 0 | 0 | 0 |
csm_db_schema_version_history | 2 | 0 | 2 | 2018-10-03 10:38:45.294172
csm_diag_result_history | 12 | 0 | 12 | 2018-10-03 10:38:45.379335
csm_diag_run_history | 8 | 0 | 8 | 2018-10-03 10:38:45.464976
csm_dimm_history | 76074 | 0 | 76074 | 2018-10-03 10:38:45.550827
csm_gpu_history | 58773 | 0 | 58773 | 2018-10-03 10:38:47.486974
csm_hca_history | 23415 | 0 | 23415 | 2018-10-03 10:38:50.574223
csm_ib_cable_history | 0 | 0 | 0 |
csm_lv_history | 0 | 0 | 0 |
csm_lv_update_history | 0 | 0 | 0 |
csm_node_history | 536195 | 0 | 536195 | 2018-10-09 14:10:40.423458
csm_node_state_history | 966991 | 0 | 966991 | 2018-10-09 15:30:40.886846
csm_processor_socket_history | 0 | 0 | 0 |
csm_ras_event_action | 1115253 | 0 | 1115253 | 2018-10-09 15:30:50.514246
csm_ssd_history | 4723 | 0 | 4723 | 2018-10-03 10:39:47.963564
csm_ssd_wear_history | 0 | 0 | 0 |
csm_step_history | 456080 | 0 | 456080 | 2018-10-09 16:01:05.797751
csm_step_node_history | 25536362 | 0 | 25536362 | 2018-10-09 16:01:06.216121
csm_switch_history | 0 | 0 | 0 |
csm_switch_inventory_history | 0 | 0 | 0 |
csm_vg_history | 4608 | 0 | 4608 | 2018-10-03 10:44:25.837201
csm_vg_ssd_history | 4608 | 0 | 4608 | 2018-10-03 10:44:26.047599
(25 rows)
---------------------------------------------------------------------------------------------------
Note
This query provides statistical information related to the DB archiving count and processing time.
Using csm_ras_type_script.sh¶
This script is for importing or removing records in the csm_ras_type table.
The csm_db_ras_type_script.sh creates a log file:
/var/log/ibm/csm/csm_db_ras_type_script.log
Note
csm_ras_type
table is pre populated which, contains the description and details for each of the possible RAS event types. This may change over time and new message types can be imported into the table. The script is ran and a temp table is created and appends the csv file data with the current records in thecsm_ras_type
table. If any duplicate (key) values exist in the process, they will get dismissed and the rest of the records get imported. A total record count is displayed and logged, along with the after livecsm_ras_type
count and also for thecsm_ras_type_audit
table.- A complete cleanse of the
csm_ras_type
table may also need to take place. If this step is necessary then the auto script can be ran with the–r
option. A"y/n"
prompt will display to the admins to ensure this execution is really what they want. If then
option is selected then the process is aborted and results are logged accordingly.
Usage Overview¶
run /opt/ibm/csm/db/ csm_db_ras_type_script.sh –h, --help.
Note
This help command (-h, --help
) will specify each of the options available to use.
Example (Usage)¶
-bash-4.2$ ./csm_db_ras_type_script.sh -h
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM datatbase ras type automation script.
=================================================================================================
[Info ] csm_db_ras_type_script.sh : Load/Remove data from csm_ras_type table
[Usage] csm_db_ras_type_script.sh : [OPTION]... [DBNAME]... [CSV_FILE]
-------------------------------------------------------------------------------------------------
Argument | DB Name | Description
-------------------------|-----------|-----------------------------------------------------------
-l, --loaddata | [db_name] | Imports CSV data to csm_ras_type table (appends)
| | Live Row Count, Inserts, Updates, Deletes, and Table Size
-r, --removedata | [db_name] | Removes all records from the csm_ras_type table
-h, --help | | help
-------------------------|-----------|-----------------------------------------------------------
[Examples]
-------------------------------------------------------------------------------------------------
csm_db_ras_type_script.sh -l, --loaddata [dbname] | [csv_file_name]
csm_db_ras_type_script.sh -r, --removedata [dbname] |
csm_db_ras_type_script.sh -h, --help [dbname] |
=================================================================================================
Importing records into csm_ras_type table (manually)¶
- To import data to the
csm_ras_type
table:
run /opt/ibm/csm/db/csm_db_ras_type_script.sh (–l, --loaddata) my_db_name (where my_db_name is the name of your DB) and the csv_file_name.
Note
The script will check to see if the given name is available and if the database does not exist then it will exit with an error message.
Example (non DB existence):¶
-bash-4.2$ ./csm_db_ras_type_script.sh -l csmdb csm_ras_type_data.csv
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM datatbase ras type automation script.
[Info ] csm_ras_type_data.csv file exists
[Info ] PostgreSQL is installed
[Error ] Cannot perform action because the csmdb database does not exist. Exiting.
-------------------------------------------------------------------------------------
Note
Make sure PostgreSQL is installed on the system [todo: Link to install postgres]
Example (non csv_file_name existence):¶
-bash-4.2$ ./csm_db_ras_type_script.sh -l csmdb csm_ras_type_data_file.csv
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM datatbase ras type automation script.
[Error ] File csm_ras_type_data_file.csv can not be located or doesnt exist
[Info ] Please choose another file or check path
-------------------------------------------------------------------------------------
Note
Make sure the latest csv file exists in the appropriate working directory
Example (successful execution):¶
-bash-4.2$ ./csm_db_ras_type_script.sh -l csmdb csm_ras_type_data_2.csv
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM datatbase ras type automation script.
[Info ] csm_ras_type_data_2.csv file exists
[Info ] PostgreSQL is installed
[Info ] Record import count: 4
[Info ] csm_ras_type live row count: 4
[Info ] csm_ras_type_audit live row count: 46
[Info ] Database csv upload process complete for csm_ras_type table.
---------------------------------------------------------------------------------------------------------------
Removing records from csm_ras_type table (manually)¶
- The script will remove records from the
csm_ras_type
table. The option (-r, --removedata
) can be executed. A prompt message will appear and the admin has the ability to choose"y/n"
. Each of the logging message will be logged accordingly.
Example (successful execution):
-bash-4.2$ ./csm_db_ras_type_script.sh -r csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM datatbase ras type automation script.
[Info ] PostgreSQL is installed
[Warning ] This will drop csm_ras_type table data from csmdb database. Do you want to continue [y/n]?
[Info ] User response: y
[Info ] Record delete count from the csm_ras_type table: 4
[Info ] csm_ras_type live row count: 0
[Info ] csm_ras_type_audit live row count: 50
[Info ] Data from the csm_ras_type table has been successfully removed
-------------------------------------------------------------------------------------
Example (unsuccessful execution):¶
-bash-4.2$ ./csm_db_ras_type_script.sh -r csmdb
-------------------------------------------------------------------------------------
[Start ] Welcome to CSM datatbase ras type automation script.
[Info ] PostgreSQL is installed
[Warning ] This will drop csm_ras_type table data from csmdb database. Do you want to continue [y/n]?
[Info ] User response: n
[Info ] Data removal from the csm_ras_type table has been aborted
-------------------------------------------------------------------------------------
Big Data Store¶
CAST supports the integration of the ELK stack as a Big Data solution. Support for this solution is bundled in the csm-big-data rpm in the form of suggested configurations and support scripts.
Configuration order is not strictly enforced for the ELK stack, however, this resource generally assumes the components of the stack are installed in the following order:
- Elasticsearch
- Kibana
- Logstash
This installation order minimizes the likelihood of improperly ingested data being stored in elasticsearch.
Warning
If the index mappings are not created properly timestamp data may be improperly stored. If this occurs the user will need to reindex the data to fix the problem. Please read the elasticsearch section carefully before ingesting data.
Elasticsearch¶
Elasticsearch is a distributed analytics and search engine and the core component of the ELK stack. Elastic search ingests structured data (typically JSON or key value pairs) and stores the data in distributed index shards.
In the CAST design the more Elasticsearch nodes the better. Generally speaking nodes with attached storage or large numbers of drives are prefered.
Configuration¶
Note
This guide has been tested using Elasticsearch 6.3.2, the latest RPM may be downloaded from the Elastic Site.
The following is a brief introduction to the installation and configuration of the elasticsearch service. It is generally assumed that elasticsearch is to be installed on multiple Big Data Nodes to take advantage of the distributed nature of the service. Additionally, in the CAST configuration data drives are assumed to be JBOD.
CAST provides a set of sample configuration files in the repository at csm_big_data/elasticsearch/ If the ibm-csm-bds-*.noarch.rpm rpm as been installed the sample configurations may be found in /opt/ibm/csm/bigdata/elasticsearch/.
- Install the elasticsearch rpm and java 1.8.1+ (command run from directory with elasticsearch rpm):
yum install -y elasticsearch-*.rpm java-1.8.*-openjdk
Copy the elastic search configuration files to the /etc/elasticsearch directory.
It is recommended that the system administrator review these configurations at this phase.
jvm.options: jvm options for the Elasticsearch service. elasticsearch.yml: Configuration of the service specific attributes, please see elasticsearch.yml for details. Make an ext4 filesystem on each hard drive designated to be in the elastic search JBOD.
The mounted names for these file systems should match the names spcified in path.data. Additionally, these mounted file systems should be owned by the elasticsearch user and in the elasticsearch group.
Start Elasticsearch:
systemctl enable elasticsearch
systemctl start elasticsearch
- Run the index template creator script:
/opt/ibm/csm/bigdata/elasticsearch/createIndices.sh
Note
This is technically optional, however, data will have limited use. This script configures Elasticsearch to properly parse timestamps.
Elasticsearch should now be operational. If Logstash was properly configured there should already be data being written to your index.
Tuning Elasticsearch¶
The process of tuning and configuring Elasticsearch is incredibly dependen on the volume and type of data ingested the Big Data Store. Due to the nuance of this process it is STRONGLY recommended that the system administrator familiarize themselves with Configuring Elasticsearch.
The following document outlines the defaults and recommendations of CAST in the configuration of the Big Data Store.
elasticsearch.yml¶
Note
The following section outline’s CAST’s recommendations for the Elasticsearch configuration it is STRONGLY recommended that the system administrator familiarize themselves with Configuring Elasticsearch.
The Elasticsearch configuration sample shipped by CAST marks fields that need to be set by a system administrator. A brief rundown of the fields to modify is as follows:
cluster.name: | The name of the cluster. Nodes may only join clusters with the name in this field. Generally it’s a good idea to give this a descriptive name. |
---|---|
node.name: | The name of the node in the elasticsearch cluster. CAST defaults to ${HOSTNAME}. |
path.log: | The logging directory, needs elasticsearch read write access. |
path.data: | A comma separated listing of data directories, needs elasticsearch read write access. CAST recommends a JBOD model where each disk has a file system. |
network.host: | The address to bind the Elasticsearch model to. CAST defaults to _site_. |
http.port: | The port to bind Elasticsearch to. CAST defaults to 9200. |
discovery.zen.ping.unicast.hosts: | |
A list of nodes likely to be active, comma delimited array. CAST defaults to cast.elasticsearch.nodes.. | |
discovery.zen.minimum_master_nodes: | |
Number of nodes with the`node.master` setting set to true that must be connected to before starting. Elastic search recommends (master_eligible_nodes/2)+1 | |
gateway.recover_after_nodes: | |
Number of nodes to wait for before begining recovery after cluster-wide restart. | |
xpack.ml.enabled: | |
Enables/disables the Machine Learning utility in xpack, this should be disabled on ppc64le installations. | |
xpack.security.enabled: | |
Enables/disables security in elasticsearch. | |
xpack.license.self_generated.type: | |
Sets the license of xpack for the cluster, if the user has no license it should be set to basic. |
jvm.options¶
The configuration file for the Logstash JVM. The supplied settings are CAST’s recommendation, however, the efficacy of these settings entirely depends on your elasticsearch node.
Generally speaking the only field to be changed is the heap size:
-Xms[HEAP MIN]
-Xmx[HEAP MAX]
Indices¶
Elasticsearch Templates: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/templates/cast-*.json |
CAST has specified a suite of data mappings for use in separate indices. Each of these indices is documented below, with a JSON mapping file provided in the repository and rpm.
CAST uses cast-<class>-<description>-<date> naming schema for indices to leverage templates when creating the indices in Elasticsearch. The class is one of the three primary classifications determined by CAST: log, counters, environmental. The description is typically a one to two word description of the type of data: syslog, node, mellanox-event, etc.
A collection of templates is provided in the CAST big data store RPM which set up aliases and data type mappings. These temlates do not set sharding or replication factors, as these settings should be tuned to the user’s data retention and index sizing needs.
The specified templates match indices generated in the data aggregators documentation. As different data sources produce different volumes of data in different environments, this document will make no recommendation on sharding or replication.
Note
These templates may be found on the git repo in csm_big_data/elasticsearch/mappings/templates.
Note
Cast has elected to use lowercase and ‘-‘ characters to separate words. This is not mandatory for your index naming and creation.
scripts¶
Elasticsearch Index Scripts: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/ |
CAST provides a set of scripts which allow the user to easily manipulate the elasticsearch indices from the command line.
createIndices.sh¶
A script for initializing the templates defined by CAST. When executed it with attempt to target the elasticsearch server running on “${HOSTNAME}:9200”. If the user supplies either a hostname or ip address this will be targeted in lieu of “${HOSTNAME}”. This script need only be run once on a node in the elasticsearch cluster.
removeIndices.sh¶
A script for removing all elasticsearch templates created by createIndices.sh. When executed it with attempt to target the elasticsearch server running on “${HOSTNAME}:9200”. If the user supplies either a hostname or ip address this will be targeted in lieu of “${HOSTNAME}”. This script need only be run once on a node in the elasticsearch cluster.
reindexIndices.py¶
A tool for performing in place reindexing of an elasticsearch index.
Warning
This script should only be used to reindex a handful of indices at a time as it is slow and can result in partial reindexing.
usage: reindexIndices.py [-h] [-t hostname:port]
[-i [index-pattern [index-pattern ...]]]
A tool for reindexing a list of elasticsearch indices, all indices will be
reindexed in place.
optional arguments:
-h, --help show this help message and exit
-t hostname:port, --target hostname:port
An Elasticsearch server to reindex indices on. This
defaults to the contents of environment variable
"CAST_ELASTIC".
-i [index-pattern [index-pattern ...]], --indices [index-pattern [index-pattern ...]]
A list of indices to reindex, this should use the
index pattern format.
cast-log¶
Elasticsearch Templates: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/templates/cast-log*.json |
The cast-log- indices represent a set of logging indices produced by CAST supported data sources.
cast-log-syslog¶
alias: | cast-log-syslog |
---|
The syslog index is designed to capture generic syslog messages. The contents of the syslog index is considered by CAST to be the most useful data points for syslog analysis. CAST supplies both an rsyslog template and Logstash pattern, for details on these configurations please consult the data aggregators documentation.
The mapping for the index contains the following fields:
Field | Type | Description |
---|---|---|
@timestamp | date | The timestamp of the message, generated by the syslog utility. |
host | text | The host of the relay host. |
hostname | text | The hostname of the syslog origination. |
program_name | text | The name of the program which generated the log. |
process_id | long | The process id of the program which generated the log. |
severity | text | The severity level of the log. |
message | text | The body of the message. |
tags | text | Tags containing additional metadata about the message. |
Note
Currently mmfs and CAST logs will be stored in the syslog index (due to similarity of the data mapping).
cast-log-mellanox-event¶
alias: | cast-log-mellanox-event |
---|
The mellanox event log is a superset of the cast-log-syslog index, an artifact of the event log being transmitted through syslog. In the CAST Big Data Pipeline this log will be ingested and parsed by the Logstash service then transmitted to the Elasticsearch index.
Field | Type | Description |
---|---|---|
@timestamp | date | When the message was written to the event log. |
hostname | text | The hostname of the ufm aggregating the events. |
program_name | text | The name of the generating program, should be event_log |
process_id | long | The process id of the program which generated the log. |
severity | text | The severity level of the log, pulled from message. |
message | text | The body of the message (unstructured). |
log_counter | long | A counter tracking the log number. |
event_id | long | The unique identifier for the event in the mellanox event log. |
event_type | text | The type of event (e.g. HARDWARE) in the event log. |
category | text | The categorization of the error in the event log typing |
tags | text | Tags containing additional metadata about the message. |
cast-log-console¶
alias: | cast-log-console |
---|
CAST recommends the usage of the goconserver bundled in the xCAT dependicies, documented in xCat-GoConserver. Configuration of the goconserver should be performed on the xCAT service nodes in the cluster. CAST has created a limited configuration guide <ConsoleDataAggregator>, please consult for a basic rundown on the utility.
The mapping for the console index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | When console event occured. |
type | text | The type of the event (typically console). |
message | text | The console event data, typically a console line. |
hostname | text | The hostname generating the console. |
tags | text | Tags containing additional metadata about the console log. |
cast-csm¶
Elasticsearch Templates: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/templates/cast-csm*.json |
The cast-csm- indices represent a set of metric indices produced by CSM. Indices matching this pattern will be created unilaterally by the CSM Daemon. Typically records in this type of index are generated by the Aggregator Daemon.
cast-csm-dimm-env¶
alias: | cast-csm-dimm-env |
---|
The mapping for the cast-csm-dimm-env index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the dimm environment counters. |
timestamp | date | When environment counters were gathered. |
type | text | The type of the event (csm-dimm-env). |
source | text | The source of the counters. |
data.dimm_id | long | The id of dimm being aggregated. |
data.dimm_temp | long | The temperature of the dimm. |
data.dimm_temp_max | long | The max temperature of the dimm over the collection period. |
data.dimm_temp_min | long | The min temperature of the dimm over the collection period. |
cast-csm-gpu-env¶
alias: | cast-csm-gpu-env |
---|
The mapping for the cast-csm-gpu-env index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the gpu environment counters. |
timestamp | date | When environment counters were gathered. |
type | text | The type of the event (csm-gpu-env). |
source | text | The source of the counters. |
data.gpu_id | long | The id of the GPU record being aggregated. |
data.gpu_mem_temp | long | The memory temperature of the GPU. |
data.gpu_mem_temp_max | long | The max memory temperature of the GPU over the collection period. |
data.gpu_mem_temp_min | long | The min memory temperature of the GPU over the collection period. |
data.gpu_temp | long | The temperature of the GPU. |
data.gpu_temp_max | long | The max temperature of the GPU over the collection period. |
data.gpu_temp_min | long | The min temperature of the GPU over the collection period. |
cast-csm-node-env¶
alias: | cast-csm-node-env |
---|
The mapping for the cast-csm-node-env index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the node environment counters. |
timestamp | date | When environment counters were gathered. |
type | text | The type of the event (csm-node-env). |
source | text | The source of the counters. |
data.system_energy | long | The energy of the system at ingestion time. |
cast-csm-gpu-counters¶
alias: | cast-csm-gpu-counters |
---|
A listing of DCGM counters.
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the gpu environment counters. |
Note
The data fields have been separated for compactness.
Data Field | Type | Description |
---|---|---|
nvlink_recovery_error_count_l1 | long | Total number of NVLink recovery errors. |
sync_boost_violation | long | Throttling duration due to sync-boost constraints (in us) |
gpu_temp | long | GPU temperature (in C). |
nvlink_bandwidth_l2 | long | Total number of NVLink bandwidth counters. |
dec_utilization | long | Decoder utilization. |
nvlink_recovery_error_count_l2 | long | Total number of NVLink recovery errors. |
nvlink_bandwidth_l1 | long | Total number of NVLink bandwidth counters. |
mem_copy_utilization | long | Memory utilization. |
gpu_util_samples | double | GPU utilization sample count. |
nvlink_replay_error_count_l1 | long | Total number of NVLink retries. |
nvlink_data_crc_error_count_l1 | long | Total number of NVLink data CRC errors. |
nvlink_replay_error_count_l0 | long | Total number of NVLink retries. |
nvlink_bandwidth_l0 | long | Total number of NVLink bandwidth counters. |
nvlink_data_crc_error_count_l3 | long | Total number of NVLink data CRC errors. |
nvlink_flit_crc_error_count_l3 | long | Total number of NVLink flow-control CRC errors. |
nvlink_bandwidth_l3 | long | Total number of NVLink bandwidth counters. |
nvlink_replay_error_count_l2 | long | Total number of NVLink retries. |
nvlink_replay_error_count_l3 | long | Total number of NVLink retries. |
nvlink_data_crc_error_count_l0 | long | Total number of NVLink data CRC errors. |
nvlink_recovery_error_count_l0 | long | Total number of NVLink recovery errors. |
enc_utilization | long | Encoder utilization. |
power_usage | double | Power draw (in W). |
nvlink_recovery_error_count_l3 | long | Total number of NVLink recovery errors. |
nvlink_data_crc_error_count_l2 | long | Total number of NVLink data CRC errors. |
nvlink_flit_crc_error_count_l2 | long | Total number of NVLink flow-control CRC errors. |
serial_number | text | The serial number of the GPU. |
power_violation | long | Throttling duration due to power constraints (in us). |
xid_errors | long | Value of the last XID error encountered. |
gpu_utilization | long | GPU utilization. |
nvlink_flit_crc_error_count_l0 | long | Total number of NVLink flow-control CRC errors. |
nvlink_flit_crc_error_count_l1 | long | Total number of NVLink flow-control CRC errors. |
mem_util_samples | double | The sample rate of the memory utilization. |
thermal_violation | long | Throttling duration due to thermal constraints (in us). |
cast-counters¶
Elasticsearch Templates: | |
---|---|
/opt/ibm/csm/bigdata/elasticsearch/templates/cast-ccounters*.json |
A class of index representing counter aggregation from non CSM data flows. Generally indices following this naming pattern contain data from standalone data aggregation utilities.
cast-counters-gpfs¶
alias: | cast-counters-gpfs |
---|
A collection of counter data from gpfs. The script outlined in the data aggregators documentation leverages zimon to perform the collection. The following is the index generated by the default script bundled in the CAST rpm.
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the gpu environment counters. |
Note
The data fields have been separated for compactness.
Data Field | Type | Description |
---|---|---|
cpu_system | long | The system space usage of the CPU. |
cpu_user | long | The user space usage of the CPU. |
mem_active | long | Active memory usage. |
gpfs_ns_bytes_read | long | Networked bytes read. |
gpfs_ns_bytes_written | long | Networked bytes written. |
gpfs_ns_tot_queue_wait_rd | long | Total time spent waiting in the network queue for read operations. |
gpfs_ns_tot_queue_wait_wr | long | Total time spent waiting in the network queue for write operations. |
cast-counters-ufm¶
alias: | cast-counters-ufm |
---|
Due to the wide variety of counters that may be gathered checking the data aggregation script is strongly recommended.
The mapping for the cast-counters-ufm index is provided below:
Field | Type | Description |
---|---|---|
@timestamp | date | Ingestion time of the ufm environment counters. |
timestamp | date | When environment counters were gathered. |
type | text | The type of the event (cast-counters-ufm). |
source | text | The source of the counters. |
cast-db¶
CSM history tables are archived in Elasticsearch as separate indices. CAST provides a document on configuring CSM database data archival <DataArchiving>.
The mapping shared between the indices is as follows:
Field | Type | Description |
---|---|---|
@timestamp | date | When archival event occured. |
tags | text | Tags about the archived data. |
type | text | The originating table, drives index assignment. |
data | doc | The mapping of table columns, contents differ for each table. |
Attention
These indicies will match CSM database history tables, contents not replicated for brevity.
cast-ibm-crasssd-bmc-alerts¶
While not managed by CAST crassd will ship bmc alerts to the big data store.
Kibana¶
Kibana is an open-sourced data visualization tool used in the ELK stack.
CAST provides a utility plugin for multistep searches of CSM jobs in Kibana dashboards.
Configuration¶
Note
This guide has been tested using Kibana 6.3.2, the latest RPM may be downloaded from the Elastic Site.
The following is a brief introduction to the installation and configuration of the Kibana service.
At the current time CAST does not provide a configuration file in its RPM.
- Install the Kibana rpm:
yum install -y kibana-*.rpm
- Configure the Kibana YAML file (/etc/kibana/kibana.yml)
CAST recommends the following four values be set before starting Kibana:
Setting | Description | Sample Value |
---|---|---|
server.host | The address the kibana server will bind on, needed for external access. | “10.7.4.30” |
elasticsearch.url | The URL of an elasticsearch service, this should include the port number (9200 by default). | “http://10.7.4.13:9200” |
xpack.security.enabled | The xpack security setting, set to false if not being used. | false |
xpack.ml.enabled | Sets the status of xpack Machine Learning. Please note this must be set to false on ppc64le installations. | false |
- Install the CAST Search rpm:
rpm -ivh ibm-csm-bds-kibana-*.noarch.rpm
- Start Kibana:
systemctl enable kibana.service
systemctl start kibana.service
Kibana should now be running and fully featured. Searchs may now be performed on the Discover tab.
CAST Search¶
CAST Search is a React plugin designed for interfacing with elastic search an building filters for Kibana Dashboards. To maxmize the value of the plugin the cast-allocation index pattern should be specified.
Logstash¶
Logstash is an open-source data processing pipeline used in the ELK stack. The core function of this service is to process unstructured data, typically syslogs, and then pass the newly structured text to the elasticsearch service.
Typically, in the CAST design, the Logstash service is run on the service nodes in the xCAT infrastructure. This design is to reduce the number of servers communicating with each instance of Logstash, distributing the workload. xCAT service nodes have failover capabilities removing the need for HAProxies to reduce the risk of data loss. Finally, in using the service node the total cost of the Big Data Cluster is reduced as the need for a dedicated node for data processing is removed.
CAST provides an event correlator for Logstash to assist in the generation of RAS events for specific messages.
Configuration¶
Note
This guide has been tested using Logstash 6.3.2, the latest RPM may be downloaded from the Elastic Site.
The following is a brief introduction to the installation and configuration of the logstash service. CAST provides a set of sample configuration files in the repository at csm_big_data/logstash/. If the ibm-csm-bds-*.noarch.rpm rpm has been installed the sample configurations may be found in /opt/ibm/csm/bigdata/logstash/.
- Install the logstash rpm and java 1.8.1+ (command run from directory with logstash rpm):
yum install -y logstash-*.rpm java-1.8.*-openjdk
Copy the Logstash pipeline configuration files to the appropriate directories.
This step is ultimately optional, however it is recommended that these files be reviewed and modified by the system administrator at this phase:
Target file Repo Dir RPM Dir logstash.yml(see note) config/ config/ jvm.options config/ config/ conf.d/logstash.conf config/ config/ patterns/ibm_grok.conf patterns/ patterns/ patterns/mellanox_grok.conf patterns/ patterns/ patterns/events.yml patterns/ patterns/
Note
Target files are relative to /etc/logstash. Repo Directories are relative to csm_big_data/logstash. RPM Directories are relative to /opt/ibm/csm/bigdata/logstash/.
Note
The conf.d/logstash.conf file requires the ELASTIC-INSTANCE field be replaced with your cluster’s elastic search nodes.
Note
logstash.yml is not shipped with this version of the RPM please use the following config for logstash.
# logstash.yml
---
path.data: /var/lib/logstash
path.config: /etc/logstash/conf.d/*conf
path.logs: /var/log/logstash
pipeline.workers: 2
pipeline.batch.size: 2000 # This is the MAXIMUM, to prevent exceedingly long waits a delay is supplied.
pipeline.batch.delay: 50 # Maximum time to wait to execute an underfilled queue in milliseconds.
queue.type: persisted
...
- Install the CSM Event Correlator
rpm -ivh ibm-csm-bds-logstash*.noarch.rpm
Note
This change is effective in the 1.3.0 release of the CAST rpms.
Please refer to Installing CEC for more details.
Note
The bin directory is relative to your logstash install location.
- Start Logstash:
systemctl enable logstash
systemctl start logstash
Logstash should now be operational. At this point data aggregators should be configured to point to your Logstash node as appropriate.
Tuning Logstash¶
Tuning logstash is highly dependant on your use case and environment. What follows is a set of recommendations based on the research and experimentation of the CAST Big Data team.
Here are some useful resources for learning more about profiling and tuning logstash:
logstash.yml¶
This configuration file specifies details about the Logstash service:
- Path locations (as a rule of thumb these files should be owned by the logstash user).
- Pipeline details (e.g. workers, threads, etc.)
- Logging levels.
For more details please refer to the Logstash settings file documentation.
jvm.options¶
The configuration file for the Logstash JVM. The supplied settings are CAST’s recommendation, however, the efficacy of these settings entirely depends on your Logstash node.
logstash.conf¶
The logstash.conf is the core configuration file for determining the behavior of the Logstash pipeline in the default CAST configuration. This configuration file is split into three components: input, filter and output.
input¶
The input section defines how the pipeline may ingest data. In the CAST sample only the tcp input plugin is used. CAST currently uses different ports to assign tagging to facilitate simpler filter configuration. For a more in depth description of this section please refer to the configuration file structure in the official Logstash documentation.
The default ports and data tagging are as follows:
Default Port Values | |
---|---|
Tag | Port Number |
syslog | 10515 |
json_data | 10522 |
transactions | 10523 |
filter¶
The filter section defines the data enrichment step of the pipeline. In the CAST sample the following operations are performed:
- Unstructured events are parsed with the grok utility.
- Timestamps are reformatted (as needed).
- Events with JSON formatting are parsed.
- CSM Event Correlator is invoked on properly ingested logs.
Generally speaking care must be taken in this section to leverage branch prediction. Additionally, it is easy to malform the grok plugin to result in slow downs in the pipeline performance. Please consult configuration file structure in the official Logstash documentation for more details.
output¶
The output section defines the target for the data processed through the pipeline. In the CAST sample the elasticsearch plugin is used, for more details please refer to the linked documentation.
The user must replace _ELASTIC_IP_PORT_LIST_ with a comma delimited list of hostname:port string pairs refering to the nodes in the elasticsearch cluster. Generally if using the default configuration the port should be 9200. An example of this configuration is as follows:
hosts => [ "10.7.4.14:9200", "10.7.4.15:9200", "10.7.4.19:9200" ]
grok¶
Logstash provides a grok utility to perform regular expression pattern recognition and extraction. When writing grok patterns several rules of thumb are recommended by the CAST team:
- Profile your patterns, Do you grok Grok? discusses a mechanism for profiling.
- Grok failure can be expensive, use anchors (^ and $) to make string matches precise to reduce failure costs.
- _groktimeout tagging can set an upper bound time limit for grok operations.
- Avoid DATA and GREEDYDATA if possible.
CSM Event Correlator¶
CSM Event Correlator (CEC) is the CAST solution for event correlation in the logstash pipeline. CEC is written in ruby to leverage the existing Logstash plugin system. At its core CEC is a pattern matching engine using grok to handle pattern matching.
A sample configuration of CEC is provided as the events.yml file described in the Configuration section of the document.
There’s an extensive asciidoc for usage of the CSM Event Correlator plugin. The following documentation is an abridged version.
Installing CEC¶
CEC should be bundled in the ibm-csm-bds-*.noarch.rpm rpm. Installation at the current time requires an external connection to the internet or an exported copy of the plugin (this process is described in offline-cec-install).
/usr/share/logstash/bin/logstash-plugin install \
/opt/ibm/csm/bigdata/logstash/plugins/logstash-filter-csm-event-correlator-*.gem
After the plugin has been installed it may then be configured with the steps described in CSM Event Correlator Filter Plugin
Data Aggregation¶
Data Aggregation in CAST utilizes the logstash pipeline to process events and pass it along to Elasticsearch.
Note
In the following documentation, examples requiring replacement will be annotated with the bash style ${variable_name} and followed by an explanation of the variable.
Logs¶
The default configuration of the CAST Big Data Store has support for a number of logging types, most of which are processed through the syslog utility and then enriched by Logstash and the CAST Event Correlator.
Syslog¶
Logstash Port: | 10515 |
---|
Syslog is generally aggregated through the use of the rsyslog daemon.
Most devices are capable of producing syslogs, and it is suggested that syslogs should be sent to Logstash via a redirection hierarchy outlined in the table below:
Type of Node | Syslog Destination |
---|---|
Service Node | Logstash |
Compute Node | Service Node |
Utility Node | Service Node |
UFM Server | Service Node |
IB/Ethernet | Logstash Node |
PDUs | Logstash Node |
Syslog Redirection¶
Warning
This step should not be performed on compute nodes in xCAT clusters!
To redirect a syslog so it is accepted by Logstash the following must be added to the /etc/rsyslog.conf file:
$template logFormat, "%TIMESTAMP:::date-rfc3339% %HOSTNAME% %APP-NAME% %PROCID% %syslogseverity-text% %msg%\n"
*.*;cron.none @@${logstash_node}:${syslog_port};logFormat
The rsyslog utility must then be restarted for the changes to take effect:
/bin/systemctl restart rsyslog.service
Field Description
logstash_node: | Replace with the hostname or IP address of the Logstash Server, on service nodes this is typically localhost. |
---|---|
syslog_port: | Replace with the port set in the Logstash Configuration File [ default: 10515 ]. |
Format
The format of the syslog is parsed in the CAST model by Logstash. CAST provides a grok for this syslog format in the pattern list provided by the CAST repository and rpm. The grok pattern is reproduced below with the types matching directly to the types in the syslog elastic documentation.
RSYSLOGDSV ^(?m)%{TIMESTAMP_ISO8601:timestamp} %{HOSTNAME:hostname} %{DATA:program_name} %{INT:process_id} %{DATA:severity} %{GREEDYDATA:message}$
Note
This pattern has a 1:1 relationship with the template given above and a 1:many relationship with the index data mapping. Logstash appends some additional fields for metadata analysis.
GPFS¶
To redirect the GPFS logging data to the syslog please do the following on the Management node for GPFS:
/usr/lpp/mmfs/bin/mmchconfig systemLogLevel=notice
After completing this process the gpfs log should now be forwarded to the syslog for the configured node.
Note
Refer to Syslog Redirection for gpfs log forwarding, the default syslog port is recommended (10515).
Note
The systemLogLevel attribute will forward logs of the specified level and higher to the syslog. It supports the following options: alert, critical, error, warning, notice, configuration, informational, detail, and debug.
Note
This data type will inhabit the same index as the syslog documents due to data similarity.
UFM¶
Note
This document assumes that the UFM daemon is up and running on the UFM Server.
The Unified Fabric Manager (UFM) has several distinct data logs to aggregate for the big data store.
System Event Log¶
Logstash Port: | 10515 |
---|
The System Event Log will report various fabric events that occur in the UFM’s network:
- A link coming up.
- A link going down.
- UFM module problems.
- …
A sample output showing a downed link can be seen below:
Oct 17 15:56:33 c931hsm04 eventlog[30300]: WARNING - 2016-10-17 15:56:33.245 [5744] [112]
WARNING [Hardware] IBPort [default(34) / Switch: c931ibsw-leaf01 / NA / 16]
[dev_id: 248a0703006d40f0]: Link-Downed counter delta threshold exceeded.
Threshold is 0, calculated delta is 1. Peer info: Computer: c931f03p08 HCA-1 / 1.
Note
The above example is in the Syslog format.
To send this log to the Logstash data aggregation the /opt/ufm/files/conf/gv.cfg file must be modified and /etc/rsyslog.conf should be modified as described in Syslog Redirection.
CAST recommends setting the following attributes in /opt/ufm/files/conf/gv.cfg:
[Logging]
level = INFO
syslog = true
event_syslog = true
[CSV]
write_interval = 30
ext_ports_only = yes
max_files = 10
[MonitoringHistory]
history_configured = true
Note
write_interval and `max_files were set as a default, change these fields as needed.
After configuring /opt/ufm/files/conf/gv.cfg restart the ufm daemon.
/etc/init.d/ufmd restart
Format
CAST recommends using the same syslog format as shown in Syslog Redirection, however, the message in the case of the mellanox event log has a consistent structure which may be parsed by Logstash. The pattern and substitutions are used below. Please note that the timestamp, severity and message fields are all overwritten from the default syslog pattern.
Please consult the event log table in the elasticsearch documentation <melElastic> for details on the message fields.
MELLANOXMSG %{MELLANOXTIME:timestamp} \[%{NUMBER:log_counter}\] \[%{NUMBER:event_id}\] %{WORD:severity} \[%{WORD:event_type}\] %{WORD:category} %{GREEDYDATA:message}
Console¶
Note
This document is designed to configure the xCAT service nodes to ship goconserver output to logstash (written using xCAT 2.13.11).
Logstash Port: | 10522 |
---|---|
Relevant Directories: | |
/etc/goconserver
|
CSM recommends using the goconserver bundled in the xCAT dependencies and documented in xCat-GoConserver. A limited configuration guide is provided below, but for gaps or more details please refer to the the xCAT read the docs.
- Install the goconserver and start it:
yum install goconserver
systemctl stop conserver.service
makegocons
- Configure the /etc/goconserver to send messages to the Logstash server associated with the
- service node (generally localhost):
# For options above this line refer to the xCAT read-the-docs
logger:
tcp:
- name: Logstash
host: <Logstash-Server>
port: 10522 # This is the port in the sample configuration.
timeout: 3 # Default timeout time.
- Restart the goconserver:
service goconserver restart
Format
The goconserver will now start sending data to the Logstash server in the form of JSON messages:
{
"type" : "console"
"message" : "c650f04p23 login: jdunham"
"node" : "c650f04p23"
"date" : "2018-05-08T09:49:36.530886-04"
}
The CAST logstash filter then mutates this data to properly store it in the elasticsearch backing store:
Field | New Field |
---|---|
node | hostname |
date | @timestamp |
Cumulus Switch¶
Attention
The CAST documentation was written using Cumulus Linux 3.5.2, please ensure the switch is at this level or higher.
Cumulus switch logging is performed through the usage of the rsyslog service. CAST recommends placing Cumulus logging in the syslog-log indices at this time.
Configuration of the logging on the switch can be achieved through the net command:
net add syslog host ipv4 ${logstash_node} port tcp ${syslog_port}
net commit
This command will populate the /etc/rsyslog.d/11-remotesyslog.conf file with a rule to export the syslog to the supplied hostname and port. If using the default CAST syslog configuration this file will need to be modified to have the CAST syslog template:
vi /etc/rsyslog.d/11-remotesyslog.conf
$template logFormat, "%TIMESTAMP:::date-rfc3339% %HOSTNAME% %APP-NAME% %PROCID% %syslogseverity-text% %msg%\n"
*.*;cron.none @@${logstash_node}:${syslog_port};logFormat
sudo service rsyslog restart
Note
For more configuration details please refer to the official Cumulus Linux User Guide.
Counters¶
The default configuration of the CAST Big Data Store has support for a number of counter types, most of which are processed through Logstash and the CAST Event Correlator.
GPFS¶
In order to collect counters from the GPFS file system CAST leverages the zimon utility. A python script interacting with this utility is provided in the ibm-csm-bds-*.noarch.rpm.
The following document assumes that the cluster’s service nodes be running the pmcollector service and any nodes requiring metrics be running pmsensors.
Collector¶
rpms: |
|
---|---|
config: | /opt/IBM/zimon/ZIMonCollector.cfg |
In the CAST architecture a pmcollector should be run on each of the service node in federated mode. To configure federated mode on the collector add all of the nodes configured as collectors to the /opt/IBM/zimon/ZIMonCollector.cfg this configuration should be then propagated to all of the collector nodes in the cluster.
peers = {
host = "collector1"
port = "9085"
},
{
host = "collector2"
port = "9085"
},
{
host = "collector3"
port = "9085"
}
After configuring the collector start and enable the pmcollectors.
systemctl start pmcollector
systemctl enable pmcollector
Sensors¶
RPMs: | gpfs.gss.pmsensors.ppc64le (Version 5.0 or greater) |
---|---|
Config: | /opt/IBM/zimon/ZIMonSensors.cfg |
It is recommended to use the GPFS managed configuration file through use of the mmperfmon command. Before setting the node to do performance monitoring it’s recommended that at least the following command be run:
/usr/lpp/mmfs/bin/mmperfmon config generate --collectors ${collectors}
/usr/lpp/mmfs/bin/mmperfmon config update GPFSNode.period=0
It’s recommended to specify at least two collectors defined in the zimon.collector section of this document. The pmsensor service will attempt to distribute the load and account for failover in the event of a downed collector.
After generating the sensor configuration the nodes must then be set to perfmon:
$ /usr/lpp/mmfs/bin/mmchnode --perfmon -N ${nodes}
Assuming /opt/IBM/zimon/ZIMonSensors.cfg has been properly distributed the sensors may then be started on the nodes.
$ systemctl start pmsensors
$ systemctl enable pmsensors
Attention
To detect failures of the power hardware the following must be prepared on the management node of the GPFS cluster.
$ vi /var/mmfs/mmsysmon/mmsysmonitor.conf
[general]
powerhw_enabled=True
$ mmsysmoncontrol restart
Python Script¶
CAST RPM: | ibm-csm-bds-*.noarch.rpm |
---|---|
Script Location: | |
/opt/ibm/csm/bigdata/data-aggregators/zimonCollector.py | |
Dependencies: | gpfs.base.ppc64le (Version 5.0 or greater) |
CAST provides a script for easily querying zimon, then sending the results to Big Data Store. The zimonCollector.py python script leverages the python interface to zimon bundled in the gpfs.base rpm. The help output for this script is duplicated below:
A tool for extracting zimon sensor data from a gpfs collector node and shipping it in a json
format to logstash. Intended to be run from a cron job.
Options:
Flag | Description < default >
==================================|============================================================
-h, --help | Displays this message.
--collector <host> | The hostname of the gpfs collector. <127.0.0.1>
--collector-port <port> | The collector port for gpfs collector. <9084>
--logstash <host> | The logstash instance to send the JSON to. <127.0.0.1>
--logstash-port <port> | The logstash port to send the JSON to. <10522>
--bucket-size <int> | The size of the bucket accumulation in seconds. <60>
--num-buckets <int> | The number of buckets to retrieve in the query. <10>
--metrics <Metric1[,Metric2,...]> | A comma separated list of zimon sensors to get metrics from.
| <cpu_system,cpu_user,mem_active,gpfs_ns_bytes_read,
| gpfs_ns_bytes_written,gpfs_ns_tot_queue_wait_rd,
| gpfs_ns_tot_queue_wait_wr>
CAST expects this script to be run from a service node configured for both logstash and zimon collection. In this release this script need only be executed on one service node in the cluster to gather sensor data.
The recommended cron configuration for this script is as follows:
*/10 * * * * /opt/ibm/csm/bigdata/data-aggregators/zimonCollector.py
The output of this script is a newline delimited list of JSON designed for easy ingestion by the logstash pipeline. A sample from the default script configuration is as follows:
{
"type": "zimon",
"source": "c650f99p06",
"data": {
"gpfs_ns_bytes_written": 0,
"mem_active": 1769963,
"cpu_system": 0.015,
"cpu_user": 0.004833,
"gpfs_ns_tot_queue_wait_rd": 0,
"gpfs_ns_bytes_read": 0,
"gpfs_ns_tot_queue_wait_wr": 0
},
"timestamp": 1529960640
}
In the default configuration of this script records will be shipped as JSONDataSources.
UFM¶
CAST RPM: | ibm-csm-bds-*.noarch.rpm |
---|---|
Script Location: | |
/opt/ibm/csm/bigdata/data-aggregators/ufmCollector.py |
CAST provides a python script to gather UFM counter data. The script is intended to be run from either a service node running logstash or the UFM node as a cron job. A description of the script from the help functionality is reproduced below:
Purpose: Simple script that is packaged with BDS. Can be run individually and
independantly when ever called upon.
Usage:
- Run the program.
- pass in parameters.
- REQUIRED [--ufm] : This tells program where UFM is (an IP address)
- REQUIRED [--logstash] : This tells program where logstash is (an IP address)
- OPTIONAL [--logstash-port] : This specifies the port for logstash
- OPTIONAL [--ufm_restAPI_args-attributes] : attributes for ufm restAPI
- CSV
Example:
- Value1
- Value1,Value2
- OPTIONAL [--ufm_restAPI_args-functions] : functions for ufm restAPI
- CSV
- OPTIONAL [--ufm_restAPI_args-scope_object] : scope_object for ufm restAPI
- single string
- OPTIONAL [--ufm_restAPI_args-interval] : interval for ufm restAPI
- int
- OPTIONAL [--ufm_restAPI_args-monitor_object] : monitor_object for ufm restAPI
- single string
- OPTIONAL [--ufm_restAPI_args-objects] : objects for ufm restAPI
- CSV
FOR ALL ufm_restAPI related arguments:
- see ufm restAPI for documentation
- json format
- program provides default value if no user provides
The recommended cron configuration for this script is as follows:
*/10 * * * * /opt/ibm/csm/bigdata/data-aggregators/ufmCollector.py
The output of this script is a newline delimited list of JSON designed for easy ingestion by the logstash pipeline. A sample from the default script configuration is as follows:
{
"type": "counters-ufm",
"source": "port2",
"statistics": {
...
},
"timestamp": 1529960640
}
In the default configuration of this script records will be shipped as JSONDataSources.
JSON Data Sources¶
Logstash Port: | 10522 |
---|---|
Required Field: | type |
Recommended Fields: | |
timestamp |
Attention
This section is currently a work in progress.
CAST recommends JSON data sources be shipped to Logstash to leverage the batching and data enrichment tool. The default logstash configuration shipped with CAST will designate port 10522. JSON shipped to this port should have the type field specified. This type field will be used in defining the name of the index.
Data Aggregators shipping to this port will generate indices with the following name format: cast-%{type}-%{+YYYY.MM.dd}
crass bmc alerts¶
While not bundled with CAST the crass daemon is used to monitor BMC events and counters. The following document is written assuming you have access to an ibm-crassd-*.ppc64le rpm.
- Install the rpm:
yum install -y ibm-crassd-*.ppc64le.rpm
- Edit the configuration file located at /opt/ibm/ras/etc/ibm-crassd.config:
This file neds the [logstash] configuration section configured and logstash=True in the [notify] section.
- Start crassd:
systemctl start ibm-crassd
Attention
The above section is a limited rundown of crassd configuration, for greater detail consult the official documentation for crassd.
CAST Data Sources¶
csmd syslog¶
Logstash Port: | 10515 |
---|
CAST has enabled the boost syslog utility through use of the csmd configuration file.
"csm" : {
...
"log" : {
...
"sysLog" : true,
"server" : "127.0.0.1",
"port" : "514"
}
...
}
By default enabling syslog will write to the localhost syslog port using UDP. The target may be changed by the server and port options.
The syslog will follow the RFC 3164 syslog protocol. After being filtered through the Syslog Redirection template the log will look something like this:
2018-05-17T11:17:32-04:00 c650f03p37-mgt CAST - debug csmapi TIMING: 1525910812,17,2,1526570252507364568,1526570252508039085,674517
2018-05-17T11:17:32-04:00 c650f03p37-mgt CAST - info csmapi [1525910812]; csm_allocation_query_active_all end
2018-05-17T11:17:32-04:00 c650f03p37-mgt CAST - info csmapi CSM_CMD_allocation_query_active_all[1525910812]; Client Recv; PID: 14921; UID:0; GID:0
These logs will then stored in the cast-log-syslog index using the default CAST configuration.
CSM Buckets¶
Logstash Port: | 10522 |
---|
CSM provides a mechanism for running buckets to aggregate environmental and counter data from a variety of sources in the cluster. This data will be aggregated and shipped by the CSM aggregator to a logstash server (typically the local logstash server).
Each run of a bucket will be encapsulated in a JSON document with the following pattern:
{
"type": "type-of-record",
"source": "source-of-record",
"timestamp": "timestamp-of-record",
"data": {
...
}
}
type: | The type of the bucket, used to determine the appropriate index. |
---|---|
source: | The source of the bucket run (typically a hostname, but can depend on the bucket). |
timestamp: | The timestamp of the collection |
data: | The actual data from the bucket run. |
Note
Each JSON document is newline delimited.
CSM Configuration¶
In the aggregator configuration file the following must be configured to enable this feature:
"bds" : {
"host" : "__LOGSTASH_IP__"
"port" : 10522
}
host: | The hostname the logstash server is configured on. |
---|---|
port: | A tcp port capable of receiving a JSON encoded message. 10522 is the default port in CAST logstash configuration files. |
This will ship the environmental data to the specified ip and port. Officially CAST suggests the use of logstash for this feature and suggests targeting the local logstash instance running on the service node.
Attention
For users not employing logstash in their solution the output of this feature is a newline delimited list of JSON documents formatted as seen above.
Logstash Configuration¶
CAST uses a generic port (10522) for processing data matching the JSONDataSources pattern. The default logstash configuration file specifies the following in the input section of the configuration file:
tcp {
port => 10522
codec => "json"
}
Default Buckets¶
CSM supplies several default buckets for environmental collection:
Bucket Type | Source | Description |
---|---|---|
csm-env-gpu | Hostname | Environmental counters about the node’s GPUs. |
Database Archiving¶
Logstash Port: | 10523 |
---|---|
Script Location: | |
/opt/ibm/csm/db/csm_db_history_archive.sh | |
Script RPM: | csm-csmdb-*.rpm |
CAST supplies a command line utility for archiving the contents of the CSM database history tables. When run the utility (csm_db_history_archive.sh) will append to a daily JSON dump file (<table>.archive.<YYYY>-<MM>-<DD>.json) the contents of all history tables and the RAS event action table. The content appended is the next n records without a archive time as provided to the command line utility.Any records archived in this manner are then marked with an archive time for their eventual removal from the database. The utility should be executed on the node running the CSM Postgres database.
Each row archived in this way will be converted to a JSON document with the following pattern:
{
"type": "db-<table-name>",
"data": { "<table-row-contents>" }
}
type: | The table in the database, converted to index in default configuration. |
---|---|
data: | Encapsulates the row data. |
CAST recommends the use of a cron job to run this archival. The following sample runs every five minutes, gathers up to 100 unarchived records from the csmdb tables, then appends the JSON formatted records to the daily dump file in the /var/log/ibm/csm/archive directory.
$ crontab -e
*/5 * * * * /opt/ibm/csm/db/csm_db_history_archive.sh -d csmdb -n 100 -t /var/log/ibm/csm/archive
CAST recommends ingesting this data through the filebeats utility. A sample log configuration is given below:
filebeat.prospectors:
- type: log
enabled: true
paths:
- "/var/log/ibm/csm/archive/*.json"
# CAST recommends tagging all filebeats input sources.
tags: ["archive"]
Note
For the sake of brevity further filebeats configuration documentation will be omitted. Please refer to the filebeats documentation for more details.
To configure logstash to ingest the archives the beats input plugin must be used, CAST recommends port 10523 for ingesting beats records as shown below:
input
{
beats {
port => 10523
codec=>"json"
}
}
filter
{
mutate {
remove_field => [ "beat", "host", "source", "offset", "prospector"]
}
}
output
{
elasticsearch {
hosts => [<elastic-server>:<port>]
index => "cast-%{type}-%{+YYYY.MM.dd}"
http_compression =>true
document_type => "_doc"
}
}
In this sample configuration the archived history will be stored in the cast-db-<table_name> indices.
Transaction Log¶
Logstash Port: | 10523 |
---|
Note
CAST only ships the transaction log to a local file, a utility such as Filebeats or a local Logstash service would be needed to ship the log to a Big Data Store.
CAST offers a transaction log for select CSM API events. Today the following events are tracked:
- Allocation create/delete/update
- Allocation step begin/end
This transaction log represents a set of events that may be assembled to create the current state of an event in a Big Data Store.
In the CSM design these transactions are intended to be stored in a single elasticsearch index each transaction should be identified by a uid in the index.
CSM Configuration¶
To enable the transaction logging mechanism the following configuration settings must be specified in the CSM master configuration file:
"log" :
{
"transaction" : true,
"transaction_file" : "/var/log/ibm/csm/csm_transaction.log",
"transaction_rotation_size" : 1000000000
}
transaction: | Enables the mechanism transaction log mechanism. |
---|---|
transaction_file: | |
Specifies the location the transaction log will be saved to. | |
transaction_rotation_size: | |
The size of the file (in bytes) to rotate the log at. |
Each transaction record will follow the following pattern:
{
"type": "<transaction-type>",
"data": { <table-row-contents>},
"traceid":<traceid-api>,
"uid": <unique-id>
}
type: | The type of the transaction, converted to index in default configuration. |
---|---|
data: | Encapsulates the transactional data. |
traceid: | The API’s trace id as used in the CSM API trace functionality. |
uid: | A unique identifier for the record in the elasticsearch index. |
Filebeats Configuration¶
CAST recommends ingesting this data through the filebeats utility. A sample log configuration is given below:
filebeat.prospectors:
- type: log
enabled: true
paths:
- /var/log/ibm/csm/csm_transaction.log
tags: ["transaction"]
Note
For the sake of brevity further filebeats configuration documentation will be omitted. Please refer to the filebeats documentation for more details.
Warning
Filebeats has some difficulty with rollover events.
Logstash Configuration¶
To configure logstash to ingest the archives the beats input plugin must be used, CAST recommends port 10523 for ingesting beats records. Please note that this configuration only creates one index for each transaction log type, this is to prevent transactions that span days from duplicating logs.
input
{
beats {
port => 10523
codec=>"json"
}
}
filter
{
mutate {
remove_field => [ "beat", "host", "source", "offset", "prospector"]
}
}
output
{
elasticsearch {
hosts => [<elastic-server>:<port>]
action => "update"
index => "cast-%{type}"
http_compression =>true
doc_as_upsert => true
document_id => "%{uid}"
document_type => "_doc"
}
}
The resulting indices for this configuration will be one per transaction type with each document corresponding to the current state of a set of transactions.
Supported Transactions¶
The following transactions currently tracked by CSM are as follows:
type | uid | data |
---|---|---|
allocation | <allocation_id> | Superset of csmi_allocation_t. Adds running-start-timestamp and running-end-timestamp. Failed allocation creates have special state: reverted. |
allocation-step | <allocation_id>-<step_id> | Direct copy of csmi_allocation_step_t. |
Beats¶
Official Documentation: | |
---|---|
Beats Reference |
Beats are a collection of open source data shippers. CAST employs a subset of these beats to facilitate data aggregation.
Filebeats¶
Official Documentation: | |
---|---|
Filebeats Reference |
Filebeats is used to ship the CSM transactional log to the big data store. It was selected for its high reliabilty in data transmission and existing integration in the elastic stack.
Installation¶
The following installation guide will deal with configuring filebeats for the CSM transaction log for a more generalized installation guide please consult the official Filebeats Reference.
- Install the filebeats rpm on the node:
rpm -ivh filebeat-*.rpm
- Configure the /etc/filebeat/filebeat.yml file:
CAST ships a sample configuration file in the ibm-csm-bds-*.noarch rpm at /opt/ibm/csm/bigdata/beats/config/filebeat.yml. This file is preconfigured to point at the CSM database archive files and the csm transaction logs. Users will need to replace two keywords before using this configuration:
Keyword | Description | Sample Value |
---|---|---|
_KIBANA_HOST_PORT_ | A string containing the “hostname:port” pairing of the Kibana server. | “10.7.4.30:5601” |
_LOGSTASH_IP_PORT_LIST_ | A list of “hostname:port” pairs pointing to Logstash servers to ingest the data (current CAST recommendation is a single instance of Logstash). | [“10.7.4.41:10523”] |
- Start the filebeats service.
systemctl start filebeat.service
Filebeats should now be sending injested data to the Logstash instances specified in the configuation file.
Python Guide¶
Elasticsearch API¶
CAST leverages the Elasticsearch API python library to interact with Elasticsearch. If the API is being run on a node with internet access the following process may be used to install this library:
pip install elasticsearch
If the node doesn’t have access to the internet please refer to the official python documentation for the installation of wheels: Installing Packages.
Big Data Use Cases¶
CAST offers a collection of use case scripts designed to interact with the Big Data Store through the elasticsearch interface.
findJobTimeRange.py¶
This use case may be considered a building block for the remaining ones. This use case demonstrates the use of the cast-allocation transactional index to get the time range of a job.
The usage of this use case is described by the –help option.
findJobKeys.py¶
This use case represents two comingled use cases. First when supplied a job identifier (allocation id or job id) and a keyword (regular expression case insensitive) the script will generate a listing of keywords and their occurrence rates on records associated with the supplied job. Association is filtered on by the time range of the jobs and hostnames that participated on the job.
A secondary usecase is presented in the verbose flag, allowing the user to see a list of all entries matching the keyword.
usage: findJobKeys.py [-h] [-a int] [-j int] [-s int] [-t hostname:port]
[-k [key [key ...]]] [-v] [--size size]
[-H [host [host ...]]]
A tool for finding keywords in the "message" field during the run time of a job.
optional arguments:
-h, --help show this help message and exit
-a int, --allocationid int
The allocation ID of the job.
-j int, --jobid int The job ID of the job.
-s int, --jobidsecondary int
The secondary job ID of the job (default : 0).
-t hostname:port, --target hostname:port
An Elasticsearch server to be queried. This defaults
to the contents of environment variable
"CAST_ELASTIC".
-k [key [key ...]], --keywords [key [key ...]]
A list of keywords to search for in the Big Data
Store. Case insensitive regular expressions (default :
.*). If your keyword is a phrase (e.g. "xid 13")
regular expressions are not supported at this time.
-v, --verbose Displays any logs that matched the keyword search.
--size size The number of results to be returned. (default=30)
-H [host [host ...]], --hostnames [host [host ...]]
A list of hostnames to filter the results to (filters on the "hostname" field, job independent).
findJobsRunning.py¶
A use case for finding all jobs running at the supplied timestamp. This usecase will display a list of jobs for which the start time is less than the supplied time and have either no end time or an end time greater than the supplied time.
usage: findJobsRunning.py [-h] [-t hostname:port] [-T YYYY-MM-DDTHH:MM:SS]
[-s size] [-H [host [host ...]]]
A tool for finding jobs running at the specified time.
optional arguments:
-h, --help show this help message and exit
-t hostname:port, --target hostname:port
An Elasticsearch server to be queried. This defaults
to the contents of environment variable
"CAST_ELASTIC".
-T YYYY-MM-DDTHH:MM:SS, --time YYYY-MM-DDTHH:MM:SS
A timestamp representing a point in time to search for
all running CSM Jobs. HH, MM, SS are optional, if not
set they will be initialized to 0. (default=now)
-s size, --size size The number of results to be returned. (default=1000)
-H [host [host ...]], --hostnames [host [host ...]]
A list of hostnames to filter the results to.
findJobMetrics.py¶
Leverages the built in Elasticsearch statistics functionality. Takes a list of fields and a job identifier then computes the min, max, average, and standard deviation of those fields. The calculations are computed against all records for the field during the running time of the job on the nodes that participated.
This use case also has the ability to generate correlations between the fields specified.
usage: findJobMetrics.py [-h] [-a int] [-j int] [-s int] [-t hostname:port]
[-H [host [host ...]]] [-f [field [field ...]]]
[-i index] [--correlation]
A tool for finding metrics about the nodes participating in the supplied job
id.
optional arguments:
-h, --help show this help message and exit
-a int, --allocationid int
The allocation ID of the job.
-j int, --jobid int The job ID of the job.
-s int, --jobidsecondary int
The secondary job ID of the job (default : 0).
-t hostname:port, --target hostname:port
An Elasticsearch server to be queried. This defaults
to the contents of environment variable
"CAST_ELASTIC".
-H [host [host ...]], --hostnames [host [host ...]]
A list of hostnames to filter the results to.
-f [field [field ...]], --fields [field [field ...]]
A list of fields to retrieve metrics for (REQUIRED).
-i index, --index index
The index to query for metrics records.
--correlation Displays the correlation between the supplied fields
over the job run.
findUserJobs.py¶
Retrieves a list of all jobs that the the supplied user owned. This list can be filtered to a time range or on the state of the allocation. If the –commonnodes argument is supplied a list nodes will be displayed where the node participated in more nodes than the supplied threshold. The colliding nodes will be sorted by number of jobs they participated in.
usage: findUserJobs.py [-h] [-u username] [-U userid] [--size size]
[--state state] [--starttime YYYY-MM-DDTHH:MM:SS]
[--endtime YYYY-MM-DDTHH:MM:SS]
[--commonnodes threshold] [-v] [-t hostname:port]
A tool for finding a list of the supplied user's jobs.
optional arguments:
-h, --help show this help message and exit
-u username, --user username
The user name to perform the query on, either this or
-U must be set.
-U userid, --userid userid
The user id to perform the query on, either this or -u
must be set.
--size size The number of results to be returned. (default=1000)
--state state Searches for jobs matching the supplied state.
--starttime YYYY-MM-DDTHH:MM:SS
A timestamp representing the beginning of the absolute
range to look for failed jobs, if not set no lower
bound will be imposed on the search.
--endtime YYYY-MM-DDTHH:MM:SS
A timestamp representing the ending of the absolute
range to look for failed jobs, if not set no upper
bound will be imposed on the search.
--commonnodes threshold
Displays a list of nodes that the user jobs had in
common if set. Only nodes with collisions exceeding
the threshold are shown. (Default: -1)
-v, --verbose Displays all retrieved fields from the `cast-
allocation` index.
-t hostname:port, --target hostname:port
An Elasticsearch server to be queried. This defaults
to the contents of environment variable
"CAST_ELASTIC".
findWeightedErrors.py¶
An extension of the findJobKeys.py use case. This use case will query elasticsearch for a job then run a predefined collection of mappings to assist in debugging a problem with the job.
usage: findWeightedErrors.py [-h] [-a int] [-j int] [-s int]
[-t hostname:port] [-k [key [key ...]]] [-v]
[--size size] [-H [host [host ...]]]
[--errormap file]
A tool which takes a weighted listing of keyword searches and presents
aggregations of this data to the user.
optional arguments:
-h, --help show this help message and exit
-a int, --allocationid int
The allocation ID of the job.
-j int, --jobid int The job ID of the job.
-s int, --jobidsecondary int
The secondary job ID of the job (default : 0).
-t hostname:port, --target hostname:port
An Elasticsearch server to be queried. This defaults
to the contents of environment variable
"CAST_ELASTIC".
-v, --verbose Displays the top --size logs matching the --errormap mappings.
--size size The number of results to be returned. (default=10)
-H [host [host ...]], --hostnames [host [host ...]]
A list of hostnames to filter the results to.
--errormap file A map of errors to scan the user jobs for, including
weights.
JSON Mapping Format¶
This use case utilizes a JSON mapping to define a collection of keywords and values to query the elasticsearch cluster for. These values can leverage the native elasticsearch boost feature to apply weights to the mappings allowing a user to quickly determine high priority items using scoring.
The format is defined as follows:
[
{
"category" : "A category, used for tagging the search in output. (Required)",
"index" : "Matches an index on the elasticsearch cluster, uses elasticsearch syntax. (Required)",
"source" : "The hostname source in the index.",
"mapping" : [
{
"field" : "The field in the index to check against(Required)",
"value" : "A value to query for; can be a phrase, regex or number. (Required)",
"boost" : "The elasticsearch boost factor, may be thought of as a weight. (Required)",
"threshold" : "A range comparison operator: 'gte', 'gt', 'lte', 'lt'. (Optional)"
}
]
}
]
When applied to a real configuration a mapping file will look something like this:
[
{
"index" : "*syslog*",
"source" : "hostname",
"category": "Syslog Errors" ,
"mapping" : [
{
"field" : "message",
"value" : "error",
"boost" : 50
},
{
"field" : "message",
"value" : "kdump",
"boost" : 60
},
{
"field" : "message",
"value" : "kernel",
"boost" : 10
}
]
},
{
"index" : "cast-zimon*",
"source" : "source",
"category" : "Zimon Counters",
"mapping" : [
{
"field" : "data.mem_active",
"value" : 12000000,
"boost" : 100,
"threshold" : "gte"
},
{
"field" : "data.cpu_system",
"value" : 10,
"boost" : 200,
"threshold" : "gte"
}
]
}
]
Note
The above configuration was designed for demonstrative purposes, it is recommended that users create their own mappings based on this example.
UFM Collector¶
A tool interacting with the UFM collector is provided in ibm-csm-bds-*.noarch.rpm. This script performs 3 key operations:
- Connects to the UFM monitoring snapshot RESTful interface.
- This connection specifies a collection attributes and functions to execute against the
- interface.
- Processes and enriches the output of the REST connection.
- Adds a type, timestamp and source field to the root of the JSON document.
- Opens a socket to a target logstash instance and writes the payload.
CSM Event Correlator Filter Plugin¶
Attention
The following document is a work in progress! The CSM Event Correlator is currently under development and the interface is subject to change.
Parses arbitrary text and structures the results of the parse into actionable events.
The CSM Event Correlator is a utility by which a system administrator may specify a collection of patterns (grok style), grouping by context (e.g. syslog, event log, etc.), which trigger actions (ruby scripts).
Installation¶
The CSM Event Correlator comes bundled in the ibm-csm-bds-logstash*.noarch.rpm rpm. When installing the rpm, any old versions of the plugin will be removed and the bundled version will be installed.
CSM Event Correlator Pipeline Configuration Options¶
This plugin supports the following configuration options:
Setting | Input type | Required |
---|---|---|
events_dir | string | No |
patterns_dir | array | No |
named_captures_only | boolean | No |
Please refer to common-options for options supported in all Logstash filter plugins.
This plugin is intended to be used in the filter block of the logstash configuration file. A sample configuration is reproduced below:
filter {
csm_event_correlator {
events_dir => "/etc/logstash/patterns/events.yml"
patterns_dir => "/etc/logstash/patterns/*.conf"
}
}
events_dir¶
Value type: | string |
---|---|
Default value: | /etc/logstash/conf.d/events.yml |
The configuration file for the event correlator, see CSM Event Correlator Event Configuration File for details on the contents of this file.
This file is loaded on pipeline creation.
Attention
This field will use an array in future iterations to specify multiple configuration files. This change should not impact existing configurations.
patterns_dir¶
Value type: | array |
---|---|
Default value: | [] |
A directory, file or filepath with a glob. The listing of files will be parsed for grok patterns which may be used in writing patterns for event correlation. If no glob is specified in the path * is used.
Configuration with a file glob:
patterns_dir => "/etc/logstash/patterns/*.conf" # Retrieves all .conf files in the directory.
Configuration with multiple files:
patterns_dir => ["/etc/logstash/patterns/mellanox_grok.conf", "/etc/logstash/patterns/ibm_grok.conf"]
CSM Event Correlator will load the default Logstash patterns regardless of the contents of this field.
Pattern files are plain text with the following format:
NAME PATTERN
For example:
GUID [0-9a-f]{16}
The patterns are loaded on pipeline creation.
named_captures_only¶
Value type: | boolean |
---|---|
Default value: | true |
If true only store captures that have been named for grok. Anonymous captures are considered named.
CSM Event Correlator Event Configuration File¶
CSM Event Correlator uses a YAML file for configuration. The YAML configuration is
heirarchical with 3 major groupings:
This is a sample configuration of this file:
---
# Metadata
ras_create_url: "/csmi/V1.0/ras/event/create"
csm_target: "localhost"
csm_port: 4213
data_sources:
# Data Sources
syslog:
ras_location: "syslogHostname"
ras_timestamp: "timestamp"
event_data: "message"
category_key: "programName"
categories:
# Categories
NVRM:
- tag: "XID_GENERIC"
pattern: "Xid(%{DATA:pciLocation}): %{NUMBER:xid:int},"
ras_msg_id: "gpu.xid.%{xid}"
action: 'unless %{xid}.between?(1, 81); ras_msg_id="gpu.xid.unknown" end; .send_ras;'
mlx5_core:
- tag: "IB_CABLE_PLUG"
pattern: "mlx5_core %{MLX5_PCI}.*module %{NUMBER:module}, Cable (?<cableEvent>(un)?plugged)"
ras_msg_id: "ib.connection.%{cableEvent}"
action: ".send_ras;"
mmsysmon:
- tag: "MMSYSMON_CLEAN_MOUNT"
pattern: "filesystem %{NOTSPACE:filesystem} was (?<mountEvent>(un)?mounted)"
ras_msg_id: "spectrumscale.fs.%{mountEvent}"
action: ".send_ras;"
- tag: "MMSYSMON_UNMOUNT_FORCED"
pattern: "filesystem %{NOTSPACE:filesystem} was.*forced.*unmount"
ras_msg_id: "spectrumscale.fs.unmount_forced"
action: ".send_ras;"
...
Metadata¶
The metadata section may be thought of as global configuration options that will apply to all events in the event correlator.
Field | Input type | Required |
---|---|---|
ras_create_url | string | Yes <Initial Release> |
csm_target | string | Yes <Initial Release> |
csm_port | integer | Yes <Initial Release> |
data_sources | map | Yes |
ras_create_url¶
Value type: | string |
---|---|
Sample value: | /csmi/V1.0/ras/event/create |
Specifies the REST create resource on the node runnning the CSM REST Daemon. This path will be used by the .send_ras; utility.
Attention
In a future release /csmi/V1.0/ras/event/create will be the default value.
csm_target¶
Value type: | string |
---|---|
Sample value: | 127.0.0.1 |
A server running the CSM REST daemon. This server will be used to generate ras events with the .send_ras; utility.
Attention
In a future release 127.0.0.1 will be the default value.
csm_port¶
Value type: | integer |
---|---|
Sample value: | 4213 |
The port on the server running the CSM REST daemon. This port will be used to connect by the .send_ras; utility.
Attention
In a future release 4213 will be the default value.
data_sources¶
Value type: | map |
---|
A mapping of data sources to event correlation rules. The key of the data_sources field matches type field of the logstash event processed by the filter plugin. The type field may be set in the input section of the logstash configuration file.
Below is an example of setting the type of all incoming communication on the 10515 tcp port to have the syslog type:
input {
tcp {
port => 10515
type => "syslog"
}
}
The YAML configuration file for the syslog data source would then look something like this:
syslog:
# Event Data Sources configuration settings.
# More data sources.
The YAML configuration uses this structure to reduce the pattern space for event matching. If the user doesn’t configure a type in this data_sources map CSM will discard events of that type for consideration in event correlation.
Data Sources¶
Event data sources are entries in the data_sources map. Each data source has a set of configuration options which allow the event correlator to parse the structured data of the logstash event being checked for event corelation/action generation.
This section has the following configuration fields:
Field | Input type | Required |
---|---|---|
ras_location | string | Yes <Initial release> |
ras_timestamp | string | Yes <Initial release> |
event_data | string | Yes |
category_key | string | Yes |
categories | map | Yes |
ras_location¶
Value type: | string |
---|---|
Sample value: | syslogHostname |
Specifies a field in the logstash event received by the filter. The contents of this field are then used to generate the ras event spawned with the .send_ras; utility.
The referenced data is used in the location_name of the of the REST payload sent by .send_ras;.
For example, assume an event is being processed by the filter. This event has the field syslogHostname populated at some point in the pipeline’s execution to have the value of cn1. It is determined that this event was worth responding to and a RAS event is created. Since ras_location was set to syslogHostname the value of cn1 is POSTed to the CSM REST daemon when creating the RAS event.
ras_timestamp¶
Value type: | string |
---|---|
Sample value: | timestamp |
Specifies a field in the logstash event received by the filter. The contents of this field are then used to generate the ras event spawned with the .send_ras; utility.
The referenced data is used in the time_stamp of the of the REST payload sent by .send_ras;.
For example, assume an event is being processed by the filter. This event has the field timestamp populated at some point in the pipeline’s execution to have the value of Wed Feb 28 13:51:19 EST 2018. It is determined that this event was worth responding to and a RAS event is created. Since ras_timestamp was set to timestamp the value of Wed Feb 28 13:51:19 EST 2018 is POSTed to the CSM REST daemon when creating the RAS event.
event_data¶
Value type: | string |
---|---|
Sample value: | message |
Specifies a field in the logstash event received by the filter. The contents of this field are matched against the specified patterns.
Attention
This is the data checked for event correlation once the event list has been selected, make sure the correct event field is specified.
category_key¶
Value type: | string |
---|---|
Sample value: | programName |
Specifies a field in the logstash event received by the filter. The contents of this field are used to select the category in the categories map.
categories¶
Value type: | map |
---|
A mapping of data sources categories to event correlation rules. The key of the categories field matches field specified by category_key. In the included example this is the program name of a syslog event.
This mapping exists to reduce the number of pattern matches performed per event. Events that don’t have a match in the categories map are ignored when performing further pattern matches.
Each entry in this map is an array of event correlation rules with the schema described in Event Categories. Please consult the sample for formatting examples for this section of the configuration.
Event Categories¶
Event categories are entries in the categories map. Each category has a list of tagged configuration options which specify an event correlation rule.
This section has the following configuration fields:
Field | Input type | Required |
---|---|---|
tag | string | No |
pattern | string | Yes <Initial Release> |
action | string | Yes <Initial Release> |
extract | boolean | No |
ras_msg_id | string | No <Needed for RAS> |
tag¶
Value type: | string |
---|---|
Sample value: | XID_GENERIC |
A tag to identify the event correlation rule in the plugin. If not specified an internal identifier will be specified by the plugin. Tags starting with . will be rejected at the load phase as this is a reserved pattern for internal tag generation.
Note
In the current release this mechanism is not fully implemented.
pattern¶
Value type: | string |
---|---|
Sample value: | mlx5_core %{MLX5_PCI}.*module %{NUMBER:module}, Cable (?<cableEvent>(un)?plugged) |
A grok based pattern, follows the rules specified in Grok Primer. This pattern will save any pattern match extractions to the event travelling through the pipeline. Additionally, any extractions will be accessible to the action to drive behavior.
action¶
Value type: | string |
---|---|
Sample value: | unless %{xid}.between?(1, 81); ras_msg_id=”gpu.xid.unknown” end; .send_ras; |
A ruby script describing an action to take in response to an event. The action is taken when an event is matched. The plugin will compile these scripts at load time, cancelling the startup if invalid scripts are specified.
This script follows the rules specified in CSM Event Correlator Action Programming.
extract¶
Value type: | boolean |
---|---|
Default value: | false |
By default the Event Correlator doesn’t save the extract pattern matches in pattern to the final event shipped to elastic search or your big data platform of choice. To save the pattern extraction this field must be set to true.
Note
This field does not impact the writing of action scripts.
ras_msg_id¶
Value type: | string |
---|---|
Sample value: | gpu.xid.%{xid} |
A string representing the ras message id in event creation. This string may specify fields in the event object through use of the %{FIELD_NAME} pattern. The plugin will attempt to populate the string using this formatting before passing to the action processor.
For example, if the event has a field xid with value 42 the pattern gpu.xid.%{xid} will resolve to gpu.xid.42.
Grok Primer¶
CSM Event Correlator uses grok to drive pattern matching.
Grok is a regular expression pattern checking utility. A typical grok pattern has the following syntax: %{PATTERN_NAME:EXTRACTED_NAME}
PATTERN_NAME is the name of a grok pattern specified in a pattern file or in the default Logstash pattern space. Samples include NUMBER, IP and WORD.
EXTRACTED_NAME is the identifier to be assigned to the text in the event context. The EXTRACTED_NAME will be accessible in the action through use of the %{EXTRACTED_NAME} pattern as described later. EXTRACTED_NAME identifiers are added to the big data record in elasticsearch. The EXTRACTED_NAME section is optional, patterns without the EXTRACTED_NAME are matched, but not extracted.
For specifying custom patterns refer to custom patterns.
A grok pattern may also use raw regular expressions to perform non-extracting pattern matches. Anonymous extraction patterns may be specified with the following syntax: (?<EXTRACTED_NAME>REGEX)
EXTRACTED_NAME in the anonymous extraction pattern is identical to the named pattern. REGEX is a standard regular expression.
CSM Event Correlator Action Programming¶
Programming actions is a central part of the CSM Event Correlator. This plugin supports action scripting using ruby. The action script supplied to the pipeline is converted to an anonymous function which is invoked when the event is processed.
Default Variables¶
The action script has a number of variables which are acessible to action writers:
Variable | Type | Description |
---|---|---|
event | LogStash::Event | The event the action is generated for, getters provided. |
ras_msg_id | string | The ras message id, formatted. |
ras_location | string | The location the RAS event originated from, parsed from event. |
ras_timestamp | string | The timestamp to assign to the RAS event. |
raw_data | string | The raw data which generated the action. |
The user may directly influence any of these fields in their action script, however it is recommended that the user take caution when manipulating the event as the contents of this field are ultimately written to any Logstash targets. The event members may be accessed using the %{field} syntax.
The ras_msg_id, ras_location, ras_timestamp, and raw_data fields are used with the .send_ras; action keyword.
Accessing Event Fields¶
Event fields are commonly used to drive event actions. These fields may be specified by the event corelation rule or other Logstash plugins. Due to the importance of this pattern the CSM Event Correlator provides a special syntaxtic sugar for field access %{FIELD_NAME}.
This syntax is interpreted as event.get(FIELD_NAME) where the field name is a field in the event. If the field was not present the field will be interpreted as nil.
Action Keywords¶
Several action keywords are provided to abstract or reduce the code written in the actions. Action keywords always start with a . and end with a ;.
.send_ras;¶
Creates a ras event with msg_id == ras_msg_id, location_name == ras_location, time_stamp == ras_timestamp, and raw_data == raw_data.
Currently only issues RESTful create requests. Planned improvements add local calls.
Attention
A clarification for this section will be provided in the near future. (5/18/2018 jdunham@us.ibm.com)
Sample Action¶
- Using the above tools an action may be written that:
Processes a field in the event, checking to see it’s in a valid range.
unless %{xid}.between?(1, 81);
Sets the message id to a default value if the field is not within range.
ras_msg_id="gpu.xid.unknown" end;
Generate a ras message with the new id.
.send_ras;
All together it becomes:
unless %{xid}.between?(1, 81); ras_msg_id="gpu.xid.unknown" end; .send_ras;
This action script is then compiled and stored by the plugin at load time then executed when actions are triggered by events.
Debugging Issues¶
Perform the following checks in order, when a matching condition is found, exit the debug process and handle that condition. Numbered sequences assume that the user performs each step in order.
RAS Event Not Firing¶
If RAS events haven’t been firing for conditions matching .send_ras perform the following diagnostic steps:
Check the `/var/log/logstash/logstash-plain.log`
Search for the phrase “Unable send RAS event” :
This indicates that the corelator was unable to connect to the CSM REST Daemon. Verify that Daemon is running on the specified hostname and port.
Search for the phrase “Posting ras message” :
This indicates that the corelator connected to the CSM REST Daemon, but the RAS events were malconfigured. Verify that the message id sent has an analog in the list of RAS events registered in CSM.
The RAS mesage id may be checked using the following utility:
csm_ras_msg_type_query -m "MESSAGE_ID"
Neither of these strings were found: