4. Core Module¶
The core module combines the data import with different connectors with bitemporal & two-dimensional historisation and calendar logic that leads to an object and application status which can be used to trigger applications.
4.1. The Concept of Business Domains¶
Business Domains (BDomains) are used to handle multiple TSA loader streams into the same target PSA-table independently. The corresponding TSA tables may differ, e.g. some sources may have more or less columns. However, the PSA table is always a superset of all possible columns. The BDomain is the finest granularity which is tracked by the RED framework. BDomains may have different load cycles, the PSA object status is evaluated for each BDomain separately.
4.2. Historisation¶
A strict bitemporal, two-dimensional historisation is used for all data, in order to be able to reconstruct all data changes. A given record has a period of functional and a period of technical validity. The functional validity results from the functional requirements, for example between which two dates a given interest rate applies and is referred to as the business date or booking date. The technical validity reflects the time of storage and the date upon which a further record is stored to replace it (load timestamp). This technical time dimension is referred to as loading date or technical date. The details behind the historisation process will be described shortly.
Each table in the PSA layer has two sets of columns to describe the functional and technical validity of a record. The columns BDATE_FROM and BDATE_TO describe the functional validity and the columns TDATE_FROM and TDATE_TO describe the technical validity. The DR views in the PSV layer include both types of validity for each record. The business date is stored with calendar day accuracy, the technical date as a time accurate to milliseconds.
Two types of tables exist: Type-2 and Type-3. Type-2 corresponds to the definition of an SCD (Slowly Changing Dimension) Type-2. Type-3 tables are intended for event tables, where an event occurs on a particular date (BDATE instead of BDATE_FROM & BDATE_TO).
Type-2 is mostly used for all kinds of master data. The functional validity is given with the BDATE_FROM` and BDATE_TO values, while the technical validity of a record is given with the TDATE_%. When a change is made to a record which applies to a given functional period, this change is noted in the TDATE_% columns.
Type-3 tables contain data related to a particular day, such as sales, which occur on a certain day. Type-3 tables can also be called “event tables” or “transaction tables”. These tables only have a single BDATE column, which describes the date on which the event occurred. Historisation, therefore, happens in the technical dimension.
Note
The term “SCD type 3” for event/transaction tables does not comply to the usage of this term in common literature. However, the RED framework uses this term as a synonym for such event tables.
Note
SCD Type-1 tables, which are marked by neither a bitemporal historisation or a single point in time, do not exist in the framework as they contradict the idea reproducabilyty and tracability.
Internally there are optional %_HT tables in the PSA layer that are managed as Type-0.
4.2.1. Key Columns¶
4.2.1.1. The Business Surrogate Key (BSK)¶
The Business Surrogate Key (BSK) serves as the basis and absolute prerequisite for the historisation logic as each data record can be uniquely identified via the BSK. This means that the data record, and all historical variants of this data record that existed before it, have the same BSK. A query via the BSK therefore returns all records from which the currently valid record has developed from, showcasing the complete temporal history. This means any changes to a data record, for example made by SQL updates in the source system, are traceable via the BSK. This also allows for the reconstruction of a record as it was at any point in the past. Furthermore, the complete historical development of individual records can be analysed.
** Definition of the BSK ** The BSK is composed of the business domain (BDomain) and a logical business key (BK). In typical OLTP systems usually the primary key of a table may be a good guess for the BK. However, stay away from using the primary key for the BK if the system is already historised and the PK contains a time dimension. In any case the BK must uniquely identify the entity within one BDomain. E.g. the BK can be a customer number, a bank account number (IBAN), etc. The BK can be built of multiple columns.
The BSK is finally calculated as a hash value (MD5) using the columns of the BK and the BDomain.
Technically, the composition of the BSK can be seen in the definition of the %_DV view of the respective table. In the %_DV views, the hash value is calculated via the BSK, stored in the PSA tables and finally made available in the PSV views in the “BSK” column.
In addition to historisation, the BSK may serve as a join attribute. Due to technical or even business reasons, the content of the BSK fields may change in the future. Therefore, it is not recommended to use the BSK as filter criterion. The columns which are part of the BSK are available in the PSV views anyways.
Important
The definition of a BSK is mandatory. If the source is a database and the source table has a primary key (PK) this PK is implicitly taken as BSK but may be changed later.
4.2.1.2. The Identity Code (IDC)¶
The Identity Code (IDC) is used to identify a row in the source database uniquely. The IDC is always identical with the primary key of the source. Further more, in many cases IDC and BSK are the same as long as the source PK does not contain a time dimension.
Important
The definition of an IDC is mandatory. If the source is a database and the source table has a primary key (PK) this PK is implicitly taken as IDC but may be changed later.
4.2.1.3. The (alternate) Business Key (BK)¶
Source tables may provide two different sets of columns which may serve as BK. An alternate BK (different from the definition used for the BSK) may be used for HT (lookup) tables. In this case the columns of the first set may be used for the BK while the BSK is made up from the second set.
HT tables record IDC, BSK and BK together. New combinations of these 3 are inserted as a new row, no updates or deletes. Therefore, HT tables serve as high perfomance lookup tables from IDC or BSK to BK-columns. These lookup joins may be used during PSA load for resolving and persisting key information (denormalisation of key tables).
Important
The definition of an alternate BK is only mandatory if HT tables are used.
4.2.2. Control-BDate vs. Effective- BDate¶
When a dataset is loaded into the TSA layer the following rules apply for filling the column RED_BDATE:
In case of a DBI loader the external BDate expression (see EXT_BDATE_SQL_EXPR in DBIADM.DBI_LOADER_BD) is evaluated. In case of a CFI loader the BDate is taken from the filename if available and defined (see BDATE_PATTERN and BDATE_FMT in CFIADM.CFI_LOADER_BD).
If BDate from (1) is NULL the current date (
sysdate
) is used instaed. If the “first full daily load” parameter is true (check BDATE_FFDL_PREV_DAY in loader definitions) usesysdate-1
for the first load of the day (after midnight).
The thus evaluated RED_BDATE in the TSA table is called the *Control-BDate”. In case of type-3 tables a dummy entry with zero rows will be created with this Control-BDate in REDADM.RED_PSA_CONTROL in order to document a load for this very BDate even if no rows will be propagated into the target PSA table.
In the course of historisation the Effective-BDate is evaluated by the DV-view on a row basis. Each row in the dataset may belong to a different BDate an will be historised accordingly. The way how the Effective-BDate is evaluated can be found in the column BDATE_EXPRESSION in REPPSA.REP_PSA_OBJECTS. Also inspect the corresponding DV-view for checking the finally created BDate expression.
For each Effective-BDate a separate control entry is created in REDADM.RED_PSA_CONTROL leading to a change in the related object status (REDADM.RED_PSA_OBJECT_STATUS and STA_OBJECT_STATUS, respectively).
4.3. The Process of Historisation for Type-2 Tables¶
Important
The RED framework uses closed intervals for all kinds of time frames.
Updating of the records is done by comparing the content of each record in the source system with the content of the already existing record. If there is no change, it is not stored again.
If a business attribute of a record changes, the old record is closed with BDATE_TO = “yesterday” and a new record is created with BDATE_FROM = “today”. Both records are technical valid which means that TDATE_TO is infinite (i.e. 31.12.9999), while TDATE_TO shows the technical load time of the record.
In addition, the previously valid record is copied and stored with unchanged valid BDATE_TO but terminated TDATE_TO values. TDATE_TO is set to the load time of the new record minus 1 millisecond (closed interval), which also records the time of the change.
There are then two records with the same functional validity period according to BDATE, but only the newer record is also valid in regard to TDATE, i.e. not terminated.
Note
“Not terminated” means the date is set to “9999-12-31”, i.e. infinite from a business perspective.
Thus, changing a record in the source system creates two new records.
If the same data record (identical BSK) with the same BDATE_FROM is delivered again during a reload of the data, but there has been a change to an attribute, then several business-valid records exist at one point in time. In this case, the history of the change of the data can be traced with the help of the Technical Date (“%TDATE%” columns). The old record is closed with TDATE_TO = “Load time minus 1 millisecond” and the new record is inserted with TDATE_FROM = “Load time”.
Historisation prevents overlapping time slices (time periods) in data records. If overlapping time slices occur, they are automatically corrected. In addition, as is generally the case in data warehouses (DWHs), a record is never physically deleted.
Important
The historisation process in the RED framework handles changes on the whole time scale transparently. Validity periods in the past are split if necessary. Old data may be reloaded at any time without extra effort.
If a record is physically deleted or marked as deleted in the source system (e.g. a “deleted flag” is set), then the BDATE_TO column is set to this day and thus the business validity of the record is terminated. Optionally, a new record with infinite validity is created with the ACTIVE column set to ‘false’. This gives the option to display deleted records if this is explicitly required.
As an alternative the historisation of a table may be configured i a way that a deleted row in the source system ends up in a technical termination (i.e. BDATE values stay the same but TDATE_FROM is terminated).
4.4. The Process of Historisation for Type-3 Tables¶
Since type-3 tables have only a BDATE (no from-to persiod), an attribute change for he same BDATE triggers a technical historisation, i.e. the new record is inserted with a current TDATE_FROM timestamp and a TDATE_TO of infinite, while the old record is terminated with TDATE_TO = “Load time minus 1 millisecond”.
Example: A sale was booked on 01.12.2020. Subsequently, the amount is changed because an incorrect value was inserted. The business date (BDATE) does not change as a result, so the historisation is achieved via the TDATE.
4.4.1. Changes to a Record¶
The following section shows an example of the evolution of an actual record (same BSK throughout) as it changes in the source system. We assume a database of a wine retail store with a table named WINES which contains the name of the wines and the grape variety.
The “%DATE%” columns of the view, the NAME and a business attribute (GRAPE_VARIETY) of the BSK are queried after each change using the following SQL statement:
1SELECT
2 bdate_from,
3 bdate_to,
4 tdate_from,
5 tdate_to,
6 name,
7 grape_variety
8FROM psv_db1_dr.wines
9WHERE bsk='1c63ce7fbfc874f72ea480d4682344a1'
10ORDER BY bdate_from, tdate_from;
First a new record is loaded on the 30th of December 2019 (TDATE_FROM). This record is given infinite validity from both a business and technical perspective, resulting in the %DATE_TO columns being set to 31.12.9999.
BDATE_FROM |
BDATE_TO |
TDATE_FROM |
TDATE_TO |
NAME |
GRAPE_VARIETY |
---|---|---|---|---|---|
2019-12-01 |
9999-12-31 |
2019-12-30 17:50:15.305 |
9999-12-31 00:00:00.000 |
The RED Vinyard |
Sangiovese |
The business attribute GRAPE_VARIETY is being changed from “Sangiovese” to “Sangiovese Grosso” on January 3rd, 2020. This creates two new data records.
BDATE_FROM |
BDATE_TO |
TDATE_FROM |
TDATE_TO |
NAME |
GRAPE_VARIETY |
---|---|---|---|---|---|
2019-12-01 |
9999-12-31 |
2019-12-30 17:50:15.305 |
2020-01-04 04:48:13.895 |
The RED Vinyard |
Sangiovese |
2019-12-01 |
2020-01-02 |
2020-01-04 04:48:13.896 |
9999-12-31 00:00:00.000 |
The RED Vinyard |
Sangiovese |
2020-01-03 |
9999-12-31 |
2020-01-04 04:48:13.896 |
9999-12-31 00:00:00.000 |
The RED Vinyard |
Sangiovese Grosso |
In order to be able to trace when this change took place, the old data record is copied and saved with unchanged BDATEs but terminated TDATE_TO (first line in Table 4.2). TDATE_TO is set to 1 millisecond before the loading time of the new record.
The functional validity of the copied record is being terminated, meaning BDATE_TO is being set to the day before the change, namely January 2nd, 2020. As the record is only loaded a day afterwards and the technical validity of this record starts with the load time, the TDATE_FROM is set to the 4th of January 2020 and as the technical validity is infinite the TDATE_TO value is also set to 31.12.9999 (second line in Table 4.2).
At the same time a new record with the new value for GRAPE_VARIETY is added. The record is functionally valid starting from the change until infinity. This means the BDATE_FROM is set to the date of the change, namely the 3rd of January 2020 and the BDATE_TO value is set to 31.12.9999. As the record is only loaded a day afterwards, the TDATE_FROM is set to the 4th of January 2020 and as the technical validity is also infinite the TDATE_TO value is also set to 31.12.9999. This entry can be seen in line 3 in Table 4.2.
Changing the business attribute GRAPE_VARIETY again adds two more records:
BDATE_FROM |
BDATE_TO |
TDATE_FROM |
TDATE_TO |
NAME |
GRAPE_VARIETY |
---|---|---|---|---|---|
2019-12-01 |
9999-12-31 |
2019-12-30 17:50:15.305 |
2020-01-04 04:48:13.895 |
The RED Vinyard |
Sangiovese |
2019-12-01 |
2020-01-02 |
2020-01-04 04:48:13.896 |
9999-12-31 00:00:00.000 |
The RED Vinyard |
Sangiovese |
2020-01-03 |
9999-12-31 |
2020-01-04 04:48:13.896 |
2020-01-17 06:30:34.677 |
The RED Vinyard |
Sangiovese Grosso |
2020-01-03 |
2020-01-15 |
2020-01-17 06:30:34.678 |
9999-12-31 00:00:00.000 |
The RED Vinyard |
Sangiovese Grosso |
2020-01-16 |
9999-12-31 |
2020-01-17 06:30:34.678 |
9999-12-31 00:00:00.000 |
The RED Vinyard |
Sangiovese Piccolo |
In order to be able to trace when this change took place, the old data record is saved again with unchanged BDATEs but terminated TDATE_TO (third line in Table 4.3).
TDATE_TO is set to 1 millisecond before the loading time of the new record.The formerly newest entry (third line in Table 4.3) is copied and functionally terminated, by setting the BDATE_TO to the day before the change, in this case the 15th of January 2020. The technical validity begins again with the loading time, in this case a day after the change, namely the 17th of January 2020 and therefore the TDATE_FROM value is set to that date. As the entry is technically infinitely valid, this results in the TDATE_TO value to be set to 31.12.9999 (fourth line in Table 4.3).
A new data record with the new value of GRAPE_VARIETY is added, which is functionally valid from the day of the change to infinity, resulting in the BDATE_FROM value to be set to the day of the change, in this case the 16th of January 2020 and the BDATE_TO value to be set to 31.12.9999. The record is loaded a day after the change and also has a technical infinite validity, resulting in the TDATE_FROM value to be set to the 17th of January and the TDATE_TO value being set to 31.12.9999 as well.
Therefore, three data records exist, containing the original value of GRAPE_VARIETY and each change to this value, which are technically still valid on the current date, but only the newest record also has functional validity. These entries as well as the entries necessary to be able to trace changes can be seen in Table 4.3.
4.4.2. Several Changes on the Same Business Date¶
If a record is changed for same business date (BDATE) this is called a change in an “old booking”. Such a change is therefore only represented via the technical dates.
The original record is technically closed, by setting the TDATE_TO to 1 millisecond before the loading time of the new record, as can be seen in Table 4.4. This means that only one of the displayed entries are technically valid, but this ensures that the historical development of the data can be traced.
BDATE_FROM |
BDATE_TO |
TDATE_FROM |
TDATE_TO |
NAME |
GRAPE_VARIETY |
---|---|---|---|---|---|
2019-12-01 |
9999-12-31 |
2019-12-30 17:50:15.305 |
2020-01-04 04:48:13.895 |
The RED Vinyard |
Sangiovese |
2019-12-01 |
2020-01-02 |
2020-01-04 04:48:13.896 |
9999-12-31 00:00:00.000 |
The RED Vinyard |
Sangiovese |
2020-01-03 |
9999-12-31 |
2020-01-04 04:48:13.896 |
2020-01-17 06:30:34.677 |
The RED Vinyard |
Sangiovese Grosso |
2020-01-03 |
2020-01-15 |
2020-01-17 06:30:34.678 |
9999-12-31 00:00:00.000 |
The RED Vinyard |
Sangiovese Grosso |
2020-01-16 |
9999-12-31 |
2020-01-17 06:30:34.678 |
2020-03-15 11:46:30.339 |
The RED Vinyard |
Sangiovese Piccolo |
2020-01-16 |
9999-12-31 |
2020-03-15 11:46:30.340 |
9999-12-31 00:00:00.000 |
The RED Vinyard |
Sangiovese Picc. |
4.4.3. Historisation of Deleted Data¶
It is possible to also historize the deletion of records in the source system. This can either be done by propagating a delete flag from the source or by performing a full load and comparing the PSA with the source data in the TSA.
Depending on the business needs, the termination of the validity can be by BDATE or by TDATE.
Furthermore, deleted records can be saved, by introducing an ACTIVE flag. This feature might be helpful to e.g. get deletion statistics.
4.5. Connectors¶
Connectors manage extraction and load processes from different sources. With minimum impact on the sources data are transferred as quick as possible from the source into the TSA layer. The RED framework takes care of the optimal timing, parallelism and logging. No external scheduler is needed.
The RED framework provides two standard connectors:
The Database Interface connector (DBI) for all types of databases with a JDBC connector
The Common File Interface connector (CFI) for files like CSV, FBV, JSON and Parquet.
Custom connectors may be developed individually and plugged in.
4.5.1. Database Interface Connector (DBI)¶
The generic DBI-Connector is used to access any database via JDBC or using custom drivers provided by Exasol.
Special features are supported for the following databases:
Exasol
Oracle
Microsoft SQL-Server
PostgreSQL
In addition the DBI connectors is user in order to fetch data from REST APIs. Since REST APIs have very individual specifications the interface API must be wrapped in a Python UDF /User Defined Function) first in order to present the data via views in the DBI area. These views are then used as DBI sources.
The DBI load process is based on an extended change data capture concept. As soon as a change is detected in the source the load processes are triggered. Alternatively, the Load processes may be triggered by a timer, e.g. once a day, every 4 minutes, etc. By these mechanisms the RED framework allows for near realtime extractions and loads.
Depending on the type of source native CDC features are supported for incremental loads, e.g. Change Tracking in SQL-Server or Flashback Query in Oracle.
In some cases it migth be usefull to trigger a single loader or a whole group of loaders manually. The RED framework supports such manual triggers via API-calls or even via a mail with a specific subject.
Important
In any case the RED framework relies on PULL and not on push propagations.
For sources which want to stream data into the DWH a Kafka connector can be provided on request. In this case RED acts as a subscriber on a topic pulling data out of Kafka into the TSA layer in sub-minute intervals.
4.5.1.1. CDC Process¶
Each DBI loader is associated with exactly one watch group (CDC group).
The CDC group defines an SQL query per BDomain which yield a so-called watch-id.
The watch-id may be some column value of a source object which indicates a change for a bunch of tables.
Or the SQL expression could simply be something like select sysdate
which results in one change per day and, therefore, triggers a load of all dependent loader once a day (at midnight).
Once the CDC group has detected a change all dependent loaders pop up in a queue (DBIADM.DBI_CDC_PENDING) and will be scheduled and executed by the RED-PSS. Load processes are recorded in DBIADM.DBI_CDC_LOG and DBIADM.DBI_CDC_LOG_DETAILS, respectively.
The CDC type (Change Data Capture) for each table is stored as a setting in the DBI_LOADER table, and can take one of the following values:
FULL: The entire contents of the source table are read at each load, with the RED framework filtering any changes in the data for further processing.
INCR: Only changed data records are transferred, not the entire table. The method used to identify the changed records is configured in the DBI_LOADER table via an SQL expression which returns some integer, date or timestamp value.
CT: Even though the title “Change Tracking” is borrowed from MS SQL Server it supports any type of near-real-time tracking of data changes in any of the supported databases. The Change Tracking method captures any changes in records and provides the primary keys of those changed records and the change operation used (insert, update, delete).
Generic statements for checking the status of a table, and detecting any changes in source tables, are defined in the table DBI_CT_GROUPS. This could be Change Tracking within MS SQL Server, a trigger framework, or some other method of tracking changes. In table DBI_LOADER_BD you define the CT_GROUP, and therefore the CT method, to be used for a given table. As soon as changes are detected, the changed records are loaded into the TSA layer. If for some reason it is not possible to use the CT method, for example if Change Tracking has temporarily been disabled on the source table, a fall-back mechanism guarantees either a full or incremental load, as defined in DBI_LOADER.
4.5.1.2. Configuration¶
Metadata stored in the DBIADM schema is used to configure the DBI Connector. Various APIs are used to manage this data. The following is an overview of the configuration steps:
Table ADM_PARAMS contains the settings for parallelisation and logging.
Table ADM_JOBS contains the control data for regularly running jobs.
Tables DBI_CDCG_WATCH and DBI_CDCG_WATCH_BD define when a load process should start for a group of source tables, and where the tables are to be found.
Table DBI_CDCG_WATCH defines where the trigger for starting the load is to be found and what it looks like. The status of this trigger is regularly queried, and the result is stored in table DBI_CDCG_WATCH_STATUS. This “watcher” process is started as a job from the batch server. The definition of the job is stored in table ADM_JOBS. If the next status query discovers changes to be loaded, a new load process for the group of tables is started through an entry in table DBIADM.DBI_CDC_LOG.
Table DBI_CDCG_WATCH_BD is used for the optional division of the data into different business domains (BDOMAINs).
The three DBI_LOADER% tables are used to store information about the source table:
Table DBI_LOADER contains a list of tables that are to be loaded together, i.e. for which there is a common trigger for loading. These tables are assigned to a common CDC_GROUP.
Table DBI_LOADER_BD defines for each source table for which BDOMAINs data should be loaded. This table also defines where the source table is to be found and into which TSA table it is to be loaded.
Table DBI_LOADER_COLUMNS specifies those columns in the source system with a time data type with a precision higher than 1 millisecond. Exasol can only store time data with a maximum precision of 1 millisecond, so the data in these columns must be converted to VARCHAR before being read from the source system.
4.5.2. Common File Interface Connector (CFI)¶
The CFI-Connector is used to load and historize CSV, FBV, JSON and Parquet files. The mode of operation is very similar to the DBI-Connector, but currently only a FULL load is supported. A load is triggered by the arrival of a file in the input directory. For PSA tables of Type-2 (master data tables) this is the entire contents of the table, for PSA tables of Type-3 (event tables) it contains the current daily changes or those of several days.
JSON files will be prepared and made visible in the database via a view. It’s possible to decide whether only certain key value pairs will be displayed, or the entire buffer. The data in this view will then be loaded via the DBI connector.
The files can contain data for one or more business days, whereby the classification is made by the following logic (with descending priority):
BDATE is in the rows of the data records
BDATE is in the file name
Today’s date is used, yesterday’s date is used for the initial load.
4.5.2.1. Process¶
The following directories are located on the batch server of the respective load environment (DEV, TST, PROD):
data/input (delivery)
data/work (processing)
data/error (not successfully loaded)
data/archive (successfully loaded)
CSV or FBV files are loaded into the input directory and provided with the temporary extension “.tra” or “.TRA”. Upon successful delivery, the delivering process must remove this extension. From this point on, the delivery is considered finished and successful. “.tra” (may be upper or lower case) files will not processed. This prevents the import of incomplete files in the event of a process termination.
A watcher view looks at the input directory and displays the files contained there (except “.tra”). If new files are detected, a prepare process moves them to the work directory. RED-PSS takes care of queueing and executing the loads in parallel. If the loading is successful, the file is moved to the archive directory, otherwise to the error directory.
Behaviour in case of error:
Error due to network termination or system-related: completely normal restart logic as with DBI- watcher
Load is completed, but some rows could not be loaded due to format errors. This is called “rejected rows”. There is a different procedure depending on the file type.
FBV: The number of rejected rows must be 0 for a successful import, otherwise the load will be set to Error and the file will be moved to the error directory.
CSV: The number of rejected rows can be configured via a parameter.
If the number is above the limit, a reload will take place. This means that the file must be corrected manually and placed in the input directory again.
If the number is below the limit but above 0, this is displayed in a warning, but the file is considered to have been loaded successfully and is moved to the archive directory.
4.5.3. Custom Specific Connector¶
Connector types are not limited to CFI and DBI connectors. Custom connectors of any type can be designed and developed if needed.
4.6. PSA Metadata Repository¶
The repository resides in the REPPSA schema and allows for the automatic generation of all PSA structures.
As soon as the loaders in the CFI and DBI connectors are defined and TSA tables are created, metadata for the PSA layer can be created in the REPPSA schema (see the APIs in API_REPPSA).
The meatdata can be found in the following main tables:
REP_TSA_OBJECTS: metadata for TSA objects. It contains, amongst other things, the schema and table name in the TSA layer, business release version, business domain, and an expression that describes how the business date is to be assigned when loading data.
REP_TSA_COLUMNS: metadata for TSA columns. It contains, amongst other things, the column name and data type, whether the column contains the business date, the position of the column in the identity column (IDC), the position of the column in the business key (BK) – logical key, the position of the column in the business surrogate key (BSK), information whether the column is a reference column to another table and column visibility in the PSB layer. IDC and BK are relevant only for so called %_HT Tables.
REP_PSA_OBJECTS: metadata for PSA objects. It contains, amongst other things, the schema and table name in the PSA layer, the business release version, table type (SCD_TYPE), historisation type, and information on how deleted records are to be treated.
REP_PSA_COLUMNS: metadata for PSA columns. It contains, amongst other things, the column name and whether the data in the column is used to mark the end of the functional validity of a record (BDATE_TO).
REP_PSA_TSA_OBJECTS: metadata to link PSA and TSA objects
REP_PSA_TSA_COLUMNS: metadata to link PSA and TSA columns
The API call API_REPPSA.COMPILE automatically creates all necessary PSA tables and views, based on the above-mentioned metadata. In the case of changes to the table structure, an ALTER TABLE statement is generated and executed automatically.
4.7. Calendar Logic¶
A complete calendar solution to steer the loading of data is a core part of the RED framework.
For each source table it is key to describe at which point the source system is ready to provide the data and therefore when the data should be loaded. With the calendar logic, it is possible to define an individual load cycle for each object, considering business days, weekends, holidays, month and year ends etc. It is also possible to introduce a delivery offset if the source provides data with a known delay. It is, of course, also possible to define dependencies between loads, so that certain data is only loaded once other data is available.
The view REDADM.RED_TSA_LOAD_CYCLE_STATUS shows load cycle mismatches, e.g.cycle overlaps. Later, when your system has been running for a while, it also shows real loads without a matchin cycle.
These options are necessary due to some data being loaded every day, seven days a week, other data is only available on business days, which may vary, dependent upon national holidays. Other data is only available on the last day of the month, where that day is a business day. If the last day of the month is not a business day, it will be loaded on the first business day after the last day of the month. Some data is not ready at midnight, when the load would automatically begin. In this case the load can only begin at a time when the data is available.
4.8. Object and Application Status¶
The framework tracks the entire load process, which results in a load status information on object level for every object and every BDATE. This status can be seen in the STA_OBJECT_STATUS view, where the status of each PSA-table per BDATE and BDOMAIN contained. An overview of the most important statuses is shown below:
FINISHED: The object has been loaded and the day is over. All loads for this BDATE are completed.
LOADED: The object has been loaded and the day is not yet over. The last load was successful, but more loads for this BDATE may arrive.
PSA_WAITING: The object is waiting for another table because of dependencies
PSA_MISSING: The TSA load is already completed but the PSA-table is still not loaded
TSA_WAITING: The delivery from the source system hasn’t yet arrived, but a delay is expected
TSA_MISSING: The delivery from the source system hasn’t yet arrived, a delay is NOT expected
ERROR/FAILED: There has been an issue during the load
LAST_PSA_FINISHED shows the timestamp of the last completed load for this very BDATE and BDOMAIN. If the status values TSA_MISSING or PSA_MISSING are reached, an alert value of greater than zero is caused.
Those object statuses can be combined into an application status of multiple objects.
4.8.1. Applications¶
Applications are logical combinations of database objects. An application has a name and assigned database objects. Objects are added to an application via search rules.
An application serves as a unit, whose state can be queried. This state is the result of the states of each object belonging to an application. If, for example, the data of all objects of an application for a specific BDATE was loaded and historised successfully, the application state for this BDATE and application is ‘FINISHED’.
Furthermore, actions can be associated with applications, which will be executed if a specific state is reached.
There are three different state columns in the view REDADM.RED_APP_STATUS, which each can have the status ‘FINISHED’ or ‘UNFINISHED’:
LOADED_STATUS: the state switches to ‘FINISHED’ as soon as all tables belonging to the application have reached either the status ‘LOADED’, ‘FINISHED’ or ‘TSA_WAITING’ for the BDATE. This means that the status becomes ‘FINISHED’ each time as soon as all components have reached the status LOADED except for those which are not yet expected for delivery according to the underlying load cycle. This column switches to ‘FINISHED’ before the other two columns.
PRELIM_STATUS: this column switches to ‘FINISHED’ when all tables on a calendar day have been successfully loaded. For the BDATE, which has been loaded for that day, data of a few, defined tables can still be missing and be delivered a few days later. An example therefore would be month-ultimos, to which the data of some of the tables will only be delivered a few days afterwards. To be able to detect such tables of a BDATE which might not be “delivered”, the information for the calendar information such as the “Load-Cycles” of a BDATE is being used. If no more data on a calendar day is expected, due to that information, the column PRELIM_STATUS is set to ‘FINISHED’.
FINAL_STATUS: If the loading of all tables of a BDATE has been successfully completed, this status switches to ‘FINISHED’. This also includes data for tables which has not been delivered on an earlier BDATE, as has been explained in the PRELIM_STATUS column before.
This column switches to ‘FINISHED’ as the last one.
The FINAL_STATUS (as well as the PRELIM_STATUS) switches back to ‘UNFINISHED’ if at least one component of this very application is reloaded. After successful load it again reaches the ‘FINISHED’ status (presuming that all other components are ‘FINISHED’ as well at this point in time). The same is true for the LOADED_STATUS apart from the fact that the individual object status needs only to be ‘LOADED’.
4.8.2. Application Actions¶
There are different kinds of actions:
email: Notification per mail, for example after a successful load of the data of a BDATE.
export_mail: Notification per mail, but with an attachment, for example a csv file containing the views of the tables, which are part of the application.
export_sftp: Transfer of data per SFTP.
jexec: Execute an external Job, such as Control-M.
materialize: Materialize tables and views (due to performance reasons). Materialisations can only be configured in central application marts (APBs).
4.9. Partial Restore on Table Level¶
Another advantage of the bitemporal and two-dimensional historisation is the ability to perform a partial restore on table level without having to restore the whole database. Just by picking a single table or a set of tables and defining a point in time to which the tables should be reset, the original state will be restored. Data will be deleted, and metadata flagged as deleted. Even a new import of data can be triggered in this process.
4.10. Delete Logic¶
Over time the amount of data stored within a data warehouse will inevitably grow. However, most likely not all the data will need to be retained indefinitely. Some data might only need to be retained for a few months, other data might need, for regulatory reasons, to be retained for a number of years.
Within the framework it’s possible to define erase groups for deleting PSA data. The data in the TSA layer is only retained for a matter of days. Each group can be defined at the table, schema or BDOMAIN level, or a combination of all three. Within each group an include and exclude regular expression can be created at table, schema or BDOMAIN level to define the objects handled by, or ignored by, this group.
Each group has a retention period, which defines for how long the data in this group should be retained. Some data within a table, for example data created at month end, may need to be held for longer. For this data a special calendar can be created to specify these dates. For the data linked to such a calendar a special retention period can be defined.
If a PSA table is not covered by an erase group, then the data within that table will not be deleted.
A protocol of the delete process is written to the database, including how many records from which tables were deleted, and with which load run the deleted data were loaded into the DWH.
4.11. Data Lineage¶
The view REPPSA.PSA_OBJ_LINEAGE allows for tracking the table lineag from source to PSA. With the upcoming module, the functionality is extended to also support column lineage. This enables companies to track in even more detail how individual data fields in tables are created and transformed. This is crucial for data accuracy and quality.
Additionally, an external governance tool can be integrated into the system to optimize data management and tracking, ensuring that all compliance requirements are met. This enables more efficient and transparent data management within businesses.
4.12. Testing¶
A testing tool can be provided for free as a part of the framework. The tool allows for the creation of test cases and workflows to evaluate any given functionality. These tests can then be automatically executed against any RDBMS from within a docker container.
Test workflows are easily configured as JSON files.
Test datasets – random or predefined - can easily be created at the beginning of a test workflow and then dropped at the end.
4.13. Data Patching (RED Correction Framework)¶
The Patch framework within RED is used to correct data which have already been persisted in the PSA layer. This is especially important, if the source cannot be changed anymore or is no longer comprehensible.
Important
A patch never just updates existing data. Changed (patched) data undergo exactly the same historization process as data comng in through the TSA layer, no exception. Therefore, the original data are preserved by technical historization.
REDPAT contains the patch management API and consists of 4 metadata tables:
PAT_PATCHES: this table contains the definitions of the patches
PAT_USERS: this table shows which user is allowed to edit which patches based on the patch-id
PAT_OBJECTS: this table contains information about which objects can be edited by which patch
PAT_AUDIT_LOGS: in this table each sql statement which has been issued for each patch session is being documented. This allows for 100% traceability, as this shows which statements have been issued in which session and, therefore, statements can be associated with patches. Furthermore, this table contains all steps of a patch up until this moment and this step list is being used when the patch is applied.
4.13.1. The Patch Process¶
Start the patch - a PatchID is assigned (patch status: STARTED).
Add PSA-objects which should be subject to patching.
Assign users who may execute DMLs on the patch set.
Prepare the patch: The patch area is filled with the selected data by asynchronous processes. The patch status changes from STARTED to PREP_SCHEDULED over PREPARING to finally PREPARED.
Apply changes: Assigned users may now execute any DML
Apply the patch: The patched data are processed like any other TSA load and apear later in the PSA as actual records. The patch status changes from PREPARED to APPLY_SCHEDULED over APPLYING to finally APPLIED.
The status of patch objects changes from ADDED to PREP_SCHEDULED, PREPARING, PREPARED and finally APPLIED.
A patch may be aborted as long as it is in status STARTED or PREPARED.
Note
A patch object may be any PSA-table as well as any ERR-table.
4.14. The RED Registry¶
All metadata schemas have a view named RED_REGISTRY which represents a configuration snapshot or - in other words - consider it as a logical backup. This view kind of reverse engineers all ADD-operations, which are needed to get the exact same state of metadata for all tables within a schema if this schema would need to be recreated in an empty framework. This means by issuing all of the commands in the RED_REGISTRY view, in the order, as described in the IMPORTANT banner below, the exact same state can be recreated pretty easily.
Important
As some ADD-functions rely on other ADD-functions being executed beforehand, one has to select the list of ADD-functions from the RED_REGISTRY view using order by order_col
to fulfill those dependencies.