2. Import Configuration Files

General process of creating *.xtable files

Each *.xtable file corresponds to an object within the Xplain environment, i.e. within the XObjectExplorer user interface. The source of each *.xtable file is a table or view within a data source. Usually, your data source will be a relational database - nevertheless it can be any data source for which a JDBC connection can be established, e.g. any relational database, an MS Excel file or a *.csv (comma-separated values) file.

XTables and their associated attributes (i.e., *.xtable and *.xattribute files) are usually created by a Java program using an *.xgenscript file as generation configuration.

This configuration file name will be passed to the Java program xplain.jar as parameter. You can find this xplain.jar file in xplain docker image h2524926.stratoserver.net/xplain_jar.

To copy the xplain.jar from docker image to your local directory

> id=$(docker create h2524926.stratoserver.net:5000/xplain_jar:<branch>)
> docker cp $id:/xplain.jar - > <your local directory>
> docker rm -v $id

A typical XTables and XAttributes generation via command line can be carried out in following command

java -jar xplain.jar -g <*.xgenscript>

An *.xgenscript file contains an entry "tasks" holding an array of named task sets each of which consists of several parallel tasks. The JSON part of each task set contains the following entries:

"name": <String> The identifier of the task set.
"parallelTasks": <Array of Maps> The set of parallel tasks. Each such map contains
- "task": <String> The type of the job:
  - "buildXTable",
  - "fuseXTables",
  - "buildAttribute",
  - "mountAttributes",
  - "removeFiles"
- depending on the choice for the job type in "task" a subset of the following entries:
  - "configurationFile",
  - "dbTableName",
  - "tables",
  - "parentTable",
  - "saveToFile",
  - "recordsPerPackage",
  - "samplingRate",
  - "attributeConfiguration",
  - "xTableFile",
  - "mountAttributes",
  - "files",
  - "autoCompleteConfiguration",
  - "stopUponFail",
  - "xTableConfiguration",
  - "excludedStartKey",
  - "includedStopKey",
  - "whereClause"
  - "queryDbUnsorted"
  - "sortRecordsThresholdForSplitting"
  - "sortNrSplitParts"

Example:

{
    "tasks" : [
        {
            "name" : "Generate Patients XTable",
            "parallelTasks" : [
                {
                    "task" : "buildXTable",
                    "configurationFile" : "patients.xtableconfig",
                    "saveToFile" : "patients.xtable",
                    "recordsPerPackage" : 100000,
                    "samplingRate" : 0
                },
                {
                    "task" : "buildXTable",
                    "configurationFile" : "prescriptions.xtableconfig",
                    "saveToFile" : "prescriptions.xtable",
                    "recordsPerPackage" : 100000,
                    "samplingRate" : 0
                }
            ]
        }
    ]
}

Creating *.xtable files

The task type "buildXTable" can be used to import data from a database into *.xtable files. The following parameters are used to configure this process:

"task": "buildXTable"
"configurationFile": <String> The name of the *.xtableconfig file holding the configuration of the XTable
"dbTableName": <String> The name of the table in the database. This overrides the "dbTableName" setting coming from the *.xtableconfig file specified in "configurationFile".
"saveToFile": <String> The name of the *.xtable / *.xattribute file to be created
"recordsPerPackage": <Number> The package size (number of records per package). NOTE: This parameter is deprecated. Use the expectedTableSize parameter instead.
"samplingRate": <Integer> The sampling rate. A sampling rate of n means that roughly 1/n of all entries will be sampled. At the moment, the sampling is based on the hash value (not on random numbers). This is not guaranteed to be kept the same forever. Sampling based on hash values has the advantage of reproducible results, but it relies on a good computation of the hash values. (And in some situations, it might be bad to rely on a deterministic process instead of a stochastic one …)
"whereClause": <String> A condition to be used for filtering data during import from the database.
"queryDbUnsorted": <Boolean> Decides whether sorting the data instances w.r.t. the keys should be done by the source database system (false) or by Xplain Data’s backend (true). If this entry is not null, it will override the setting in the XTable configuration (*.xtableconfig file).
"sortRecordsThresholdForSplitting": <Long> Specifies the threshold size of the data array for splitting it into k parts during the external k-way merge sort. If the threshold is not exceeded, then the array is sorted in main memory. If this entry is not null, it will override the setting in the XTable configuration (*.xtableconfig file).
"sortNrSplitParts": <Integer> Specifies the number k of parts of the data array for splitting it during the external k-way merge sort if the threshold sortRecordsThresholdForSplitting is exceeded. If this entry is not null, it will override the setting in the XTable configuration (*.xtableconfig file).

Example:

{
    "task" : "buildXTable",
    "configurationFile" : "patients.xtableconfig",
    "saveToFile" : "patients.xtable",
    "recordsPerPackage" : 100000,
    "samplingRate" : 0
}

Configuring the creation of XTables via *.xtableconfig file

An *.xtableconfig file holds the configuration for the creation process of an XTable (i.e., for the creation of an *.xtable file).

Parameters:

"databaseConnectionConfiguration": <Map> see below
"dbTableName": <String> Name of the table within the database
"objectName": <String> Name of the result table, i.e., the object name within the Xplain Data System
"dbKey": <String> (Optional) The database column name that corresponds to a key for the table.
"dbForeignKeys": <Array of Strings> The names of the table columns in the database defining the keys in the hierarchy.
"dimensionConfigurations": <Array of Maps> The array of dimension configurations (see below)
"expectedTableSize": <Integer> The expected table size, the records per package of the XTable will be automatically calculated based on this value. If "recordsPerPackage" is specified in task, this expectedTableSize value will be overwritten (see below):
"omitTableNameDelimiters": <Boolean> Don’t use table name delimiters (usually double quotes, backticks, or square brackets) in SQL queries. If this flag is null, the global default (set via command line parameter or environment variable) might be used instead.
"omitColumnNameDelimiters": <Boolean> Don’t use column name delimiters (usually double quotes, backticks, or square brackets) in SQL queries. If this flag is null, the global default (set via command line parameter or environment variable) might be used instead.
"queryDbUnsorted": <Boolean> Decides whether sorting the data instances w.r.t. the keys should be done by the source database system (false) or by Xplain Data’s backend (true).
"sortRecordsThresholdForSplitting": <Long> Specifies the threshold size of the data array for splitting it into k parts during the external k-way merge sort. If the threshold is not exceeded, then the array is sorted in main memory.
"sortNrSplitParts": <Integer> Specifies the number k of parts of the data array for splitting it during the external k-way merge sort if the threshold sortRecordsThresholdForSplitting is exceeded.

Example:

{
    "databaseConnectionConfiguration" : {
        "databaseType" : "ORACLE",
        "url" : "MediServer",
        "portNumber" : 1234,
        "databaseName" : "medicationdb"
    },
    "dbTableName" : "medi.patbase",
    "objectName" : "PatientObject",
    "dbKey" : "PATIENTID",
    "dbForeignKeys" : [ ],
    "expectedTableSize": 10000,
    "dimensionConfigurations" : [
        {
            "dbColumnName" : "PATIENTID",
            "dimensionName" : "Patient",
            "xplainDataType" : "LONG"
        }, {
            "dbColumnName" : "SEX",
            "dimensionName" : "Sex",
            "xplainDataType" : "CATEGORIAL"
        }, {
            "dbColumnName" : "DOB",
            "dimensionName" : "Date of birth",
            "xplainDataType" : "LONG"
        }
    ]
}

Keys

All objects and sub-objects of the object hierarchy must have a primary (key) dimension, except for the leafs of the object hierarchy. If an object or sub-object has a primary dimension then its corresponding column in the database must be specified using the entry "dbKey". The entries of the array "dbForeignKeys" shall be the database column names that make up the key sequence from (the column for) the root of the object hierarchy to (the column of) the parent of the current object to be configured. That means, for each step downwards the object hierarchy, an additional key must be added.

Dimension Configurations

The entry "dimensionConfigurations" must contain an array. Each entry of this array must be a map with the following entries:

"dbColumnName": <String> The name of the corresponding database column.
"dimensionName": <String> The name of the dimension to be created for the object.
"xplainDataType": <String> / <XplainDataType> The data type of the dimension to be created. Admissible entries are the numerical types "LONG"/"INT"/"SHORT"/"BYTE"/"DOUBLE"/"FLOAT", the categorial types "CATEGORIAL"/"CATEGORIALSHORT"/"CATEGORIALBYTE", as well as the type "STRING".

The numerical integer types represent integral numbers with the following ranges:
- "BYTE": 8 bits (1 byte) signed integer (-128 to 127)
- "SHORT": 16 bits (2 bytes) signed integer (-32,768 to 32,767)
- "INT": 32 bits (4 bytes) signed integer (-2,147,483,648 to 2,147,483,647)
- "LONG": 64 bits (8 bytes) signed integer (-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807).
(Please note: Since the smallest value of each type is treated as a null-representative, the actual usable range is -127 to 127 and so on.)

The numerical floating point types are: - "FLOAT": single-precision 32-bit IEEE 754 floating point values, - "DOUBLE": double-precision 64-bit IEEE 754 floating point values For details, see Java Language Specification / Floating-Point Types and Values

The categorial types represent categories (names/strings) with the following restrictions:
- "CATEGORIALBYTE": uses 1 byte, max 128 categories,
- "CATEGORIALSHORT": uses 2 bytes, max 32,768 categories,
- "CATEGORIAL": uses 4 bytes, max 2,147,483,648 categories.
The type "STRING" is not suitable for analytic purposes, but it can be used to store arbitrary strings as context information.

"attributeConfigurations": <Array of Maps> The configuration of attributes (might be left out at creation time), see below.:

{
        "dimensionConfigurations" : [
                {
                        "dbColumnName" : "PATIENTID",
                        "dimensionName" : "Patient",
                        "xplainDataType" : "LONG"
                },
                {
                        "dbColumnName" : "SEX",
                        "dimensionName" : "Sex",
                        "xplainDataType" : "CATEGORIAL"
                },
                {
                        "dbColumnName" : "DOB",
                        "dimensionName" : "Date of birth",
                        "xplainDataType" : "LONG"
                }
        ]
}

Attribute Configurations

If the entry "attributeConfigurations" is present, it must contain an array. Each entry of this array must be a map with the following entries:

"databaseConnectionConfiguration": <Map> see below
"dbTableName": <String> The name of the table which is defining the attribute / hierarchy
"attributeName": <String> Understandable name of the attribute
"dbKey": <String> The key field linking into the fact table, simultaneously the leaf level of the hierarchy
"dbHierarchyLevelColumnNames": <Array of Strings> The columns which are used in defining the hierarchy
"hierarchyLevelNames": <Array of Strings> Optionally: names for the different levels of hierarchy, e.g. “state”, “region”, …
"dimensionName": <String> Optionally: in case where the attribute after building is not saved and mounted but immediately added to the dimension
"objectName": <String> Optionally: see "dimensionName"

Fusing *.xtable files

The task type "fuseXTables" can be used to merge several *.xtable files into one file. The following parameters are used to configure this process:

"task": "fuseXTables"
"tables": <Array of Strings> An array of *.xtable filenames to be fused together.
"parentTable": <String> The name of an *.xtable file that shall be used as the parent object to filter out all entries that do not belong to an instance of this parent object.
"saveToFile": <String> The name of the *.xtable file to be created
"recordsPerPackage": <Number> The package size (number of records per package)
"samplingRate": <Number> The sampling rate

Example:

{
    "task" : "fuseXTables",
    "tables" : ["prescriptions_2015.xtable", "prescriptions_2016.xtable", "prescriptions_2017.xtable"],
    "parentTable" : "patients_male.xtable",
    "recordsPerPackage" : 100000,
    "saveToFile" : "prescriptions_2015-2017_male.xtable"
}

Creating *.xattribute files

The task type "buildAttribute" can be used to create an *.xattribute file by importing all existing states from database columns (or Excel file columns) of a specified hierarchy.

"task": "buildAttribute"
"attributeConfiguration": <Map>

The settings for the generation of an attribute:
- "databaseConnectionConfiguration": <Map> see below
- "dbTableName": <String> The name of the table in the database. If the attribute is imported from an Excel file, this entry must contain the name of the Excel sheet.
- "attributeName": <String> The name/identifier of the attribute (in Xplain Data)
- "dbKey": <String> The name of the key field
- "dbHierarchyLevelColumnNames": <Array of Strings> The database column names defining the state hierarchy. If the attribute is imported from an Excel file, this entry must contain the column names (in the first row) of the Excel sheet.
- "hierarchyLevelNames": <Array of Strings> The level names of the state hierarchy in the system of Xplain Data

"saveToFile": <String>

The name of the *.xattribute file to be created:

{
    "task" : "buildAttribute",
    "attributeConfiguration" : {
        "databaseConnectionConfiguration" : {
            "databaseType" : "ORACLE",
            "url" : "MediServer",
            "portNumber" : 1234,
            "databaseName" : "medicationdb"
        },
        "dbTableName" : "productdb",
        "attributeName" : "ATC Hierarchy",
        "dbKey" : "Code",
        "dbHierarchyLevelColumnNames" : [ "ATC1", "ATC2", "ATC3", "ATC4", "Brand", "Code" ],
        "hierarchyLevelNames" : [ "ATC Level 1", "ATC Level 2", "ATC Level 3", "ATC Level 4", "Brand", "Code" ]
    },
    "saveToFile" : "ATC.xattribute"
}

Mounting attributes to *.xtable files

The task type "mountAttributes" can be used to mount attributes to an *.xtable file.

"task": "mountAttributes"
"xTableFile": <String> The source file where attributes shall be mounted to
"mountAttributes": <Map of Maps> A map of dimensions to attribute configurations (for mounting), see below
"saveToFile": <String> The name of the *.xtable / *.xattribute file to be created

Removing files

The task type "removeFiles" can be used to remove (temporary) files.

"task": "removeFiles"
"files": <Array of Strings> An array of filenames for the files to be removed

Database connection

The configuration of the database access for data import is defined by the entry "databaseConnectionConfiguration" (in the *.xtableconfig file for a "buildXTable" task or in the entry "attributeConfiguration" for a "buildAttribute" task).

Associated with this keyword is a map with the following entries:

"databaseType": <String> The type of the "database" (or import file), defined by one of the keywords "CSV", "DB2", "EXCEL", "IMPALA", "MSACCESS", "MSSQLSERVER", "MYSQL", "MARIADB", "ORACLE", "POSTGRESQL", "SAPHANA"
"url": <String> The location of the server
"portNumber": <Number> The port number for connecting the database server
"databaseName": <String> The name of the database,
"driverProperties": <Map> An optional map of properties (that depend on the JDBC driver, i.e., the database type). An example might be the entry "currentSchema" for setting the database schema that is used by some drivers.
- MS SQLServer:
  English: Properties
  
  Deutsch: Verbindungseigenschaften
- IBM Db2:
  Common IBM Data Server Driver for JDBC and SQLJ properties for all supported database products
  
  IBM Data Server Driver for JDBC and SQLJ configuration properties
- MariaDB / MySQL: Parameters
- Oracle: Supported Connection Properties
- Postgres: Connection Parameters
- SAP:
  The SAP JDBC driver is not included in the jar file delivered by Xplain Data, but you can use it in the following way:
  
  Download the JDBC driver ngdbc-2.25.9.jar and put it into a separate directory for external libraries which we refer to by <libdir>
  
  Run Java using the parameter loader.path, e.g. using the following scheme (replace <libdir> by the actual directory):
  
  java -jar xoe.jar -Dloader.path=<libdir>
  
  Connection Properties
- Databricks:
  Due to a problem with the Databricks JDBC driver, you need to specify the parameter --add-opens=java.base/java.nio=ALL-UNNAMED with the Java command, e.g.:
  
  java --add-opens=java.base/java.nio=ALL-UNNAMED -jar xoe.jar
  
  JDBC Drivers
- CSV: Driver Properties for CSV driver

Example:

{
        "databaseConnectionConfiguration" : {
                "databaseType" : "ORACLE",
                "url" : "MediServer",
                "portNumber" : 1234,
                "databaseName" : "medicationdb"
        }
}

CSV Import Configuration

The import of data from CSV files can be configured using the parameters described in the documentation of the CSV driver properties. (Note that the entry "csvConfiguration" of "databaseConnectionConfiguration" is deprecated. Please use the corresponding entries in "driverProperties".)

The most important entries are:

"charset": <String> The charset of the CSV file, e.g. "UTF-8". If not set, the default of the Java Runtime Environment is used and a list of available charsets is provided in the log (at info level).
"columnTypes": <String> A comma-separated list of SQL data types for table columns. The default is all String columns.
"commentChar": <String> Lines before the header starting with commentChar are ignored. After reading the header, everything is interpreted as data. Default is null.
"fileExtension": <String> The file extension of the CSV files. If the extension ".dbf" is used, then files are read as dBase format files. Default is ".csv".
"quotechar": <String> The quote character (a string consisting of at most one character). Entries surrounded with the quote character are parsed without the quote characters (which makes sense when entries contain the separator or line breaks). Default is the string consisting of the single " character. If empty, quoting is disabled.
"separator": <String> The column separator. Default is ",".
"timestampFormat": <String> The format for parsing columns of type Timestamp. Default is "yyyy-MM-dd HH:mm:ss".
"timeFormat": <String> The format for parsing columns of type Time. Default is "HH:mm:ss".
"dateFormat": <String> The format for parsing columns of type Date. Default is "yyyy-MM-dd".
"timeZoneName": <String> The time zone for columns of type Timestamp. Default is "UTC".