Extractors

Serverboards’ Universal Service Queries can extract information from any external source, be it this data dynamic or static. Then the SQL engine will manage the data and properly ask for only the required data so that proper intelligence can be obtained from the data.

The general idea of USQ extractors come from Postgres Foreign Data Wrappers, but vastly simplified for Serverboards use case.

To implement USQ extractors a component has to be defined, and two functions implemented at at Command.

- id: csv.extractor
  icon: disk outline
  type: extractor
  service: serverboards.core.ssh/ssh
  name: Remote CSV File
  description: Access to a CSV file stored in an SSH accessible server
  command: serverboards.core.ssh/extractor.daemon
  extractor: csv_extractor
  schema: csv_schema
  params:
    - name: path
      label: Path to CSV files
      type: text

Here we define an extractor that requires a specific service type (serverboards.core.ssh/ssh) that will call csv_schema to extract the schema and csv_extractor to extract the data itself. The command serverboards.core.ssh/extractor.daemon is the active component that extracts such data.

There is some UI data that will be used on the GUI to ease the use.

Active components: schema and extractor

All functions receive as firs tparameter the configuration of the extractor with:

  • service – The UUID of a related service if required
  • config – The configuration dictionary as required by the params at the manifest.

Data schema

The schema function can be called with two different signatures:

extractor(config)

Returns the list of tables at this extractor.

In the example above the CSV reader may check for all files taht end in .csv and return such list.

>>> schema({"service": "XXX-YYY", "config": {"path": "/var/csv/"}})
<<< ["purchases", "clients", "products"]

extractor(config, table)

Returns a dictionary with a columns key. The value is a list of:

  • column names
  • dictionaries with a name key.

This allows future enhancement to add types and other helpers. The easiest way and currently recommended is just return the list of columns.

Future enhancements may be types and cost analysis to help the SQL engine prepare more efficient execution plans. For example right now the types are loosely defined, so on each comparison we have to ensure they are the same type (two numbers or two strings, for example). A possible performance improvements is that if we know both sides will have the same type, do not cast them. This is anyway a future enhancement, but the system is prepared for such enhancement.

>>> schema({"service": "XXX-YYY", "config": {"path": "/var/csv/"}}, "products")
<<< {"columns": ["id", "name", "price", {"name" : "category", "type": "int"}]}

Data extraction

extractor(config, table, quals, columns)

An extractor function is called to extract the data.

  • config is the configuration as seen before
  • table is the table to extract data from
  • quals are the restrictions over the data. Will be explained in detail later.
  • columns is the list of columns on which we are interested

Both quals and columns can be ignored by the extractor; both will be checked again at the SQL engine. This is to ease the development. If implemented performance gains can be achieved, but this is highly query dependant.

It returns an object with columns and rows. Columns is a list of columns, ideally as in the columns parameter, and rows a list of rows, each row with the data at the same order as columns:

>>> extract(
  {"service": "XXX-YYY", "config": {"path": "/var/csv/"}},
  "products",
  [["id", "IN", [1, 2, 3, 4, 5]]],
  ["id", "name", "price"]
  )
<<< {
  "columns": ["name", "id", "price"],
  "rows": [
    ["Coca Cola", 1, 1],
    ["Pepsi", 2, 0.8],
    ["Sprite", 3, 0.8],
    ["Bitter Kas", 10, 1.0]
  ]
}

In this query we return the proper columns but in the wrong order and an extra element. It’s OK, the ExoSQL engine takes care of it.

quals

quals is a list of triplets with [column, op, value]. Its a list of restrictions for the data to be returned. For example a typical quals list is [["datetime", ">=", "2018-01-01"], ["datetime", "<=", "2018-01-01"]].

This hint can be both used for dynamic data, and to limit the queries on the remote side. For example the Google Analytics extractor needs a datetime range to be able to do the final query. The http_request extractor needs a specific URL to do the request.

op can be any of the following list, but if the extractor finds an op that is not recognized, the safest is to just ignore it, as later the ExoSQL engine will ensure all the correct restrictions are meet.

  • =
  • >=
  • <=
  • >
  • <
  • IN

quals is a necessary optimization that can be safely ignored.

Due to not having types at ExoSQL, the types might mismatch. To prevent it the developer can proactively cast to the desired type, or the SQL analyst do the proper casting as needed.

columns

The list of columns, as in quals, it is an optional hint. It helps to ask only for the really required data, and it can be ignored. It will never ask for a column not in the schema request. An extractor can return all the columns it want but MUST contain those of the columns list.

If returned in exactly the same order as required the columns will not have to be reordered at the ExoSQL engine, which is a small benefit.

The column list can be empty for some corner cases. The extractor is free to return a list of empty lists. Normally this is to count the rows. In the future this may be solved with another qual operation.