DBIngestor

Off-the-shelf systems often fail to provide useful tools that enable the ingestion of data into the database system. These tools are usually only capable of handling ASCII files in the comma separated values format. However scientific data is as diverse as there are scientists.

When handling large datasets it is inefficient to transform the data first into a format that can then be ingested into the database with the provided tools. Not only is it inefficient, but ASCII file representation of binary files are in our experience around 8 times larger than binary files. This can be a huge increase in data if one deals with TBs of data.

I have therefore developed a database ingestion library that can be coupled to any reading routine provided for binary or ASCII scientific data, independent of the database system. The DBIngestor library provides support for the SQLite3, MySQL, and ODBC database interface APIs. It is further capable of performing defined consistency checks on the data and can perform elementary data transformations. A buffering mechanism is implemented to send multiple rows at once to the database through the API, to obtain best performance (unfortunately I am not yet using bulk operations, but solely rely on prepared statements).

Download DBIngestor from GitHub!

When integrating a given reading routine with DBIngestor, one only needs to define the data schema and how it maps from the file to the database tabel, together will all the required assertion checks and transformations. Then, by inheriting from a virtual base class, the reading routine needs to be coupled to DBIngestor (getNextRow for reading a new row into memory and getItemInRow for extracting information from the row). Anything else is taken over by DBIngestor. DBIngestor thus enables bespoke solutions, for bespoke data files.

AsciiIngest

Using the DBIngestor library, I have implemented a row based ASCII file ingestion tool, similar to the one provided with MS SQL Server. Our AsciiIngest tool can work with either arbitrarily delimited formats or fixed formats as they are usually produced by FORTRAN code.

Download AsciiIngest from GitHub!

The definition of the data file are provided by the user in format files, that are similar in structure as the ones needed by MS SQL Server. However in the AsciiIngest format files, it is possible to define data assertion and data transformation tasks. It is also possible to produce more elaborate transformation results by using a more or less archaic way of defining procedures (I did not want to provide a full flexed data transformation language but if you know what you are doing, complex results can be achieved).

A description/example input file is given below:

numCols 
numHeaderRows
#numCol DType PrefixLen      DataLen Separator   SchemaColName  AssertAndConvertList
D1      CHAR  0              10      ','         haloName       ASRT_ISNOTNULL,CONV_CAPITALIZE,CONV_TRIM
D2      INT4  1              40      ','         numParticles   ASRT_ISNOTNULL,ASRT_ISNOTNEGATIVE,CONV_MULTIPLY(D2)
D3      INT4  1              40      ','         numParticles2  ASRT_ISNOTNULL,ASRT_ISNOTNEGATIVE,CONV_MULTIPLY(D2; 12.0)
#These are examples of header items
#numCol DType LineNum:offset DataLen Separator   SchemaColName  AssertandConvertList
H4      INT4  0              0       'Redshift:' redshift       ASRT_ISNOTNULL,CONV_FLOOR
H5      REAL4 10:5           1       ''          expFact        ASRT_ISNOTNULL,CONV_FLOOR,CONV_MULTIPLY(3.1415)
#These are example of constant items
C6      REAL4 0              0       '3.14156'   pi             ASRT_ISNOTNULL
C7      INT4  0              0       '2'         someMagicNumber