4/6/19

Building a Modern Data Warehouse From Different Sources

A data warehouse (DW) is a core component of business intelligence and a central data repository for different sources. This poses many challenges as the schema definition for those sources may be completely different from one another. In this presentation, we discuss how to create a data warehouse model that can support data for disparate data sources. We look at how to build a dimensional model which can enable us to import the data with different shapes into a data warehouse. We then create processes to transform and load the data using Apache Spark. We finally use PowerBI to visualize the data from our data warehouse.


Modern Data Warehouse


Learning Objectives:

Challenges when using disparate data sources (different models)

  • The number of fields on the payload can be different. Some system can send more data readings/facts than another.
  • The fields can have a different name but the data represent the same reading.
  • The model can change independently of the other system.

Define a common model to support the different model shapes

  • We need to identify the common fields across all the data models that can identify a unique entry. For example, the date, a serial number or device id, a regional location. 
  • Identify the dimension from the actual entry from all these models. The dimensions are used to build other tables that can provide structured information about the readings.
  • We need to identify the measurement or readings and support their different names.
  • For performance, we would like to be able to read the entries without the overhead of processing all the measurements. 
  • Star Schema design can be used to associate the dimension tables to the fact table.

Strategy for processing the different models into the common model with Apache Spark

  • Load the data source into data frames and map their fields into a data set with well defined data types for the fields.
  • Parse the dimension fields into a different data set. Create a unique hash numeric key that can be used as a foreign key on the the relational data model.
  • Map all the measurements from their data source name to the corresponding field on the common model.
  • Create a dimension for the field name to catalog the measurement fields thus avoiding having to load all the data and get unique names.
  • Transpose all the measurements from columns to rows. This enables us to support different number of columns as well as support field changes on the data source.

Data Pipeline for Loading the Data into the Data Warehouse with Apache Spark
  • Load the new unique dimension records from the Spark database into the dimension tables in the data warehouse first.
  • Load the fact entry with all the dimension relationships. Create a unique hash id for the entry.
  • Load the fact measurements with the entry hash id relationship. This maps the reading to a unique entry.
Sample Project:

The sample project with the Spark and SQL code as well as demo data can be found at this location:

https://github.com/ozkary/modern-data-warehouse-spark/

By following this high level plan, we should be able to create a data warehouse that can support disparate data sources with different models using modern data processing tooling.



Originally published by ozkary.com

3/30/19

Magic Leap One No device is connected

When developing apps for Magic Leap One, we can use the Magic Leap Remote app to start the device connectivity and deploy our apps directly from the Unity Editor. Even when the ML1 device has been configured for development purpose by enabling the creator mode, installing the certificate and allowing MLDB access, there may be a “No Device is Connected” error. In this article, we take a look at how to use the MLDB CLI to address device connectivity problems on a Windows computer.


What is MLDB?

The Magic Leap Device Bridge (MLDB) CLI is used to provide direct connectivity to the device via USB and WIFI.  We can also use this CLI to directly deploy apps onto the device. It is during the deployment process when we can notice connectivity problems with our computers.

Is the device connected?

Assuming the ML1 device has been configured properly for MLDB connectivity and that we have a verified USB connection, we are going to start to troubleshoot the connection error by starting the MLDB CLI and typing the following commands after the > prompt:


>mldb devices

--output
List of devices attached




If we get no device serial number on the output then we know that we have a problem with the device connecting to our PC via USB. To double check, let’s take a look at our Device Manager panel and check the Universal Serial Bus Devices configuration.  If we find our ML1 listed there (see image below), we know that the physical connection via USB is working. 







Let’s Repair our Registry Settings

We know that a physical connection is working, but we still do not see the device via MLDB CLI. We need to check that our device registration in our PC as USB connection is valid by using the CLI and checking our registry settings with the following command at the > prompt:


>mldb usbreg

Found connected ML device with serial number: ###
GOOD: Device has a container node.
GOOD: Device has an interface GUID for the non-composite device.
BAD: Device is missing an interface GUID for the first composite interface
BAD: Device is missing an interface GUID for the second composite interface
Created registry file that should fix the issue: \mlsdk\v0.20.0\tools\mldb\mldb-fix.reg
Double click the file in Windows Explorer to update your registry, then unplug
and replug your ML device. MLDB should now recognize the device.



We should now see some messages indicating if there are any problems. If any problem is found, a new registry setting file is created, and we need to import it into our system. But before we import the file, we can take a look at our registry settings and compare it to the values in the newly created file.  This can allow us to understand what is going on.

We can open the new registry file to learn the path for our registry. We can then type regedit command on the CLI to open the registry database in our computer. By comparing the file with the values in our registry, we can see that the DeviceInterfaceGUID on our registry is no longer matching what our device is reporting. This make sense, so we need to update our GUID value with the one provided by the output of the mldb usbreg command.

Conclusion

After updating our registry settings and reconnecting our device via the USB port, we should be able to run the mldb devices command again and see our device listed on the output. If we restart the Magic Leap Remote app and press the Start Device button, we should see that our device connects and it is ready for us to deploy our apps via the Unity editor or the MLDB CLI.
I hope this provides some insight on how to address the No Device is Connected error when using the Magic Leap Remote app.


Thanks for reading

Originally published by ozkary.com

3/9/19

Omit Data Members with JavaScript and CSharp

When calling Restful APIs with a well define JSON contract, we get errors on the request when we submit a JSON payload that does not meet the contract specification.  In many cases, we use a model that may have properties which we need to ignore from the payload before submitting to the API. In this article, we take a look at how a data member can be omitted to maintain the JSON contract definition using both JavaScript on the device side and C# for our backend services.



Defining Our Contract

For the context of this article, we are working on a telemetry API with the following JSON contract which only captures the device id and telemetry information like temperature, humidity and sound levels.


Telemetry {
deviceId (integeroptional),
temperature (numberoptional),
humidity (numberoptional),
sound (numberoptional)
}

Omit a Data Member in JavaScript

From the device side, we are capturing the information with JavaScript.  There many other data points that we capture on the device, but we only want to send the properties listed on the JSON contract. There are several approaches to rebuild the model and capture the properties that we need. On this article, we will take a look at the approach of cloning the contract and mapping properties to a new object.  Let’s take a look at our solution using JavaScript.


var device = {
    "deviceId": 2026,
    "humidity": 65,
    "location": "stock-23450",
    "sn": "br52552endn",
    "sound": 120,
    "bu": "mfr2939",
    "ver": "3.4.0",
    "robot": "rb5625",
    "temperature": 35,
};

var contract = {
    "deviceId": 0,
    "temperature": 0,
    "humidity": 0,
    "sound": 0,
}


function cloneAndMap(device, contract) {

    var data = null;

    if (device) {

        var data = Object.assign({}, contract);

        for (var key in data) {

            data[key] = device[key];

        }

    }

    return data;

}

var data = cloneAndMap(device, contract);


As shown in the code, the device object has several more properties that are not relevant to our API. We want to be able to get only the properties that are defined on our contract object.  A way to do this is to clone the contract object using Object.assign native API to shallow copy the object. We then read every key from our contract object and use those keys to get the values from our device object. The main logic is handled by the cloneAndMap function.

We should note that objects in JavaScript are dynamic data types that are essentially hash tables. This enables us to have constant time ( time complexity O(1) – one operation to access the data ) access to the hash table values by using the key for each field. This works out well for us because the device data can have hundreds of fields, but we will access each field in constant time. We are only bound to the contract object size which is much smaller and well define to n=4 fields, so the time complexity to traverse all the contract keys is O(4).

We should also note that we are cloning the contract object because we want to continue to reuse it for other API calls. We also decided not to clone the device object because its running time T(n)  would depend on the size (n) of all its properties which is much larger than the contract. We would then need to delete the unwanted properties, but we know that objects in JavaScript are immutable. This means that every time we delete a property a new object will be created. This is a performance concern.

Ignore a Data Member with Attributes C#

Now that we understand how to ignore a data attribute with JavaScript, let’s take a look at how that can be done using C# by first looking at our model definition.


/// <summary>
/// device telemetry readings
/// </summary>
public class Telemetry
{
    /// <summary>
    /// device id
    /// </summary>
    [DataMember]
    public int deviceId;

    /// <summary>
    /// internal id
    /// </summary>        
    [IgnoreDataMember]
    public long Id;

    /// <summary>
    /// temperature reading
    /// </summary>
    [DataMember]
    public float Temperature;

    /// <summary>
    /// humidity reading
    /// </summary>
    [DataMember]
    public float Humidity;

    /// <summary>
    /// noise/sound reading
    /// </summary>
    [DataMember]
    public float Sound;     
}

We define our telemetry class with the same attributes as our JSON model. There is however an internal Id data member which we may use for internal purposes, but we do not want to serialize as JSON to the client application. When we want to ignore a property on C#, we can use the IgnoreDataMember attribute which is in the Serialization namespace and declaratively ignore the property.

This IgnoreDataMember  attribute enables us to change the default behavior of the DataContractSerializer which by default serializes all public properties. In our case, we still want to maintain the Id property with public accessibility, so it is available to other backend integrations that are not in the same assembly.

We should note that by setting the Id property to internal or private also prevents the serialization of the property to JSON. This however impacts accessibility to the data member. We need to look at our system requirements and determine the best approach as those other access modifiers prevent the accessibility of that property to other assemblies and classes.

Conclusion

With this article, we are able to see how to omit additional properties in the application model, so we can maintain our JSON contract using both JavaScript for the client app and C# for the backend API. We looked at how depending on the implementation approach we can optimize the solution for time complexity and accessibility concerns.


Thanks for reading.

Originally published by ozkary.com

2/16/19

Cache-Aside Pattern With Nodej SQL Server and Redis


The cache-aside pattern is a common approach in which the application performs a cache-first look up to get some data. When the data is found, the term cache-hit is used. When there is no data, it is call a cache-miss, and the application performs a lookup from the database.   In this article, we take a look at the benefits of using this pattern and provide a basic solution using Node.js, SQL Server and Redis.



What is In-Memory Cache?

An in-memory cache is a database in which the data is stored in memory instead of disk. This provides performance benefits as a data read can return in sub-millisecond average speed. Because of this benefit, the data is not persisted on disk, and the data is lost on a system restart. This is a common trade-off as this kind of database is used for short time lived storage and not as a system of record. These systems are also commonly known as a distributed cache because they usually reside in different clusters than where the disk based database is hosted.

When to Use a Cache

We tend to use a cache for performance purposes.  When building distributed application with heavy reads and writes on the database, we often start to encounter performance issues as the load on the system grows. Compare to the speed of a cache which is in the sub-milliseconds average speed, a disk based storage is in the double-digit millisecond average speed. This is just the physics limitation of the hardware.  This does not include other performance concerns like lack of index tuning, partitions and other factors that can lead to blocks and deadlocks on the database.

Cache Aside Pattern

When adding a distributed cache into our system architecture, we need to start thinking on strategies that can enable us to efficiently develop a solution that can improve our overall system performance without doing an overall system redesign. 

For these kinds of solutions, we leverage the Cache Aside Pattern which at its simplest form of implementation involves doing an asynchronous look up into the cache and failing over to the disk based database in the event of a cache-miss. 

For the cases where there is a cache-miss, we also want to make sure that the cache is populated with the missing information. This helps us keep the reads on the disk based database at a minimum.

Defining the Provider Interface

The first step for our solution is to define an interface contract that all the database providers should implement. The goal is that as we inject the providers into our components, the implementation should be the same. This enables our code to be implementation agnostic of the actual technology that we are using. For our simple approach, we define our contract with a get function export. This function provides our components with database read support. The internal function can have a different name as long at the exported name is kept the same, our components should work properly.


Module.exports.get =     function getTelemetry(){…


Implementing the SQL Server Library

Now that we have our provider contract defined, we can implement the SQL server library. This library just performs a SQL query to read information from our SQL database telemetry table for a particular time range. For this integration, we use the typeorm library which works very similar to the .NET Entity Framework library.



Function init(){
    var context =  orm.createConnection(
        {“type”: “mssql”,
        “host”: “servername”,  
        “entities”:[
            new orm.EntitySchema(require(“../models/entity/telemetry”))
        ]
        });               
    return context;
}

function getTelemetry() {
   
    return q.Promise(async function(resolve, reject, notify){
        let context = init();
        context.then( async function(conn){
            let repository = conn.getRepository("telemetry"); 
            let ts = (new Date);              
            let timeRange = new Date(ts.setHours(ts.getHours()-1));
            let dateFilter = timeRange.toLocaleDateString() + ' '
                            + timeRange.toLocaleTimeString();

            let data = await repository.createQueryBuilder()
                            .where("telemetry.processed >= :dt",{ dt: dateFilter})
                            .getMany();
                
            resolve(data);
            await conn.close();
        })
        .catch(function(err){
            //…;               
        });
                   
    });      
};




The init() function returns the database context which enables us to make calls to the database and query the telemetry table.  The data is return as JSON, and it is mapped by using our entity definition which is mapped to the context by using the entities property. Our model is shown below:


module.exports = {
    "name": "telemetry",
    "columns": {
        "telemetryId": {
            "primary": true,
            "type": "int",
            "generated": true
        },
        "deviceId": {
            "type": "varchar"
        },
        "temperature": {
            "type": "numeric"
        },
        "humidity": {
            "type": "numeric"
        },  
        "sound": {
            "type": "numeric"
        },   
        "processed": {
            "type": "datetime"
        },
        "created": {
            "type": "datetime"
        }        
    }
}


Implementing the Redis Library                                    

We are done with the SQL library. We now move on to implementing our Redis library.  Similar to our SQL Server implementation, we want to maintain the same contract. The big difference here is that the Redis API looks very different than a SQL query. This is expected as this is a Key/Value pair system in which we want to find the data based on the particular key.

For our purposes, we define the key to be the same as the table name on our database. For date range on Redis, we use the concept of score for our data entries. We basically use a JavaScript timestamp. This enables us to use the ZRangeByScore API call to query Redis for a time range similar to how we do a SQL query.\


//client context
    const client = redis.createClient($config.REDIS.port, $config.REDIS.host);

function getTelemetry(){
              
        //async operation - return a promise
        return q.Promise(async function(resolve, reject, notify){

            let ts = (new Date);   
            let from  =  ts.setHours(ts.getHours()-1);      //basic date range query
            let to  =  ts.setHours(ts.getHours()+1);      
            let args = [tableName,from,to];    
           
            await client.zrangebyscore(args,function(err,data){
                              
                let result = data;

                result.forEach(function(item,idx){
                    result[idx]= JSON.parse(item);
                });

                let promise = (err) ? reject : resolve;
                let response = err || result;
               
                promise(response);    //send back the promise with data
            });

        });          
    }



Similar to the SQL context, we need to get the Redis context to be able to use the APIs. With the context instantiated, we make a call to the items within the time range. Since we stored the data in JSON format which is what we want to return to the client, we do not need to map anything.  Our code returns a JSON array of telemetry data.

Message Broker

Now that we have both our service libraries ready, let’s take a look at implementing the cache-aside pattern by looking at our message broker implementation.

Similar to our other libraries, we want our service broker to implement the same contract. The main reason is because we just want to inject the service into our components and not have to change their implementation.

Our service broker also needs to be able to support the cache and SQL providers. This is what enables us to manage the cache-aside pattern logic. As the service goes through the workflow, it uses the same implementation and passes only the provider reference to handle the database integration. This is the advantage of maintaining the same interface for all our providers. The logic looks the same because the APIs have the same contract definition.  


function getFromProvider(provider, callback){
    let result= null;
    let error = null;

    provider.get().done(function(data){
        result = data;
        callback(result,error);
    },
    function(err){
        error = err;   
        callback(result,error);                   
    });

}

function getData (){
    console.log("broker get");
           
    return q.Promise(function(resolve, reject){
           
        getFromProvider(cache,function(data,err){
           
            if (!data || data.length === 0){                  
               
                getFromProvider(storage,function(data, err){
                    if (err){
                        reject(err);
                    }else{
                        resolve(data);
                        //add to the cache
                        addToProvider(cache,data);
                    }                       
                });

            }else{
                resolve(data);
            }               
        });
                   
        });      


At its simplest implementation, our service broker just queries the cache. If data is found, it returns the data. If no data is found, it fails over the SQL query. It then populates the cache with the data from SQL, so the next time a request is made; the data is available from the cache.

How to Use the Libraries

We are now ready to use our libraries. Let’s recall that we have two database providers (redis and sql) as well as one service broker. We want our API server library to use one provider to handle all the data access concerns, and let the service broker integrate with the other database providers. Let’s see how to do that:



//message broker
const $repository = require('./modules/message-broker');   
const $storage = require('./data_modules/sqlRepository');    //sql
const $cache = require('./data_modules/redisRepository');    //redis

//initialize the broker with the data providers
$repository.init($cache,$storage);

//api service
var $api = require('./modules/telemetry-api.js');   
$api.init(app, $repository);     



When we initialize the repository (service broker), we pass the references to the cache and sql.  We then initialize the API with our repository reference.  We should notice that we could pass a reference of any library to our API because all of them implement the same interface. This is a way to enable us to either use sql, cache or both for storage.

API Library

Let’s take a look at how the API library uses the provider.


//add the route for
        app.get('/api/telemetry', getData)   
                
        //getData
        function getData (req, res){
            console.log("GET Telemetry");
           
            provider.get().done(function(data){
                res.json(data); 
            },
            function(err){
                console.log(err);
                res.status(400).send({ error: err });               
            });
        } 



Our API library expects a provider instance to handle the data operations. The API defines the routes and the handler for those routes. When a get request is received, the API library uses the same contract definition from the provider to get the data. 

The database system is agnostic to the API library. Since all those providers maintain the same contract, the API can use any provider the cache, sql or the service broker. This enables us to minimize the changes on other components.

Conclusion

We are able to show that by maintaining our interfaces, we are able to inject a service broker that manages different data storage systems to implement our cache-aside pattern.  Note that the cache should always be populated the moment the database has new information. In our pattern, we populate the cache on a cache-miss event as an exception and not the norm. There are other patterns to handle this database-cache synchronization. This is often then via pipelines that are not part of the application.


Thanks for reading

Originally published by ozkary.com