Data Science, Big Data & Microsoft Machine Learning


Preface

Data is seemingly controlling us and everything around us in many ways. Big data is more prevalent in our daily life more than what we think. Business Intelligence has become integral part for effective decision making. Data Science, Artificial Intelligence (AI) and Big Data Analysis has been playing pivotal role in the digital transformation processes. Current industry trend is to assess data behaviour and build knowledge libraries.

Industires are dealing with billions of micro devices that are connected to the control hub and emitting time series data continuously. These devices need to be connected to the central data pipeline and thus predict or prescribe device's health, availability and possible failure points. To connect billions of devices into corporate data pipeline, we need to have Big Data repositories. This is where IoT concept kicks in. Dr. Williams, Director of MIT described a world, where "things" (devices or sensors) are connected and able to share data. Data coming from these devices and sensors provide business insights that were previously out of reach. The invaluable insights enabled by harnessing and analyzing the data from these connected devices are what the Internet of Things (IoT) is all about.

DCS, APCs, RTUs, SCADA, PI Systems, DVR and MOM fall under the category of OT while Infrastructure, Product Lifecycle, ERP and Logistics fall under IT umbrella. If we look into the following diagram, we can imagine why we need a full stack platform to take the idustry towards Stream Analytics.



Figure 1: IT/OT 5 Layer Architecture

Microsoft established Azure based IoT Cloud model, which has been regarded as one of the key products for Digital Transformation. Azure IoT platform can connect to your connected devices and gain insights. It can then turn those insights into action with powerful applications built on the industry-leading platform for IoT development. From manufacturing to transportation to retail, start fueling new revenue and business opportunities with IoT solutions designed for industry needs.

Recently, Microsoft released Microsoft Machine Learning platform to provide Big Data analysis capabilities. Microsoft Machine Learning Server incorporates three renowned advanced & statistical analytic platforms which are: R, Python & Anaconda. All of the platforms have their pros and cons, there are some distinct advantages and disadvantages with each of them. My main focus of this document will be on R or previously known as Revolution R.

Julia White, CVP Microsoft Azure mentioned that Microsoft will invest $5 billion in the Internet of Things (IoT) over the next four years. So, it's worth looking at Microsoft Machine Learning platform as you could integrate with other amazing products from Microsoft stack like SQL Server for Database, Reporting & Analysis Services, Power BI for Business Intelligence & Analysis Services, BizTalk for Data Integrations Services.

Data Science

Data Science is a multidisciplinary field that involves scientific methods, algorithms and platforms to extract valuable insight of data from different perspective. This process monitors the behavioural pattern of data over the period of time, traing identified data models and thus predict or prescribe the futuristic flow of the data. In order to solve complex problems, it emphasises on data inference, data exploration & insights, algorithm/model development. Overall, Data Science encompasses Data wrangling, Specificity, Machine Learning by developing and traning specific Data Models.

Big Data Analysis

Big Data Analysis is done thorugh some specialized tools, platforms and processes to extract the insights from large volume of data. It can corelate differnt types of similar streams of data which was impossible through ordinary tools and platforms. Hadoop, SAS, R and Phython are some powerful advanced big data analytics platforms that goes beyond the normalized databases or databases that follows only data represented by rows and columns. It can derive insights from structed or unstructured databases or large data files.

According to the book "Making the World Work Better" by Kevin Maney, Steve Hamm, Jeffrey O'Brien, 1200 exabytes (which is equal to 1200 billion gigabytes) of digital information were created in 2010 which has been rising exponentially since then. Because of the high demand of disk spaces, hard drive manufacturing companies like segate, western digital, sandisk, samsung  are focusing on producing cheaper hard drives with maximum storage capacities.

Machine Learning

Machine learning is basically broad class of methods that emphasises on Predictions & Pattern Discovery. According to SAS, Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. The processes involved in machine learning are similar to that of data mining and predictive modeling. Both require searching through data to look for patterns and adjusting program actions accordingly.

These are some widely used algorithms being used for Machine Learning:

-      Data Co-Relations
-      Decision Trees
-      K-Means Clustering
-      Neural Networks
-      Reinforcement Learning/Deep Learning




Microsoft Machine Learning Platform

Microsoft released a Machine Learning Platform previously known as "R" or Revolution Analytics platform. This platform is being delived by keeping in mind the need of Data Analysts. It can be used for both Offline/Online and Structured/Unstructured Data. Microsoft's machine learning platform has been elaborated briefly in the following section:

1.    Microsoft Machine Learning Server (Standalone)

Microsoft Machine Learning Server is an in-data advanced analytics offering that enables us to work with flexible tooling across the existing open source and enterprise IT investments.
This offering works within Microsoft's end-to-end data science solution to empower organizations to solve business problems and seize new opportunities by enabling them to analyze data where it lives, innovate using artificial intelligence, and build mission-critical applications faster. It is a flexible enterprise platform for analyzing data at scale, building intelligent apps, and discovering valuable insights across your business with full support for Python and R.

Figure 4: Microsoft Machine Learning Server & R Client Communication

Feature category
Description
R-enabled
R packages for solutions written in R, with an open-source distribution of R and run-time infrastructure for script execution. Everything you had in R Server and more.
Python-enabled
Python modules for solutions written in Python, with an open-source distribution of Python and run-time infrastructure for script execution.
For visual analysis and text sentiment analysis, ready to score data you provide.
Operationalize your server and deploy solutions as a web service.
Start remote sessions on a Machine Learning Server on your network from your client workstation.
Scale out on premises
Clustered topologies for Spark on Hadoop, and Windows or Linux using the operationalization capability built into Machine Learning Server.

Table 1: Key features of Microsoft Machine Learning Server

2.    Microsoft Machine Learning Services (In-Database)

SQL Server 2017 Machine Learning Services is an add-on to a database engine instance, used for executing R and Python code on SQL Server. Code runs in an extensibility framework, isolated from core engine processes, but fully available to relational data as stored procedures, as T-SQL script containing R or Python statements, or as R or Python code containing T-SQL.

SQL Server 2017 Machine Learning Services is the next generation of R support, with updated versions of base R, RevoScaleR, MicrosoftML, and other libraries introduced in 2016.



Figure 3: SQL Server Machine Learning (In-Database) Services Execution


Microsoft Machine Learning Services (In-Database) Components:

Component
Description
SQL Server Launchpad service
A service that manages communications between the external R and Python runtimes and the database engine instance.
R packages
RevoScaleR is the primary library for scalable R. Functions in this library are among the most widely used. Data transformations and manipulation, statistical summarization, visualization, and many forms of modeling and analyses are found in these libraries. Additionally, functions in these libraries automatically distribute workloads across available cores for parallel processing, with the ability to work on chunks of data that are coordinated and managed by the calculation engine.

MicrosoftML (R) adds machine learning algorithms to create custom models for text analysis, image analysis, and sentiment analysis.
sqlRUtils provides helper functions for putting R scripts into a T-SQL stored procedure, registering a stored procedure with a database, and running the stored procedure from an R development environment.
olapR is for building or executing an MDX query in R script.
Microsoft R Open (MRO)
MRO is Microsoft's open-source distribution of R. The package and interpreter are included. Always use the version of MRO installed by Setup.
R tools
R console windows and command prompts are standard tools in an R distribution.
R Samples and scripts
Open-source R and RevoScaleR packages include built-in data sets so that you can create and run script using pre-installed data.
Python packages
Revoscalepy is the primary library for scalable Python with functions for data manipulation, transformation, visualization, and analysis.
Microsoftml (Python) adds machine learning algorithms to create custom models for text analysis, image analysis, and sentiment analysis.
Python tools
The built-in Python command line tool is useful for ad hoc testing and tasks.
Anaconda
Anaconda is an open-source distribution of Python and essential packages.
Python samples and scripts
As with R, Python includes built-in data sets and scripts.
Pre-trained models in R and Python
Pre-trained models are created for specific use cases and maintained by the data science engineering team at Microsoft. You can use the pre-trained models as-is to score positive-negative sentiment in text, or detect features in images, using new data inputs that you provide. The models run in Machine Learning Services, but cannot be installed through SQL Server Setup.

3.    Microsoft R Client

Microsoft R Client is a free, community-supported, data science tool for high performance analytics. R Client is built on top of Microsoft R Open so we can use any open-source R package to build your analytics. Additionally, R Client includes the powerful RevoScaleR technology and its proprietary functions to benefit from parallelization and remote computing.

R Client allows us to work with production data locally using the full set of RevoScaleR functions, but there are some constraints. Data must fit in local memory, and processing is limited to two threads for RevoScaleR functions. To work with larger data sets or offload heavy processing, you can access a remote production instance of Machine Learning Server from the command line or push the compute context to the remote server. Learn more about its compatibility.

Machine Learning Server and Microsoft R Client offer virtually identical R packages, but each one targets different scenarios. R Client is intended for data scientists who create solutions that run locally. Machine Learning Server is commercial software that runs on a range of platforms, at much greater scale, with infrastructure for handling major workloads, on client-server topologies that support remote access over authenticated connections.

-      From R Client, we can shift data-centric RevoScaleR operations to a remote Machine Learning Server by creating a remote compute context. Remote compute context is supported for SQL Server Machine Learning Services or a Spark cluster. Typically, we shift the compute context to bring computations to where the data resides, thus avoiding data transfer over the network.

-      From R Client, we can run arbitrary R code on a remote production instance of Machine Learning Server. This is a general-purpose capability: from a command line, you can switch between local and remote sessions interactively, useful for testing, administration, or to use the additional processing power of a production server.


Figure 5: Microsoft Machine Learning Server & Client Sessions


Like, Microsoft R Client, Microsoft R Open is a free product. Microsoft R Open is the enhanced distribution of R from Microsoft Corporation.

Microsoft R Open includes:

-          The open source R language, the most widely used statistics software in the world
-          The installation of many packages include all base and recommended R packages plus a set of specialized packages released by Microsoft Corporation to further enhance your Microsoft R Open experience
-          Support for Windows and Linux-based platforms

Plus these key enhancements:

-          Multi-threaded math libraries that brings multi-threaded computations to R.
-          A high-performance default CRAN repository that provide a consistent and static set of packages to all Microsoft R Open users.
-          The checkpoint package that make it easy to share R code and replicate results using specific R package versions.

Samples

NYC Taxi demo data

Scope: Drawing a Histogram plot on Tipping Ratio for NYC Taxi demo

Code:
Ø  library(RODBC)
Ø  sqlConnString <- odbcDriverConnect('driver={SQL Server};server=DHA00730-ESCD02; database=NYCTaxi_Sample; trusted_connection=true')
Ø  sqlData <- sqlQuery(sqlConnString, "SELECT tipped, fare_amount FROM nyctaxi_sample;")
Ø  rxHistogram(~tipped, data=sqlData, col='lightblue', title = 'Tip Histogram', xlab ='Tipped or not', ylab ='Counts');

Result:



Scope: Trained data model for NYC Taxi demo

Code:
Ø  sqlConnString <- odbcDriverConnect('driver={SQL Server};server=DHA00730-ESCD02;database=NYCTaxi_Sample; trusted_connection=true')
Ø  sqlData <- sqlQuery(sqlConnString, "select tipped, fare_amount, passenger_count,trip_time_in_secs,trip_distance, pickup_datetime, dropoff_datetime, dbo.fnCalculateDistance(pickup_latitude, pickup_longitude,  dropoff_latitude, dropoff_longitude) as direct_distance from nyctaxi_sample tablesample (70 percent) repeatable (98052);")
Ø  logitObj <- rxLogit(tipped ~ passenger_count + trip_distance + trip_time_in_secs + direct_distance, data = sqlData)
Ø  summary(logitObj)

Result:



References

3.     Microsoft R Open, https://mran.microsoft.com/rro

Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. Nice blog. Thanks for sharing.

    Vicky from Way2Smile - Top Leading Digital Transformation Company in Dubai

    ReplyDelete
  3. Thanks for sharing such a great post. You have great knowledge of the topic please keep sharing.

    Data engineering services for enterprises

    ReplyDelete
  4. Really nice blog post. Thanks for sharing this nice article. Really helpful topic. If anyone looking for a best and Mobile app development company, Reach Way2Smile Solutions Mobile Application Development Company in Chennai.

    ReplyDelete
  5. It's really a valuable blog post. Thanks for sharing this nice article.

    Looking for Data Analytics Service Providers in UK? Reach Way2Smile Solutions UK.

    ReplyDelete
  6. What a piece of amazing and meaningful information you have written on Elearning App Development Company. I appreciate you and your precious time that you devoted to this blog. Additionally, I also want to clear some more doubts about IT consulting company NYC. Normally i don't leave comments on blogs, but i can't stop myself here to write a few words for you.

    ReplyDelete
  7. a blog-host and blog-publishing service that allows Vivo Y51s

    ReplyDelete
  8. I've read you artcle and i found it very useful, but you've to create you blog on wordpress, I know a very reputed web development company in your town you should contact them.

    ReplyDelete
  9. Nice blog and informative content. If anyone looking for trending course, AI Patasala is the best option for you. AI Patasala provides Best Data Science Course in Hyderabad and Best Machine Learning Course in Hyderabad

    ReplyDelete
  10. The blog about big data machine learning and Data Science is really nice, if you want you can check data science course in bangalore they provide a lot of help and information for the same

    ReplyDelete
  11. Docker is an essential tool for any freelance core java developer working on enterprise application development projects. It is a container platform that helps build, manage, and secure a range of traditional applications and microservices. Hence, freelance developers using this tool stays ahead of the competition. If you are an experienced developer, you can also check out freelancing sites like Eiliana.com, which provides freelance developers with more development and testing projects from top clients.

    ReplyDelete

Post a Comment

Popular posts from this blog

Cloud Computing Technology Assessment

Database Testing With DBUnit