Annotated Bibliography

Advancements in the Machine Learning Infrastructure

Barros, R.S.M., & Santos, S.G.T. C. (2018). A large-scale comparison of concept drift detectors.

Information Sciences. 451, 348-370. doi: 10.1016/j.ins.2018.04.014

This paper explores concept drift detectors, which provide a small program useful in estimating the position of changes that occur in the data distribution. It presents a comparison of 14 different concept drift detector configurations that mine fully labeled data streams containing concept drift. This is done with artificial datasets and the Naive Bayes and Hoeffding Tree classifiers.

This is helpful to the machine learning infrastructure since concept drift is a major concern. It can occur in many ways and can negatively affect the results of machine learning models.

While they specifically measured how good the concept drift detectors are with streaming data only in the ingestion process, it is still helpful to understand. The research lead to the ability to define which concept drift detectors were best for specific use cases as well as in general application. Their findings can be readily applied to the machine learning infrastructure.

Cheng, H., (2017). Tensorflow estimators: Managing simplicity vs. flexibility in high-level

machine learning frameworks. KDD '17 Proceedings of the 23rd ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining (1763-1771). ACM

New York, NY, USA. doi: 10.1145/3097983.3098171

This paper covers a framework for the selection, training, evaluation, and deployment of machine learning models. Their goal is to do this so that machine learning models can be brought into production much faster. However, they provide a disclaimer that this is by no means an exhaustive list of approaches to all types of machine learning models.

This is essentially the bread and butter of my research paper as it lays out a foundation for what to look for in building an efficient machine learning infrastructure. They provide a unifying Estimator interface for writing downstream in the infrastructure for tasks like hyperparameter optimization and distributed training.

This framework has been adopted within Google to reduce time in launching a working model. This is all due to the help of TensorFlow Estimators which automates the construction of model layers, the adding of evaluation metrics, and bring the system into production for running on distributed training clusters. The only thing it does not handle is the debugging steps. Still, we can see how this is invaluable to the machine learning infrastructure.

Gouw, S., Mauro, J., & Zavattaro, G. (2019). On the modeling of optimal and automatized

Cloud application deployment. Journal of Logical and Algebraic Methods in

Programming. 107, 108-135. doi: 10.1016/j.jlamp.2019.06.001

This paper explores the computing resources needed by software components in

addition to those already provided, description of deployment protocol, and

minimization of total costs through constraint solving techniques when considering

the optimal automatic deployment of cloud applications. This is achieved with the

Abstract Behavioral Specific(ABS) language, Amazon EC2, and Fredhopper Cloud Services. By formalizing the tools necessary for optimizing the deployment of

applications they were able to satisfy constraints.

While there is some useful information in this article, after reading other papers about cloud optimization I am not sure this is really helpful for my research problem. If you are specifically using the cloud for e-commerce I can see the value.

In other papers they discussed tools that provide solutions to cloud deployment

latency. This paper just talks about the load balancer and querying without making

specific reference to technology that would be helpful to this dilemma. This is why I

don’t think this paper will be of much use for my research topic.

Hartmann, T., Fouquet, F., Moawad, A., Rouvoy, R., Traon, Y. (2019). GreyCat: Efficient what-if

analytics for data in motion at scale. Information Systems. 83, 101-11. doi: 10.1016/

This paper takes into account the inability of existing analytics to consider what-if decisions and present GreyCat. This is their proposed open source Many-Worlds graph model, which is a combination of time series and graphs to create multi-dimensional models. Hosted on millions of nodes this can update thousands of parallel worlds and explore a large number of independent actions.

This was born of the shift from descriptive to predictive analytics, and it is necessary to consider these advancements in conversations about machine learning. They predict GreyCat will surpass machine learning in usefulness.

Through experiments they were able to show that GreyCat outperforms Neo4j in both mass and single inserts, and is faster than Influx DB in single node deployments. Since it can handle hundreds of millions of timepoints, nodes, and hundreds of thousands of independent worlds one proposed application is in smart cities.

Hill, J. (2019). Study machine learning / deep learning [White paper]. Retrieved July 15,

2019, from Siemens:


Siemens AG is a German multinational conglomerate company, and the largest industrial manufacturing company in Europe. This paper talks about obstacles faced in Germany with regards to artificial intelligence and then specifically machine learning. Cloud computing and security are the top two topics of conversation in IT departments with machine learning finally coming in at third. While some apprehension if found in personal bias, the majority has to do with lack of understanding. This presents itself in failure to pick a model to address specific needs,  insufficient quality of data, and hiring consultants to work with staff team which has created conflicts.

The information gathered in this white paper serves as a microcosm considering the field of machine learning globally. It provides a picture of struggles organizations face when choosing to apply machine learning to solve their business needs.

There is a lot of rich information contained in this paper on general sentiments and approaches to integrating machine learning into business in Germany. However, this can be applied globally as more companies seek to expand. It uncovers many of the obstacles faced and shows problem areas where a tighter feedback loop needs to be created so that companies can get the value they seek from their machine learning infrastructure.

Hutter, F., Kotthoff, L., & Vanschoren, J. (2019). Automated machine learning : Methods,

systems, challenges.Cham, Switzerland: Springer.

This book observes the commercial interest in automated machine learning, which is becoming increasingly popular due to the sensitivity of machine learning methods to design. And since human engineers are responsible for the training procedures, hyperparameters, and choosing regulation methods non-automated machine learning is less fail-proof. Instead the AutoML would be responsible for selecting the best approach to given data. This is also a way to make machine learning more user friendly to scientists who do not have the know-how for currently operating the technologies used in the machine learning infrastructure.

It is proposed as a way to improve performance as well as saving time and money, however this book suggests that developments in AutoML should be open source.

Having the model selection process fine tuned would be advantageous for those interested in working with machine learning models, but that don’t have the readily available resources to fully benefit.

In addition it would free up time, in large companies, for those who are doing the analysis and presentation of the data as well as those who are responsible for maintaining the machine learning infrastructure. In addition it will not only formulate a process, but act as a selector for the best fit for the data. This would mean more decision making could be focused on enhancements made to the surrounding structure and in understanding the data.

Kovács, J., Kacsuk, P., & Emődi, M. (2018) Deploying Docker Swarm cluster on hybrid

clouds using Occopus. Advances in Engineering Software. 125, 136–145. doi:


The paper presents a method for deploying complex infrastructures in hybrid cloud environments that use the cloud orchestrator Occopus, which allows use of plugins that make deploying infrastructures on both private and public clouds more user friendly. It supports all major cloud technologies like Amazon EC2, OpenStack, and CloudSigma. As a bonus they also provide a how-to overview for deploying a Docker Swarm cluster on top of  a hybrid cloud built with the previously mentioned cloud technologies. And like any paper that discusses hybrid clouds, it touches on the inherent security issues.

Hybrid clouds are becoming increasingly common as different technologies provide varying amenities and some companies choose to integrate in-house built systems with cloud technologies. Occopus presents a way to make this more user friendly through providing plugin capabilities.

Occopus seems like it would be a huge resource to large companies that have the need and capabilities to use the support of multiple cloud technologies. These companies are similarly the ones investing in machine learning to solve business problems, and while cloud technology are an integral component to the machine learning infrastructure introducing multiple types can cause problems. It would be interesting to see how deploying a hybrid cloud would look with additional container orchestrators added with the Docker Swarm cluster.

Kumar, M.K., Abdel-Majeed, M.R., & Annavaram, M. (2019). Efficient automatic parallelization

of a single GPU program for a multiple GPU system. Integration. 66, 35-43. doi:


This paper provides an overview for use of multiple GPUs in a system and set-backs that occur. They use two GPUs connected through and off-chip interconnect to reduce remote access to a more localized access in order to enhance performance by 1.55 times, proving that more than one GPU is better for handling the demand for high throughput. This is done by parallelizing a single GPU code onto multiple GPUs to save money as an alternative to using cross-card data. With the addition to the creation of a data location aware scheduler they propose to decrease programmer workload when it comes to partitioning.

GPUs are an important component to parallelization and speeding up the machine learning model. Adding more GPUs will speed the process up even more, but may prove more costly if operating on cross-card data, and can create additional programming work to be done.

The solutions presented in the paper can prove helpful in cost savings. Having multiple GPUs is now the standard for the machine learning infrastructure as parallelization increases in popularity. However, programming each GPU is a task in and of its own and copying this same code on each GPU seems more effective. In addition, they propose a data location awareness scheduler that handles tasks closest to the GPU first.

Mohamed, M., Engel, R., Warke, A., Berman, S., & Ludwig, H. (2019). Extensible persistence as a service for containers. Future

Generation Computer Systems. 97, 10-20. doi: 10.1016/j.future.2018.12.015

This paper discusses the Ubiquity framework for managing data and making it available for containers like Kubernetes, Docker, and Openshift to use when deploying workloads in an agile manner. Micro-services have become the new way of speeding up software application development since it consists of individual components intended for a specific singular purpose allowing for smaller code chunks that are easier to manage. Due to the rise of microservices there may be multiple deployment platforms in an organization that uses a range of container types. Ubiquity is presented as a way to provide storage backends for these workloads and is efficient when onboarding stateful services in heterogeneous container environments.

The Ubiquity framework makes it easy to integrate different container orchestrators, which come with their specialties, in the deployment of software applications. This provides storage as well as ease in ability to make modifications to surrounding software applications used in the machine learning infrastructure.

If software applications need to use multiple container orchestrators, Ubiquity provides a solution to changes to APIs while providing support for storage with the provided know-how through plug-in features. There have been other attempts to remedy the storage capabilities necessary for running both stateless and stateful applications, but not have developed a solution close enough to the one provided by Ubiquity. It is an open source project, which may lend to the reasons why it has been able to address this concern.

Raj, P & Raman, A.(2018). Software-defined cloud centers: Operational and management

technologies and tools. Cham, Switzerland: Springer

This book covers advancements in Cloud computing and how it’s trending affects the different organizational structures of IT departments. It also covers the benefits of software-defined cloud environments, storage virtualization, and network virtualization. It even discusses multi-cloud management and security. Most excitingly it explores known and unknown challenges of hybrid cloud integration.

Hybrid cloud integration is key to the machine learning infrastructure because you can benefit from the accessibility and scalability of the cloud environment while simultaneously keeping an on-premise database. However, this comes with it’s own setbacks inherent to managing two platforms and general security issues seen by both.

Different cloud environments are covered in depth here and this is especially important since machine learning infrastructures have to be able to seamlessly blend into their environment. Knowing the tools and techniques used in the Cloud operated element of the environment can aid in choosing the appropriate tooling. The machine learning infrastructure utilizes the Cloud for a host of operations that are difficult to run without the efficiency provided by Cloud computing and Supercomputing.

Reuther, A. , Byun, C., Arcand, W., Bestor, D., Bergeron, B., Hubbell, M., Jones, M., Michaleas,

P., Prout, A., Rosa, A., & Kepner, J. (2018). Scalable system scheduling for HPC and big

data. Journal of Parallel and Distributed Computing. 111, 76-92. doi:  10.1016/j.jpdc.2017.06.009

This paper is a survey of 15 big data and supercomputing schedulers. They developed a theoretical model for measuring scheduler latency, and conduct a comparison of the top four schedulers—Hadoop YARN, Slurm, Son of Grid Engine, and Mesos. From this they surmise that a nonlinear exponent and the marginal latency of the scheduler are the two key parameters, and uncovered a host of useful information four schedulers.

Understanding the performance of different schedulers can be helpful when deciding which to use in a machine learning infrastructure. This is an area that needs to be fine tuned when creating a tighter feedback loop in the data supply chain.

This paper served as an insightful overview and analysis of different schedulers and how their features can be fine tuned to enhance task execution. Since schedulers control the workload of a system any latency experienced here will affect the overall efficiency of a scalable computing infrastructure. They found that these features are common in all four schedulers so the insight this paper provides comes from latency issues and I found the documentation they did of this process to be excellent and useful for processing data, before the data is ingested into the machine learning infrastructure.

Ronkainen, J., & Livari, A. (2015). Designing a data management pipeline for pervasive sensor

communication systems. Procedia Computer Science. 56, 183-188. doi: 10.1016/j.procs.2015.07.193

This article discusses how to maximize the benefit of  embedded sensor data through analysis and preprocessing, and how distributed processing aids in this. Specifically, they discuss requirements for designing the data pipeline. The focus is on live preprocessing, data input, and cloud storage with the idea that IoT is growing rapidly with more devices connected to the Internet than people in the world. It also explores the two types of processing—stream and batch—which are both required for sensor data.

Often times the data used in machine learning will come from sensors as they provide the real-time data used to make decisions in just about every field from automotive, medical, and chemical. The pipeline is integral to this and is how machine learning infrastructures ingest data.

This paper is a survey of what a data pipeline for sensor data should look like for high-performance based on realistic sensor communications.  It effectively provides an overview of the different components, but provides a disclaimer that this model is not flexible and the integration of other data processing frameworks will take added effort. For general comparison this paper is resourceful for data collection, but only relates to the machine learning infrastructure at the juncture where the paper concludes.

Scionti, A., et al. (2019, June 21). HPC, cloud and big-data convergent architectures: The

LEXIS approach. Conference on complex, intelligent, and software intensive systems:

CISIS 2019 13th International Conference on Complex, Intelligent, and Software

Intensive Systems, Australia,  doi:10.1007/978-3-030-22354-0

This paper discusses the design project LEXIS, which is an effort funded by the

European Commission to present an innovative architecture combining high  

performance computing(HPC) with big data technologies in the fourth paradigm of

scientific discovery. This new paradigm relies heavily on large data sets and heterogeneous infrastructures that provide resolution to latency concerns. Burst Buffers(BB), SSDs with a PCI-Express bus, provide temporary storage for cached data and faster read/write throughput. Field Programmable Gate Arrays (FPGAs) allow a balanced tradeoff between flexibility and compute power with customizable circuit features. BB and FBGAs integrated with Cloud Computing technologies will allow LEXIS to build a HPC-as-a-service model that is more readily accessible to a broad range of users while maintaining security standards.

This project presents the LEXIS platform architecture, LEXIS computing infrastructure, and LEXIS data layer, in which the paper covers the necessary tooling well. These can be adopted into the machine learning infrastructure, if not already, in efforts towards creating a tight feedback loop with greater user accessibility and low-latency.

This paper is highly valuable to my research project as it not only outlines potential computing infrastructure to consider in congruence with the machine learning infrastructure, but discusses platform architecture as well. It is an effort to address concerns currently faced when making the best use of supercomputing for analysis of large datasets as an alternative method for uncovering knowledge about our world. Broadly this is what artificial intelligence with come to rely on, and more specifically machine learning will become more computationally efficient as better processes, such as this, are designed.

Wang, B., Song, Y.,  Cao, J.,  Cui, X., & Zhang, L. (2018).  Improving task scheduling with

parallelism awareness in heterogeneous computational environments. Future Generation

Computer Systems. 94, 419-429. doi: 10.1016/j.future.2018.11.012

This article covers task scheduling and the benefits of parallel computing as an optimization problem. It utilizes parallelism awareness for maximizing the use of available cores of the assigned server for meeting deadlines by executing tasks due first in heterogeneous environments. In their experiment they were able to show improved performance in finish time, energy efficiency, task violations, and resource utilization for task execution in heterogeneous computer systems with scheduling parallelism awareness(SPA).

With the addition of new technologies to the machine learning infrastructure and it diversifying ever more in it’s heterogeneity it is beneficial to use SPA. This article covers the custom built algorithms used and differing scheduling methods, which you can compare against current infrastructures.

They mention future necessary research, and will inevitably end in processing tasks with hybrid themes after additional related research. This too will be helpful for my research topic once conducted. They accomplish the depiction of their research well in this paper, and in theory this could be reproducible or applied in an ad-hoc manner to evaluate the optimization of an infrastructure.

Witt, C., Bux, M., Gusew, W., & Leser, U. (2019). Predictive performance modeling for distributed batch processing using

black box monitoring and machine learning. Information Systems. 82, 33-52. doi: 10.1016/

This paper looks into the monitoring of batch processing with non-intrusive methods that allow for uniform application across all workloads, with workloads being viewed as a “black box” by systems responsible for managing the infrastructure. They consider the limitations, predicted performance metrics, performance variation, and use cases through predictive performance modeling(PPM) as a means to support distributed computing. PPM serves as a way to cost effectively solve for planning as an alternative to analytical modeling or simulation.

This is helpful for resource allocation when considering economies of scale. Machine learning is a way to deploy PPM methods and could help better create a tight feedback loop in the machine learning infrastructure.

Considering the predicted performance model as not only deployable with the help of machine learning, but something that could help create the tight feedback loop necessary in current machine learning infrastructures is worthy of exploration. The batch processing aspect of workloads is a bottleneck, and a problem that many researchers are developing solutions too. This theory combined with the necessary technology could prove beneficial to existing infrastructures.