Morzsák

Oldal címe

Analyzing machine learning based on distributed file system

Címlapos tartalom

Machine learning has been one of the fastest growing areas in computer science in the last fifteen years and it continues to produce innovation with the training process becoming ever more streamlined. With the rapid development of deep learning tools the demand for training data will only get higher and higher. With central file systems the emergence of a storage bottleneck is inevitable. Transitioning to a distributed file system, from a centralized storage solution, will mean improvements in performance by reducing the I/O overhead, and distributing the work more evenly, while also providing a better return on investment the more our system is scaled up. The goal of this thesis is the creation and implementation of a distributed file system in cloud environment which can effectively serve as the central storage solution for the distributed deep learning framework Horovod[1]. In addition all configuration steps will be executed using automation tools, which will make both initial and subsequent implementations more streamlined. After a general description of deep learning frameworks, multiple distributed file systems will be compared against each other, with the aim of finding the most suitable based on it’s compatibility, reliability and performance. A short description of Terraform and Ansible will also be included since I am are going to use these tools for building my system. There are no existing, well known implementations of Horovod based on a distributed file system. However, Spark, a distributed big-data solution which is based on distributed datasets, that store data on multiple machines, similarly to how Horovod distributes worker nodes to multiple physical machines, has been implemented based on HDFS. A good example for this could be the web-based platform, Databricks which was developed by the creators of Spark, and provides automated cluster management and IPython-style notebooks.