In this blog post, we discuss what you need to consider when running machine learning at home. This article is a continuation of our previous blog post on Machine learning in the home office (Part 1/3): Motivation. Part 3 tackles the practical how-to’s and best-practices.
Here is my user story: “As a machine learning engineer, I want the perfect setup at home”. But I have another one: “As a manager, I want to limit the costs of daily developments”.
This is nothing new: You will always need a local copy of the data being processed. If a network transfer is involved anywhere in the data-processing pipeline, it will dramatically slow down your experiments. This is also the case in the cloud: You want your data to reside in the same geographical area in order to avoid cross-region data transfer fees and times. So the goal is not that different:
To do this, you need two other ingredients: Network and storage.
Despite data locality, network transfers will happen nevertheless. For that, I encourage engineers to invest in their network equipment at home. WiFi, for instance, has very variable transfer rates as it is sensitive to many external factors (e.g., walls, interferences from your and your neighbors’ devices). On the other hand, you probably do not want to have long network cables running through your house. Transfers will likely happen between your workstation and the internet, and not between computers in your home. Thus, the transfer rates will be dominated by the bandwidth given to you by your internet provider.
A good solution is the so-called PowerLine technology: A cheap, consumer-oriented technology that uses your electric wires as network cables. Your network becomes less sensitive to interference and other external factors, you have satisfactory transfer rates, installation is done with the push of a button since you usually have an electric infrastructure already, and you avoid long Ethernet cables.
You will need to store the original dataset but also any intermediate transformations of it. Storage doesn’t cost much these days, whether in the cloud or locally. An external hard drive with 24TB costs around 600€ (Feb. 2021), and if you are concerned by speed/safety you can even tweak them to have a RAID-like structure. As a reference point, a dataset such as ImageNet is around 300GB and a Wikipedia textual content with full history is around 200GB. If you are not working in a company that is actually collecting data, then a high-end consumer external hard drive should cover many use cases at very limited cost.
External hard drives have two drawbacks: The noise/vibrations as they are usually spinning, and the speed of access. For frequently accessed data, like for the data accessed during the training loop, you may prefer to have internal SSD drives instead for 85€/TB (with a hard limit on the number of such drives). If you were able to minimize the data needed for the training loop, this means that after the data transformation pipeline, you do not need the external hard drive anymore.
Backups are crucial and are difficult to cater for at home. Even dedicated hardware at home will be lost in case of robbery or fire. Depending on how much processing time has been devoted to transforming the raw data in your pipeline, you may or may not be willing to back up the generated data. As the processing pipeline gets larger and/or closer to the training data, the added value of that data increases. That is even more true if you want to share this intermediate data with your colleagues. Here it will be hard to beat the added value and services proposed by the cloud providers: You will have data safety (server-side encryption, redundancy) for a price and offer that you can adapt to your needs. On AWS, the offer revolves around the availability of the stored data: Data with immediate access -- and adapted for sharing -- will cost a higher storage fee than data targeted for archival, with an overhead in scheduling the access and a rather slow availability.
If sharing is the only purpose, and you want to avoid the cloud at all costs (and you feel like a geek), you can always use the Torrent protocol privately and among your peers. Yes, I feel like a geek!
As an example, the 24TB mentioned above will cost around $550/month on the most expensive tier, and only $25/month on e.g., AWS Glacier Deep Archive (mostly for archival purposes). There will be a hard trade-off between recreating the data from scratch vs. storing it for reuse. Recreating it would mean CPU cycles and time if ever accessed again, storing on the other hand would amortize the initial processing costs among peers and provide disaster recovery.
Why a workstation and not, e.g., a laptop in the first place? After all, the MacBook Pros are wonderful and beautiful machines.
An effective machine-learning setup needs certain requirements:
- 1 Processing power,
- 2 Memory,
- 3 Storage,
- 4 A limited level of nuisances (e.g. noise), and, lastly,
- 5 Flexibility to tweak points 1.-4.
Workstations let you add hardware depending on your needs, whether it is memory, hard drive, a new shiny GPU, or anything else. In addition, you are given the possibility to adjust and refine the configuration in an iterative manner, such that your setup evolves together with the expression of new constraints you may have because of work or home environment. This is impossible with laptops.
Note that a refurbished workstation costs less than a third of that price, and you won’t lose much of the performance: CPUs from a few years back are still very good, and from my experience, those workstations are rock solid.
Just go for Linux/Ubuntu. It is free and provides you with millions of useful packages, which can be installed with a single command. Moreover, it will be closer to the deployment environment of your pipelines, and the installation can be easily described (for instance through
Ansible plays) and replicated on your colleagues machines.
There are some downsides about Linux:
- First, you have to be able to activate the “hibernation” or at least the “deep sleep” (and it took me a while to have that feature stable, welcome 2021). Why? Just because you may want to interrupt the training overnight to get some good sleep, and resume the several day ML training on the next working day.
- Second, Linux is not perfect when it comes to multimedia. Staying in touch with your colleagues at the times of home office is important, so, luckily all standard team-communication tools are available (Slack, Skype, Zoom, Hangouts, etc), but the audio/webcam might not be well-supported or become unstable after a few hibernation/waking up cycles.
- Finally, if you need encryption for your data/code, the configuration might not be as simple as for other operating systems. Still, it is very well-supported and does not involve a complex setup.
Summing it all up
Read more on the practical how-to’s and best practices when running machine learning at home in Part 3 of this article.
📬 If you found this blog interesting, subscribe to our monthly newsletter digest, where you'll receive further content about machine learning in work tools and coding best-practices.