This article is a continuation of Machine learning in the home office (Part 1/3): Motivation and Machine learning in the home office (Part 2/3): Considerations.
Tooling and technology are a necessity, but, by themselves, they don’t tell you about how to use them efficiently. Machine learners not only develop code, but also mathematical models and data processing pipelines. I have summarized the best practices that you should keep in mind when doing machine learning at home:
Avoid developing your code in “presentation mode”
You will be sitting home far from your mates for some time. While Jupyter notebooks are the tool of choice when presenting a piece of code to an audience (and tools/platforms exist to share those), most of the training time will not be spent with the eyes of others scrutinizing every step of your developments.
Your data and training pipeline will at some point become production code. The meaning of production is not exactly the same as in consumer products, but the program that you develop will produce valuable artefacts (models, data) for your company. Little time should be devoted to making things that will not go to production code, and Jupyter notebooks fall into this category.
Do not underestimate the value of unit testing
I have a strong tendency for robust development through contractual proof of correctness. As the code grows in size, it becomes increasingly hard to keep up with the correctness of various parts of your pipeline without the use of automated unit testing. Besides, machine learners want their pipeline to be based on solid (and correct) foundation: Would Pandas, PyTorch or TensorFlow be as popular as they are today if they were full of bugs?
Any deterministic part of the program can be easily covered by unit tests. Automated tests for learning/training code may also fall into this category, but the procedures are less well established. Not only do programs benefit from unit testing, but it makes it easier to grow teams around a common code base and provides cues for disseminating knowledge about code. In times of remote work, those properties are especially valuable within teams. Unit testing is a best practice in software engineering and provides great value in machine learning too.
Start small, discuss synthetic models early
A machine-learning engineer has to deal with code and maths. Collaboration tools for code are everywhere today and can be carried over effectively by a well-understood industry standard process. Discussing maths is another story though.
While we have wonderful collaboration tools (thank you, Miro) available to us today, the throughput of information among people is very different in a home-office setup compared to when everybody is sitting in the same office and discussing with a strawberry smoothie over a whiteboard. Developers and researchers are usually familiar with a remote setup, but they also value live discussions and interactions, and what they bring to fostering ideas and creating synergies.
To communicate over machine learning, we can of course read and discuss the same articles, write and communicate mathy documents in Latex, share virtual boards.
Fruitful collaboration will tackle questions like “How did you design the experiments?”, “How did you create the data?”, “How does the training curve look like?”, “Does it look right?”, “What can be improved?”, “Can we check the distribution of X during training?”, “Can we have the gradient of that?”
Those insights take communication to a completely different dimension. The expressiveness of the communication, the vocabulary, and the manipulated notions are targeted towards mathematical models. You’ll speak the language of your peers: Model, loss curve, test dataset, metrics. If something looks suspicious, then you switch to looking at the code. This makes the communication more precise and the iterations over the models much faster.
Choose your technology stack carefully
There is a chance that the code that you are developing at home will be the code that will be deployed on the cloud, whether this code will be directly customer facing (e.g., running an ML service) or not (training a pipeline).
A particularity of the cloud compared to e.g., edge computing is that the technology on which the code is based is completely hidden from the customer. This gives a lot of freedom to developers, and the temptation to try the latest new trend, or to add another technology on a very specific part of the code, is high.
It is however important to bear in mind that any additional engineering effort for adapting your code to another running environment should be kept as low as possible. Having many technologies to cope with will increase the skill requirements for maintaining and extending the code.
Prefer frameworks that can be adapted (
SQLAlchemy) with some configuration or that have equivalent on the cloud (
postgres). Follow the “choose boring tech” principle: Well established technologies have made the proof of time and will provide you with better support. In turn, they will give you more time to dedicate to what is important, which is your data pipelines and your ML models.
Methodological adjustments to consider when running machine learning at home
Optimize for data locality
Working on a limited number of processors and memory places you in a different mindset. Preprocess your data as much as you can.
You certainly do not want to waste any CPU/GPU cycles, especially during the many epochs and batches of iterations your pipeline will go through during the training. Any transformation that eats up processing power from your training is better cached to disk, assuming that loading the data is faster than recomputing it.
To place this assumption in good conditions, I/O may become a limiting factor. Various methods may help:
- 1 Be parsimonious on the data: (Preprocess and) read only what you need for your training. Any other information that is read from disk should be discarded/avoided. For images, avoid any resizing or color information if not needed. If your textual data has to go through a lookup table, then save the result of that lookup table instead.
- 2 Make the data as small as possible on disk by using compression. Even if uncompressing the data will take CPU cycles, decompression can be considered as part of the I/O. There are various compression/decompression algorithms with different efficiency in terms of speed. As an example, the “feather” format in Pandas offers excellent read speed and relatively good compression, even if it doesn’t result in the smallest files.
- 3 Use as few files as possible: Every access to your disk should be amortized because of the access itself and the overhead of reading and decompression.
- 4 Use internal SSD drives for faster reading. “Internal” is important here: Even if the bandwidth over USB is increasing, this is still nothing compared to dedicated I/O slots inside your computer. If you are short in space (SSD are expensive), move all unnecessary data to your external storage. If you are still short of disk space, buy another SSD. Cherry on top: SSD drives do not produce any noise or vibrations, so that you could be tempted to leave the workstation running overnight.
Upload/backup data regularly to the cloud
This will not cost you a lot of CPU cycles and you can do this regularly during the training.
Share data and parameters with precision
In a remote setup, it is particularly difficult to spread a training experiment repeatably among teams. Various elements that are cumbersome to communicate influence the experiment itself: The subset of the dataset, the parameters of the various blocks (training and data preparation), the hyper-parameters, etc.
Agree on various things, such as indicating what is the precision of the set of files that your dataset contains.
The integrity of the data you share. Repeatable experiments as you have a limited feedback-loop.
Checkpoint your training
Your computer is not failure-proof: Algorithms going crazy over your RAM, power shortage, cat eating the cables, etc. Your home is a very different environment than the laboratory of a datacenter (although machines in a datacenter may also fail you for other reasons).
If everything stays in RAM, then it will not survive against any little perturbation in your environment, and you will lose the amount of CPU/GPU effort you put into the training. A way to mitigate this is to regularly checkpoint your models during the training on disk. Make sure you can start your training from a saved one on disk before going serious with the training.
There is one side benefit: You may inject an intermediate version of your model to the pipeline that consume your models while training.
Costs of Home ML vs Cloud ML
According to some magical calculations, an E5-2630 v3 bi-processor has 256 vCPUs. Let’s do 2 calculations:
- How many vCPUs in a week can be dedicated to my training?
- How much does it cost?
For this comparison to be “fair”, I am taking a rather boosted AWS c5n.18xlarge instance with a price of $1.1659/h (spot price) and $3.888/h (on demand), see AWS Spot pricing and AWS On-demand pricing.
Machines on AWS are working 24/7 while at home. At home, there are constraints that data centers do not have (sleep/nighttime, week-ends) and let’s say the available CPU time is 12 hours a day over 5 days. Since I am also using my workstation for my daily work, let’s say 10% (magical number, on average) of the CPU time goes to my daily developments. After 7 days, I have then:
- Home ML: 0.9 × 256 × 12 × 5 = 13824 vCPU * h
- AWS c5n.18xlarge: 72 x 24 x 7 = 12096 vCPU * h
About the costs:
- Home ML: One-time purchase + various equipment = 4000€ ~ $4800
- AWS c5n.18xlarge (spot price): $1.1659 x 24 x 7 = $195.9
- AWS c5n.18xlarge (on-demand): $3.888 x 24 x 7 = $653
And this is only for a single week: We have more CPU cycles, and we are still amortizing various costs like the ones of a regular home office.
- “Virtualization” at best: Using your workstation for your work and for training
- Running machine-learning from home is very cost-effective
- Amortization of up-front costs happens fast
What was not considered
GPU was a game-changer in many ML tasks, and yet we did not look at the GPU. But the spirit would be the same. You will certainly have access to extremely boosted GPUs on the cloud, that would leverage shorter training time or bigger datasets processing, with a cost per CPU/GPU and an entry price.
Additional infrastructure costs
We shortly mentioned robbery: This can happen at home or at the office, it would be unlikely in a datacenter. However, data robbery may happen and be seen completely unnoticed. You may need additional insurance depending on the field: Insuring the hardware if you benefit from an expensive workstation and/or equipment, additional warranties from manufacturers for various hardware failures, and/or additional data safety by digital encryption means.
You will need electricity for running all this: When idle, a workstation + monitor setup uses around 200Wh, when processing it may increase to 1000Wh. Taking into account the same processing time (12h x 5d=60h), one training might easily reach 60kW/week. This cost is amortized in winter by the fact that you can reuse the heat, but this is no longer true when the temperatures get higher.
- Get the development and the processing in the same room
- Invest in the right equipment for being able to train locally and at home
- Training power at home can indeed be high
- Remove any context switch
- Improve precision and communication
🥇 Get early access to the closed-beta of Reasonal to explore how we apply machine learning to make you find the files and work contents you are looking for.
📬 If you found this blog interesting, subscribe to our monthly newsletter digest, where you'll receive further content about machine learning in work tools and coding best-practices.