Economists have predicted that a leisurely 15-hour workweek awaits us in the future, with robots taking over the menial tasks so that we’ll be free to explore the more cognitively stimulating aspects of our jobs.
Sounds like science fiction, right? Well, it is really – this concept has formed the plot of some really good sci-fi novels…
But the other reason that we find ourselves in disbelief at the idea of a leisurely (or stimulating) workweek is that the reality of our work is in stark contrast with these futuristic predictions.
More than 40 percent of employees claim to spend at least a quarter of their week on manual, repetitive tasks.
It seems that robots haven’t freed us (yet!).
But there is a silver lining. The process of data automation is a tried-and-tested solution to the anguishes of the modern data scientist. It might not solve all of our future work problems, but it does unlock the path to a more productive, streamlined, and profitable data operation.
What is data automation?
Data automation is the use of intelligent processes, choice of infrastructure, application of artificial intelligence, and software to collect, transform, store, and analyze data without human intervention.
The use of data automation replaces manual labor in the data ecosystem with machines and processes that do the work for us.
Human intelligence is irreplaceable. No machine or process can predict everything that could go wrong and respond as flexibly as humans can. But a reduction in the time wasted on manual processes gives domain experts the opportunity to shine in what they do best – putting their expertise towards solving problems and generating more revenue.
This is just one of the many benefits of data automation.
The benefits of data automation
From the e-commerce business to the Internet of Things (IoT) system engineering venture, companies with any volume of data enjoy benefits from data automation:
- Reduces operational costs. Resolving bugs and manually running queries leads to an opportunity cost for everyone involved. Instead of working on interesting and revenue-generating projects, team members spend time maintaining a system to keep it afloat. Manual execution comes at a high cost: spending upwards of 40% of your workday on things that could be automated is not cost-effective.
- Less error-prone. Data automation results in fewer errors in the long run. First of all, it causes us to rethink every process and specify pen-to-paper (or keystrokes-to-code) the necessary steps for automating a process. This makes us more aware of any biased perspectives in system design, blind spots, or missing steps that could go wrong. Secondly, things are bound to break at some point. But fixing the problem within a physical system (instead of with a mental note to remind the engineer the next time they do it manually) increases the operational efficiency of the basic framework by embedding the solution within the automation code itself. The system is self-improving with each iteration.
However, understanding the importance of automation processes and actually implementing them are not equivalent. To build an intelligent enterprise, you need to implement the right data automation strategy.
Start data automation with the right data automation strategy
The process of data automation is similar to business process automation – you start with a clear strategy to guide you.
The strategy for automating your data operations has three crucial components:
- Identification of problems. Identify the major areas where your business could benefit from automation. Simply look at where data automation can be of relevant use. Ask yourself: where do your data operatives spend the majority of their time doing manual work? Which aspects of your data operations keep breaking down? Create a list of all the potential processes for improvement.
- Prioritization. Use the amount of time wasted as a proxy for how important a process is. The more time consumed by manual labor, the greater the impact of automation will be on the bottom-line. Make sure that you also consider how long it would take to automate a process. Quick wins are where it’s at, as they keep everyone’s morale up while the business owners see the value of automation.
- Execution. This is technically the hardest part: execution of data strategies. We will explore how to execute three different processes: improving reporting, improving the engineering pipeline, and improving machine-learning processes.
How to automate reporting
Data analysts spend a lot of time on reporting rather than the actual data analysis part of their work. That is, report generation represents a massive chunk of data analytics.
This is because data analytics provides business-critical information, which needs to be deliver in timely updates to guarantee maximum impact for decision-makers.
In practice, this often means that the data analyst has to re-run the same SQL queries to generate the dashboards and visualizations, export them, and email them to critical collaborators.
One of the easiest ways to automate reporting is to rely on data automation tools such as Looker, which automate report generation and even send emails with finished reports at pre-specified times. However, data still needs to be unified and imported into BI tools for them to generate reliable, high-quality results.
Of course, you could also automate your reporting with Python. You could write scripts which access the database, clean and aggregate the data, and generate Excel reports which can be email automatically just in time for that Monday meeting. But this solution is suboptimal (more prone to errors, less flexible to specification changes, etc.).
We highly recommend the SaaS model of report automation instead.
How to automate the data engineering ETL pipeline
The ETL pipeline is the common engineering pipeline used to process data from its raw form to one that’s usable. ETL stands for Extract-Transform-Load – extract the data from its sources, transform the data by cleaning and aggregating it, and load the data into a relational database, data warehouse, or data lake, where it’s ready for your analysts and machine-learning experts.
Depending on the underlying infrastructure and architecture, automating your ETL pipeline can be easy-peasy or require a high level of expertise, so we’ll provide no-code, low-code, and full-code examples of automating your ETL pipeline.
If you’re considering big data automation, automating the ETL pipeline will have the highest impact on the scalability of big data technologies that you use.
1. Automate data extraction
The extract phase is the part of the ETL pipeline in which we collect data from its raw sources.
The sources can include other databases, enterprise resource planning (ERP) software, customer relationship management (CRM) software, third-party apps accessed via APIs (one of the most common data sources nowadays), and a myriad of other sources.
There are several best practices when automating data collection:
- Use REST APIs or WebSockets. The architecture of your data collection should use the REST API design or webhooks and WebSockets. Either schedule REST API calls at regular intervals to keep your data up to date, or implement a webhook/WebSocket, which updates data in your ETL pipeline whenever new data is generated in the source. The latter is faster and keeps your data fresh, while the former is less expensive because it processes data in batches. You need to decide which of the two options works best for your business demands. Either way, do not run your collection scripts manually.
- Test for the improbable (but possible) worst-case scenario. Data collection scripts break, and they do so often. The payloads and endpoints of APIs change between updates, data types change, data formats are altered, and service reliability issues are a frequent pain. All of these can cause your system to accidentally import incomplete or corrupted data. You need to be prepare for the improbable (but possible) worst-case scenario and write tests which prevent data imports for extractions that do not adhere to specifications.
- Rely on software to make your life easier. Instead of wasting time and money on writing long scripts, maintaining them, and monitoring them for reliability and validity, deploy software that collects data for you. Vendors who offer data platforms provide software to integrate with your sources and take on the responsibility of the software, leaving you with more time for other tasks.
2. Automate data transformation
Transformation is the second step of the ETL pipeline. It’s at this point we clean the incoming data and aggregate it in the desired form for loading into the database.
When automating data transformations, keep in mind these best practices:
- Write reusable scripts. Modularize your cleaning code so that the scripts can be used on a variety of datasets. For example, when removing outliers, you could write a helper function to remove them, which is also accessible to other modules over other datasets.
- Clean preemptively. Even if your data is shaped in the right manner and doesn’t suffer from certain corruption issues, you should apply cleaning scripts to it nonetheless. For example, run the remove_missing_values.py script on tables which don’t have missing values. Why? Because if missing values do appear, you won’t have to re-clean the table manually or change scripts – the system will do it automatically.
- Write diagnostic scripts to save computing power. Cleaning preemptively (the advice above) has the shortcoming of wasting (compute) time and slowing down your system. Instead, you could create a set of diagnostic scripts, which check for missing values and run the scripts only when they find them.
- Schedule or trigger your transformation processes. Every piece of advice given so far could still be execute manually. What makes the transform stage fully automated is automated execution. You can either schedule the transformation scripts at regular intervals (batch) or trigger the scripts whenever new data comes in. The trade-off is between longer delays in having clean data vs. higher costs of cleaning data constantly.
- Move aggregations to the extract phase. This is a pro tip. APIs and webbooks allow you to specify the structure of the returned payload in advance. You can use this flexibility to specify the aggregation that you need when collecting data ahead of time. Instead of collecting rows of raw data and aggregating manually, simply collect pre-aggregated data.
3. Automate data loading
Data-loading automations guarantee that your representation of data is compliant with the data model specifications (aka schema). Use SQL triggers and constraints to make sure that everything is as specified. SQL triggers and constraints operate on a ‘set and forget’ principle. For example, you could set a couple of “on insert” and “on update” triggers, which re-check the data for any errors.
Automate machine-learning pipelines
Data science and machine learning have been revolutionizing businesses in recent years. From automated speech recognition to computer vision and countless novel solutions, the era of big data is reshaping the business landscape.
Automation is also a disruptive technology that allows us to tap into machine learning’s potential for growth. Unfortunately, not much has been written about automating machine-learning pipelines… until now.
Here are some ideas on how to make your machine-learning pipelines run smoother with automation:
- Automate labeling. Automated labeling of categorical data into a set of numerical identifiers is simple. Just use scikit-learn’s LabelEncoder to one-hot encode your target variables. If you move this process into the Transform layer of your ETL pipeline, you can automate data preprocessing for classification tasks.
- Automate feature engineering. Feature engineering refers to the process of creating new data features to be use as input for machine-learning algorithms. Feature creation can be automate in the same way that we automate the ETL transformation step. Just create an aggregate or arithmetic composite feature out of your raw features during preprocessing and make that data available to your machine-learning notebooks. What’s more, you could also automate the feature selection aspect of feature engineering. This can be done by using establish libraries like scikit-learn’s feature selection module.
- Automate model selection. We evaluate ML models against a set of objective metrics, such as their Accuracy, ROC, F1 score, etc. Those same metrics can be use to automate model selection. We simply prepare a couple of potential models (e.g. Logistic Regression, Support Vector Machine, Naive Bayes Classifier, Random Forest Classifier, etc.), train and evaluate each model against an evaluation score (e.g. Accuracy), and then pick the model that performs best. That model is ship to production. With little work, we can automate the entire data pipeline for picking the best ML model.
- Auto-tune hyperparameters. Machine-learning models are only as good as their configurations. Finding the right set of hyperparameters can boost your model’s accuracy by orders of magnitude. You can automate hyperparameter tuning quite easily. Use GridSearchCV as an exhaustive search strategy to train the same model with different hyperparameters, then pick the best set of hyperparameters and ship them to production. This automation step follows the same logic as the previous one.
- Automate predictive analytics. If you followed the advice above, you have all of the necessary means to automate predictive analytics. Simply schedule or trigger a call model. predict() on your fresh data with your newly built model and save the results with a new table. Combine this automated pipeline with reporting, and you’ll have automated everything from ETL to ML to BI.