The wave of the fourth industrial revolution (4IR) is sweeping every terrain of life, work, and interpersonal relationships. Industries worldwide have been poised to leverage the benefits that come with the technological advances brought by the revolution. Utility industries, including the petroleum, logistics, and medical, are generating unprecedented amount of data like never before. Almost every stage in their operational life cycles is fitted with sensors monitoring various events, collecting petabytes of data in the process, and transferring same to waiting data “lakes” and “oceans” before being transported to “warehouses” for permanent archiving. These data are waiting for robust algorithms to consume them and utilize the patterns hidden in them to predict future events. The same events would have been conventionally obtained through laborious, time-consuming, and expensive measurement procedures or intense experimentations.
To achieve the desired objective of getting the most value out of data, the utilities industry has witnessed a wave of employment of data scientists many of whom know little or nothing about the scientific or engineering background of the problems they are employed to solve with their machine learning algorithms. While there have been some successful feats, it has created some problems in some cases.
The Emerging Problem
Many utility industries, including the petroleum, logistics, and medical industries, have benefitted immensely from machine learning applications. A number of applications have been successfully deployed, many more are being developed, and much more are still being conceived. The potential for the machine learning application at the industrial scale is limitless. In the bid to possibly reduce or completely remove the need for domain experts to have to learn the new field of data science, the petroleum industry has been embarking on a massive recruitment of Data Scientists. This has resulted in a quagmire and the need for synergy is rising ever than below. The industry is now faced with the challenge of having domain experts that require enormous resources to be trained on how to build sophisticated machine learning algorithms and the new wave of “green” Data Scientists who know close to nothing about the domain of the problem they are employed to solve.
There used to be a bifurcation in the application of machine learning in the industry. Domain experts are restricted to simple off-the-shelf algorithms that may not possess the sufficient “energy” to handle the complex data we have today while the Data Scientists consume the data with their algorithms without having sufficient knowledge of what story the data has to tell. The domain experts have the data and understand the physical bases of the problem. Their fundamental solutions in form of equations are based on the proper understanding of the physics of the problems but laden with assumptions for simplicity. However, they lack the skill to build sophisticated algorithms that the complex and voluminous data require. The Data Scientists think that all they need is the data and their sophisticated algorithms will do the required justice (some people call it “magic”) to the problem. However, they lack the knowledge of the fundamental principles behind the problems.
Before the wave of the employment of Data Scientists in the specialized industries, academic institutions are known for the research in and development of sophisticated machine learning algorithms. However, these algorithms most times end up in conference proceedings and journals as they are developed without the field and operational realities in focus. Industry subject matter experts would only trust a model that is developed in close collaboration with and sufficient contribution from them.
I started my machine learning modeling career in an academic institution. With a pure Computer Science and Software Engineering background coupled with near-zero knowledge of petroleum reservoir characterization, I started applying machine learning models to the prediction of petroleum reservoir properties. My first project was to predict reservoir permeability from wireline logs. A public data was used for the project. It would sound funny to tell anyone that I included Caliper log as part of the input variables not knowing that this log does not have any predictive value on its own but only used to QC the other logs. In reality, Caliper log has no relevance to the prediction of any reservoir property. I was educated on this reality during a conference presentation of the “successful” outcomes of the project. One can imagine how “successful” it was to have predicted reservoir permeability using a log that got nothing to do with the problem.
In a related project and a similar situation, I had used “-999” as part of the input data points for reservoir porosity prediction. This value is actually an indication of unavailable measurement. There is no need to tell the story of how I was publicly educated on the need to remove those spurious data entries before applying my algorithms. I was given the datasets and I thought they have been QC’ed. This is because I asked for QC’ed data due to my limited knowledge about the problem domain. The datasets I was given were claimed to have been QC’ed but perhaps the QC was not completely done.
The major cause of these goofs was my working exclusively with the data and in complete isolation from a subject matter expert in developing the solution to the problem. I know a lot of Data Scientists who committed the same and similarly grave but preventable errors.
The Way Forward: Synergizing to Close the Gap
For a successful machine learning application in any field for that matter, and especially in the petroleum industry, the best approach is for Data Scientists to work in very close collaboration with subject matter experts. Data Scientists do not have the luxury of time to study every field of application. This is much more the case in the petroleum industry where a Data Scientist may be required to apply machine learning algorithms on diverse petroleum engineering problems. They are not expected to be experts in all the subjects they work on. Working closely with SMEs, they will have the opportunity to learn “in the process”. They will also have the resources to do things right. Machine learning models developed with such synergy will be more accurate and reliable. The final outcome will also be more acceptable by the domain experts. In essence, the data science effort will be complimented with the necessary dose of the fundamental physics of the problem.
To meet the objectives of their digital transformation agenda, the utility industries especially petroleum (where I have my application experience) is leveraging the benefits offered by the various advanced technologies making up the 4IR. Data is now massively generated from the various sensors installed to continuously monitor digital oilfields. This has given rise to the increased employment of Data Scientists in the petroleum industry. Petroleum domain experts, with their lack of the requisite skill to develop sophisticated algorithms to handle the complex data, are either stuck with their derived empirical correlations or are restricted to the use of simple machine learning models that are available off-the-shelf. Data Scientists on the other hand, with their lack of expertise in the domain of application, work in complete isolation of domain experts in building their sophisticated models. This has resulted in distrust by domain experts. To derive maximum benefit from each other and deliver more reliable sophisticated machine learning models, data scientists and domain experts need