Data Engineers are the data professionals who prepare the “big data” infrastructure to be analyzed by Data Scientists. They are software engineers who design, build, integrate data from various resources, and manage big data. Then, they write complex queries on that, make sure it is easily accessible, works smoothly, and their goal is optimizing the performance of their company’s big data ecosystem.
They might also run some ETL (Extract, Transform and Load) on top of big datasets and create big data warehouses that can be used for reporting or analysis by data scientists. Beyond that, because Data Engineers focus more on the design and architecture, they are typically not expected to know any machine learning or analytics for big data.
Here's a couple of the key skills needed from data engineers.
1. Tools and Components of Data Architecture
Since data engineers are much more concerned with analytics infrastructure, most of their required skills are, predictably, architecture-centric.
2. In-Depth Knowledge of SQL and Other Database Solutions
Data Engineers need to understand database management, and as such, in-depth knowledge of SQL is hugely valuable. Likewise, other database solutions, such as Cassandra or Bigtable, are great to know if you plan on doing freelance or for hire engineering, as not every database is going to be built in the recognizable standard.
3. Data Warehousing and ETL Tools
Data warehousing and ETL experience is essential to this position. Data warehousing solutions like Redshift or Panoply, as well as familiarity with ETL Tools, such as with StitchData or Segment is hugely valuable. Similarly, experience with data storage and retrieval is equally vital, as the amount of data being dealt with is simply astronomical.
4. Hadoop-Based Analytics (HBase, Hive, MapReduce, etc.)
Having a strong understanding of Apache Hadoop-based analytics is a very common requirement in this space, with knowledge of HBase, Hive, and MapReduce often considered a requirement.
Speaking of solutions, knowledge of coding is a definite plus here (and also possibly a requirement for many positions). Familiarity, if not outright expertness, is very valuable in Python, C/C++, Java, Perl, Golang, or other such languages.
6. Machine Learning
While mainly the focus of data scientist, some level of understanding of how to act upon this data is also invaluable for Data Engineers. For this reason, some knowledge of statistical analysis and the basics data modelling are hugely valuable.
While machine learning is technically something relegated to the Data Scientist, knowledge in this area is helpful to construct solutions usable by your cohorts. This knowledge has the added benefit of making you extremely marketable in this space, as being able to “put on both hats” in this case makes you a formidable tool.
7. Various Operating Systems
Finally, intimate knowledge of UNIX, Linux, and Solaris is very helpful, as many math tools are going to be based in these systems due to their unique demands for root access to hardware and operating system functionality above and beyond that of Microsoft’s Windows or Mac OS.