You think you got the hang of big data analytics? There‘s no time to be smug. To deliver real value, you’d better keep your stack up to date
We’ve been on this big data adventure for a while. Not everything is still shiny and new anymore. In fact, some technologies may be holding you back. Remember, this is the fastest-moving area of enterprise tech — so much so that some software acts as a placeholder until better bits arrive.
Those upgrades — or replacements — can make the difference between a successful big data initiative and one you’ll be living down for the next few years. Here’s are some elements of the stack you should start to think about replacing:
1. MapReduce. MapReduce is slow. It’s rarely the best way to go about a problem. There are other algorithms to choose from — the most common is DAG, of which MapReduce can be considered a subset. If you’ve done a bunch of custom MapReduce jobs, the performance difference compared to Spark is worth the cost and trouble of switching.
2. Storm. I’m not saying Spark will eat the streaming world, although it might, but with technologies like Apex and Flink there are better, lower-latency alternatives to Spark than Storm. Besides, you should probably evaluate your latency tolerance and whether the bugs you have in your lower-level, more complicated code are worth a few extra milliseconds. Storm doesn’t have the support that it could, with Hortonworks as the only real backer — and with Hortonworks facing increasing market pressure, Storm is unlikely to get more attention.
3. Pig. Pig kind of blows. You can do anything it does with Spark or other technologies. At first Pig seems like a nice “PL/SQL for big data,” but you quickly find out it’s a little bizarre.
4. Java. No, not the JVM, but the language. The syntax is clunky for big data jobs. Plus, newer constructs like Lambda have been bolted onto the side in a somewhat awkward manner. The big data world has largely moved to Scala and Python (the latter when you can afford the performance hit and need Python libraries or are infested with Python developers). Of course, you can use R for stats, until you rewrite it in Python because R doesn’t have all the fun scale features.
5. Tez. This is another Hortonworks pet project. It’s a DAG implementation, but unlike Spark, Tez is described by one of its developers as like writing in “assembly language.” At the moment, with a Hortonworks distribution, you’ll end up using Tez behind Hive and other tools — but you can already use Spark as the engine in other distributions. Tez has always been kind of buggy anyhow. Again, this is one vendor’s project and doesn’t have the industry or community support of other technologies. It doesn’t have any runaway advantages over other solutions. This is an engine I’d look to consolidate out.
6. Oozie. I’ve long hated on Oozie. It isn’t much of a workflow engine or much of a scheduler — yet it’s both and neither at the same time! It is, however, a collection of bugs for a piece of software that shouldn’t be that hard to write. Between StreamSets, DAG implementations, and all, you should have ways to do most of what Oozie does.
7. Flume. Between StreamSets and Kafka and other solutions, you probably have an alternative to Flume. That May 20, 2015, release is looking a bit rusty. You can track the year-on-year activity level. Hearts and minds have left. It’s probably time to move on.
Maybe by 2018 …
What’s left? Some technology is showing its age, but complete viable alternatives have not arrived yet. Think ahead about replacing these:
1. Hive. This is overly snarky, but Hive is like the least performant distributed database on the planet. If we hadn’t as an industry decided RDBMSes were the greatest thing since sliced bread for like 40 years, then would we really have created this monster?
2. HDFS. Writing a system-level service in Java is not the greatest of ideas. Java’s memory management also makes pushing massive amounts of bytes around a bit slow. The way the HDFS NameNode works is not ideal for anything and constitutes a bottleneck. Various vendors have workarounds to make this better, but honestly, nicer things are available. There are other distributed filesystems. MaprFS is a pretty well-designed one. There’s also Gluster and a slew of others.
Your gripes here
With an eye to the future, it’s time to cull the herd of technologies that looked promising but have grown either obsolete or rusty. This is my list. What else should I add?