A Brief History of Big Data Technology Development

Foreword

Before writing this article, I wrote some history of big data components and some of its evaluations intermittently, but I didn't feel satisfied. History should be continuous and have its internal laws, so I want to write an article to summarize the history of big data technology development, sort out its context, and try to find out its internal laws, and share it with everyone.

In the domestic big data community, most developers focus on how to use a certain technology. Going a little deeper is the architecture analysis of this technology. As for how this technology was born and what problems it solves, almost no one cares. In a sense, this trend is also true, after all, most engineers in most companies do more of a certain job. To be more precise is to better complete a certain job, and the corresponding learning of technology is the learning of the use of technology and its structure. But I think this is far from enough, because the birth of a certain technology must encounter some problems before developers want to research and create this technology.

In addition, the development of technology does not necessarily choose the right direction at the beginning, but leaves its name in history after continuous market competition and elimination. After all, there are only a few technologies that leave their names, and most of them are submerged in the dust of history.

Gossip less, let us go back to the beginning of the birth of big data technology, which is also the "pastoral time" at the beginning of the computer.

Before Big Data

Looking back on the early days of computer development, it should be a wonderful era. At that time, programmers were not yet called "code farmers", but were still "scientists" and were at the forefront of technological development. Just as many ancient Chinese philosophers longed for the "pastoral era", programmers sometimes fantasize about going back to that era, hand-handling an operating system or database, or writing a hand-written implementation for the TCP/IP protocol.

For individual programmers, being able to write a database, operating system, or programming language is naturally a wonderful thing, but it is a very painful thing for the entire technology development.

Think about it, if every programmer is thinking about building a database, operating system or programming language, these systems and components are basically the same in terms of function. For example, the core function of a database is to store and search data. No matter how fancy it is, the database is still for storing and searching data.

Therefore, for the overall technology development trend, it is enough to build systems and components such as databases, operating systems, or programming, and then everyone can carry out secondary development based on these already built wheels.

But looking back at history now, without such a group of people, it would be difficult for personal computers to develop.

 

As "A Brief History of Computers" said: the development of personal computers is "the achievement of geeks--young amateur technical enthusiasts, with their persistent pursuit and technical ingenuity, have accomplished what the so-called experts think is impossible."

 

We can't ignore the amazing talents of the programmers just because the current programmers have become "code farmers".

"Pastoral" Era

If you start talking about big data from Turing or von Ruoyman, it is a bit too far away, so it would be better to change the title directly to "A Brief History of Computer Development". Therefore, to trace the history of big data starts with the development of personal computers.

When the personal computer was first born, there was nothing but a machine. If you want this machine to perform certain functions, you have to program it yourself. Such as the first personal computer "Altair 8800".

But at that time, nine out of ten people who could buy a personal computer at home were technology enthusiasts. After all, there is no programming language, no database, and no operating system. There is only a button to enter binary code.

Every programmer at that time had to start from scratch, face the lowest-level hardware, and use the most primitive means to complete his little hobbies.

Bill Gates spent six weeks writing the BASIC programming language, Eric Schmidt tinkered with Lex, Linus Torvalds completed the micro-operating system Linux in his sophomore year, and the small database dBase.

For programmers, being able to write a certain programming language, operating system and database by themselves is a great sense of accomplishment. But for the personal computer to develop into a personal computer that every ordinary consumer can use, there must be a recognized component that turns the low-level, complex, and abstract binary coding into a simple way to use it.

Of course, coupled with the promotion of commercial interests and the fighting of market competition, an ecosystem of programming languages, operating systems and databases was finally built. Programmers no longer need to write code from scratch.

In fact, this is also normal. A trend in the development of the computer field is layer upon layer abstraction, and professional things are handed over to professional tools. For example, the compiler provides a programming language to the user, so that the user does not need to care about assembly language or even binary encoding; The operating system provides various system interfaces, so that users no longer have to face the underlying drivers and complex hardware environment; The database provides the SQL language, so that users do not need to care about the specific logic of data processing.

The sign of the end of the "pastoral" era is the birth and popularity of programming languages, operating systems and databases.

Back to this article, for the development of big data technology, all three are indispensable. But the most important thing is the development of the database. After all, big data technology was born out of the database, developed in the distributed system, and finally returned to the database system.

Database Era

To talk about the real origin of big data, we must mention the database.

Whether it is the mobile Internet, the PC Internet, or the computer itself, there are programs written by groups of programmers behind it, and all programs are still data processing in the final analysis. If data processing is compared to a kingdom, then the king of this kingdom is the database.

So what is a database?

In the simplest terms, a user can store data in the database. When needed, the user can tell the database that I need certain data, and then the database will complete the actual data processing process and return the data to the user. The database helps programmers shield the underlying complex data processing process, and programmers only need to care about data storage and query.

To give a simple example, before there was a database, in order to process data, programmers first had to face the underlying file system of the operating system by themselves. The APIs of the file systems of different operating systems are different. Regardless of the complexity of the file system, simply facing different file systems, programmers must develop in a targeted manner. Secondly, you also need to consider the stored data structure, index structure, file format, and even concurrent transactions, failure recovery, etc.

This is very difficult. Thus, the database was born.

The database is the same as normal commercial software, and the first "front wave" to eat crabs eventually fell on the beach. At that time, the relational model was considered "unreliable", and the most popular models were the hierarchical model and the network model (interested readers can search for these two models online).

After experiencing fierce market competition and engineering practice, everyone finally found that the relational model is the database model suitable for most environments. Relational databases have established a market monopoly.

Once a certain basic software has established its monopoly position, and then builds a huge ecosystem around it, even if the latecomer makes a similar product, the monopoly software ecosystem will kill the latecomer unless another track or it is the first mover who cannot solve certain types of problems.

The same is true for databases. Oracle, DB2 and other databases occupy most of the shares of traditional enterprises such as finance and telecommunications, which can illustrate this situation.

Back to the database itself, the first real relational database should be SystemR from IBM Research. Most of the things that come out of the research institute are relatively "high-end and few", without taking into account the actual experience of most people. For example, the query language used by System R is more like a mathematician's plaything than an engineer's masterpiece. 

Anyway, I can't understand it. Readers may be confused after reading the System R query language. What is this all about?

But it is undeniable that this is indeed the first database, which is much better than the previous situation. After all, as long as you learn the syntax of the query language, you can process data. After all, when there is no database, there are many complicated situations to deal with, which is much more difficult than this mathematical query language.

System R is like a "front wave", helping everyone to lay the road flat. Proves that the relational model is superior to other database models and is commercially superior. So everyone rushed forward, and Oracle, DB2, etc. appeared on the stage, each leading the way.

At this time, a great language was also born, that is SQL. SQL, the full name is Structured Query Language, to some extent, SQL is the "only" unified language in the data field. Just like our country, although there are different dialects in different places, Mandarin is the common language of the country and everyone can understand it. The same is true for SQL.

Here, we would like to thank the two inventors of SQL, Donald Chamberlin and Raymond Boyce.

The theoretical foundation of SQL is relational algebra. This article is not a popular science article for technology. Interested readers can understand the theory of relational algebra by themselves.

A very important feature that distinguishes SQL from other languages is that it is a "declarative" language.

To put it simply, the so-called declarative means that the user only needs to tell the computer what he wants to do, and the rest is done by the computer. The imperative language corresponds to the declarative language. The imperative language will tell the computer what steps it should follow to achieve what, and the computer can follow suit.

Most programming languages, such as Java and Python, are imperative languages. Compared with declarative languages, imperative languages are more difficult and steeper to learn, which also means that it is difficult to expand the number of users. So people will prefer declarative languages (presumably).

much of the success of the computer industry depends on developing a class of users other than trained computer specialists.

In fact, behind the declarative language and the imperative language, to some extent, there is also a battle between the two philosophies of Unix philosophy and database philosophy.

The Unix philosophy is committed to providing simple tools to users, and leaving the rest of the implementation process to users. The database philosophy is different. It thinks that users only need to tell it what to do, and I will help you complete the implementation process.

The Unix philosophy believes in the ability of the user, while the database philosophy does not believe in the ability of the user. Which one is better or worse is up to the reader's own judgment.

As the basic software, the database ushered in its glorious era with SQL after proving its capabilities.

At that time, no matter what kind of company it is, as long as you use information technology to provide business services, you cannot do without a database. Because the essence of all applications is to process data. 

In the glory days of databases, a group of young Californians invented the Internet. At that time, no one thought that the seemingly indestructible database would be shaken by the little-known Internet.

A new era is coming!

Hadoop Era

Unveiling The Era Of Big Data

In fact, the term big data appeared very early. Big data was mentioned in "The Third Wave" written by Alvin Toffler in 1983, and he concluded that big data will be the cadenza in the third wave.

Of course, big data at that time was like a baby in its infancy, and there was still a long way to go before it could change the world.

Big data does not really become what we call "big data" until technology has the ability to process large-scale data.

Let us go back to history. The last article mentioned the arrival of the Internet era.

The Internet, whether it is the PC Internet, the mobile Internet, or the Internet of Things in the future, the biggest challenge to the database is the amount of data.

More and more data, so much that a single machine can no longer be stored and processed, no matter how powerful the performance of a single machine is, the huge data will always swallow its remaining performance.

To some extent, before the birth of big data technology, more and more data was garbage. Yes, data that cannot be processed is a pile of garbage, after all, high-end servers are not cheap.

The first company to solve this problem is called Google. It is a pity that Google did not open source its technology at that time, but only published three technical papers. These three papers are dubbed the "Troika" in the field of big data, namely Google File System GFS, MapReduce and BigTable.

  • GFS solves the problem of large-scale data storage, allowing data to grow almost infinitely;
  • MapReduce solves the problem of large-scale data calculation and makes large-scale data processing possible;
  • BigTable solves the problem of online real-time query, even if the amount of data is large, users can quickly query the data.

Among these three articles, the most core, or the most valuable article is GFS. Because big data can only be processed and queried if it is reliably stored, the open source implementation of GFS, HDFS, is still the only standard in the field of big data.

While GFS is important, MapReduce was the loudest and most controversial article in the early 2000s. During this period, David J. DeWitt and Michael Stonebraker, big names in the database field, wrote the famous MapReduce: A major step backwards. Accused MapReduce of abandoning the theoretical essence accumulated through hard work in the database field, but chose a simple and crude way to process data. Regarding MapReduce: A major step backwards, there are a lot of analysis and comments on the Internet, so this article will not go into details, and readers can judge by themselves.

At this time, everything is still a thesis, and there are no products that have landed until Hadoop appears.

The Birth of Hadoop

As mentioned earlier, Google only published papers, but no open source implementation. So looking back, it can only be said that Google has unveiled the curtain of the era of big data, but it has not enjoyed the huge dividends brought by the era of big data.

It's a pity.

The first company to realize the "troika" outside of Google, that is, the first to eat crabs, is Yahoo.

Yahoo was learning from Google to do the search industry at the time, and it was also facing the same problem as Google. The data on the Internet is very large and a lot, how to deal with it has become a difficult problem.

So Yahoo imitated Google's thesis to come up with Hadoop, and tested and perfected Hadoop with actual online business. I have to say that Yahoo is very courageous. If something goes wrong with Hadoop, Yahoo will lose real money.

Finally, it is admirable that Yahoo made a "live Lei Feng" and opened up Hadoop.

A company open-sources a certain software, especially a large company open-sources its own internal core products. In a sense, it uses technological leadership to seize market share and thus obtain the right to formulate standards. Only by holding the right to formulate standards in its hands can a company gain as much benefit as possible from the product and the market corresponding to the product.

However, Yahoo made Hadoop open source, but there was no continuous maintenance and control, which caused Hadoop to become the Apache Foundation, not Yahoo itself.

Before Hadoop, many companies envied that Google could process the data of the entire Internet and obtain huge profits from it. Although the itself understands its business model, but suffers from the lack of a dragon-slaying knife, it can only stare blankly, helpless.

After Yahoo open-sourced Hadoop, major companies started to build it internally as if they were treasures.

When Hadoop was first open sourced, as mentioned in the previous chapter, it suffered a lot of criticism from experts in the database field. In all fairness, Hadoop at this time is not as easy to install, deploy and use as the current version.

The newly open source Hadoop is not only troublesome to install and deploy, but also very difficult to use: for example, if the user only wants to count the number of the entire file, he has to write a bunch of Map and Reduce functions and codes for this purpose.

However, these are not important. What is important is that Hadoop solves a very important problem, so important that the user experience has become a small problem.

That problem is how to perform stable and reliable calculations on a bunch of cheap computers.

With Hadoop, enterprises no longer depend on expensive high-end hardware machines, and only need cheap computers to complete what was done on high-end hardware machines in the past, and to some extent, it is even better. And the data accumulated by the enterprise is no longer "garbage", but turned into a gold mine, and the enterprise can continuously dig out valuable things from the data.

Even though Hadoop has been criticized for not being elegant enough, there is no competing product on the market. In addition, many companies are not as skilled as Google, but "three cobblers are better than Zhuge Liang".

The Hadoop ecosystem has started.

The success of Hadoop

In a way, Hadoop has become the de facto standard in the field of big data. When Google is doing Google Cloud, it also has to pinch its nose to be compatible with Hadoop interfaces.

Looking back now, why did Hadoop succeed in the end?

First of all, the timing should be good. When Hadoop appeared, enterprises faced the problem that they had huge data but could not handle it. Hadoop solved it in a simple and rude way.

The second is open source, because Hadoop is open source, so companies using Hadoop don’t have to worry about this technology being stuck by other companies, and also benefit from the help of the open source community, Hadoop can change from a rough toy to a commercially available product .

 

posted @ 2023-07-26 14:45  ImreW  阅读(32)  评论(0编辑  收藏  举报