This is the second article in the four-part series on History of Artificial Intelligence. The first part can be accessed here.
Every decade seems to have its technological buzzwords: we had personal computers in the 1980s; Internet and worldwide web in 1990s; smartphones and social media in 2000s; and Artificial Intelligence (AI) and Machine Learning in this decade. However, the field of AI is 67 years old and this is the second in a series of five articles wherein:
- The first article discusses the genesis of AI and the first hype cycle during 1950 and 1982
- This article discusses a resurgence of AI and its achievements during 1983-2010
- The third article discusses the domains in which AI systems are rivaling humans
- The fourth article discusses the current hype cycle in Artificial Intelligence
- The fifth article discusses as to what 2018-2035 may portend for brains, minds, and machines
The resurgence of Artificial Intelligence
The 1950-82 era saw a new field of Artificial Intelligence (AI) being born, a lot of pioneering research being done, massive hype being created, and AI going into hibernation when this hype did not materialize, and the research funding dried up . During 1983 and 2010, research funding ebbed and flowed, and research in AI continued to gather steam although ” some computer scientists and software engineers would avoid the term artificial intelligence for fear of being viewed as wild-eyed dreamers” .
During the 1980s and 90s, researchers realized that many AI solutions could be improved by using techniques from mathematics and economics such as game theory, stochastic modeling, classical numerical methods, operations research and optimization. Better mathematical descriptions were developed for deep neural networks as well as evolutionary and genetic algorithms, which matured during this period. All of this led to new sub-domains and commercial products in AI being created.
In this article, we first briefly discuss supervised learning, unsupervised learning and reinforcement learning, as well as shallow and deep neural networks, which became quite popular during this period. Next, we discuss the following six reasons that helped AI research and development in gaining steam – hardware and network connectivity became cheaper and faster; parallel and distributed became practical, and lots of data (“Big Data”) became available for training AI systems. Finally, we discuss a few AI applications that were commercialized during this era.
Machine Learning Techniques Improve Substantially
Supervised Machine Learning: These techniques require to be trained by humans by using labeled data . Suppose we are given several thousand pictures of faces of dogs and cats and we would like to partition them into two groups – one containing dogs and the other cats. Rather than doing it manually, a machine learning expert writes a computer program by including the attributes that differentiate dog-faces from cat-faces (e.g., length of whiskers, droopy ears, angular faces, round eyes). After enough attributes have been included and the program checked for accuracy, the first picture is given to this “black box” program. If its output is not the same as that provided by a “human trainer” (who may be training in person or has provided a pre-labeled picture), this program modifies some of its internal code to ensure that its answer becomes the same as that of the trainer (or the pre-labeled picture). After going through several thousand such pictures and modifying itself accordingly, this black box learns to differentiate the faces of dogs from cats. By 2010, researchers had developed many algorithms that could be used inside the black box, most of which are mentioned in the Appendix, and today, some applications that commonly use these techniques include object recognition, speaker recognition, speech to text conversion.
Unsupervised learning algorithms: These techniques do not require any pre-labeled data and they try to determine hidden structure from “unlabeled” data . One important use case of unsupervised learning is computing the hidden probability distribution with respect to the key attributes and explaining them, e.g., understanding the data by using its attributes and then lustering and partitioning it in “similar” groups. There are several techniques in unsupervised learning most of which are mentioned in the Appendix. Since the data points given to these algorithms are unlabeled, their accuracy is usually hard to define. Applications that use unsupervised learning include recommender systems (e.g., if a person bought x then will the person by y), creating cohorts of groups for marketing purposes (e.g., clustering by gender, spending habits, education, zip code), and creating cohorts of patients for improving disease management. Since k-means is one of the most common technique, it is briefly described below:
Suppose we are given a lot of data points each having many n attributes (which can be labelled as n coordinates) and we want to partition them into k groups. Since each group has n coordinates, we can imagine these data points as being in an n-dimensional space. To begin with, the algorithm partitions these data points arbitrarily into k groups. Now, for each group the algorithm computes its centroid, which is an imaginary point with each of its coordinates being the average of the same coordinates of all the points in that group, i.e., this imaginary point’s first coordinate is the average of all first coordinates of the points in this group, the second coordinate is the average of all second coordinates, and so on. Next, for each data point, it finds the centroid that is the closest to that point and achieves a new partition of these data points into k new groups. This algorithm again finds the centroids of these groups and Since the data points given to this algorithm are unlabeled, their accuracy is usually hard to define. repeats these steps until it either converges or has gone through a specified number of iterations. An example in a two-dimensional space with k=2 is shown in the picture below:
Another technique, hierarchical clustering creates hierarchical groups, which at the top level would have “supergroups” each containing sub-groups, which in turn, would contain sub-subgroups and so on. K-means clustering is often used for creating hierarchical groups as well.
Reinforcement Learning: Reinforcement Learning (RL) algorithms learn from the consequences of their actions, rather than from being taught by humans or by using pre-labeled data ; it is analogous to Pavlov’s conditioning, when Pavlov noticed that his dogs would begin to salivate whenever he entered the room, even when he was not bringing them food . The rules that such algorithms should obey are given an upfront and they select their actions on basis of their past experiences and by considering new choices . Hence, they learn by trial and error in a simulated environment. At the end of each “learning session,” the RL algorithm provides itself a “score” that characterizes its level of success or failure, and over time, the algorithm tries to perform those actions that maximize this score. Although IBM’s Deep Blue, which won the chess match against Kasparov, did not use Reinforcement Learning, as an example, we describe a potential RL algorithm for playing chess:
As input, the RL algorithm is given the rules of playing chess, e.g., 8*8 board, initial location of pieces, what each chess piece can do in one step, a score of zero if the player’s king has a check-mate, a score of one if the opponent’s king has a check-mate, and 0.5 if only two kings are left on the board. In this embodiment, the RL algorithm creates two identical solutions, A and B, which start playing chess against each other. After each game is over, the RL algorithm assigns the appropriate scores to A and B but also keeps a complete history of the moves and countermoves made by A and B that can be used to train A and B (individually) for playing better. After playing several thousand such games in the first round, the RL algorithm uses the “self-generated” labelled data with outcomes of 0, 0.5, and 1 for each game and of all the moves played in that game and by using learning techniques, determines the patterns of moves that lead A (and similarly B) to getting a poor score. Hence for the next round, it refines these solutions for A and for B appropriately so that each of them optimizes the play of such “poor moves,” thereby, improving them for the second round, and then for the third round, and so on, until the improvements from one round to another become minuscule, in which case A and B end up being reasonably well-trained solutions.
In 1951, Minsky and Edmonds built the first neural network machine, SNARC (Stochastic Neural Analogy Reinforcement Computer); it successfully modeled the behavior of a rat in a maze searching for food, and as it made its way through the maze, the strength of some synaptic connections would increase, thereby reinforcing the underlying behavior, which seemed to mimic the functioning of living neurons . In general, Reinforcement Learning algorithms perform well while solving optimization problems, in game theoretic situations (e.g., in playing Backgammon  or GO ) and in problems where the business rules are well defined (e.g., autonomous car driving) since they can self-learn by playing against humans or against each other.
Mixed learning: Mixed learning techniques use a combination of one or more of supervised, unsupervised and reinforcement learning techniques. Semi-supervised learning is particularly useful in cases where it is expensive or time-consuming to label a large dataset. For example, while differentiating dog-faces from cat-faces, if the database contains some images that are labeled but most of them are not. Some of their broad uses include classification, pattern recognition, anomaly detection, and clustering/grouping.
The resurgence of Neural Networks – Both Shallow and Deep
As discussed in the previous article , a one-layer perceptron network consists of an input layer, connected to one hidden layer of perceptrons, which is in turn connected to an output layer of perceptrons . A signal coming via a connection is recalibrated by the “weight” of that connection, and this weight is assigned connection during the “learning process”. Like a human neuron, a perceptron “fires” if all the incoming signals together exceed a specified potential but unlike humans, in most such networks, signals only move from one layer to that in front of it. The term, Artificial Neural Networks (ANNs) was coined by Igor Aizenberg and colleagues in 2000 for Boolean threshold neurons but is used for perceptrons and other “neurons” of the same ilk . An example of the one-layer network is given below:
Although multi-layer perceptrons were invented in 1965 and an algorithm for training an 8-layer network was provided in 1971 [18, 19, 20], the term, Deep Learning, was introduced by Rina Dechter in 1986 . For our purposes, a deep learning network has more than one hidden layer. The example given below shows a deep neural network has ten layers. Given below are important deep learning networks that were developed during 1975 and 2006 and are frequently used today; however, their description is out of the scope of this article:
- In 1979, Fukushima provided the first “convolutional neural network” (CNN) when he developed Neocognitron in which he used a hierarchical, multilayered design . CNNs are widely used for image processing, speech to text conversion, document processing and Bioactivity Prediction in Structure-based Drug Discovery .
- In 1982, Hopfield popularized Recurrent Neural Networks (RNNs), which were originally introduced by Little in 1974 [51,52]. RNNs are analogous to Rosenblatt’s perceptron networks that are not feed-forward because they allow connections to go towards both the input and output layers; this allows RNNs to exhibit temporal behavior. Unlike feedforward neural networks, RNNs use their internal memory to process arbitrary sequences of incoming data. RNNs have since been used for speech to text conversion, natural language processing and for early detection of heart failure onset .
- In 1997, Hochreiter and Schmidhuber developed a specific kind of deep learning recurrent neural network, called LSTM (long short-term memory) . LSTMs mitigate some problems that occur while training RNNs and they are well suited for predictions related to time-series. Applications of such networks include those in robotics, time series prediction, speech recognition, grammar learning, handwriting recognition, protein homology detection, and prediction in medical care pathways .
- In 2006, Hinton, Osindero and Teh invented Deep Belief Networks and showed that in many situations, multi-layer feedforward neural networks could be pre-trained one layer at a time by treating each layer as an unsupervised machine and then fine-tuning it using supervised backpropagation . Applications of such networks include those in image recognition, handwriting recognition, and identifying of onset of diseases such as liver cancer and schizophrenia [100, 109].
Parallel and Distributed Computing Improve AI Capabilities
During 1982 and 2010, hardware became much cheaper and more than 500,000 times faster; however, for many problems, one computer was still not enough to execute many machine learning algorithms in a reasonable amount of time. At a theoretical level, computer science research during 1950-2000 had shown that such problems could be solved much faster by using many computers simultaneously and in a distributed manner. However, the following fundamental problems related to distributed computing remained resolved until 2003: (a) how to parallelize computation, (b) how to distribute data “equitably” among computers and do automatic load balancing, and (b) how to handle computer failures and interrupt them if they go into infinite loops. In 2003, Google published Google File Systems paper and then followed it up by publishing MapReduce in 2004, which was a framework and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster . Since MapReduce was proprietary to Google, in 2006, Cutting and Carafella (from University of Washington but working at Yahoo) created an open source and free version of this framework called Hadoop . Also, in 2012, Spark and its resilient distributed datasets were invented, which reduced the latency of many applications when compared to MapReduce and Hadoop implementations . Today a Hadoop-Spark based infrastructure can handle 100,000 or more computers and several hundred million Gigabytes of storage.
Big Data begins to help AI systems
In 1998, John Mashey (at Silicon Graphics) seemingly first coined the term, “Big Data,” that referred to large volume, variety and velocity at which data is being generated and communicated . Since most learning techniques require lots of data (especially labelled data), the data stored in organizations’ repositories and on the World Wide Web, became vital for AI. By early 2000, social media websites such as Facebook, Twitter, Pinterest, Yelp, and Youtube as well as weblogs and a plethora of electronic devices started generating Big Data, which set the stage for creating several “open databases” with labeled and unlabeled data (for researchers to experiment with) [72,73]. By 2010, humans had already created almost a quadrillion Gigabytes (i.e., one zettabytes) of data, most of which was either structured (e.g., spreadsheets, relational databases) or unstructured (e.g., text, images, audio and video files) .
Progress in Various Fields of Artificial Intelligence and their Commercial Applications
Reinforcement Learning Algorithms play Backgammon: In 1992, IBM’s Gerald Tesauro built TD-Gammon, which was a reinforcement learning program to play backgammon; its level was slightly below that of the top human backgammon players at that time .
Machines beat humans in Chess: Alan Turing was the first to design a computer chess program in 1953 although he “ran the program by flipping through the pages of the algorithm and carrying out its instructions on a chessboard” . In 1989, chess playing programs, HiTech and Deep Thought developed at Carnegie Mellon University, defeated a few chess masters . In 1997, IBM’s Deep Blue became the first computer chess-playing system to beat world’s champion, Garry Kasparov. Deep Blue’s success was essentially due to considerably better engineering and processing 200 million moves per second .
Robotics: In 1994, Adler and his colleagues at Stanford University invented, a stereotactic radiosurgery-performing robot, Cyberknife, which could surgically remove tumors; it is almost as accurate as human doctors, and during the last 20 years, it has treated over 100,000 patients . In 1997, NASA built Sojourner, a small robot that could perform semi-autonomous operations on the surface of Mars .
Better Chat-bots: In 1995, Wallce creates A.L.I.C.E., which was based on pattern matching but had no reasoning capabilities . Thereafter, Jabberwacky (renamed as Cleverbot in 2008) was created, which had web-searching and game-playing abilities  but was still limited in nature. Both chatbots used improved NLP algorithms for communicating with humans.
Improved Natural Language Processing (NLP): Until the 1980s, most NLP systems were based on complex sets of hand-written rules. In the late 1980s, researchers started using machine learning algorithms for language processing. This was due to the faster and cheaper hardware as well as the reduced dominance of Chomsky-based theories of linguistics. Instead, researchers created statistical models that made probabilistic decisions based on assigning weights to appropriate input features, and they also started using supervised and semi-supervised learning techniques and partially labeled data [82,83].
Speech and Speaker Recognition: During the late 1990s, SRI researchers used deep neural networks for speaker recognition and they achieved significant success . In 2009, Hinton and Deng collaborated with several colleagues from University of Toronto, Microsoft, Google and IBM, and showed substantial progress in speech recognition using LSTM-based deep networks [85,86].
Recommender Systems: By 2010, several companies (e.g., TiVo, Netflix, Facebook, Pandora) built recommendation engines using AI and started using them for marketing and sales purposes, thereby, improving their revenue and profit margins .
Recognizing hand-written digits: In 1989, LeCun and colleagues provided the first practical demonstration of backpropagation; they combined convolutional neural networks (CNNs) with back propagation onto reading “handwritten” digits. This system was eventually used to read the numbers of handwritten checks; in 1998, and by the early 2000s, such networks processed an estimated 10% to 20% of all the checks written in the United States .
The year 2000 had come and gone but Alan Turing’s prediction of humans creating an AI computer remained unfulfilled [1,2] and Loebner prize was initiated in 1990 with the aim of developing such a computer . Nevertheless, substantial progress was made in AI, especially with respect to deep neural networks, which were invented in 1965 with the first algorithm for training them given in 1971 [18,19,20]. During 1983 and 2010, exemplary research done by Hinton, Schmidhuber, Bengio, LeCun, Hochreiter, and others ensured rapid progress in deep learning techniques [90,91,92] and some of these networks began to be used in commercial applications. Because of these techniques and the availability of inexpensive hardware and data, which made them practical, the pace of research and development picked up substantially during 2005 and 2010, which in turn, led to a substantial growth in AI solutions that started rivaling humans during 2011 and 2017; we discuss such solutions in the next article, “Domains in Which AI Systems are Rivaling Humans” .
References: The bibliography for the article can be found at www.scryanalytics.com/articles.