北京交通大学论坛-知行信息交流平台

 找回密码
 注册

微信登录

微信扫一扫,快速登录

快速进入版块与发帖 搜索
查看: 4023|回复: 0

Primer on Natural Language Processing (NLP)

[复制链接]
发表于 2016-5-28 19:14 | 显示全部楼层 |阅读模式
Primer on Natural Language Processing (NLP)

Author: Susanne Lomatch
Natural language processing (NLP) is an outgrowth of the field of computational linguistics, the statistical and/or rule-based modeling of natural language from a computational (or algorithmic) perspective. Natural language is any language that arises as an innate facility for language possessed by the human intellect; it may be spoken, signed or written. Machine learning (ML) algorithms are used in conjunction with language models to recognize text in NLP systems, which may also employ speech models and hardware/software specialized to process and recognize speech or even signed (gesture-based) language.
Natural speech processing (NSP) may be considered a separate discipline, since it involves speech models (derived from phonetics as opposed to text linguistics) and speech signal processing methods (acoustics). However, natural language started as spoken language, evolving into written language. Evolution has rendered both a “synchronous” human skill, and it makes sense to integrate language and speech models and processing for systems that handle both. In general I follow the industry convention and use NLP to include NSP, though NSP is not strictly an adjunct. Many in the field refer to NSP practically as automatic speech recognition (ASR), which deals with the analysis of the linguistic content of speech to automatically turn spoken words into text, though this is really a subset of NSP. The opposite, natural speech synthesis (NSS), deals with text-to-speech, is also a subset of NSP, and is a difficult problem in its own right.
Natural language understanding (NLU) is the comprehensive usage of NLP to enable machines or AI systems the ability to “comprehend” machine-recognized text and/or speech. It is a grand challenge of AI, simply because “understanding” and “comprehension” differ from recognition/translation, even if that recognition/translation is experienced and proficient.
There are plenty of AI applications of NLP that do not require NLU, though NLU will revolutionize AI systems when and if it is achievable to an acceptable level (i.e., the system that employs it can reliably pass a suitable Turing Test, which may be an upper limit of NLU, granted). It is an active area of debate as to what precisely defines NLU, as the question of “what is thought” or “what is consciousness” enters into the equation.
I think three key aspects are required for NLU: the ability to draw inferences from recognized text/speech, the ability to effortlessly disambiguate words or phrases within context (objective semantics), and the ability to handle abstraction (cross-model or spatial) which ties to understanding metaphors (subjective semantics). All three of these involve knowledge-based reasoning. If we are to believe Searle [1], something even more than these three aspects is required, which is why NLU is a grand challenge, much like cognitive thought and consciousness.
Major applications of NLP/NSP include:
•      Machine translation: automatic translation from one natural language to another
•      Information retrieval and extraction: extraction is the additional recognition and tagging of semantic information, or of particular information into a structured representation (e.g., relationships, sentiment); both IR and IE are used for data mining, and as enablers for question-answering, expert systems, VPAs, etc.; may also include multilingual or cross-language information retrieval/extraction; IE will be a crucial component of the Semantic Web
•      Sentiment analysis/opinion mining: application of IE and NLP to analyze and extract subjective content in text/speech information, specifically overall contextual polarity or writer/speaker attitudes (as is done with Twitter and other social media feeds)
•      Summarization: automatic summarization of large quantities of text into a concise, abbreviated version
•      Question-answering: provide a relevant answer to a user query
•      User interfaces: natural language capable interfaces to specialized user systems
•      Expert systems: e.g. Watson
•      Virtual personal assistants: e.g. Siri, Evi, Google Majel
•      Intelligent gaming
•      Intelligent databases: e.g. the Semantic Web
•      Dialogue systems: a natural language capable chatterbot with a good handle of human dialogue
In a separate review (AI Review, Part 2) I included a short description on how Watson and Siri use NLP/ASR to accomplish tasks.
Key components of NLP/NSP systems are:
•      Natural language and speech models
•      Modular software and hardware engines for text and/or speech processing and recognition; may also include optical character recognition (OCR) capability
•      ML algorithms that are applied to train modules using training data (e.g., text corpora and speech corpora) to recognize text/speech “off-line”, and that are integral in processing text/speech “on-line”
Classification of NLP systems in terms of functional capability is useful, and is effectively accomplished through identifying the representation of “knowledge levels” that humans use to extract meaning from text or spoken language, adapted from [2]:
NLP/NSP “knowledge levels” representation:
•      Acoustic/phonetic level: handles the physical properties of speech sounds (phones), and how to form phonemes, a set of phones that are cognitively equivalent, and the smallest segmental unit of sound employed to form meaningful contrasts between utterances [acoustic/audio signal processing, phonetic parsing, phonetic segmentation, phonetic transcription, acoustic/phonetic speech recognition]

•      Phonological level: handles the abstract characterization, meaning and interpretation of speech sounds (phonemes) within and across words, includingphonotactic, alternant and prosodic content and context, syllables, the application of phonological rules, and how to form morphemes, the smallest semantically meaningful unit in a language [phonological parsing, speech processing and speech recognition]

•      Morphological level: handles the analysis and interpretation of the smallest parts of words that carry a meaning (morphemes), including suffixes and prefixes, and how to form words [morphological parsing]

•      Lexical level: handles the lexical meaning of words and parts of speech analyses, and how to derive units of meaning [lexical parsing, part-of-speech tagging]

•      Syntactic level: handles the structural rules and roles of words and sentences (grammar), and how to analyze, interpret and form sentences [grammar parsing]

•      Semantic level: handles how to analyze and interpret possible meanings of a sentence by focusing on the interactions among word-level meanings in the sentence, including the semantic disambiguation of words with multiple senses (word-sense disambiguation), and how to express meaning in a semantic representation [semantic parsing, semantic role labeling/tagging, semantic corpora]

•      Discourse level: handles the structural rules and roles of different kinds of text using document structures, and how to analyze, interpret and form dialogues [discourse analysis]

•      Pragmatic level: handles knowledge and meaning assigned to text or dialogue as a result of outside world knowledge, i.e., from outside the contents of the document, and how to use outside knowledge, skill and reasoning/inferencing (among other methods and tools) to analyze, interpret, express or apply contextual meaning [pragmatic analysis, AI expert system, knowledge base]

•      Cognitive level: handles the understanding and usage of natural language and dialogue in some defined and testable capacity and fluency, e.g. via a Turing Test [NLU system, Artificial Cognitive System]
I have added key machine processing functions covering the text/speech analysis, interpretation and generation at each level in brackets. This processing integrally includes ML techniques and algorithms in modern NLP systems. The graphic in Fig. 1 was taken from [2], and is a simple but good example of how an NLP system might be structured according to these levels to process a voice command that gets turned into an executable UNIX command.
A few notes about the above language knowledge levels.
Semantics contrasts with syntax, the study of the combinatorics of units of a language (without reference to their meaning), and pragmatics, the study of the relationships between the symbols of a language, their meaning, and the users of the language. In some representations, meaning at the semantic level is context-independent or context-free, with context-dependence or context-sensitive analyses pushed to the pragmatic level.
However, word-sense disambiguation is context-sensitive, and is commonly dealt with at the semantic level (though not very well by some NLP/AI systems – how about these rather simple examples: “can you go get the file I need to sharpen this tool, as this file recommends” and “I turned up the bass on the radio while cleaning the bass I caught for dinner”). In the English language alone, the most commonly occurring verbs each have eleven meanings (or senses) and the most frequently used nouns have nine senses, but humans are able to unambiguously understand and select the one sense or meaning that is intended by the author or speaker. Disambiguation in an NLP system may rely on local context and a corpus containing the frequency with which each sense occurs at the semantic level through semantic parsing and tagging, or pragmatic knowledge outside of a document or speech, such as a common sense ontology or knowledge base.



When I talked about natural language understanding (NLU), I mentioned objective vs. subjective semantics. Word-sense disambiguation is an objective process, as word sense is usually a static quantity in a language, even if common sense is applied. I contrast this with metaphor or other figures of speech, which are subjective in natural language. From a neuroscience point of view, metaphor is a language abstraction that is manifested as the cognitive linkage of seemingly unrelated concepts in the brain. These linkages are driven by genetics and learning, and may be the result of greater hyperconnectivity between language centers (most notably the angular gyrus) and other modes of processing (such as visual or auditory) in the brain [3]. This cross-modal abstraction generates some simple common metaphors (“loud shirt” “sharp cheese” “bright sound”), as do spatial abstractions (“life is a climb on a tall mountain”). It is clear (at least to me) that such abstractions are the hallmark of the human cognitive process, and a result of the highest levels of language and cognitive processing in the brain, relying on the super-nested, hyperconnected neural architecture to make it all happen. This is at the root of why NLU and cognitive architectures are grand challenges in AI. As such, I have added a “cognitive level,” which describes NLU systems that are able to understand, use and apply natural language to a level of competency and fluency that might include metaphor and other nontrivial figures of speech.
Not only are the levels of the above language knowledge representation interdependent, they can interact in a dynamic sense, in a variety of orders (i.e., nonsequentially and synchronously). These levels are structured hierarchically, though there exist bidirectional interdependencies (see Fig. 2) – information gained at a higher level of processing can be used to assist in a lower level of analysis, e.g. pragmatic knowledge may be used to learn, classify and contextually use or speak a new word, disambiguate word or phrase meanings, or to draw inferences from a body of text. Cognitive level processing would potentially make this synchronous understanding and expression of speech and language seem effortless, intelligent and fluent. Machine learning (ML) is a key part of processing at every level. Ultimately, an NLU system might resemble a neural map of how the brain processes speech and language in a unified sense.
Key architectural and algorithmic approaches to NLP systems are driven by linguistic models and theories, and are typically classified as symbolic, statistical or stochastic,connectionist or hybrid [2]. I have expanded upon each of these approaches below, including basic concepts and Wiki links to concepts that deserve more depth, so that readers can dig deeper to understand those concepts; I have also provided links to recent research work. As a complement, I recommend the other two primers I have prepared, on machine learning (ML) and knowledge representations & acquisition.
In recent years, NLP research and systems have been based primarily on statistical and connectionist approaches employing ML, extending older symbolic paradigms targeting the tasks of part-of-speech tagging, chunking and parsing to statistical or connectionist language modeling and knowledge representations. A comprehensive survey of NLP research work incorporating more recent developments in deep learning, and other worthy approaches such as recursive distributed networks, self-organizing maps and Bayesian belief networks, is not readily available, and so the embedded references below are meant to serve as the basis of an ongoing survey that I intent to build upon. The challenge is to use newer methods to effectively address language structure beyond local sequences, such as long-term dependencies, nested/recursive structure, and hetero-associative and/or cross-modal phenomena in language.
Recent NSP/ASR research and systems have followed a similar progression, having embraced successful statistical approaches early. A decent 2009 review by industry and academic leads [4] highlighted the status of the field and what the grand challenges are going forward. Some of the points made on speech cognition follow what I discussed above regarding NLU. As the review also indicates, state-of-the-art ASR systems are designed to transcribe spoken utterances, and do not tackle more complex gradations of speech comprehension. As an intermediate solution short of a cognitive systems approach, they recommend a comprehension “mimicry” system that is trained on acquiring speech and language knowledge in much the same way that a human child progresses through the process. Such an approach indicates the importance of the learning process; to make such a solution efficient and practical it seems to me that the cognitive systems approach is still required in some form, otherwise we are either looking at just another variant of Watson, which fills a large air conditioned room, or a cheaper system that employs many tricks and kluges to feign simpler comprehension (the equivalent of a chatbot).
Key architectural approaches to NLP/NSP:
•      Symbolic
o   Based on explicit representations of facts about language through well-understood knowledge representation schemes and associated algorithms
o   Dominated by the Chomskyan theories of linguistics (Chomsky hierarchy) and automata theory, which seek to unify natural languages and machine languages
o   Enlivened by newer approaches that focus on functional knowledge engineering, including ontologies and embedded reasoning
o   Examples:
§  Formal rule-based systems
•      Focused on formal grammar, a set of formation rules for strings in a formal language: the rules describe how to form strings from the language's alphabet that are valid according to the language's syntax (compare natural language syntax to programming language syntax)
•      A grammar does not describe the meaning of the strings or what can be done with them in whatever context (semantics, pragmatics), only their form
•      Chomsky hierarchy and its mapping to (automata theory equivalents):
o   Regular grammars (finite state machines or finite state automata)
•      Cross-classes of grammars exist within the Chomsky hierarchy:
o   Fillmore’s case grammar
§  Logic-based systems
•      Focused on formal semantics, the understanding of linguistic meaning by constructing precise mathematical models of the principles that speakers use to define relations between expressions in a natural language and the world which supports meaningful discourse
•      Models use first-order logic as a semantic representation, or categorical logic in some cases; formal logic is generally applied
•      Gödel’s completeness theorem ties formal semantics to formal syntax/grammar in first-order logic
•      Examples:
§  Functional-based systems
•      Models that consider languages to have evolved under the pressure of the particular functions that the language system has to serve
•      Driven by semantics, discourse and pragmatics (and NLU and human cognition) with a focus on context in a dynamic sense
•      Examples:
o   Natural language knowledge levels (shown above in this primer)
o   Advantages:
§  Well understood in terms of formal descriptive/generative power and practical applications
§  Can be used for modeling phenomena at various linguistic knowledge levels (multiple dimensions of patterning)
§  Computationally efficient algorithms for analysis and generation
§  Work well when the linguistic domain is small and well-defined
o   Disadvantages:
§  Tend to be fragile, leading to parsing failures – cannot easily handle minor, yet non-essential deviations of the input from the modeled linguistic knowledge, unless robust parsing techniques are added
§  Don’t scale very well
§  Require use of experts such as linguists, phonologists, and domain experts, since such models cannot be instructed to generalize (learn from example)
o   Algorithms:
§  Natural language parsers (speech/text) can be designed to target phonological, morphological, lexical, syntactic, semantic information, but in general focus on the syntactic through the definition of a grammar
§  Examples:
•      Phonological and morphological parsers: finite state transducer
•      Syntactic parsers: a list focusing on context-free grammars is HERE
•      Semantic parsers: shallow parsers, deep parsers, contextual parsers
•      Statistical/Stochastic
o   Based on various probabilistic techniques to develop approximate generalized models of linguistic phenomena, derived from actual examples/samples of these phenomena (e.g., linguistic corpora)
o   Extends formal automata algorithms (or other symbolic algorithms) to include probabilistic states
o   Employs machine learning (ML) to train models (estimate probabilistic model parameters) against sample data; models are then used in turn to recognize language patterns or sequences to some level of accuracy, efficiency, etc.
o   Examples:
•      Extends finite-state machines (regular grammars) to include states that have two sets of probabilities associated with it: one determines which symbol to emit from this state (emission or output probabilities); the second set determines which state to visit next (transition probabilities)
•      Parameter learning/estimation algorithms (e.g. Baum-Welch or forward-backward) are applied to find a set of state transition and output probabilities via some defined criterion, given an output sequence or set of sequences (e.g. training data)
•      Search optimization algorithms (e.g. Viterbi) are applied to find the input sequence that is most likely to have generated the output sequence, using the trained HMM
•      Also classified as sequence learners, dynamic Bayesian networks, or a type of Markov network (see below)
•      Especially useful in recognizing temporal-based sequences in ASR (a good tutorial on HMMs in ASR is located HERE [Jua04] and HERE[Rab04])
•      For a review of their use in part-of-speech tagging see [Jur08]
•      Probabilistic language models that predict the next item in a contiguous sequence of n items from a given sequence of text or speech, in the form of an (n-1)th order Markov model
•      Items can be phonemes, syllables, letters, words or base pairs according to the application
•      N-grams are collected from a text or speech corpus
•      Effective at modeling language data; not meant to model more complex, long-range dependencies in language
•      Hybrid models incorporate Bayesian inference (maximum a priori likelihood and maximum a posteriori estimates)
•      For a survey of these models see [Jur08]: as they state, recent research applying n-gram models focuses on very very large n-grams, e.g., in 2006, Google publicly released a very large set of n-grams that is a useful research resource, consisting of all the five-word sequences that appear at least 40 times from 1,024,908,267,229 words of running text; there are 1,176,470,663 five-word sequences using over 13 million unique words types; large language models generally need to be pruned to be practical, using techniques found HERE [Chu07] and elsewhere
•      Generally extends context-free grammars to include probabilistic states
•      Probabilities can be assigned based on rule-use as exemplified by training data - the probability of each rule’s “significance” can be determined based on the frequency of the rule’s contribution to successful parses of training sentences
•      As with the HMM, parameter learning algorithms (e.g. inside-outside) and parsing optimization algorithms (e.g. CYK-WCFG) apply
•      Recent implementation examples: HERE [Pet06]
§  Statistical semantic parsing
•      Extends semantically-annotated grammars to include probabilistic states
•      Various ML techniques and statistical parsers are applied for parameter learning and parsing optimization, with the approaches dependent on the level of semantic information (e.g. shallow vs. deep)
•      This area has received much research focus in the last decade, given the motivation to accurately recognize semantic content in speech and text, following initial work by Gildea and Jurafsky on semantic role labeling
§  Hybrid or semantic HMMs
•      Each HMM hidden state can generate a sequence of words or correspond to multiple observations; in the case of semantics, hidden states are semantic slot labels, while the observed words are the fillers of the slots - usage is defining how a sequence of hidden states, corresponding to slot names, could be decoded from (or could generate) a sequence of observed words
•      Generative models with two components: the P(C) component represents the choice of what meaning to express; it assigns a prior over sequences of semantic slots, computed by a concept n-gram; P(W|C) represents the choice of what words to use to express that meaning; the likelihood of a particular string of words being generated from a given slot - it is computed by a word n-gram conditioned on the semantic slot
•      Applied to semantic “understanding” processing in dialogue systems; see [Jur08] for more detail
•      These models are very similar to the HMM models for named entity recognition
o   Advantages:
§  Effective in modeling language performance through training based on most frequent language use
§  Especially useful in modeling linguistic phenomena that are not well understood from a competence perspective, e.g. speech; many modern ASR systems are based on HMMs applied to phoneme recognition (a good tutorial on HMMs in ASR is located HERE [Jua04] and HERE [Rab04])
§  Effectiveness is highly dependent on the volume of training data available; generally more training data translates to better performance
§  Statistical models (esp. HMM) are more robust at handling variations and noise, and can be used to model nuances and imprecise language concepts
o   Disadvantages:
§  Run-time performance is generally linearly proportional to the number of distinct classes (symbols) modeled, and thus can degrade considerably as classes increase; this holds for both training and pattern classification
§  Effectiveness is tightly bound to extensive, representative, error-free text corpora and speech corpora, the production of which may be a time-consuming and error-prone process, depending on the application; to answer this deficiency, recent research focuses on systems trained on unannotated data (unsupervised learning)
§  For the task of predicting the probabilities of sentences for a given language using n-gram models, n-gram counts become unreliable with large n, as the number of possible n-grams grows with the number of distinct words to the power of n
o   Algorithms: See above examples
•      Connectionist
o   Based on complex, massively interconnected sets of simple, nonlinear components that operate in parallel
o   Trained networks are generally applied to recognize language and speech patterns (sequences and higher level structures)
o   Recent NLP research has focused on these approaches (especially multilayer networks) to address linguistic deep structure (in the semantic, pragmatic and cognitive meaning senses), where deep learning and cortical learning are applied to learn such deep structure features, avoiding time consuming parse trees
o   Recent NSP/ASR research has also focused on these architectures, especially stochastic networks
o   Examples:
•      Network nodes are artificial neurons
•      Network connections are (synaptic) weights encoding the strength of the connection
•      Each network node is associated with an activation or transfer function F(w(x)) that takes as input a set of activation weights for the node's parent variables (dendrites) and outputs a threshold value (axon) that propagates to the input of the next layer through a (synaptic) weight function
•      Acquired knowledge through training using ML techniques is stored in the pattern of interconnection weights among components – weights are updated/changed through learning
•      Performance is gauged by number and type of inputs/outputs, number of nodes and layers, connectivity, choice of activation threshold/function, choice of training process and update function, among other metrics
•      Stochastic artificial neural networks (SANN) possess stochastic neuron transfer functions and/or stochastic weights to allow for random fluctuations, rendering them generally more robust
•      Classes of ANNs applied to NLP/ASR:
o   Nonrandom ANNs (NANNs) or stochastic ANNs (SANNs); note applications can include either, in various geometries described below (taken from [Hend10] and a variety of more recent sources; some of the geometries listed may not have been specifically applied to NLP/ASR, but may have potential use)
§  Information flows in one direction, with no intralayer connections between hidden states (acyclically directed)
•      Used for function approximation, categorization or classification, and sequence modeling
•      Computes a function from a fixed-length vector of input values to a fixed-length vector of output values
•      Usually trained via backpropagation, an iterative gradient descent process with potentially slow, nonoptimal convergence
•      Generally insufficient for NLP tasks where inputs/outputs are arbitrarily long sequences (wordy sentences)
§  Auto-associative MLP or autoencoder
•      Target output pattern/layer is identical to the input pattern/layer, with the number of hidden layer nodes considerably less than the input/output layer nodes
•      Hidden layers are used as encoders, enabling the learning of compressed representations while reducing the dimension of the feature space
•      Generally more tractable than standard MLPs in determining optimal parameter values during weight training: optimal weight values can be derived using linear techniques, such as SVD [Bou88]
•      For many hidden layers, as problem becomes increasingly intractable (computationally intensive nonlinear optimization), special pre-training schemes have been developed HERE [Hin06(1)] by treating each bi-layer as a restricted Boltzmann machine (RBM, see below): learned feature activations of one RBM are used as the ‘data’ for training the next RBM in the stack; after the pre-training, the RBMs are ‘unrolled’ to create a deep autoencoder, which is then fine-tuned using backpropagation of error derivatives
•      Successfully applied to ASR and image processing, and to deep structure NLP using deep networks/learning (see below)
o   Multilayer recurrent neural networks (RNN)
§  Information can generally flow in arbitrary directions (cyclically directed or undirected), with intralayer connections between hidden layers, forming internal addressable memory states and allowing for dynamic feedback
§  Recurrent connections between hidden layers allow the network to compute a compressed representation that includes information from previous compressed representations; by performing this compression repeatedly, at each step adding new input features, a recurrent network can compress an unbounded sequence into a finite vector of hidden features
§  RNNs can be divided into two useful classes: autonomous RNNs with fixed temporal inputs (Hopfield nets, Boltzmann machines, RAAMs, BAMs), and non-autonomous RNNs with time-varying inputs (recurrent MLPs, recursive ANNs, LSTMs, MRNNs, BRNNs, FRNNs)
§  Recurrent MLP
•      An MLP (feedforward) whose hidden layers include internal links that loop back towards the input (forming internal addressable memory states), allowing for arbitrary sequence length inputs
•      Also known as a simple recurrent network (SRN) or Elman network for the special case of three layers
•      Generally insufficient for NLP tasks where there are nonlocal sequence correlations (structure more complex than sequences), as the pattern of interconnections between hidden layers imposes an inductive bias in learning
•      A recurrent MLP applied to input structures more complex than sequences, such as trees, graphs or functional logic (semantic structures such as predicate-argument): a copy of the network is made for each node of the tree or graph, and recurrent connections are placed between any two copies which have an edge in the tree or graph [Fra98]
•      Pattern of interconnection between hidden layers better reflects locality in the structure being modeled
•      RvNNs have been successfully used to learn distributed representations of structured objects such as logical terms, see RAAMs below
•      A recurrent auto-associative network/encoder applied to input structures more complex than sequences, such as trees, graphs or functional logic [Pol90, Gol96]
•      Labeling RAAMs (LRAMMs) have found the most use in NLP [Gol96]
•      Also classified as a type of RvNN
§  Simple Synchrony ANNs (SSNN)
•      Similar to an RvNN, but hidden layer connectivity is tailored to minimize inductive bias [Hend10]
§  Multilayer Hopfield network
•      A multilayer recurrent ANN with symmetric (undirected) connections between nodes, including hidden layers, allowing for undirected information flow and for internal addressable memory states; each neural node has a binary activation function
•      Also known as an auto-associative memory network
•      Stochastic version of a multilayer Hopfield network, or more generally, a recurrent SANN with stochastic transfer functions (threshold values are activation probabilities)
•      Also a type of Markov random field (MRF) or Markov network
•      Become generally intractable as learning is applied to sufficiently complex multilayer networks: e.g., exact maximum likelihood learning is intractable, as exact computation of both the data-dependent expectations and the model’s expectations takes a time that is exponential in the number of hidden units
•      Good review HERE
•      BMs that do not allow intralayer connections between hidden units, and are tractable as learning is applied to complex multilayer networks (therefore these are not technically recurrent networks, but the stochastic binary equivalent of anautoencoder)
•      A multilayer RNN in which the hidden layers are replaced by a “memory block” containing one or more memory cells and a pair of adaptive, multiplicative gating units which gate input and output to all cells in the block
•      Each memory cell state is associated with a recurrent self-connected linear unit that allows for the “regulation” of local error back flow, enforcing non-decaying error flow “back into time”
•      LSTMs solve tasks that general RNNs cannot, due to their failure to learn in the presence of long time lags between relevant input and target events [Schm12]
§  Multiplicative RNNs (MRNN)
•      An RNN variant that uses multiplicative (or “gated”) connections which allow the current input character to determine the transition matrix from one hidden state vector to the next
•      Trained using Hessian-free optimization techniques, MRNNs overcome the difficulties associated with training traditional RNNs, making it possible to apply them successfully to challenging sequence problems; see detail and NLP applications HERE [Sut11]
•      LSTM (above) makes it possible to handle datasets which require long-term memorization and recall but even on these datasets it is outperformed by using a standard RNN trained with the HF optimizer; see [Sut11]
•      A multilayer recurrent ANN with bidirectional connections between nodes, including hidden layers, allowing for information flow in both directions (feedback, feedforward), and for internal addressable memory states
•      Successfully applied to ASR [Schu97]
•      A bidirectional associative memory network (BAM) is a type of BRNN (bi-layer feedback network), and the hetero-associative counterpart to an auto-associative Hopfield network; also classified as a hetero-associative memory network or a “resonance network”; see [Kos88]
§  Fully recurrent neural networks (FRNN)
•      Each node in the network has a directed or undirected connection to every other node (input, output or hidden), and a time-varying activation/transfer function, allowing for internal memory states and dynamic temporal behavior (feedback and feedforward)
•      Practical use is limited due to complexity and intractability of training
§  Variants of MLPs incorporating biologically-inspired features
§  Introduces layers that apply convolutions on their input which take into account locality information in the data, i.e. they learn features from image patches or windows within a sequence
§  Exploit spatially local correlation by enforcing a local connectivity pattern between neurons of adjacent layers; the input hidden units in the n-th layer are connected to a local subset of units in the (n-1)-th layer, which have spatially contiguous receptive fields
§  Architecture confines the learnt “filters” (corresponding to the input producing the strongest response) to be a spatially local pattern (since each unit is unresponsive to variations outside of its receptive field with respect to the retina, in the case of vision); stacking many such layers leads to “filters” (not anymore linear) which become increasingly “global” (i.e spanning a larger region of pixel space)
§  Operationally, this architectural approach specifies a feature extractor (that “sparsely” filters raw input) which outputs a feature vector to a trainable classifier (an MLP or other ANN, such as a SANN or DNN), and these are jointly trained jointly to optimize the class scores or probability estimates
§  CNNs are easier to train than other networks, including DNNs and DBNs that employ layer-by-layer deep learning techniques (see below and [Ben09]); multitask learning and direct optimization of a joint objective function can be accomplished with good results
§  Applied to NLP: HERE [Ben03], HERE [Mor05], HERE [Col08], HERE [Wes08], HERE [Mni07] and HERE [Mni09]
•      For the NLP case, the CNN results in word feature vectors which are trained to reflect exactly the word similarities which are needed by the probability estimation model, and they work better than finding similarities based on some independent criteria (such as latent semantic indexing), or trying to specify them by hand
•      Implementations outperform competitive n-gram models; as the size of the word window n was increased, the CNN continued to improve with larger n while the n-gram models had stopped improving, indicating that the representation of word similarity feature vectors succeeded in overcoming the unreliable statistics of large n-grams
§  Generally multilayer ANNs (feedforward or recurrent) whose hidden layers are successively trained layer-by-layer (“deep learning” – e.g., unsupervised learning/training starts with the base MLP, RNN, autoencoder or RBM, and its output is used as training data input for the next higher layer structure, with the process repeated until the entire network is initialized, and can seed a final step of supervised training or fine-tuning), rendering the overall DNN an improved generative model, especially on deep language or speech structure; this process differs from the usual single-pass training; see [Ben09] HERE and refs therein; see also a good tutorial HERE
§  Approach is to exploit layer-local unsupervised criteria, i.e., the idea that injecting an unsupervised training signal at each layer may help to guide the parameters of that layer towards better regions in parameter space
§  CNNs, including the “Neocognitron,” are considered types of DNNs (class of machines/models that can learn a hierarchy of features by building high-level features from low-level ones, thereby automating the process of feature construction)
§  RNNs can also be viewed as DNNs (an RNN can be “unfolded in time” by considering the output of each neuron at different time steps as different variables, making the unfolded network over a long input sequence a very deep architecture)
§  Alternative deep learning approaches involving the constraint of feature vectors at each layer in a deep (convolutional) net to besparse and overcomplete for unsupervised pre-training can be found HERE [lec07], and refs therein
§  Alternative deep learning approaches employing Hessian-free optimization techniques may be useful for ML in difficult to optimize deep architectures, such as RNNs, see HERE [Mar09]
o   Topological maps
•      Biologically-inspired model; maps resemble topographically organized maps found in cortices of mammalian brains
•      Defines a topology-preserving mapping between an often highly dimensional input space and a low dimensional, most typically 2-D, space; self- organization is introduced by having the notion of neighboring units, whose weights are adjusted in proportion to their distance from the winning unit
•      Self-organization process can discover semantic relationships in sentences; SOMs have also been used in practical speech recognition [Koh90]
•      See a review HERE
•      NLP advantages: unsupervised learning, self-organization, emergent structure from representations, plasticity modeling, Hebbian learning
•      Choice of ANN/training algorithm(s) depends on the application and the type of ML to be applied: supervised, reinforcement, unsupervised, deep
o   Supervised learning:
•      Learns by classifying patterns
•      Recent NLP research examples: HERE [Col07]
o   Specific task: semantic role labeling via shallow semantic parsing
•      Applied to ASR: HERE [Hos99]
o   Reinforcement learning:
•      Learns by optimizing using observations and feedback
•      For SANNs, stochastic states are Markov decision processes, which might be a useful approach for dialogue architectures and conversational agents (i.e., applied to NLP using SANNs; such an approach might be based on [Lev00] and might even include hybrid deep belief networks, CNNs, RNNs)
o   Unsupervised and semi-supervised learning:
•      Learns by unsupervised induction – weights are updated through a stochastic HMM process (HMM acoustic models are used to train the ANN recognizer; semi-supervised learning and joint optimization can be performed as well)
•      Applied specifically to ASR: HERE [Bea01] and HERE [Scha00]
•      Learns by building a map using input examples (unsupervised clustering)
•      Recent NLP research examples: HERE [Bur11], HERE [Pov06], HERE [Li02] and HERE [Hon97]
o  Deep learning:
§  Deep ANN (DNN) / semi-supervised learning, multitask learning
•      Learns features relevant to the tasks at hand given very limited prior knowledge
•      Tasks are integrated into a single system, which is trained jointly; all the tasks except the language model are supervised tasks with labeled training data; the language model is trained in an unsupervised fashion, jointly with the other tasks
•      Recent NLP research examples: HERE [Col08] and HERE [Wes08]; see also a good review HERE
•      Applied to context-dependent, large vocabulary ASR specifically, but can be generalized to NLP
•      Learns phones/phonemes using semi-supervised learning, through pre-training of a DNN
•      Outperform conventional context-dependent HMMs 5-10%
•      Recent research examples: HERE [Dah12] and HERE [Jai12]
§  Recursive autoencoder (RAAM) / semi-supervised learning
•      General tool for predicting parse tree structures for NLP
•      Captures recursive features of natural language
•      Learned feature representations capture syntactic and compositional-semantic information
•      Outperform PCFGs on functionality and accuracy
•      Recent research examples: HERE [Soc11]
•      Network nodes are random variables having a Markov property, arranged in an undirected graph (a.k.a. Markov network)
•      Network connections or nodal edges represent a probabilistic dependency between variables
•      Markov networks can represent certain dependencies that a Bayesian network cannot (such as cyclic dependencies); it cannot represent certain dependencies that a Bayesian network can (such as induced dependencies)
•      Markov models are (generally) noncausal, though hybrid approaches can incorporate some causality
•      Classes of MRFs applied to NLP/ASR:
o  Hidden recurrent/recursive networks
•      As explained above, generally intractable unless a deep learning technique using pre-training of an RBM is applied, or other deep learning strategies; see [Ben09], [Sal09] (HERE) and [Sal10] (HERE) and refs therein; see also [Myl99] for approximate variational techniques
•      Unlike DBNs (see below), the approximate inference procedure, in addition to an initial bottom-up pass, can incorporate top-down feedback, allowing deep BMs to better propagate uncertainty about, and hence deal more robustly with, ambiguous inputs [Sal09]
•      Recent research applications for ASR focus on using the pre-trained RBM to seed a DNN; see [Dah12] and [Jai12]
•      Recent research applications for NLP focus on using a pre-trained RBM to seed a BM; see [Sal09] and [Sal10]
o  Hybrid random fields
•      As described above, extensively applied in ASR; for a review of recent approaches, see refs in [Dah12] HERE
•      Used for structured sequence prediction, where context is important
•      A supervised (discriminant) ML technique
•      CRF models define a conditional probability p(Y|x) over label sequences Y given a particular observation sequence x, rather than a joint distribution over both label Y and observation x sequences; as such, these models support tractable inference and represent the data without making unwarranted independence assumptions
•      CRFs outperform both HMMs and MEMMs on a number of real-world sequence labeling tasks; see refs in HERE[Wal04]
•      Recent ASR research examples: HERE [Hif09] and HERE [Yu10]
•      Also known as sigmoid belief networks (SBNs): Hybrid generative models where the model construction module uses Bayesian network techniques, while the probabilistic reasoning module is implemented as a massively parallel Boltzmann machine; for good early reviews, see HERE [Nea90], HERE [Hin95], HERE [Sau96] and HERE [Myl99]
•      A recent DBN architecture example often cited for its efficient deep learning algorithm is found HERE [Hin06(2)] and a short review HERE; this work introduced the concept of using pre-training of RBMs for tractable, efficient overall training (fine-tuning) of a DBN or DNN, whereby the first two layers of the architecture form an undirected associative memory and the remaining layers form a directed acyclic graph that converts the representation in the associative memory into observable variables such as the pixels of an image or the probability of words in a sequence
•      Hybrid CNN-DBN architectures have also been formulated that show improved performance for vision tasks; seeHERE [Lee09]
•      Recent ASR research examples: HERE [Moh12] and HERE [Sar11]
•      Recent NLP research examples: HERE [Tit07], HERE [Hen08], HERE [Des09] and HERE [Zho10]
§  Bayesian networks (BNs)
•      Network nodes are Bayesian random variables, arranged in a directed acyclic graph
•      Network connections represent probabilistic dependence between variables, with conditional probabilities encoding the strength of the dependencies
•      Each network node is associated with a conditional probability distribution function P(x|Parents(x)) that takes as input a particular set of values for the node's parent variables and gives the probability of the variable represented by the node
•      Acquired knowledge is stored in the pattern of conditional probabilities, set by a priori knowledge (nodal evidence) and changed through learning or inferencing
•      Networks calculate posterior probabilities of an event as output, given a priori nodal evidence – Bayesian nets generate a probabilistic output of event occurrences
•      Bayesian networks/models are causal, and are often referred to as belief networks that employ evidentiary reasoning and inferencing
•      Classes of BNs applied to NLP/ASR:
o  Hybrid random fields
§  These include dynamic BNs which are a hybrid MRF (HMMs, hybrid HMMs), CRFs, SBNs and DeepBNs, see above
§  Notable is the burgeoning field of “statistical relational learning,” which features hybrids of MRF/BN models, and ML and probabilistic reasoning techniques to deal with them (specifically probabilistic inductive logic programming), as applied to designing database systems; see HERE [Get07], HERE, HERE and HERE
§  Hierarchical temporal memory (HTM), a hybrid BN-MRF model inspired by the visual cortex and applied to model artificial vision among other applications, may be useful for NLP/ASR modeling as well; the power of this model is its predictive capability, while it is limited in its application to time-dependent sequences over more complex data structures (this limitation may be overcome with some theoretical effort)
•      HTM assumes a hierarchy of nodes where each node learns spatial coincidences and then learns a mixture of Markov models over the set of coincidences; the hierarchy of the model corresponds to the hierarchy of cortical regions in the brain; the nodes in the model correspond to small regions of cortex
•      HTM networks use Bayesian belief propagation for inference
•      See review HERE [Geo09]
•      HTM has been applied to ASR: HERE [Dor06]
o  Fully Bayesian networks
§  No hybrid approach, these networks fully implement the directed acyclic and causal features of a full BN; see HERE [Pea87] and [Pea88]; for a decent review with comparisons to SBNs and BMs, see [Nea90]
§  These networks found early application in expert systems, specifically medical diagnosis systems
§  Advantages of these networks include that they can be used as a compact representation for many naturally occurring distributions, where the dependencies between variables arise from a relatively sparse network of connections, resulting in relatively small conditional probability tables, where representation of the problem domain probability distribution can be constructed efficiently and reliably, assuming that appropriate high-level expert domain knowledge is available; such BNs offer a framework for constructing algorithms for different probabilistic reasoning tasks
§  As discussed in the early literature, BNs suffer from finding appropriate and tractable learning procedures; approaches like gradient ascent must be constrained to avoid invalid solutions or getting stuck as local maxima; likewise, for a general BN structure, probabilistic reasoning algorithms, such as the combinatorial optimization problem of finding the maximum a posteriori probability (MAP), is an NP-hard problem (see [Myl99] and refs therein, and also HERE [Abd98])
§  Useful references:
•      Approaches for constructing BNs from sample data, combining domain expert knowledge with ML: HERE [Hec95]
•      Survey of algos for real-time BN inference: HERE [Guo02]
•      Finding MAPs using modified BNs (high-order RNNs): HERE [And09] and HERE [And12]
§  Recent NLP research examples: HERE [Wei06], HERE [Mey05], and undoubtedly many others, as I have not done an exhaustive search
o   Advantages:
§  Architectures are self-organizing, in that they can be made to generalize from training data even though they have not been explicitly “instructed” on what to learn; this can be very useful when dealing with linguistic phenomena that are not well-understood – when it is not clear what needs to be learned by a system in order for it to effectively handle such a phenomenon (unsupervised learning and deep learning)
§  Successful in discovering useful features of words and joint models of multiple tasks; exploit similarities between words by training feature-based representations of them
§  Architectures are fault tolerant, due to the distributed nature of knowledge representation/(memory) storage; as increasing numbers of their components become inoperable, their performance degrades gradually
§  Weights or probabilities can be adapted in real-time to improve performance
§  Effective in modeling nonlinear transformations between inputs and outputs, due to the nonlinearity within each computational element
§  Can improve computational efficiency and accuracy over traditional tagging and parsing techniques; specifically, successful at improving accuracy over n-gram models by exploiting similarities between words, and thereby estimating reliable statistics even for large n-grams
§  Convolutional nets (CNNs), recursive nets (RvANNs) and self-organizing maps (SOMs) are increasingly showing improved utility for NLP tasks and modeling over other approaches
§  Bayesian nets/hybrid random fields/stochastic nets vs. deterministic neural nets:
•      How do you deal with missing data in a neural network?
•      How do you find out how sure a neural network is of its answer?
•      How did the neural network derive its answer, what was its logic process?
o   Disadvantages:
§  Possible for a system to be over-trained and thus diminish its capability to generalize – only the training data can be recognized
§  Due to their massive parallelism, and their usual implementation on non-parallel architectures (as modular components), such systems may be ineffective from a runtime complexity perspective for many real-time tasks in human-computer interaction
§  General concern is for either intractable training of sufficiently complex networks that might represent language tasks and modeling beyond simple sequences, or the relatively long training times that even may exist in tractable approaches such as deep learning and DBNs; there is also the issues of inductive bias and intractable or unacceptably tractable probabilistic reasoning/inferencing
§  Specific criticism of deep belief networks (DBNs) and associated layer-by-layer deep learning techniques: as classifiers, DBNs can underperform other learning algos/classifiers; one reason cited [Mca08] is due to the fact that DBNs iteratively learn “features-of-features” in each level’s RBM; if the network has an appropriate implementation for the task at hand, this will potentially lead to a very high accuracy classifier; however, if the network and data do not work perfectly well together, this feature-of-feature learning can potentially lead to recursively learning features that do not appropriately model the training data (however this property may be useful for language data, which has inherent recursive structure); one solution cited is to use appropriate continuous-valued neuron representations
o   Algorithms: See above examples
•      Hybrid
o   Based on different variations of compound architectures and linguistic models attempting to use the best approach (symbolic, statistical or connectionist) for a given modeling subproblem in an application
o   Recent research focuses on hybrid approaches, which are outlined above under statistical and connectionist
An example list of open or commercial toolkits, standards and research groups (by no means complete, and will be revised periodically):
NLP software toolkits: GATE, OpenNLP, NLTK, CMUSLM, SENNA; see also Torch5, a collection of ML algos
ASR software toolkits: HTK (HMM), Sphinx, CSLU, ATT-GRM, SRILM, RWTH, Shout
ASR standards: VoiceXML, (for a dated review, see HERE), NIST (benchmarks HERE, tools HERE, history HERE)
(Disclaimer: This primer is meant to inform. I encourage readers who find factual errors or deficits to contact me (contact link below). I also welcome constructive and friendly comments, suggestions and dialogue.)
References and Endnotes:
[2] “Natural Language Processing: A Human–Computer Interaction Perspective,” B. Manaris, Advances in Computers (Marvin V. Zelkowitz, ed.), vol. 47, pp. 1-66, Academic Press, New York, 1998.
[3] “A Brief Tour of Human Consciousness,” V.S. Ramachandran, Pi Press, 2004.
[4] “Research Developments and Directions in Speech Recognition and Understanding,” J.M. Baker et al., IEEE Signal Processing Magazine, vol. 75, May 2009. A link to this review can be found HERE.
[Pet06] “Learning Accurate, Compact, and Interpretable Tree Annotation,” S. Petrov, Proceedings of the 21st International Conference on Computational Linguistics, 2006.
[Jua04] “Automatic Speech Recognition – A Brief History of the Technology Development,” B.H. Juang and L.R. Rabiner, 2004.
[Rab04] “Speech Recognition: Statistical Models,” L.R. Rabiner and B.H. Juang, 2004.
[Jur08] “Speech and Language Processing,” D. Jurafsky and J.H. Martin, Prentice-Hall, 2008.
[Chu07] “Compressing Trigram Language Models With Golomb Coding,” K. Church et al., Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007.
[Hend10] “Artificial Neural Networks,” J.B. Henderson, in “The Handbook of Computational Linguistics and Natural Language Processing,” ed. A. Clark et al., 2010.
[Bou88] “Auto-Association by Multilayer Perceptrons and Singular Value Decomposition,” H. Bourlard and Y. Kamp, Biological Cybernetics vol. 59, p.291 (1988).
[Hin06(1)] “Reducing the Dimensionality of Data with Neural Networks,” G. E. Hinton and R. R. Salakhutdinov, Science, vol. 313 (5786), p.504, Jul. 2006.
[Fra98] “A general framework for adaptive processing of data structures,” P. Frasconi et al., IEEE Trans. Neural Networks, vol. 9 (5), p.768, Sep. 1998.
[Pol90] “Recursive distributed representations,” J.B. Pollack, Artificial Intelligence, vol. 46 (1), p.77, Nov. 1990.
[Gol96] “Learning task-dependent distributed representations by backpropagation through structure,” Goller, C.and Kuchler, A., IEEE Conf. on Neural Networks, Jun. 1996.
[Schm12] See J. Schmidhuber’s excellent website on recurrent neural nets, with an emphasis on LSTMs and their application to NLP/ASR, handwriting recognition, etc., linked HERE.
[Sut11] “Generating Text with Recurrent Neural Networks,” I. Sutskever et al., Proceedings of the 28th International Conference on Machine Learning, 2011.
[Schu97] “Bi-directional Recurrent Neural Networks [for Speech Recognition],” M. Schuster and K.K. Paliwal, IEEE Trans. Speech Processing, vol. 45 (11), p.2673, Nov. 1997.
[Kos88] “Bidirectional Associative Memory,” B. Kosko, IEEE Trans. Systems, Map, Cybernetics, vol. 18 (1), Jan. 1988.
[Ben03] “A Neural Probabilistic Language Model,” Y. Bengio et al., Journal of Machine Learning Research, vol.3, p.1137, 2003.
[Mor05] “Hierarchical Probabilistic Neural Network Language Model, AISTATS, 2005.
[Col08] “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning,” R. Collobert and J. Weston, 2008.
[Mni07] “Three New Graphical Models for Statistical Language Modeling,” A. Mnih and G.E. Hinton, Proceedings of the 24th International Conference on Machine Learning, 2007.
[Mni09] “A Scalable Hierarchical Distributed Language Model,” A. Mnih and G.E. Hinton, Advances in Neural Information Processing Systems 21, 2009.
[Ben09] “Learning Deep Architectures for AI,” Y. Bengio, Foundations and Trends in Machine Learning, vol. 2 (1), 2009.
[Lec07] “Energy-Based Models in Document Recognition and Computer Vision,” Y. LeCun et al., Ninth Intl. Conf. on Document Analysis and Recognition, 2007.
[Mar09] “Deep learning via Hessian-free optimization,” J. Martens, Proceedings of the 27th International Conference on Machine Learning, 2010.
[Koh90] “The Self-Organizing Map,” T. Kohonen, Proc. IEEE, vol. 78 (9), Sept. 1990.
[Col07] “Fast Semantic Extraction Using a Novel Neural Network Architecture,” R. Collobert and J. Weston, 2007.
[Hos99] “Speech Recognition Using Neural Networks,” J.P. Hosom et al., Center for Spoken Language Understanding, OGI, 1999.
[Lev00] “A Stochastic Model of Human-Machine Interaction for Learning Dialog Strategies,” E. Levin et al., IEEE Transactions on Speech and Audio Processing, vol.8, p.11, Jan. 2000.
[Bea01] “Neural Networks in Automatic Speech Recognition,” F. Beaufays et al., Published in Handbook of Brain Theory and Neural Networks, 2000.
[Scha00] “CLSU-HMM: The CLSU Hidden Markov Modeling Environment,” J. Schalkwyk et al., Center for Spoken Language Understanding, OGI, 2000.
[Bur11] “Self organizing maps in NLP: exploration of coreference feature space,” A. Burkovski et al., Proceedings of the 8th international conference on Advances in self-organizing maps, 2011.
[Pov06] “Neural Network Models for Language Acquisition: A Brief Survey,” J. Poveda and A. Vellido, Lecture Notes in Computer Science, vol. 4224, p. 1346, 2006.
[Li02] “A Self-Organizing Connectionist Model of Bilingual Processing,” Bilingual Sentence Processing, P. Li and I. Farkas, vol.59, p.85, 2002
[Hon97] “Self-Organizing Maps in Natural Language Processing,” T. Honkela, Ph.D. Thesis, 1997.
[Wes08] “Deep Learning via Semi-Supervised Embedding,” Weston et al., Proceedings of the 25th International Conference on Machine Learning, 2008.
[Dah12] “Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition,” G.E. Dahl et al., IEEE Trans. Audio, Speech and Lang. Processing, Jan. 2012.
[Jai12] “Application of Pretrained Deep Neural Networks to Large Vocabulary Conversational Speech Recognition,” N. Jaitly et al., U. Toronto, Mar. 2012.
[Soc11] “Parsing Natural Scenes and Natural Language with Recursive Neural Networks,” R. Socher et al., Proceedings of the 28th International Conference on Machine Learning, 2011.
[Sal09] “Deep Boltzmann Machines,” R. Salakhutdinov and G.E. Hinton, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, 2009.
[Sal10] “Efficient Learning of Deep Boltzmann Machines,” R. Salakhutdinov and H. Larochelle, Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010.
[Wal04] “Conditional Random Fields: An Introduction,” H.M. Wallach, 2004.
[Hif09] “Speech Recognition Using Augmented Conditional Random Fields,” Y. Hifny and S. Renals, IEEE Trans. Audio, Speech and Lang. Processing, Feb. 2009.
[Yu10] “Deep-Structured Hidden Conditional Random Fields for Phonetic Recognition,” D. Yu and Li Deng, Proc. Interspeech, 2010.
[Nea90] “Learning Stochastic Feedforward Networks,” R.M. Neal, UofToronto, Nov. 1990.
[Hin95] “The wake-sleep algorithm for unsupervised neural networks,” G.E. Hinton et al., Science, vol. 268, p.1558, 1995.
[Sau96] “Mean Field Theory for Sigmoid Belief Networks,” L.K. Saul et al., Journal of Artificial Intelligence Research, vol. 4, p. 61, 1996.
[Myl99] “Massively Parallel Probabilistic Reasoning with Boltzmann Machines,” P. Myllymaki, Applied Intelligence, vol. 11, p.31, 1999.
[Hin06(2)] “A fast learning algorithm for deep belief nets,” G.E. Hinton et al., Neural Computation, vol. 18, p.1527, 2006.
[Lee09] “Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations,” H. Lee et al., Proceedings of the Twenty-sixth International Conference on Machine Learning, 2009.
[Moh12] “Acoustic Modeling using Deep Belief Networks,” A. Mohamed et al., submitted to IEEE Trans. Audio, Speech and Lang. Processing, 2012.
[Sar11] “Deep Belief Nets for Natural Language Call-Routing,” R. Sarikaya et al., ICASSP, 2011.
[Tit07] “Constituent Parsing with Incremental Sigmoid Belief Networks,” I. Titov and J. Henderson, Proc. 45th Meeting of Association for Computational Linguistics, 2007.
[Hen08] “A Latent Variable Model of Synchronous Parsing for Syntactic and Semantic Dependencies,” J. Henderson et al., Proceedings of the CoNLL-2008 Shared Task, 2008.
[Des09] “A Deep Learning Approach to Machine Transliteration,” T. Deselaers et al., Proceedings of the Fourth Workshop on Statistical Machine Translation, 2009.
[Zho10] “Active Deep Networks for Semi-Supervised Sentiment Classification,” S. Zhou et al., Harbin Inst. Of Tech., 2010.
[Mca08] “Document Classification using Deep Belief Nets,” L. McAfee, Stanford CS, 2008.
[Get07] “Introduction to statistical relational learning,” L. Get00r and B. Taskar, MIT Press, 2007.
[Geo09] “Towards a Mathematical Theory of Cortical Micro-circuits,” D. George and J. Hawkins, PLoS Computational Biology, vol. 5, 2009.
[Dor06] “Hierarchical Temporal Memory Networks for Spoken Digit Recognition,” J. van Doremalen, Ph.D. thesis, Radboud University, 2006.
[Pea87] “Evidential Reasoning Using Stochastic Simulation of Causal Models,” J. Pearl, Artificial Intelligence, vol. 32, p.245, 1987.
[Pea88] “Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference,” J. Pearl, Morgan Kaufmann Publishers, San Mateo, CA, 1988.
[Abd98] “Approximating MAPs for belief networks is NP-hard and other theorems,” A.M. Abdelbar and S.M. Hedetniemi, Artificial Intelligence vol.102, p.21, 1998.
[Hec95] “Learning Bayesian Networks: The Combination of Knowledge and Statistical Data,” D. Heckerman et al., Machine Learning, vol. 20(3), p.197, Sept. 1995.
[Guo02] “A Survey of Algorithms for Real-Time Bayesian Network Inference,” H. Guo and W. Hsu, American Association for Artificial Intelligence Technical Report, 2002.
[And09] “Finding MAPs Using High Order Recurrent Networks,” E.A.M. Andrews and A.J. Bonner, Proceedings of the 16th international conference on neural information processing: Part I, 2009.
[And12] “Finding MAPs using strongly equivalent high order recurrent symmetric connectionist networks,” E.A.M. Andrews and A.J. Bonner, Cognitive Systems Research vol.14, p.50, 2012.
[Mey05] “Comparing Natural Language Processing Tools to Extract Medical Problems from Narrative Text,” S.M. Meystre and P.J. Haug, AMIA Symposium Proceedings, 2005.
[Wei06] “Bayesian Network, a model for NLP?” D. Weissenbacher, Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics, 2006.


手机版|北京交通大学论坛-知行信息交流平台 ( BJTUICP备13011901号 )

GMT+8, 2020-8-3 17:02

Powered by Discuz! X3.4

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表