Oleg Zabluda's blog: September 19, 2016

Oleg Zabluda's blog

Monday, September 19, 2016

Rest in Peas: The Unrecognized Death of Speech Recognition (2010)
Rest in Peas: The Unrecognized Death of Speech Recognition (2010)
"""
The accuracy of computer speech recognition flat-lined in 2001, before reaching human levels. The funding plug was pulled, but no funeral, no text-to-speech eulogy followed.

After a long gestation period in academia, speech recognition bore twins in 1982: the suggestively-named Kurzweil Applied Intelligence and sibling rival Dragon Systems. Kurzweil’s software, by age three, could understand all of a thousand words—but only when spoken one painstakingly-articulated word at a time. Two years later, in 1987, the computer’s lexicon reached 20,000 words, entering the realm of human vocabularies which range from 10,000 to 150,000 words. But recognition accuracy was horrific: 90% wrong in 1993. Another two years, however, and the error rate pushed below 50%. More importantly, Dragon Systems unveiled its Naturally Speaking software in 1997 which recognized normal human speech. Years of talking to the computer like a speech therapist seemingly paid off.
[...]
In 2001 recognition accuracy topped out at 80%, far short of HAL-like levels of comprehension. Adding data or computing power made no difference. Researchers at Carnegie Mellon University checked again in 2006 and found the situation unchanged.
"""
https://web.archive.org/web/20130120011331/http://robertfortner.posterous.com/the-unrecognized-death-of-speech-recognition
https://geektimes.ru/post/92771/

https://geektimes.ru/post/92771

Labels: Oleg Zabluda

A Large Contextual Dataset for Classification, Detection and Counting of Cars with Deep Learning (2016) T.
A Large Contextual Dataset for Classification, Detection and Counting of Cars with Deep Learning (2016) T. Nathan Mundhenk et al (LLNL)
""
We have created a large diverse set of cars from overhead images, which are useful for training a deep learner to binary classify, detect and count them. [...] The set contains contextual matter to aid in identification of difficult targets. We demonstrate classification and detection on this dataset using a neural network we call ResCeption. This network combines residual learning with Inception-style layers and is used to count cars in one look. This is a new way to count objects rather than by localization or density estimation.
[...]
It has recently been demonstrated that one-look methods can excel at both speed and accuracy [19] for recognition and localization. The idea of using a one-look network counter to learn to count has recently been demonstrated on synthetic data patches [20] and by regression on subsampled crowd patches [21]. Here we utilize a more robust network, and demonstrate that a large strided scan can be used to quickly count a very large scene with reasonable accuracy.
[...]
The cost of running GoogLeNet is 30k ops per pixel at 224x224 [...] Table 7. Performance results taken for our models running on Caffe on a single Nvidia GeForce Titan X based GPU [...] The AlexNet version will count cars at a rate of 1 km^2 per second. A company such as Digital Globe which produced satellite data at the rate of 680,000 km^2 per day in 2014 would theoretically be able to count the cars in all that data online with 8 GPUs. [Orbital Insight] claimed that they can count cars in 4 trillion pixels worth of images in 48 hours [...] our AlexNet based solution would be able to count that many pixels in 23 hours using one single GPU.
""
https://arxiv.org/abs/1609.04453

https://arxiv.org/abs/1609.04453

Labels: Oleg Zabluda

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition (2015) Kaiming He et al
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition (2015) Kaiming He et al
"""
we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixed length outputs, which are then fed into the fully-connected layers (or other classifiers). [...] to avoid the need for cropping or warping at the beginning. [...] Spatial pyramid pooling [14], [15] [...] an extension of the Bag-of-Words (BoW) model [16], is one of the most successful methods in computer vision. It partitions the image into divisions from finer to coarser levels, and aggregates local features in them. SPP has long been a key component in the leading and competition-winning systems for classification (e.g., [17], [18], [19]) and detection (e.g., [20]) before the recent prevalence of CNNs. Nevertheless, SPP has not been considered in the context of CNNs. We note that SPP has several remarkable properties for deep CNNs: 1) SPP is able to generate a fixed length output regardless of the input size, while the sliding window pooling used in the previous deep networks [3] cannot; 2) SPP uses multi-level spatial bins, while the sliding window pooling uses only a single window size. Multi-level pooling has been shown to be robust to object deformations [15]; 3) SPP can pool features extracted at variable scales thanks to the flexibility of input scales. Through experiments we show that all these factors elevate the recognition accuracy of deep networks.

SPP-net not only makes it possible to generate representations from arbitrarily sized images/windows for testing, but also allows us to feed images with varying sizes or scales during training. Training with variable-size images increases scale-invariance and reduces over-fitting. We develop a simple multi-size training method. For a single network to accept variable input sizes, we approximate it by multiple networks that share all parameters, while each of these networks is trained using a fixed input size. In each epoch we train the network with a given input size, and switch to another input size for the next epoch. Experiments show that this multi-size training converges just as the traditional single-size training, and leads to better testing accuracy

The advantages of SPP are orthogonal to the specific CNN designs. In a series of controlled experiments on the ImageNet 2012 dataset, we demonstrate that SPP improves four different CNN architectures in existing publications [3], [4], [5] (or their modifications), over the no-SPP counterparts
[...]
SPP-net also shows great strength in object detection. In the leading object detection method R-CNN [7], the features from candidate windows are extracted via deep convolutional networks. [...] But the feature computation in RCNN is time-consuming, because it repeatedly applies the deep convolutional networks to the raw pixels of thousands of warped regions per image. In this paper, we show that we can run the convolutional layers only once on the entire image (regardless of the number of windows), and then extract features by SPP-net on the feature maps. This method yields a speedup of over one hundred times over R-CNN. Note that training/running a detector on the feature maps (rather than image regions) is actually a more popular idea [23], [24], [20], [5]. But SPP-net inherits the power of the deep CNN feature maps [...] computes features 24-102× faster than R-CNN, while has better or comparable accuracy.
[...]
fixed-length vectors [...] can be generated by the Bag-of-Words (BoW) approach [16] that pools the features together. Spatial pyramid pooling [14], [15] improves BoW in that it can maintain spatial information by pooling in local spatial bins. These spatial bins have sizes proportional to the image size, so the number of bins is fixed regardless of the image size.
[...]
To adopt the deep network for images of arbitrary sizes, we replace the last pooling layer (e.g., pool5, after the last convolutional layer) with a spatial pyramid pooling layer. Figure 3 illustrates our method. In each spatial bin, we pool the responses of each filter (throughout this paper we use max pooling). The outputs of the spatial pyramid pooling are kM-dimensional vectors with the number of bins denoted as M (k is the number of filters in the last convolutional layer). The fixed-dimensional vectors are the input to the fully-connected layer.
[...]
Interestingly, the coarsest pyramid level has a single bin that covers the entire image. This is in fact a “global pooling” operation, which is also investigated in several concurrent works. In [31], [32] a global average pooling is used to reduce the model size and also reduce overfitting; in [33], a global average pooling is used on the testing stage after all fc layers to improve accuracy; in [34], a global max pooling is used for weakly supervised object recognition. The global pooling operation corresponds to the traditional Bag-of-Words method.
[...]
[31] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv:1312.4400, 2013.

[32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXiv:1409.4842, 2014.

[33] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556,
2014.

[34] M. Oquab, L. Bottou, I. Laptev, J. Sivic et al., “Learning and
transferring mid-level image representations using convolutional
neural networks,” in CVPR, 2014.
"""
https://arxiv.org/abs/1406.4729

https://arxiv.org/abs/1406.4729

Labels: Oleg Zabluda

Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT (2014) Philipp Fischer, Alexey...

Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT (2014) Philipp Fischer, Alexey Dosovitskiy, Thomas Brox
"""
descriptors like SIFT are not only used in recognition but also for many correspondence problems that rely on descriptor matching. [...] descriptors extracted from convolutional [...] neural networks perform consistently better than SIFT also in the low-level task of descriptor matching [...] unsupervised network slightly outperforms the supervised one.
[...]
Supervised [...] pre-trained model [...] follows Krizhevsky
[...]
We performed unsupervised training of a CNN as described in [5].
[...]
We extracted features from various layers of neural networks and measured their performance. While the unsupervised CNN clearly prefers features from higher network layers, the ImageNet CNN does not have a preference if the optimal patch size is used.
[...]
both neural nets perform much better than SIFT on all transformations except blur. The unsupervised network is superior to SIFT also on blur, but not by a large margin. [The unsupervised training explicitly considers blur, which is why the unsupervised CNN suffers less from it. ] Interestingly, the difference in performance between the networks and SIFT is typically as large as between SIFT and raw RGB [...] patches [...] as a weak ’naive’ baseline.
"""
http://arxiv.org/abs/1405.5769

Labels: Oleg Zabluda

A Critical Review of Recurrent Neural Networks for Sequence Learning (2015) Zachary C.

A Critical Review of Recurrent Neural Networks for Sequence Learning (2015) Zachary C. Lipton, John Berkowitz, Charles Elkan
"""
Countless learning tasks require dealing with sequential data. Image captioning, speech synthesis, and music generation all require that a model produce outputs that are sequences. In other domains, such as time series prediction, video analysis, and musical information retrieval, a model must learn from inputs that are sequences. Interactive tasks, such as translating natural language, engaging in dialogue, and controlling a robot, often demand both capabilities. Recurrent neural networks (RNNs) are connectionist models that capture the dynamics of sequences via cycles in the network of nodes. Unlike standard feedforward neural networks, recurrent networks retain a state that can represent information from an arbitrarily long context window. [...] When appropriate, we reconcile conflicting notation and nomenclature. Our goal is to provide a self-contained explication of the state of the art together with a historical perspective and references to primary research.
"""
http://arxiv.org/abs/1506.00019

Labels: Oleg Zabluda

BLIS: A Framework for Rapidly Instantiating BLAS Functionality (2015) FG Van Zee et al
BLIS: A Framework for Rapidly Instantiating BLAS Functionality (2015) FG Van Zee et al
"""
fundamental innovation is that virtually all computation within level-2 (matrix-vector) and level-3 (matrix-matrix) BLAS operations can be expressed and optimized in terms of very simple kernels. [...] the simplest set that still supports the high performance [...] Higher-level framework code is generalized and implemented in ISO C99 [...] competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).
[...]
BLIS is part of a larger effort to overhaul the dense linear algebra software stack as part of the FLAME project
"""
http://www.cs.utexas.edu/users/flame/pubs/BLISTOMSrev2.pdf

contains BLAS critique, etc

http://www.cs.utexas.edu/users/flame/pubs/BLISTOMSrev2.pdf

Labels: Oleg Zabluda

"""
"""
centre of the human retina [...] bipolar cell often receives its central input from just one photoreceptor, and its surrounding input from a couple more. In contrast, a bipolar cell in the more peripheral regions of the retina might collect signals from 25 photoreceptors before joining signals with 5000 other bipolar cells when communicating with the retinal ganglion cell (which, in turn, transmits information to the brain). This means that a single retinal ganglion cell in the periphery of our retina collects information from up to 75,000 photoreceptors! This allows it to keep watch over a much larger area of the visual world than cells that pool information from a much smaller number of receptors (hence, cells in the periphery are said to have large receptive fields). A further perk of such great convergence is that these cells are exquisitely sensitive to light, since they can add up weak signals from thousands of receptors. This is primarily why astronomers have historically preferred to look for faint stars by directing the telescope just slightly off-centre of their retinas – a technique known as averted vision.
[...]
larger receptive fields, they become less capable of resolving small details. Thus, the brain is unlikely to be informed of the presence of small objects in the visual periphery – something that is responsible for the fact that the dots you see in the extinction illusion are never far away from the centre of your gaze.
[...]
if another dot were to occur in the vicinity of the first one (below), there is a high probability that it would also stimulate the same receptive field’s centre. [...] In the case of these two dots, the visual brain is receiving the same signal from the same retinal ganglion cell. This means that the information it has access to does not allow it to reasonably infer where something might be happening in a particular area of the visual world. Thus, as receptive fields of neurons in the retina grow larger, their signals give the brain less and less certainty as to what exactly is happening and where. When such certainty about visual events is lacking, there is no reason for them to arise in our conscious experience. Instead, the brain appears to provide us with an experiential ‘filler’ (you don’t exactly go around seeing ‘uncertainty’) and some level of ignorance regarding just how poor our spatial vision is in the eye’s periphery.
"""
https://theneurosphere.com/2016/09/17/the-basic-neurobiology-behind-the-visual-illusion-that-is-here-to-break-the-internet/

https://theneurosphere.com/2016/09/17/the-basic-neurobiology-behind-the-visual-illusion-that-is-here-to-break-the-internet

Labels: Oleg Zabluda

"""
"""
at least 40 to 50 times a year, an airliner somewhere in the world will encounter a rapid decompression [...] when slow depressurizations are figured in, the rate increases even more. And because not all events require that regulators be notified, the problem, said Stabile, is “grossly underreported.”
[...]
On an American Trans Air flight in 1996, a mind-boggling sequence of events brought a Boeing 727 a hairbreadth from catastrophe.
[...]
Nine years after American Trans Air 406, a Boeing 737 took off from Cyprus on August 14, 2005, on a flight to Athens. It never arrived.
"""
http://www.airspacemag.com/flight-today/crash-detectives-excerpt-180960071/

http://www.airspacemag.com/flight-today/crash-detectives-excerpt-180960071

Labels: Oleg Zabluda

About Me