Papers to Read
General Introduction
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.nature14539.pdf
[This is a general introduction by three towering figures of the field]
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.LeNet 00726791.pdf
[This is the original LeNet by Yann Le Cun]
Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on (pp. 6645-6649). IEEE.Speech RNN Hinton 06638947.pdf
[This work boosted MicroSoft's Speech Technology]
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems(pp. 3104-3112).sequence-to-sequence-learning-with-neural-networks NIPS 2014 .pdf
[This leads to Google's better speech understanding, Gmail answers, ..]
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 807-814).ReLU icml2010_NairH10.pdf
[ReLU is better than Sigmoid dealing with the vanishing gradient problem]
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1), 1929-1958.Dropout srivastava14a.pdf
[Dropout (Brain-Damage) gives robust net work]
Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (pp. 448-456).Batch Normalization icml2015_ioffe15.pdf
[This technique makes training faster]
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-2680).Generative Adversarial Nets.pdf
[Many said GAN is the most important paper over the past few years]
ImageNet Challenge Winners
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).AlexNet-imagenet-classification-with-deep-convolutional-neural-networks.pdf [AlexNet]
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. VGGNet 1409.1556.pdf [VGGNet]
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015, June). Going deeper with convolutions. CVPR 2015 GoogLeNet Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf [GoogLeNet]
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).ResNet He_Deep_Residual_Learning_CVPR_2016_paper.pdf [ResNet]
Diannoa Family
Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., & Temam, O. (2014, February). Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM Sigplan Notices (Vol. 49, No. 4, pp. 269-284). ACM.DianNao p269-chen.pdf
Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., ... & Temam, O. (2014, December). Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 609-622). IEEE Computer Society.DadianNao p609-chen.pdf
Liu, D., Chen, T., Liu, S., Zhou, J., Zhou, S., Teman, O., ... & Chen, Y. (2015, March). Pudiannao: A polyvalent machine learning accelerator. In ACM SIGARCH Computer Architecture News (Vol. 43, No. 1, pp. 369-381). ACM.
Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., ... & Temam, O. (2015, June). ShiDianNao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News (Vol. 43, No. 3, pp. 92-104). ACM.ShiDiannaop92-du.pdf
Chen, Y., Chen, T., Xu, Z., Sun, N., & Temam, O. (2016). DianNao family: energy-efficient hardware accelerators for machine learning. Communications of the ACM, 59(11), 105-112.DianNao Family p105-che.pdf
Liu, S., Du, Z., Tao, J., Han, D., Luo, T., Xie, Y., ... & Chen, T. (2016, June). Cambricon: An instruction set architecture for neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture(pp. 393-405). IEEE Press.Cambricon 07551409.pdf
Zhang, S., Du, Z., Zhang, L., Lan, H., Liu, S., Li, L., ... & Chen, Y. (2016, October). Cambricon-X: An accelerator for sparse neural networks. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on (pp. 1-12). IEEE.Cabricon X 07783723.pdf
Lu, W., Yan, G., Li, J., Gong, S., Han, Y., & Li, X. (2017, February). FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on (pp. 553-564). IEEE.Flexflow HPCA2017 07920855.pdf
Vivienne Sze (MIT)
Sze, V., Chen, Y. H., Yang, T. J., & Emer, J. (2017). Efficient processing of deep neural networks: A tutorial and survey. arXiv preprint arXiv:1703.09039.MIT Sze Survey 1703.09039.pdf
Chen, Y. H., Krishna, T., Emer, J. S., & Sze, V. (2017). Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1), 127-138.Eyeriss 07738524.pdf
Chen, Y. H., Emer, J., & Sze, V. (2017). Using Dataflow to Optimize Energy Efficiency of Deep Neural Network Accelerators. IEEE Micro, 37(3), 12-21.MIT DataFlow IEEE Micro 07948671.pdf
Han Song and Bill Dally (Stanford)
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360.SqueezNet 1602.07360.pdf
Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.Deep Compression 1510.00149.pdf
Han, S., Pool, J., Tran, J., & Dally, W. (2015). Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (pp. 1135-1143).Han Song 5784-learning-both-weights-and-connections-for-efficient-neural-network.pdf
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M. A., & Dally, W. J. (2016, June). EIE: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture (pp. 243-254). IEEE Press.EIE p243-han.pdf
Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., ... & Yang, H. (2017, February). ESE: Efficient speech recognition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 75-84). ACM.ESE Sparse LSTM FPGA.pdf
More Compression Approaches
Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 27-40. DOI: https://doi.org/10.1145/3079856.3080254SCNN ISCA 2017 p27-Parashar.pdf
Albericio, J., Judd, P., Hetherington, T., Aamodt, T., Jerger, N. E., & Moshovos, A. (2016, June). Cnvlutin: ineffectual-neuron-free deep neural network computing. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on (pp. 1-13). IEEE.Cnvlutin 07551378.pdf
Judd, P., Delmas, A., Sharify, S., & Moshovos, A. (2017). Cnvlutin2: Ineffectual-Activation-and-Weight-Free Deep Neural Network Computing. arXiv preprint arXiv:1705.00125.Cnvlutin2 1705.00125.pdf
Google TPU
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., ... & Boyle, R. (2017). In-datacenter performance analysis of a tensor processing unit. arXiv preprint arXiv:1704.04760.TPU ISCA 2017 1704.04760.pdf
Chilimbi, T. M., Suzue, Y., Apacible, J., & Kalyanaraman, K. (2014, October). Project Adam: Building an Efficient and Scalable Deep Learning Training System. In OSDI (Vol. 14, pp. 571-582).Adam osdi14-paper-chilimbi.pdf
Ovtcharov, K., Ruwase, O., Kim, J. Y., Fowers, J., Strauss, K., & Chung, E. S. (2015). Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper, 2(11).15 DCNN hardware.pdf
Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J., ... & Haselman, M. (2014, June). A reconfigurable fabric for accelerating large-scale datacenter services. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on (pp. 13-24). IEEE.microsoft catapult 2014 06853195.pdf
Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., & Cong, J. (2015, February). Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 161-170). ACM.UCLA Cong p161-zhang.pdf
Sharma, H., Park, J., Mahajan, D., Amaro, E., Kim, J. K., Shao, C., ... & Esmaeilzadeh, H. (2016, October). From high-level deep neural models to FPGAs. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on (pp. 1-12). IEEE.HL DNN to FPGA 2016 Micro 07783720.pdf
Suda, N., Chandra, V., Dasika, G., Mohanty, A., Ma, Y., Vrudhula, S., ... & Cao, Y. (2016, February). Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 16-25). ACM.OpenCL FPGA p16-suda.pdf
Wang, C., Gong, L., Yu, Q., Li, X., Xie, Y., & Zhou, X. (2017). DLAU: A scalable deep learning accelerator unit on FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 36(3), 513-517.DLAU Yuan Xie 07505926.pdf
Peemen, M., Setio, A. A., Mesman, B., & Corporaal, H. (2013, October). Memory-centric accelerator design for convolutional neural networks. In Computer Design (ICCD), 2013 IEEE 31st International Conference on (pp. 13-19). IEEE.Memory Centric CNN 06657019.pdf
Various Acceleration Approaches
Gupta, S., Agrawal, A., Gopalakrishnan, K., & Narayanan, P. (2015). Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (pp. 1737-1746).Num Precision gupta15.pdf
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., & Bengio, Y. (2016). Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830.Binary 1602.02830.pdf
Vanhoucke, V., Senior, A., & Mao, M. Z. (2011, December). Improving the speed of neural networks on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop (Vol. 1, p. 4).CPU Improvement Google VanhouckeNIPS11.pdf
Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das, Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, Bharat Kaul, Pradeep Dubey, and Anand Raghunathan. 2017. ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 13-26. DOI: https://doi.org/10.1145/3079856.3080244ScaleDeep ISCA 2017 p13-Venkataramani.pdf
Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 548-560. DOI: https://doi.org/10.1145/3079856.3080215Scalpel ISCA 2017 p548-Yu.pdf
Judd, P., Albericio, J., Hetherington, T., Aamodt, T. M., & Moshovos, A. (2016, October). Stripes: Bit-serial deep neural network computing. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on (pp. 1-12). IEEE.Stripes bit serial Micro 2016 07783722.pdf
Kim, Y. D., Park, E., Yoo, S., Choi, T., Yang, L., & Shin, D. (2015). Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530.Yoo Compressed CNN 1511.06530.pdf
Additional References
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61, 85-117.Schmidhuber Overview NN1-s2.0-S0893608014002135-main.pdf
LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., & Jackel, L. D. (1990). Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems (pp. 396-404).handwritten-digit-recognition-with-a-back-propagation-network.pdf
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.RNN Comparison 1412.3555.pdf
Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.word2vec explained 1402.3722.pdf
Lin, H. W., Tegmark, M., & Rolnick, D. (2016). Why does deep and cheap learning work so well?. Journal of Statistical Physics, 1-25.Why DL Work10.1007\s10955-017-1836-5.pdf
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(Dec), 3371-3408.vincent10a.pdf
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8), 1798-1828.Learning Representation 06472238.pdf
Ota, K., Dao, M. S., Mezaris, V., & De Natale, F. G. (2017). Deep Learning for Mobile Multimedia: A Survey. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 13(3s), 34.DL mobile survey.pdf
Tutorials by Bill Dally and Song Han (nVidia/Stanford)
Tutorial: High-Performance Hardware for Machine Learning https://youtu.be/J-GOkwiwg4c
Efficient Methods and Hardware for Deep Learning https://youtu.be/eZdOkDtYMoo
Tutorials by Vivienne Sze of MIT
- Efficient Processing for Deep Learning: Challenges and Opportunitieshttps://www.youtube.com/watch?v=kYEUkpHpOKA&t=61s
- Tutorials by Professor Sungjoo Yoo of SNU
4-1 Deep Learning Algorithms, Optimization Methods, and Hardware Accelerators (Prof. Sungjoo Yoo) https://youtu.be/ebqVpK4c3cw
4-2 Example of Object Detection Result (Prof. Sungjoo Yoo)https://youtu.be/MEgwTaUdmqw
4-3 Convoiution with Matrix Multiplication (Prof. Sungjoo Yoo)https://youtu.be/2ExjsudgDU4
4-4 High Performance Accelerator Architecture (Prof. Sungjoo Yoo)https://youtu.be/hAZ2t0a7rdU