随笔档案「2019年1月」 - 匠人先生

大数据基础之Impala（2）实现细节

摘要：一架构 Impala is a massively-parallel query execution engine, which runs on hundreds of machines in existing Hadoop clusters. It is decoupled from the u 阅读全文

posted @ 2019-01-30 17:38 匠人先生阅读(2075) 评论(0) 推荐(1)

Java基础之常用JVM工具

摘要：查看当前所有java进程 # jps 查看某个进程的堆内存占用情况 # jmap -heap $pid 查看某个进程的堆内存中对象分布情况 # jmap -histo $pid 将某个进程的堆内存导出文件 # jmap -dump:format=b,file=test.dump $pid 分析堆内存阅读全文

posted @ 2019-01-30 11:41 匠人先生阅读(325) 评论(0) 推荐(1)

Linux基础之常用命令

摘要：1 磁盘、cpu、内存相关查看全部设备信息 # lspci 查看整体磁盘空间占用情况 # df -h 查看磁盘分区及文件系统 # df -T 查看整体磁盘inode占用情况 # df -i 查看文件详细信息 # ls -l $path 查看文件inode信息 # ls -i $path# stat 阅读全文

posted @ 2019-01-30 11:39 匠人先生阅读(358) 评论(0) 推荐(1)

大数据基础之Ambari（5）通过Ambari部署Hue

摘要：ambari2.7.3（hdp3.1）安装 hue4.2 ambari的hdp中原生不支持hue安装，下面介绍如何通过添加service的方式使ambari支持hue安装：官方：http://gethue.com/ Hue is an open source Workbench for deve 阅读全文

posted @ 2019-01-29 12:01 匠人先生阅读(3172) 评论(2) 推荐(0)

Linux基础之SSH隧道/端口转发

摘要：格式 ssh -L <local port>:<remote host>:<remote port> <SSH servername> 示例 # ssh -L 3307:$server2:3306 user@$server1 实现的效果是使用user登录$server1之后建立本地3307端口到远程阅读全文

posted @ 2019-01-28 18:36 匠人先生阅读(1781) 评论(0) 推荐(0)

运维基础之keepalived

摘要：keepalived 2.0.12 官方：http://www.keepalived.org/ 一简介 Keepalived is a routing software written in C. The main goal of this project is to provide simple 阅读全文

posted @ 2019-01-28 18:16 匠人先生阅读(4434) 评论(0) 推荐(0)

数据库基础之Mysql（2）主从库配置

摘要：一安装 # wget -i -c http://dev.mysql.com/get/mysql57-community-release-el7-10.noarch.rpm# yum -y install mysql57-community-release-el7-10.noarch.rpm# yu 阅读全文

posted @ 2019-01-28 15:56 匠人先生阅读(293) 评论(0) 推荐(0)

Linux基础之iptables

摘要：iptables 1.4.21 官方：https://www.netfilter.org/projects/iptables/index.html iptables is the userspace command line program used to configure the Linux 2 阅读全文

posted @ 2019-01-27 13:16 匠人先生阅读(935) 评论(0) 推荐(1)

大叔经验分享（28）ELK分析nginx日志

摘要：提前安装好elk（elasticsearch、logstach、kibana）一启动logstash $LOGSTASH_HOME默认位于/usr/share/logstash或/opt/logstash 1 nginx日志使用默认格式 log_format main '$remote_addr 阅读全文

posted @ 2019-01-26 19:34 匠人先生阅读(509) 评论(0) 推荐(1)

算法基础之Anaconda（1）简介、安装、使用

摘要：Anaconda 2 官方：https://www.anaconda.com/ 一简介 The Most Popular Python Data Science Platform Anaconda® is a package manager, an environment manager, a P 阅读全文

posted @ 2019-01-26 18:23 匠人先生阅读(633) 评论(0) 推荐(0)

大数据基础之Airflow（1）简介、安装、使用

摘要：airflow 1.10.0 官方：http://airflow.apache.org/ 一简介 Airflow is a platform to programmatically author, schedule and monitor workflows. Use airflow to aut 阅读全文

posted @ 2019-01-26 17:26 匠人先生阅读(4543) 评论(0) 推荐(0)

大叔问题定位分享（27）spark中rdd.cache

摘要：spark 2.1.1 spark应用中有一些task非常慢，持续10个小时，有一个task日志如下： 2019-01-24 21:38:56,024 [dispatcher-event-loop-22] INFO org.apache.spark.executor.CoarseGrainedExe 阅读全文

posted @ 2019-01-25 18:33 匠人先生阅读(1732) 评论(0) 推荐(0)

Linux基础之redhat6升级glibc-2.12到2.14

摘要：redhat6自带glibc-2.12，升级到glibc-2.14过程 # strings /lib64/libc.so.6 |grep GLIBC_GLIBC_2.2.5GLIBC_2.2.6GLIBC_2.3GLIBC_2.3.2GLIBC_2.3.3GLIBC_2.3.4GLIBC_2.4GL 阅读全文

posted @ 2019-01-24 17:15 匠人先生阅读(4391) 评论(0) 推荐(2)

大叔经验分享（27）linux服务器升级glibc故障恢复

摘要：redhat6系统默认安装的glibc-2.12，有的软件依赖的是glibc-2.14，这时需要升级glibc，下载安装 http://ftp.gnu.org/gnu/glibc/glibc-2.14.tar.gz # ./configure --prefix=/usr --disable-prof 阅读全文

posted @ 2019-01-24 17:11 匠人先生阅读(6070) 评论(1) 推荐(3)

大数据基础之Alluxio（1）简介、安装、使用

摘要：Alluxio 1.8.1 官方：http://www.alluxio.org/ 一简介 Open Source Memory Speed Virtual Distributed StorageAlluxio, formerly Tachyon, enables any application t 阅读全文

posted @ 2019-01-23 15:08 匠人先生阅读(4704) 评论(0) 推荐(0)

大数据基础之Flink（1）简介、安装、使用

摘要：Flink 1.7 官方：https://flink.apache.org/ 一简介 Apache Flink is an open source platform for distributed stream and batch data processing. Flink’s core is 阅读全文

posted @ 2019-01-22 22:25 匠人先生阅读(1520) 评论(0) 推荐(1)

Linux基础之用户和组

摘要：1 添加、删除用户 # useradd $user# userdel $user 2 设置用户密码 # passwd $user /etc/passwd 3 查看$user的用户和组信息 # id $user 4 将用户$user添加到组$group # usermod -G $group $use 阅读全文

posted @ 2019-01-22 14:15 匠人先生阅读(480) 评论(0) 推荐(1)

大叔经验分享（26）hive通过外部表读写elasticsearch数据

摘要：hive通过外部表读写elasticsearch数据，和读写hbase数据差不多，差别是需要下载elasticsearch-hadoop-hive-6.6.2.jar，然后使用其中的EsStorageHandler； Connect the massive data storage and deep 阅读全文

posted @ 2019-01-21 20:54 匠人先生阅读(3724) 评论(0) 推荐(1)

大叔经验分享（25）hive通过外部表读写hbase数据

摘要：在hive中创建外部表： CREATE EXTERNAL TABLE hive_hbase_table(key string, name string,desc string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 阅读全文

posted @ 2019-01-21 20:38 匠人先生阅读(2403) 评论(0) 推荐(1)

大叔经验分享（24）hive metastore的几种部署方式

摘要：hive及其他组件（比如spark、impala等）都会依赖hive metastore，依赖的配置文件位于hive-site.xml hive metastore重要配置 hive.metastore.warehouse.dirhive2及之前版本默认为/user/hive/warehouse/，阅读全文

posted @ 2019-01-21 18:07 匠人先生阅读(1283) 评论(0) 推荐(1)

大数据基础之ElasticSearch（1）简介、安装、使用

摘要：ElasticSearch 6.6.0 官方：https://www.elastic.co/ 一简介 ElasticSearch简单来说是对lucene的分布式封装，增加了shard（每个shard是一个子索引，也是一个lucene的index）和replica的概念；所以在ElasticSear 阅读全文

posted @ 2019-01-21 15:44 匠人先生阅读(642) 评论(0) 推荐(1)

大数据基础之Impala（1）简介、安装、使用

摘要：impala2.12 官方：http://impala.apache.org/ 一简介 Apache Impala is the open source, native analytic database for Apache Hadoop. Impala is shipped by Cloude 阅读全文

posted @ 2019-01-21 13:38 匠人先生阅读(3319) 评论(0) 推荐(1)

大数据基础之Kudu（1）简介、安装、使用

摘要：kudu 1.7 官方：https://kudu.apache.org/ 一简介 kudu有很多概念，有分布式文件系统（HDFS），有一致性算法（Zookeeper），有Table（Hive Table），有Tablet（Hive Table Partition），有列式存储（Parquet），有阅读全文

posted @ 2019-01-21 12:45 匠人先生阅读(3520) 评论(1) 推荐(2)

大数据基础之ElasticSearch（2）常用API整理

摘要：Fortunately, Elasticsearch provides a very comprehensive and powerful REST API that you can use to interact with your cluster. Among the few things th 阅读全文

posted @ 2019-01-20 22:16 匠人先生阅读(2718) 评论(0) 推荐(1)

大数据基础之Ambari（4）通过Ambari部署Impala

摘要：ambari2.7.3（hdp3.1）安装 impala2.12（自动安装最新） ambari的hdp中原生不支持impala安装，下面介绍如何通过mpack方式使ambari支持impala安装：一安装Service 1 下载 # wget https://github.com/cas-bi 阅读全文

posted @ 2019-01-19 23:46 匠人先生阅读(5627) 评论(4) 推荐(2)

大数据基础之Ambari（3）通过Ambari部署Airflow

摘要：ambari2.7.3（hdp3.1）安装 airflow1.10 ambari的hdp中原生不支持airflow安装，下面介绍如何通过mpack方式使ambari支持airflow安装： 1 下载 # wget https://github.com/miho120/ambari-airflow- 阅读全文

posted @ 2019-01-17 21:54 匠人先生阅读(2060) 评论(1) 推荐(2)

大叔问题定位分享（25）ambari metrics collector内置standalone hbase启动失败

摘要：ambari metrics collector内置hbase目录位于 /usr/lib/ams-hbase 配置位于 /etc/ams-hbase/conf 通过ruby启动 /usr/lib/ams-hbase/bin/hirb.rb 实际的启动命令为 /usr/lib/ams-hbase/bi 阅读全文

posted @ 2019-01-17 21:21 匠人先生阅读(5032) 评论(0) 推荐(2)

大叔问题定位分享（24）hbase standalone方式启动报错

摘要：hbase 2.0.2 hbase standalone方式启动报错： 2019-01-17 15:49:08,730 ERROR [Thread-24] master.HMaster: Failed to become active master java.lang.IllegalStateExc 阅读全文

posted @ 2019-01-17 16:43 匠人先生阅读(3203) 评论(0) 推荐(3)

大数据基础之Ambari（2）通过Ambari部署ElasticSearch（ELK）

摘要：ambari2.7.3（hdp3.1）安装 elasticsearch6.3.2 ambari的hdp中原生不支持elasticsearch安装，下面介绍如何通过mpack方式使ambari支持elasticsearch安装：一安装Service 1 下载 Mpack include vers 阅读全文

posted @ 2019-01-17 13:04 匠人先生阅读(6590) 评论(1) 推荐(1)

Linux基础之curl

摘要：http请求过程如下： # curl -v http://www.baidu.com % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 阅读全文

posted @ 2019-01-16 21:34 匠人先生阅读(547) 评论(0) 推荐(1)

大叔问题定位分享（23）Ambari安装向导点击下一步卡住

摘要：ambari安装第一步是输入集群name，点击next时页面卡住不动，如下图：注意到其中一个接口请求结果异常，http://ambari.server:8080/api/v1/version_definitions 重现如下： curl -u admin:admin "http://ambari. 阅读全文

posted @ 2019-01-15 18:57 匠人先生阅读(2659) 评论(1) 推荐(2)

Linux基础之sudo

摘要：sudo允许用户以其他用户的身份（比如root）执行命令，比如切换用户、执行命令、读写文件等；配置 sudo配置在：/etc/sudoers ## Sudoers allows particular users to run various commands as## the root user, 阅读全文

posted @ 2019-01-15 16:00 匠人先生阅读(461) 评论(0) 推荐(1)

Linux基础之SSH秘钥登录

摘要：官方：https://www.ssh.com/ssh/ The SSH protocol uses encryption to secure the connection between a client and a server. All user authentication, commands 阅读全文

posted @ 2019-01-15 14:23 匠人先生阅读(1258) 评论(0) 推荐(1)

大数据基础之Ambari（1）简介、编译安装、使用

摘要：官方：http://ambari.apache.org/ The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, 阅读全文

posted @ 2019-01-15 12:26 匠人先生阅读(5962) 评论(2) 推荐(1)

数据库基础之Mysql（1）常用命令

摘要：1 创建用户 CREATE USER 'username'@'host' IDENTIFIED BY 'password'; 比如 create user 'test_user'@'%' identified by 'test'; ps：如果只允许本机登录则host=localhost，如果允许从任阅读全文

posted @ 2019-01-15 11:40 匠人先生阅读(336) 评论(0) 推荐(1)

大叔经验分享（22）securecrt连接自动断开

摘要：securecrt一段时间没有操作连接就会自动断开（xshell就没有这个问题），提示信息为：信号灯超时时间已到，解决方法为： Options -- Session Options -- Terminal -- Send protocol NO-OP 阅读全文

posted @ 2019-01-14 17:16 匠人先生阅读(636) 评论(0) 推荐(1)

Linux基础之后台运行

摘要：linux服务器通常都是远程登录的，执行命令或者脚本时，如果连接断掉（执行时间较长或者网络不稳定时），那么进程也就没了，这时只能重新连接重新执行，这时可以使用后台执行： 1 nohup 命令 nohup $command $args & 这时会生成一个nohup.out文件，内容是命令执行的控制台输阅读全文

posted @ 2019-01-14 12:56 匠人先生阅读(343) 评论(0) 推荐(1)

Linux基础之上传下载

摘要：1 rz sz 安装 yum install -y lrzsz 上传 rz ，对话框操作下载 sz $filename 注意：rz不能上传大于4g的文件，此时可以改为scp或sftp上传，其中sftp url为：sftp://user:passwd@host:22 2 scp 安装 yum ins 阅读全文

posted @ 2019-01-14 10:54 匠人先生阅读(218) 评论(0) 推荐(1)

运维基础之Nginx（1）简介、安装、使用

摘要：官方：http://nginx.org nginx [engine x] is an HTTP and reverse proxy server, a mail proxy server, and a generic TCP/UDP proxy server, originally written 阅读全文

posted @ 2019-01-13 19:58 匠人先生阅读(424) 评论(0) 推荐(1)

运维基础之Ansible（1）简介、安装和使用

摘要：官方：https://www.ansible.com/ 一简介 Ansible is a radically simple IT automation engine that automates cloud provisioning, configuration management, appli 阅读全文

posted @ 2019-01-13 18:12 匠人先生阅读(453) 评论(0) 推荐(1)

大数据基础之Hadoop（1）HA实现原理

摘要：有些工作只能在一台server上进行，比如master，这时HA（High Availability）首先要求部署多个server，其次要求多个server自动选举出一个active状态server，其他server处于standby状态，只有active状态的server允许进行特定的操作；当ac 阅读全文

posted @ 2019-01-11 15:25 匠人先生阅读(1770) 评论(0) 推荐(1)

Linux基础之查看linux发行版以及内核版本

摘要：redhat查看发行版 # cat /etc/redhat-release CentOS Linux release 7.2.1511 (Core) 查看内核版本 # uname -aLinux $host 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 1 阅读全文

posted @ 2019-01-10 17:25 匠人先生阅读(530) 评论(0) 推荐(0)

大叔经验分享（21）yarn中查看每个应用实时占用的内存和cpu资源

摘要：在yarn中的application详情页面 http://resourcemanager/cluster/app/$applicationId 或者通过application命令 yarn application -status $applicationId 只能看到应用启动以来占用的资源*时间统阅读全文

posted @ 2019-01-10 16:54 匠人先生阅读(14786) 评论(1) 推荐(0)

大叔经验分享（19）spark on yarn提交任务之后执行进度总是10%

摘要：spark 2.1.1 系统中希望监控spark on yarn任务的执行进度，但是监控过程发现提交任务之后执行进度总是10%，直到执行成功或者失败，进度会突然变为100%，很神奇，下面看spark on yarn任务提交过程： spark on yarn提交任务时会把mainClass修改为Cl 阅读全文

posted @ 2019-01-10 16:18 匠人先生阅读(2399) 评论(0) 推荐(0)

大数据基础之Spark（8）Spark中Join实现原理

摘要：spark中join有两种，一种是RDD的join，一种是sql中的join，分别来看： 1 RDD join org.apache.spark.rdd.PairRDDFunctions /** * Return an RDD containing all pairs of elements wit 阅读全文

posted @ 2019-01-09 17:42 匠人先生阅读(3470) 评论(0) 推荐(2)

大叔经验分享（23）spark sql插入表时的文件个数研究

摘要：spark sql执行insert overwrite table时，写到新表或者新分区的文件个数，有可能是200个，也有可能是任意个，为什么会有这种差别？首先看一下spark sql执行insert overwrite table流程： 1 创建临时目录，比如 .hive-staging_hiv 阅读全文

posted @ 2019-01-09 15:05 匠人先生阅读(2439) 评论(0) 推荐(1)

大叔算法分享（7）最小二乘法

摘要：Ordinary Least Square 最小二乘法提到最小二乘法要先提到拟合，拟合（Fitting）是数值分析（Numerical Analysis）的基础工具之一，拟合中最简单的是一元函数（function of one variable）拟合，一元函数拟合（即二维平面）分为直线拟合（一元一阅读全文

posted @ 2019-01-03 23:35 匠人先生阅读(1213) 评论(0) 推荐(1)

大叔算法分享（6）机器学习概览

摘要：Machine Learning 机器学习分类 Classification (分类) Regression (回归) Clustering (聚类) Dimensionality reduction (降维) Supervised Learning 监督学习已有样本数据（TrainingSet 阅读全文

posted @ 2019-01-03 22:25 匠人先生阅读(247) 评论(0) 推荐(1)

股票常用指标计算

摘要：计算MA和EMA通用方法 def getAverageArray(datas : Array[Double], period : Int, maType : MaType = MaType.Ma, weight : Double = 2.0) : ArrayBuffer[Double] = { va 阅读全文

posted @ 2019-01-02 15:41 匠人先生阅读(1084) 评论(0) 推荐(1)

Thinking in BigData

匠人先生

01 2019 档案

公告