E-commerce Data Analytics System(CentOS分布式集群)
0Project Introduction
0.1 Project Background
Online shopping platforms such as Taobao, JD.com, and Amazon generate massive amounts of publicly available logistics orders and shopping data worldwide. These data contain valuable information such as consumer purchasing preferences, time of purchase, geographic location, and more. Analyzing this data can help consumers better understand market trends and make more informed purchasing decisions. Additionally, these data can help platforms and sellers better understand market demand and trends, and adjust sales strategies in a timely manner to increase sales and profitability.
0.2 Project Motivation
Therefore, we use this data as the raw data for this project, aiming to practice various typical operations involved in the entire data processing process, such as data preprocessing, storage, querying, and visual analysis. This will help us become familiar with and master various data processing tools and techniques, such as the installation and usage methods of systems and software like Linux, MySQL, Hadoop, Hive, Sqoop, Eclipse, ECharts, and Spark. In this process, we will learn how to clean, deduplicate, format, and transform large-scale data for subsequent data processing and analysis. We will also learn how to choose an appropriate database or data warehouse to store and manage data for quick querying and analysis. Finally, we will use various visualization tools and techniques to present data in chart form, making data analysis results more intuitive and providing better services and experiences for consumers and platforms.
0.3 Project Significance
New-generation technologies based on the internet encompass various aspects such as new hardware and data centers at the infrastructure level, distributed computing and mass data storage and processing technologies. There are also more ways to communicate between people through social networks, diversification of devices such as mobiles, and data acquisition through the Internet of Things. Massive data will play a central role in all of these. The behavior data of all end-users of internet enterprises can be easily stored on servers and through mining, analysis, and graphical presentation of the massive data, we can clearly reveal behavior patterns of users. This helps to deepen our understanding of user needs and harness the collective wisdom of users to provide a basis for product research and development personnel to make informed decisions. These insights can constantly improve the intelligence level of the system and enhance the product user experience. Massive data contains unprecedented potential, and the processing of such data will have a far-reaching impact. It is essential to employ sophisticated data processing techniques to ensure that we extract the valuable insights hidden within this data. By doing so, we can optimize business strategies, improve products and services, and ultimately provide more value to our customers. Given the rapid growth of the internet, the importance of data processing and analysis will only continue to increase in the coming years.
1 System Design
1.1Technical Architecture Design
Our technical architecture diagram is as follows:
- The data acquisition layer: The project's dataset utilizes open datasets available on the internet. Shopping logs are simulated by having the program read the shopping log data and send it to Kafka at regular intervals. The real-time log flow is then output to the flow computing engine after Kafka. For future implementation, we plan to consider utilizing data acquisition technology, such as timing extraction or real-time synchronization through tools like Sqoop or Cannal. Additionally, we collect a variety of buried point logs in real-time through Flume.
- Data storage and analysis: For offline processing, we store the data in HDFS and use Hive for querying and analysis. The statistics of the Hive-processed data are stored in a MySQL database for display on web pages. Furthermore, we use SparkMLlib for machine learning prediction and analysis of some data. For online streams from Kafka, we use Structured Streaming to compute the data in real time and send the results to the Kafka producer, who then sends the processed messages to the message queue. The message queue is then read through the consumer. We use ZooKeeper to monitor each node's state and schedule resources.
- API: Offline data is pushed from MySQL to the frontend through the PyMySQL API. Online data is read through Flask-SocketIO and then pushed to the frontend.
- Data application: The frontend browser uses the JS framework SocketIO to receive data in real-time, and then uses the JS visualization library, Highlights.js, for dynamic display. We use Echarts to display our statistical data, and a real-time query interface is designed on the frontend.
1.2 Functionalities
Our system has the following functions:
- Statistical analysis of offline batch data: We use statistical methods to analyze batch offline data. Hive is used to perform statistical analysis on HDFS data, and the results are outputted to the MySQL database for storage.
- Machine learning analysis of offline batch data: We utilize Spark MLlib to analyze consumer behavior and predict whether customers will buy the same product again, this architecture can also support a variety of Machine learning computing.
- Real-time streaming data analysis: We process real-time streaming data using Kafka and Spark Streaming to count the number of male and female buyers in purchase logs over time.
- Dynamic scalability: Our framework has good dynamic scalability, which allows it to handle increasing amounts of data over time.
- Support for various data types: Our framework supports the import of various data types, enabling it to handle a wide range of data formats.
- Support for changing data records: Our framework supports changing the number of columns and rows of data records, which makes it highly adaptable to different data sources and analysis requirements.
2 Platform and Tools
2.0 Environment Version
Linux: CentOS-7-x86_64
Hadoop: hadoop-2.7.5
Java: jdk1.8.0_162
MySQL: 5.7
Hive: 1.2.2
Sqoop: 1.4.6
Eclipse: jee-2022-06-R-linux
ECharts: 3.4.0
Spark: 3.2.0 (scala 2.12)
Pyspark: 2.4.6
Kafka:2.8.0 (scala 2.12)
2.1 Install CentOS Virtual Machine
Click Create Virtual Machine
Click Customize>Next>Install the operating system later>Next>Set the virtual machine storage location>Set the virtual machine name>Number of processors:2>Allocate memory>Select NAT for network type>Finish>
Edit the virtual machine settings>Use ISO image file>Select the image file you have downloaded
Run the virtual machine>Install CentOS 7>Select language: Chinese>Continue>
Date and Time>Settings in the upper right corner>Cancel all previous ticks>Select three new ones from the following>Check the box and OK
ntp1.aliyun.com
ntp2.aliyun.com
ntp3.aliyun.com
ntp4.aliyun.com
ntp5.aliyun.com
ntp6.aliyun.com
ntp7.aliyun.com
Software selection > select "Infrastructure Server" in "Basic Environment" > select "Debugging Tools" in "Additional Options for Selected Environment" > Finish
Installation Location>Select "Auto-configure Partitions">Finish
Network and Hostname>Open Network>Finish
Start installation>ROOT password>Set ROOT password>Finish
Reboot after installation is complete
Similarly, create two other slave virtual machines.
2.2 Deploy Hadoop
2.2.1 Configuring a Static Network
Enter root>Enter password
ping www.baidu.com
Check if the network can ping through, Ctrl+C to stop.
vi /etc/sysconfig/network-scripts/ifcfg-ens33
Press i to enter edit mode, change 2 main things.
1. BOOTPROTO="dhcp" to BOOTPROTO="static"
2. ONBOOT="no" to ONBOOT="yes", if it is already yes, don't touch it.
Go to VMware interface > Edit > Virtual Network Editor > VMnet8 > NAT Settings > View Subnet IP, Subnet Mask and Gateway
Then add the following lines in the last line, IPADDR can be in the same network segment as the viewed subnet IP, GATEWAY is the same as the viewed
IPADDR="192.168.19.11"
NETMASK="255.255.255.0"
GATEWAY="192.168.19.2"
When you are done, press esc to exit edit mode, then hold shift+: and type wq to save and exit.
After saving and exiting we need to restart the service:
service network restart
Then ping, can ping through, configure the network this step even if no problem. If you have problems, go to Difficulties and Solution section.
2.2.2 Turn off the Firewall
systemctl stop firewalld
systemctl disable firewalld
If you are not sure, you can check it by entering the following command:
systemctl status firewalld
Seeing inactive followed by dead indicates a successful closure.
2.2.3 Configure the Host Name
Then change the login name in the master and both slave in turn.
hostnamectl set - hostname master
slave0:
hostnamectl set - hostname slave0
slave1:
hostnamectl set - hostname slave1
Then reboot and you will see that the login name has been changed. Next, go to the hosts file for configuration:
192.168.19.110 master
192.168.19.111 slave0
192.168.19.112 slave1
Type wq to save and exit. You can type ping slave0 or slave1 to ping through, if you can ping through, it means step 2.2.3 is also completed.
2.2.4 Transferring Files Using SSH
Use Xshell and Xftp to transfer the files hadoop-2.7.5.tar.gz, jdk-8u162-linux-x64.tar.gz to the root directory of the virtual machine.
New file transfer > transfer jdk installation package and hadoop installation package to virtual machine
2.2.5 Configure SSH Password-free Login
First generate the key:
ssh-keygen -b 1024 - t rsa
Run 3 codes in master, slave0 and slave1 respectively, 9 codes in total.
ssh-copy-id master
ssh-copy-id slave0
ssh-copy-id slave1
Run 1 code each on master, slave0 and slave1, for a total of 3 codes:
chmod 600 authorized_keys
Afterwards, test whether the machines can log on to each other password-free. Enter the command ssh slave0
and the following result will indicate a successful configuration:
[root@master ~]# ssh slave0
Last login: Thu May 22 11:00:34 2023 from 192.168.19.110
[root@slave0 ~]#
2.2.6 Configuring Time Synchronisation
Subsequently configure the time synchronisation. Enter the command crontab -e
to access a new page and type
0 1 * * * /usr/sbin/ntpdate cn.pool.ntp.org
The time synchronization is configured.
Enter clock to view
2.2.7 Unpack the Jdk Package and the Hadoop Package
tar -xzvf hadoop-2.7.5.tar.gz
tar - xzvf jdk-8u162-linux-x64.tar.gz
Once you have finished unpacking, you can view it by typing the command: ls
2.2.8 Configuring Jdk and Hadoop
To install the vim editor:
yum install vim -y
Then go to the .bash_profile file and add the following:
export JAVA_HOME=/root/jdk1.8.0_162
export PATH=$JAVA_HOME/bin:$PATH
Save and exit. After using the command source .bash_profile
to make it effective, enter the command: java -version
If a version of java appears, the installation is successful.
If the java command does not work, then see the Difficulties and Solution section.
After saving and exiting, enter the following two lines of command on the master to copy to the slave:
scp -r jdk1.8.0_171 root@slave0:~/
scp -r jdk1.8.0_171 root@slave1:~/
Next, configure Hadoop. Go to where you want to configure the file,
cd hadoop2.7.5/etc/hadoop
Type ls
to list all configuration files,
The following files need to be configured: core-site.xml
, hadoop-env.sh
, hdfs-site.xml
, yarn.site.xml
, yarn-env.sh
, mapred-site.xml.template
(later renamed mapred-site.xml
), slaves
, where core-site.xml
, hdfs-site.xml
, yarn-site.xml
, mapred-site.xml
are hadoop cluster-specific configuration files, which are user-editable in the configuration files of each hadoop component or etc. Hadoop-env.sh
, yarn-env.sh
, mapred.env.sh
These script files are responsible for setting the java home directory, different log file locations, and various daemon JVM properties. The contents of configuration files are as follows:
core-site.xml
. It is a specific generic hadoop property configuration file, and the configuration items in this file will override the same configuration items in core-default.xml.
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/root/hadoopdata</value>
</property>
hadoop-env.sh
. The configuration file will override the same configuration items in hdfs-default.xml.
export JAVA_HOME=/root/jdk1.8.0_162
hdfs-site.xml
. The configuration items in the configuration file overwrite the same configuration items inhdfs-default.xml
.
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
yarn-site.xml
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
yarn-env.sh
. Configuration file configuration items will override the same configuration inyarn-default.xml
.
JAVA_HOME=/root/jdk1.8.0_162
mapred-site.xml.template
. configuration file configuration items will override the same configuration in mapred-default.xml.
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
slaves
slave0
slave1
In this case, after configuring the mapred-site.xml.template file, you also need to rename the file by entering the following command:
mv mapred-site.xml.template mapred-site.xml
After completing the above configuration, copy them to the slave in the same way:
scp -r hadoop-2.7.5 root@slave0:~/
scp -r hadoop-2.7.5 root@slave1:~/
2.2.9 Configuring System Environment Variables for Hadoop
Next, configure the system environment variables for hadoop. Simply append the environment variables for hadoop to the back of jdk:
export HADOOP_HOME=/root/hadoop-2.7.5
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
Then give effect to:
source .bash_profile
Then create a new directory on master, slave0 and salve1, that is, execute the following code on each virtual machine:
mkdir /root/hadoopdata
2.2.10 Hadoop Format and Start Hadoop
Let's take the final step to format and start hadoop:
hdfs namenode -format
start-all.sh
After the startup is successful, enter jps. As shown in the following figure, the configuration is successful:
slave0 and slave1 started the cluster successfully
The command to close is:
stop-all.sh
2.3 Deploy MySQL
Update yum
yum update -y
- Install the wget tool
sudo yum install -y wget
- Use wget to download the mysql yum source:
wget https://dev.mysql.com/get/mysql80-community-release-el7-3.noarch.rpm
- Add mysql yum source.
sudo yum localinstall mysql80-community-release-el7-3.noarch.rpm -y
- Install the yum tool yum-utils .
sudo yum install -y yum-utils
The following 5 to 8 are some of the operations that may be used:
- Check the available mysql .
yum repolist enabled | grep "mysql.*-community.*"
- View all mysql versions
yum repolist all | grep mysql
The result is similar to the figure below:
- Using a specific version of MySQL
If I want to use MySQL 5.7, then I need to shut down MySQL 8.0 first.
Shutting down MySQL 8.0
sudo yum-config-manager --disable mysql80-community
Turn on MySQL 5.7
sudo yum-config-manager --enable mysql57-community
- Check the currently enabled version of MySQL
yum repolist enabled | grep mysql
- Installing MySQL
sudo yum install -y mysql-community-server
- Start MySQL
sudo service mysqld start
- Checking MySQL Service Status
sudo service mysqld status
- Initializing MySQL
To view the initialization password:
sudo grep 'temporary password' /var/log/mysqld.log
To log in using the initial password:
mysql -u root -p
Initialize password
ALTER USER 'root'@'localhost' IDENTIFIED BY 'daasan7ujm^YHN';
- Setting the MySQL Password Policy
Querying the initial MySQL password policy
SHOW VARIABLES LIKE 'validate_password%';
Change password authentication strength
set global validate_password_policy=LOW;
Change password length
set global validate_password_length=6;
At this point the password can already be set as a simple password.
ALTER USER 'root'@'localhost' IDENTIFIED BY '123456';
The password here is the one you set for yourself
- Refresh permission
FLUSH PRIVILEGES;
- Set up MySQL boot, set up in linux
Not configured in sql, executed at the Linux command line.
systemctl enable mysqld
(This is just the directory where the hints are located) MySQL configuration file directory:
/etc/my.cnf
- Configuring the firewall
Set port 3306 open
firewall-cmd --zone=public --add-port=3306/tcp --permanent
Seeing success means the addition was successful.
Restart the firewall
firewall-cmd --reload
Seeing success
means the addition was successful.
Verify that 3306 is open successfully
firewall-cmd --zone=public --query-port=3306/tcp
At this point, mysql is already configured in centos.
If you have problems with the above steps you can see the Difficulties and Solution section.
2.4 Deploy Hive
Next, let's install and configure hive.
First download hive on the internet and extract it on the virtual machine system (you can change the file name to hive).
The second step is to configure the environment variables and add hive to the environment variables as follows:
vim ~/.bashrc
Add the following to the top line of the document:
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_HOME=/root/hadoop-2.7.5
Save the changes. Run the file using sourcez to make the configuration take effect immediately.
Next use vim to create a new configuration file hive-site.xml and add the following configuration information to it:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password to use against metastore database</description>
</property>
</configuration>
Exit and save, and hive is configured.
As we later use a mysql database to hold hive's metadata, configure mysql to allow hive access.
The first step is to download the mysql-jdbc package and extract the downloaded zip on your system and copy it to the hive/lib directory. Then start mysql and execute the create statement to create a new hive database:
# This hive database corresponds to the hive at localhost:3306/hive in hive-site.xml and is used to hold hive metadata
mysql> create database hive.
# grant all permissions on all tables of all databases to the hive user, followed by 123456 which is the connection password configured in hive-site.xml
mysql> grant all on *. * to hive@localhost identified by '123456'.
After performing the above actions the configuration of hive has been successfully completed. Next, you can start hive and check that everything is configured correctly
If the following screen appears it means that you have successfully entered the hive interactive execution environment and this part of the configuration is finished.
hive>
2.5 Deploy Sqoop
In this step we configure sqoop, which is an open source tool for data transfer between Hadoop (Hive) and traditional databases (mysql, postgresql...), which can import data from a relational database (e.g. MySQL, Oracle, Postgres, etc.) into Hadoop's HDFS, or import data from HDFS into a relational database. You can import data from a relational database (e.g. MySQL, Oracle, Postgres, etc.) into Hadoop's HDFS, or you can import data from HDFS into a relational database.
First download sqoop1.4.6 zip package from the web and unzip it in centos system, for convenience you also need to change the folder name sqoop.
The second step configuration file, modify the configuration file sqoop-env.sh and add the following information:
export HADOOP_COMMON_HOME=/root/hadoop-2.7.5
export HADOOP_MAPRED_HOME=/root/hadoop-2.7.5
export HIVE_HOME=/usr/local/hive
Step 3 Configure the environment variables
Go to the .bashrc file and add the following:
export HADOOP_COMMON_HOME=/root/hadoop-2.7.5
export HADOOP_MAPRED_HOME=/root/hadoop-2.7.5
export HIVE_HOME=/usr/local/hive
Remember to save and run with source after you have configured the environment variables to make them take effect.
Step 4 Copy the mysql driver package to the SQOOP_HOME/lib directory.
Step 5 Test the connection between sqoop and mysql
Be sure to execute it with mysql started
sqoop list-databases --connect jdbc:mysql://127.0.0.1:3306/ --username root -P
This way sqoop is successfully connected to mysql.
2.6 Deploy Eclipse
As our subsequent experiments require the use of the software Eclipse to write Java programs, we have to download eclipse and make the relevant configuration in this step.
The first step is to download eclipse from the official website, here we download the version 2022-06 and unzip it on the linux system after downloading.
In order to start eclipse from the desktop, we need to create a launcher in the terminal:
vi /usr/share/applications/eclipse.desktop
and add the following to this new file:
[Desktop Entry]
Encoding=UTF-8
Type=Application
Name=eclipse
Exec=/opt/eclipse/eclipse #This is the directory where eclipse is unpacked
GenericName=eclipse
Comment=Java development tools
Icon=/opt/eclipse/icon.xpm
Categories=Application;Development.
Terminal=false
Once saved, you can find eclipse in the folder /usr/share/applications and move it to the desktop. To use eclipse afterwards you don't need to run it through the terminal, you can just double click on the program on the desktop.
The second step is to start creating the project in eclipse and add the JAR packages that you need. These JAR packages are located in the Hadoop installation directory on the Linux system. In order to write a Java application that can interact with HDFS, you generally need to add the following JAR packages to the Java project:
(1) hadoop-common-2.7.1.jar and hadoop-nfs-2.7.1.jar in the "/root/hadoop-2.7.5/share/hadoop/common" directory;
(2) all JAR packages in the directory "/root/hadoop-2.7.5/share/hadoop/common/lib";
(3) haoop-hdfs-2.7.1.jar and haoop-hdfs-nfs-2.7.1.jar in the "/root/hadoop-2.7.5/share/hadoop/hdfs" directory;
(4) All JAR packages in the directory "/root/hadoop-2.7.5/share/hadoop/hdfs/lib".
Two selected jar packages are added to the current java project as follows:
As you can see from this screen, hadoop-common-2.7.1.jar and haoop-nfs-2.7.1.jar have been added to the current Java project. Then, in a similar way, you can click the "Add External JARs..." button again to add all the remaining JARs. Once all the packages have been added, you can click the "Finish" button in the bottom right corner of the interface to complete the creation of the Java project.
To write a java application, find the name of the project you have just created, then right click on the project name and select the "New->Class" menu in the pop-up menu. After "Name", enter the name of the new Java class file, all other default settings can be used, then, click the "Finish" button at the bottom right corner of the interface, a source code file will appear, you can enter the code in the file.
2.7 Deploy Spark
Go to the directory where Spark's compressed package is located and then decompress spark:
tar -zxf spark-3.1.3-bin-without-hadoop.tgz -C /usr/local/ cd /usr/local
mv ./spark-3.1.3-bin-without-hadoop/ ./spark
chown -R hadoop:hadoop ./spark
Then modify Spark's configuration file Spark-env.sh:
cd /usr/local/spark
cp ./conf/spark-env.sh.template ./conf/spark-env.sh
Edit the spark env.sh file (vim./conf/spark env.sh) and add the following configuration information in the file:
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
(note:The path here is the path where the bin in Hadoop is located)
Use spark shell to attempt to start spark:
bin/spark-shell
If the spark logo appears, it indicates that the installation was successful and can be used.
2.8 Deploy Kafka
Visit Kafka https://kafka.apache.org/downloads and download version 2.8.0 of Kafka. This installation package already includes a Zooeeper, so there is no need to install an additional Zookeeper. Follow the steps in order
sudo tar -zxf kafka_2.12-2.8.0.tgz -C /usr/local
cd /usr/local
sudo mv kafka_2.12-2.8.0/ ./kafka
Enter Kafka's configuration file directory and modify the configuration file:
cd ./kafka/config
vim server.properties
# server.properties # Specify the brokerId of the node, the brokerId needs to be unique within the same cluster broker.id=0 # Specify the listener address and port number, this configuration specifies the intranet ip listeners=PLAINTEXT://192.168.19.110:9092 # Specify the directory where the kafka log files are stored log.dirs=/usr/local/kafka/kafka-logs |
---|
After completing the modification of the configuration file, in order to facilitate the use of Kafka's command script, we can configure Kafka's bin directory to the environment variable:
vim /etc/profile
# add
export KAFKA_HOME=/usr/local/kafka
export PATH=$PATH:$KAFKA_HOME/bin
source /etc/profile
The installation steps for the other two nodes are also the same, just modify the brokerId and listening IP in the configuration file. So directly copy the Kafka directory from this node to the other two machines:
rsync -av /usr/local/kafka 192.168.19.111:/usr/local/kafka
rsync -av /usr/local/kafka 192.168.19.112:/usr/local/kafka
Kafka distinguishes different nodes in a set group through brokerId. So modify the brokerId as 2 in slave0, 3 in slave1.
test Kafka
Zookeeper will start the service according to the default configuration file. Do not close the current terminal.
# 进入kafka所在的目录
cd /usr/local/kafka
bin/zookeeper-server-start.sh config/zookeeper.properties
Start a new terminal and start the Kafka server
bin/kafka-server-start.sh config/server.properties
Start a new terminal and create a topic called BIT by a single node configuration.
cd /usr/local/kafka
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic BIT
We can use a list to list all the created topics to see if the previously created topic exists.
bin/kafka-topics.sh --list --zookeeper localhost:2181
Using Producer to produce data
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic dblab
input the follow messege:
hello hadoop hello bit hadoop world |
---|
Use Consumer to receive data
cd /usr/local/kafka
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic dblab --from-beginning
Close Kafka
cd /usr/local/kafka
./bin/kafka-server-stop.sh
./bin/zookeeper-server-stop.sh
2.9 Deploy Flask
Flask is a lightweight and flexible web framework for building web applications in Python. It provides a set of tools and libraries for handling common web development tasks, and supports a wide range of third-party extensions and plugins. Flask is designed to be simple and easy to use, making it a great choice for smaller and simpler web applications, but it can also be used to build larger and more complex applications if needed.
Here I will show how to deploy it step by step.
Step 0:
We need to config python environment in our CentOS. Here we deploy miniconda.
First we download the Miniconda installer for Linux from the official website:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
This will download the latest version of Miniconda for Linux in the current directory.
Then we run the installer script using the following command:
bash Miniconda3-latest-Linux-x86_64.sh
This will start the installation process. Follow the prompts to specify the installation location, accept the license agreement, and choose whether or not to add Miniconda to your PATH environment variable.
After that we activate the installation by running the following command:
source ~/.bashrc
This will update your PATH environment variable to include the Miniconda installation.
Finally we can see that we deploy miniconda successfully.
conda -h
Step 1: Install Flask and pymysql libraries using pip command in your command prompt or terminal:
import flask
import pymysql
Step 2: Import the necessary libraries in your Python code:
from flask import Flask, render_template, request
import pymysql
Step 3: Create an instance of Flask class:
app = Flask(__name__)
Step 4: Create a connection to your MySQL database using pymysql:
connection = pymysql.connect(
host='localhost',
user='your_username',
password='your_password',
db='your_database_name',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor
)
Note: Replace "your_username", "your_password" and "your_database_name" with your actual MySQL server credentials and database name respectively.
Step 5: Define a route and a function to handle the incoming request. Here's an example:
@app.route('/')
def index():
# Open a cursor
with connection.cursor() as cursor:
# Execute a query
sql = "SELECT * FROM my_table"
cursor.execute(sql)
# Fetch the results
result = cursor.fetchall()
# Render the template
return render_template('index.html', result=result)
Note: Replace "my_table" with your actual table name.
Step 6: Run the Flask app:
if __name__ == '__main__':
app.run(debug=True)
Here's the complete code with all the above steps:
from flask import Flask, render_template, request
import pymysql
app = Flask(__name__)
# MySQL database configuration
connection = pymysql.connect(
host='localhost',
user='your_username',
password='your_password',
db='your_database_name',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor
)
@app.route('/')
def index():
# Open a cursor
with connection.cursor() as cursor:
# Execute a query
sql = "SELECT * FROM my_table"
cursor.execute(sql)
# Fetch the results
result = cursor.fetchall()
# Render the template
return render_template('index.html', result=result)
if __name__ == '__main__':
app.run(debug=True)
Note: Replace "index.html" with your actual HTML template file name.
In the Visualisation Analysis part, we will show more details about the Front-End.
3 Functional Implementation and Details
3.1 Data Acquisition and Preprocessing
3.1.1 Data Acquisition
Compared to traditional data collection, big data collection is significantly different in terms of data type and source diversity. As individuals, we can usually collect relatively large datasets through the following means:
1) Directly obtaining web data from internet users through web crawlers.
2) Downloading data from certain government departments, companies or institutions that provide open data resources or data laboratories.
Here, we use an open dataset provided by the Xiamen University Database Laboratory, which contains transaction data from the Taobao website. The dataset consists of three files, user behavior log file: user_log.csv, repeat customer training set: train.csv, and repeat customer test set: test.csv. The data format for these three files is defined as follows:
a.User behavior log file user_log.csv, the field definitions in the log are as follows:
user_id: buyer ID
item_id: product ID
cat_id: product category ID
merchant_id: seller ID
brand_id: brand ID
month: transaction time - month
day: transaction time - day
action: behavior, with values 0, 1, 2, 3, where 0 represents click, 1 represents adding to cart, 2 represents purchase, and 3 represents product attention.
age_range: buyer age segment, where 1 represents age <18, 2 represents age [18,24], 3 represents age [25,29], 4 represents age [30,34], 5 represents age [35,39], 6 represents age [40,49], 7 and 8 represent age >=50, 0 and NULL represent unknown.
gender: gender, where 0 represents female, 1 represents male, 2 and NULL represent unknown.
province: province of the shipping address.
b.The repeat customer training set train.csv and the repeat customer test set test.csv have the same fields, which are defined as follows:
user_id: buyer ID
age_range: buyer age segment, where 1 represents age <18, 2 represents age [18,24], 3 represents age [25,29], 4 represents age [30,34], 5 represents age [35,39], 6 represents age [40,49], 7 and 8 represent age >=50, 0 and NULL represent unknown.
gender: gender, where 0 represents female, 1 represents male, 2 and NULL represent unknown.
merchant_id: seller ID
label: whether or not the customer is a repeat customer, where 0 represents not a repeat customer, 1 represents a repeat customer, and -1 represents that the user has exceeded the prediction range we need to consider. NULL values only exist in the test set and represent values that need to be predicted.
3.1.1 Data Preprocessing
- Delete the first line of the file records
That is, the first line of the field name user_log.csv is all the field names we have in the file. When data is imported to the Hive data warehouse, the field name in the first row is not required. Therefore, the first row is deleted during data preprocessing.
2) Obtain the first 100,000 pieces of data in the data set.
Since the transaction data in the data set is too large, the intercepted data is concentrated in the first 10,000 pieces of the dataset. The transaction data is used as a small data set small_user_log.csv.
3) Import data to the database
Import data in small_user_log.csv to the Hive data warehouse. In order to do that, we will first upload the file to the distributed file system HDFS, and then create two external tables in Hive to import the resulting data. First, we start HDFS, as HDFS is the core component of Hadoop, we only need to start Hadoop. Second, we need to upload user_log.csv from the Linux local file system to the distributed file system. In HDFS, create a new directory dbtaobao under the root directory of HDFS, and create a subdirectory dataset under this directory. Save the file to the /dbtaobao/dataset directory in the HDFS.
To create a database in Hive, we use a MySQL database to store the metadata. Before starting Hive, we need to start the MySQL database. Since Hive is a data warehouse built on Hadoop, all query statements written in HiveQL language are automatically parsed into MapReduce tasks executed by Hadoop. Hence, we need to start Hadoop first. As we have already started Hadoop earlier, there is no need to start it again. In the next step, we can open a new terminal and execute the following command to enter Hive:
cd /usr/local/hive
./bin/hive
After successful startup, enter the "hive>" Command prompt status. We can enter a HiveQL statement similar to an SQL statement. And then, we create a database dbtaobao in Hive, run the following command:
hive> create database dbtaobao;
hive> use dbtaobao;
When Hive creates an internal table, data is moved to the path pointed to by the data warehouse. When Hive creates an external table, only the data path is recorded and the data location is not changed. When Hive deletes a table, the metadata of the internal table is deleted together with the data. The external table is more secure and the data organization is more flexible, making it easier to share source data. So we use an external table to store the data. We're going to create an external table user_log in our database dbtaobao, It contains fields (user_id, item_id, cat_id, merchant_id, brand_id, month, day, action, age_range, gender, province).
hive> CREATE EXTERNAL TABLE dbtaobao.user_log(user_id INT,item_id INT,cat_id INT,merchant_id INT,brand_id INT,month STRING,day STRING,action INT,age_range INT,gender INT,province STRING) COMMENT 'Welcome to xmu dblab,Now create dbtaobao.user_log!' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/dbtaobao/dataset/user_log';
The small_user__log.csv data from the directory "/dbtaobao/dataset/ user__log" in HDFS has been successfully loaded to the Hive data warehouse. Now you can run the following command to query the data:
hive> select * from user_log limit 10;
3.2 Hive Data Analysis
Hive is a data warehouse analysis system built on Hadoop. It provides rich SQL query methods to analyze storage data in the Hadoop distributed file system. We can map structured data files to a database table and provide complete SQL lookup polling function; We can convert SQL statements to MapReduce tasks to run, through their own SQL query analysis of the required content, this set of SQL Hive SQL, for short, enables users unfamiliar with mapreduce to query, summarize, and analyze data using the SQL language.
1) View tables in Hive
After MySQL, Hadoop, and Hive are started successfully, the hive> window is displayed. Command prompt status: Run the following command.
hive> use dbtaobao;
hive> show tables;
hive> show create table user_log;
- Simple query analysis
We can check the brand of goods in the top 10 transaction logs, query the time when goods were purchased and the type of goods in the top 20 transaction logs, or query in the table by using nested statements.
hive> select brand_id from user_log limit 10;
hive> select month,day,cat_id from user_log limit 20;
hive> select ul.at, ul.ci from (select action as at, cat_id as ci from user_log) as ul limit 20;
- Statistical analysis
After a simple query, we can also add more conditions after select to query the table, and then we can use the function to find what we want.
- Use the aggregate function count() to calculate the number of rows in the table.
hive> select count(*) from user_log;
- Add distinct to the function to find the number of Uids that do not duplicate.
hive> select count(distinct user_id) from user_log;
- Query how many pieces of data are not duplicated (to rule out customer brushing).
hive> select count(*) from (select user_id,item_id,cat_id,merchant_id,brand_id,month,day,action from user_log group by user_id,item_id,cat_id,merchant_id,brand_id,month,day,action having count(*)=1)a;
As you can see, after excluding duplicates, there are only 4754 records.
4) Keyword query and analysis
- Queries conditional on the presence of keywords, such as how many people bought goods on Nov 11.
hive> select count(distinct user_id) from user_log where action='2';
- The keyword is conditioned on a given value to analyze other data. Take a given time and a given brand, and find the number of goods of this brand purchased that day.
hive> select count(*) from user_log where action='2' and brand_id=2661;
- Analysis by user behavior
- Query the purchase proportion or browse proportion of a product on a certain day, query how many users have bought the product, query how many users have clicked the store on the specific day.
hive> select count(distinct user_id) from user_log where action='2';
hive> select count(distinct user_id) from user_log;
According to the above statement, the number of purchases and the number of clicks can be divided to get the purchase rate of the product on that day.
- Check the proportion of male and female buyers buying goods on a specific day.
hive> select count(*) from user_log where gender=0;
hive> select count(*) from user_log where gender=1;
Divide the results of the above two statements to get the ratio you want.
- Given the quantity range of goods purchased, query the id of the user who purchased the quantity of goods on this website on a certain day.
hive> select user_id from user_log where action='2' group by user_id having count(action='2')>5;
- Real-time user query analysis
- Check the number of views for different brands.
hive> create table scan(brand_id INT,scan INT) COMMENT 'This is the search of bigdatataobao' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
hive> insert overwrite table scan select brand_id,count(action) from user_log where action='2' group by brand_id;
hive> select * from scan;
3.3 MySQL Data Storage
In the previous step, we have stored the data in Hive and performed basic data analysis using the tools provided by Hive. Next, we will store the data in MySQL for visualization purposes.
1) Starting Hive
If Hive is not started, start the MySQL service, which stores Hive metadata. Hive is a Hadoop-based database. Query statements written in HiveQL are eventually parsed by Hive into MapReduce jobs and executed by Hadoop. Therefore, you need to start Hadoop and then Hive.
Next, we're at the "Hive>." Execute the following command under the command identifier:
- Create temporary tables inner_user_log and inner_user_info:
hive> create table dbtaobao.inner_user_log(user_id INT,item_id INT,cat_id INT,merchant_id INT,brand_id INT,month STRING,day STRING,action INT,age_range INT,gender INT,province STRING) COMMENT 'Welcome to XMU dblab! Now create inner table inner_user_log ' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
After the command execution, Hive will automatically create the corresponding data in the HDFS file system file "/ user/Hive/warehouse/dbtaobao db/inner_user_log".
- Insert the data from the user_log table into inner_user_log:
In the step of uploading data to Hive, we have created an external table user_log in the dbtaobao database of Hive. Insert the dbtaobao.user_log data into the dbtaobao.inner_user_log table with the following command:
hive> INSERT OVERWRITE TABLE dbtaobao.inner_user_log select * from dbtaobao.user_log;
- Use Sqoop to import data from Hive to MySQL
First, we need to import the generated temporary table data from Hive to MySQL, which involves the following four steps.
a. Log in to the MySQL database. Run the following command:
mysql –u root –p
b. Create a database
show databases;
create database dbtaobao;
use dbtaobao;
c. Create a table
The following is to create a new table user_log in MySQL database dbtaobao, and set its encoding to utf-8 to ensure that the Chinese content stored in it can be stored normally:
CREATE TABLE `dbtaobao`.`user\_log` (`user\_id` varchar(20),`item\_id` varchar(20),`cat\_id` varchar(20),`merchant\_id` varchar(20),`brand\_id` varchar(20), `month` varchar(6),`day` varchar(6),`action` varchar(6),`age\_range` varchar(6),`gender` varchar(6),`province` varchar(10)) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Since sqoop will cast data to String in the process of fetching data, the type of each column needs to be set to varchar in the process of building a table in MySQL. After the MySQL database is successfully created, run exit to exit the mysql database.
d. Import data
After returning to the shell command format, we need to carry out the data import operation:
cd /usr/local/sqoop
bin/sqoop export --connect jdbc:mysql://localhost:3306/dbtaobao --username root --password 123456789, --table user_log --export-dir '/user/hive/warehouse/dbtaobao.db/inner_user_log' --fields-terminated-by ',';
In the preceding command, export indicates to copy data from hive to MySQL. - connect the JDBC: mysql: / / localhost: 3306 / mysql dbtaobao said call address and database; --username root; --password 123456789, indicates the MySQL username and password. --table use_log indicates the table to which data will be imported; -- the export - dir '/ user/hive/warehouse/dbtaobao db/inner_user_log' said export from the hive file; -- field-terminated -by ',' represents the delimiter for files to be exported from Hive.
- Start MySQL to check data
First we need to log into our MySQL. Then enter the corresponding password to go to mysql>; Run the following command to query the first ten rows of the user_log table:
use dbtaobao;
select * from user_log limit 10;
Results like the one below show success:
3.4 Structured Streaming Real-Time Processing
3.4.1 Task: Count Number by Sex
Our task is about real time statistics of the number of male and female students shopping per second.
The dataset contains three files, namely user_ Log.csv, repeat customer train.csv and test.csv. In this case, only the user is used_ The log.csv file. For each shopping log, it records a transaction. But, we only need to obtain the gender and send it to Kafka. Then, Structured Streaming receives the gender for wordcount process. Final, Kafka Producer transfers results and Consumer receives results.
3.4.2 Send Preprocessed Data to Kafka
Instantiate a Kafka producer used to deliver messages to Kafka.
producer = KafkaProducer(bootstrap_servers='localhost:9092')
Read the user log file, read one line at a time, and then send it to Kafka every 0.1 seconds, so that 10 shopping logs are sent every 1 second. Name Kafka's topic 'sex'.
for line in reader:
gender = line[9] # 性别在每行日志代码的第9个元素
if gender == 'gender':
continue
time.sleep(0.1) # 每隔0.1秒发送一行数据
producer.send('sex',line[9].encode('utf8'))
Start Kafka, KafkaProducer, and KafkaConsumer on the terminal respectively.
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
python3 producer.py
write a KafkaConsumer to tests whether the data was successfully delivered.
from kafka import KafkaConsumer
consumer = KafkaConsumer('sex')
for msg in consumer:
print((msg.value).decode('utf8'))
Start Consumer. Seeing the result on the terminal run consumer.py means successfully delivered.
python3 consumer.py
3.4.3 WordCount
Structured Streaming reads data with topic "sex" in Kafka (such as 1,1,0,2..., where 0 represents female and 1 represents male, so 2 or null values are not considered). Through analysis, it can be found that this is a typical wordcount problem and is based on Spark stream calculation. The number of girls is 0, and the number of boys is 1.
Therefore, using the groupBy interface of Structured Streaming, set the window size to 1 and the sliding step size to 1. The calculated number of 0 and 1 is the number of boys and girls per second.
windowedCounts = df \
.withWatermark("timestamp", "1 seconds") \
.groupBy(
window(col("timestamp"), "1 seconds" ,"1 seconds"),
col("value")) \
.count()
wind = windowedCounts.selectExpr( "CAST(value AS STRING)","CAST(count AS STRING)")
The specific implementation steps are: creating a Spark Structured Streaming query to read stream data from the Kafka topic 'sex'. After the query is started, whenever there is a new message in the Kafka topic, Spark will automatically read and convert it into a 'DataFrame' object for further streaming and output use. Count the number of male and female students that appear separately in 1s, encapsulate the results into JSON in the following format: [{Gender: Quantity}], and send it to Kafka, that is, use KafkaProducer to send a message. Detailed code can be found in the project file.
3.4.4 Run and Result
Write a running script, create a new startup.sh file
# The four input parameters represent # 1. zkQuorum is the zookeeper address # 2. Group is the group in which the consumer belongs # 3. Topics The topics consumed by the consumer # 4. Number of numThreads that enable consumption of topic threads /usr/local/spark/bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 /usr/local/spark/mycode/kafka/kafka_test.py 127.0.0.1:2181 1 sex 1 |
---|
Run command to execute the written Structured Streaming program.
sh startup.sh
After successful operation, start the KafkaProducer to post messages, and then change the topic received in KafkaConsumer to verify whether the specified topic message can be received. After success, you can see the following output on the KafkaConsumer running terminal:
4 Results and Demonstration
4.1 Visualisation Analysis &Building Dynamic Web Applications
In this part we have implemented a web application based on Flask and PyMySQL, with the main function of connecting to MySQL databases and querying data. Specifically, its functions are as follows:
- Users can enter the name of a province on the webpage and submit it. The program will query the database and display the data of that province in a table format;
- Users can browse all data on the webpage and present it in tabular form;
- Users can view the display of all data on the timeline and map on the webpage.
The following is the functional organization of each file:
File name | Function |
---|---|
main.py | Import Flask and PyMySQL modules |
sql_connect.py | Defined MySQL class for connecting and operating MySQL databases |
sql_flask.py | Defined three routing functions/query,/query1, and/trend, and used MySQL classes to interact with the database |
templates__init__.py | An initialization file for a Python module, a tool for creating, loading, and rendering templates |
(a)Visualization\main.py
This is a Python file that uses Flask and PyMySQL to create web applications and connect to MySQL databases through the PyMySQL library. This file is mainly responsible for introducing the required dependency modules.
import flask
import pymysql
(b)Visualization\sql_connect.py
This program is a Python class named MySQL, used to connect and operate MySQL databases. The constructor of this class will attempt to connect to the MySQL database and create a cursor object. If the connection is successful, it will output "Connection successful!"; If the connection fails, output 'Connection failed!'. The getdata method is used to execute SQL statements to query data in the database and return results. Finally, the destructor of the class is used to close the database connection.
import pymysql
class Mysql(object):
def __init__(self):
try:
self.db = pymysql.connect(host="localhost",user="root",password="123456",database="dbtaobao")
#游标对象
self.cursor = self.db.cursor()
print("连接成功!")
except:
print("连接失败!")
def getdata(self,province='甘肃'):
sql ="""select * from user_log where province= '{}';""".format(province)
#执行sql语句
print(sql)
self.cursor.execute(sql)
#获取所有的记录
results = self.cursor.fetchall()
print(results)
return results
#关闭
def __del__(self):
self.db.close()
(c)Visualization\sql_flask.py
This program is a web application developed based on the Flask framework, with the file name SQL_ Flask. py. It imports Flask and render_ Template, request, pymysql, and MySQL classes, and defined three routing functions, namely/query,/query1, and/trend. Among them, the/query routing function is used to respond to GET and POST requests, read form input named province, query data from the database through the getdata method of the MySQL class, and render the query results to SQL_ Returned to the client in the select. html template/ The query1 and/trend routing functions respectively respond to GET requests by converting SQL_ The select. html or trend. html template is returned to the client. If you run the program directly, it will start the Flask server in debug mode and listen for the 127.0.0.1:5000 address.
from flask import Flask, render_template, request
import pymysql
# 导入数据库操作类
from sql_connect import Mysql
app = Flask(__name__)
@app.route("/query", methods=['GET', 'POST'])
def info():
# 调用
db = Mysql()
province=request.form['province']
results = db.getdata(province=province)
return render_template("./sql_select.html", results=results)
@app.route("/query1")
def move_to_query():
return render_template("./sql_select.html")
@app.route("/trend")
def move_to_trend():
return render_template("./trend.html")
if __name__ == "__main__":
app.run(debug=True, port=5000, host='127.0.0.1')
(d)Visualization\templates__init__.py
This is an initialization file for a Python module, with the path being templates__ Init__ Py. A tool for creating, loading, and rendering templates. The file itself does not contain specific code implementation, it performs operations such as importing statements into sub modules or setting some module level variables.
Moreover, in direction Visualization\templates, this directory contains the HTML template files required by the Flask application to present query results and visualize data. The template files and functions under the directory are organized as follows:
File name | Function |
---|---|
base.html | The basic template of the website, which defines the title, style, navigation bar, etc. of the page |
sql_select.html | Template for presenting query results in database tables |
trend.html | Template for visualizing data on the timeline |
heatmap.html | Template for visualizing data on maps |
static/css/style.css | CSS style files for websites |
static/img/map.png | Map images used by the website |
4.2 Presentation of Results
Below is the front page of the front end, showing the different data in graphical form.
The following is the data for each of these modules:
(It is noteworthy that the right image above shows the real-time buyer gender data after kafka processing)
The following is the interface for database queries:
Here are the results of the search
5 Difficulties and Solutions
- Configuring a static network
When configuring a static network, if you find the same error after pinging: Name or service not know
It should be a DNS configuration problem:
vi /etc/resolv.conf
nameserver 8.8.8.8
nameserver 8.8.4.4
After that you should be able to ping again without any problems
- java commands and hadoop commands do not work after installation
sync
reboot
- Changing the mysql default character set under centos
Enter the MySQL console
mysql -u root -p
Enter the password
To view the current mysql operational status
mysql>status
The default encoding for the last four codes is latin1.
Modify mysql configuration file
Default location: /etc/my.cnf
Go to the etc folder>>vim my.cnf
If mysql does not have the [client] field, first add it, add the following code in the [client] segment default-character-set=utf8 (there are other setting statements mentioned on the web, they are the previous version, not used now)
* Add the following code to the [mysqld] section
character-set-server=utf8
collation-server=utf8_general_ci
:wq! # save exit
- systemctl restart mysql.service # Restart MySQL
* Check the current mysql running status
mysql>status
All encodings should be UTF-8 at this point
- Error:Unable to find a match:mysql-community-server
Execute this code block.
yum module disable mysql
sudo yum install -y mysql-community-server
- The public key for mysql-community-client-5.7.39-1.el7.x86_64.rpm is not yet installed.
Execute the following command:
rpm --import https://repo.mysql.com/RPM-GPG-KEY-mysql-2022
Cause:
It may be caused by the MySQL GPG key being expired, change the key.
6 Conclusion
6.1 Conlusion
This report mainly introduces the design, development, and deployment process of an e-commerce data analysis system.
Firstly, the background and functions of the system were introduced, and then a detailed introduction was given on how to install and deploy various components required for the e-commerce data analysis system on the CentOS system. Next, the process of data collection, preprocessing, storage, and analysis was introduced through specific implementation details, including the use of Hadoop technology and Hive for data analysis. Next, we introduced how to store data in MySQL and use Flask to develop dynamic web applications. We also provided examples of Python module initialization and HTML template files.
The structure of our project is shown below:
6.2 Limitations
- : Since the data set we downloaded is organized data with high integrity, less data preprocessing is required. In reality, the original data obtained through crawler may need data cleaning, feature selection and other preprocessing before it can be used.
2) : The classical support vector machine algorithm only provides the two-class classification algorithm, and in the practical application of data mining, it is generally necessary to solve the classification problem of multiple classes, so the application scope of this approach is not broad enough.
3) : Although big data processing techniques are used
, they may be limited by hardware resources when dealing with large scale data. The initial allocation of memory and storage space to the virtual machine is so small that opening Eclipse and Pycharm later in the project often leads to lagging and even crashes.
4) : When using Kafka stream processing, we have overcome issues including but not limited to mismatches between Spark, Kafka and Scala versions, missing dependent jar packages, and not recognising or finding the parser called "sbt-chain" that is used.
However, when streaming using the Kafka data source (KafkaV2) and attempting to write the results to an external system or target, there are often problems that cause the write job to abort. This makes it difficult to get the results of the stream processing to appear, and we have chosen to solve this by saving the results of the stream processing as they appear and then processing them manually before delivering them to Flask.
5) : Due to time constraints, we still have a few minor imperfections in the front end and there is plenty of room to spruce it up.
6.3 Future Work
- : Obtain more original data through static or dynamic web crawler, improve the independence of project completion and reduce the dependence on existing data sets.
2) : To solve the classification problem of multiple classes, it can be solved by the combination of multiple two-class support vector machines. There are one-to-many combination mode, one-to-one combination mode and SVM decision tree. Then it is solved by constructing a combination of multiple classifiers. The main principle is to overcome the inherent shortcomings of SVM, combined with the advantages of other algorithms, to solve the classification accuracy of multi-class problems. For example, it combines with rough set theory to form a combinatorial classifier with complementary advantages.
3) : In larger projects, allocate more memory and storage space initially to the virtual machine, or pay attention to memory if it is not a virtual machine.
4) : In the outlook for future work, we need to address the issues that can arise when streaming using Kafka data sources and writing the results to external systems. To do this, we can take a number of steps to improve the reliability and stability of stream processing. Firstly, introduce a failure recovery mechanism, such as Spark Streaming's checkpointing mechanism, so that jobs can be restored to their previous state in the event of failure. Secondly, a monitoring and alerting system can be set up to monitor job status and performance metrics in real time, and to identify and resolve potential problems in a timely manner. Properly set water lines can control processing delays and ensure the accuracy of results. Introduce early warning mechanisms to provide timely alerts for early action. Consider using high availability and scalability features to seamlessly scale and adjust resources. Implement data replay and backtracking features to reprocess past data and fix errors. Conduct regular fault-tolerance tests and stress tests to verify system performance under high loads, exceptions and failures. These work outlooks will help improve the reliability, performance and stability of the stream processing system, thus ensuring smooth processing of data streams and accurate delivery of results.
7 Acknowledgement
Thanks to my friends. GAC, HZJ, LQB, LZY, WQN and YXY. They all made important contributions to this project.
8 Debug Log
Spark Cluster Construction
https://www.cnblogs.com/zhangyongli2011/p/10572152.html
May 13:
Reference post: https://dblab.xmu.edu.cn/blog/1714/
When configuring environment variables, when editing spark-env.sh,
export SPARK_HOME=.......
export PATH=......
The paths of the two files are not preceded by /root
And then it worked.
log:Pycharm Deploy Pyspark
Reference post:
0512:
The first attempt was to install pyspark(300+MB) directly, but it failed.
https://blog.csdn.net/sinat_26599509/article/details/51895999
Reference to the above post b plan, with vim..bashrc modified, and then sourse a little useless.
0513:
http://t.csdn.cn/FLeMH
According to the above post, the bashrc has been modified as follows:
Then source ./.bashrc two time, still useless!Restarting pycharm is half the battle.
There is no kafka folder in spark/python/pyspark/ and no corresponding library in spark3.This version download pyspark2.4.6, compression pyspark folder in the directory, replace/usr/local/spark/python/lib/pyspark.zip. Changepyspark.zip/streaming/init.py in this path
from pyspark.streaming.kafka import KafkaUtils
__all__ = ['StreamingContext', 'DStream', 'StreamingListener', 'KafkaUtils']
Still can't solve it.Create a new file mykafka.py
in the kafka folder, paste the KafkaUtilis class from the pyspark/streaming/kafka
file into this file,and modify from mykafka.py import KafkaUtilis
in the test file and it no longer reports the error. Follow the tutorial for the next steps.
http://t.csdn.cn/n9GwL
Following the idea of this post to unzip the python folder 'py4j' and 'pyspark' in the spark installation directory, but can not find the path from which the pycharm guide package is now guided. You need to find this path and put pyj4 and pyspark (which are already on the desktop) after you have extracted them, and you should be able to import them.
Also, pip install
will inexplicably break, tried commands such as
pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install --upgrade pip (These two commands are attempts to update the pip)
pip install pyspark
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pyspark (Nor can I)
0514:
pip3 py4j==0.10.9
Put a jar package in the path/usr/local/spark/jars
spark-streaming-kafka-0-10_2.12-3.2.0.jar
In startup.sh, write the following command
/usr/local/spark/bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 --driver-class-path /usr/local/spark/jars/*:/usr/local/spark/jars/kafka/* --class "org.apache.spark.examples.streaming.KafkaWordCount" /usr/local/spark/mycode/kafka/target/scala-2.12/simple-project_2.12-1.0.jar 127.0.0.1:2181 1 sex 1
9 Reference
Spark课程综合实验案例:淘宝双11数据分析与预测_厦门大学数据库实验室 (xmu.edu.cn)
Spark课程实验案例:Spark+Kafka构建实时分析Dashboard(免费共享)_厦门大学数据库实验室 (xmu.edu.cn)
(55条消息) 大数据架构设计_csdn 大数据 设计_wandy0211的博客-CSDN博客
(72条消息) 基于Spark2.x新闻网大数据实时分析可视化系统项目_xl.zhang的博客-CSDN博客
如何用XShell连接另一台电脑上的LINUX虚拟机 - 简书 (jianshu.com)
(72条消息) 使用Xshell连接到另一台电脑的虚拟机(NAT网络)_Myzrsweety的博客-CSDN博客
(72条消息) 如何用Xshell连接另一台电脑上的虚拟机_xshell连接另一台电脑虚拟机_Larry酷睿的博客-CSDN博客
基于Python语言的Spark数据处理分析案例集锦(PySpark)_厦大数据库实验室博客 (xmu.edu.cn)
No module named 'pyspark.streaming.kafka' - 木叶流云 - 博客园 (cnblogs.com)
Spark+Kafka构建实时分析Dashboard案例(2022年9月V2.0版)——步骤三:Structured Streaming实时处理数据(scala版本)_厦大数据库实验室博客 (xmu.edu.cn)
前端必看的数据可视化入门指南 - 知乎 (zhihu.com)
使用百度地图官方WEB API,提示 “ APP 服务被禁用“ 问题的解决方法_.猫的树的博客-CSDN博客
控制台 | 百度地图开放平台 (baidu.com)
Hadoop 和 BI 如何结合?搭建一个基于 Hadoop+Hive 的数据仓库,它的前端展现如何实现?如何实现 BI? - 知乎 (zhihu.com)