这篇文章是从saphana.com上转载的,文档对HANA数据载入,数据模型,性能优化等方面进行了一些分析,是一篇值得学习的文章。
I've been involved in a few SAP HANA projects where performance has been initially disappointing, and I have been asked to come and show the customer HANA speed. As I've said before, you can take the best technology in the world, create a bad design, and it will work badly. Yes, even SAP HANA can be slow.
With that in mind, here are my best practices for HANA data loading.
1) Build the right architecture
Every technology has limits, and HANA is constrained by the same laws of physics as any other computer system. The major differences with HANA are:
a) Data is kept entirely in-memory and therefore accessing all data has the same latency
b) Data is compressed in-memory and only uncompressed in the CPU itself
c) Data may be loaded into all cores (up to 80) simultaneously and a single instruction executed against all at the same time (massively parallel)
If you think about all this, then by far the major challenge with HANA is getting optimized data in fast enough. Networks bottom out at 300MB/sec, disks are often even slower and if you need to load HANA data fast, then you need to think carefully about your design.
2) Optimizing the data model
HANA doesn't work like Oracle, so you need to adjust your thinking. Here are the golden rules for HANA tables:
- Keep tables as narrow as possible. HANA has a cost for columnar inserts, so you get a reduction in data load performance with an increase in the number of columns. Don't build out fields you don't need. Interestingly, very wide tables have lower throughput in MB/sec than narrower tables. Once you get over 80-100 columns, you will see up to a 25% degradation in throughput.
- Consider your partitioning strategy. On a single node appliance 3-5 partitions is normally a good number for optimal insert performance. On a scale-out appliance, you can multiply this by the number of active nodes over which you distribute the tables.
- Avoid using keys if you don't need them. Actually, avoid using the HANA database for referential integrity if you can, it is expensive.
- Avoid using VARCHARs if you don't need them. Fixed width CHAR is better for smaller fields, and INTEGER fields are even better. VARCHARs have a cost, and don't compress as well.
Good | Bad |
---|---|
CREATE COLUMN TABLE "SCHEMA"."TABLE" ( "TIME" TIME, "FIELD1" CHAR(1), "FIELD2" CHAR(16), "FIELD3" CHAR(4), "MEASURE1" INT, "MEASURE2" DECIMAL(7,4)) PARTITION BY ROUNDROBIN PARTITIONS 12; | CREATE COLUMN TABLE "SCHEMA"."TABLE" ( "TIME" TIMESTAMP, "FIELD1" VARCHAR(255), "FIELD2" VARCHAR(255), "FIELD3" VARCHAR(255), "MEASURE1" BIGINT, "MEASURE2" DECIMAL(15,7) "UNUSED_FIELD" VARCHAR(2000)); |
Note that in my test example here, the table on the left loads 6x faster. Note that I use 12 partitions because I have a 4-node environment with one master node, and the table partitions distributed over the remaining nodes - 4 partitions in each of 3 nodes. This gives maximum table distribution and therefore throughput.
Note that merge deltas operate better with multiple partitions. But if you really want to learn about table partitioning, read "Table Partitioning in the SAP HANA database" section in the "SAP HANA Administration guide", here: SAP HANA Appliance Software – SAP Help Portal Page
3) Get the upper bound by optimizing a control file load
I always test control file loads with HANA first, using CSV files. If your files are in the right format then control files are always the fastest way to load into HANA. They were written by the bulk loader team and you can get amazing load performance. This is how I do it:
a) Turn off database logging
On a single-node appliance, or a scale-out appliance like IBM or Hitachi that uses Fusion-IO for logs, you will get the same performance with and without logging enabled. The only difference is you will fill up the log disk and crash the database. Turn them off until you do your final testing.
You do this by double clicking on your database node on the left hand side of HANA Studio, click Configuration -> global.ini -> persistence. Set enable_auto_log_backup to no and log_mode to overwrite.
b) Get the file in CSV format with the minimum number of columns
I usually use bash script to do this, with a combination of grep, awk, sed, cut and even perl, if it gets ugly. You can convert fixed format, or any format, to nice clean CSV files. Perl even has a CSV library that you can use to do this. Here's a neat example that sucks in your CSV and gives you the number of rows, columns and the maximum length of a column.
#!/bin/bash
typeset -a a END
let a=1
COLUMNS=`awk 'BEGIN {FS=","} ; END{print NF}' test.csv`
ROWS=`wc -l < test.csv`
echo $ROWS ' rows'
echo $COLUMNS ' columns'
while ((a
echo -n $a
cut -f $a -d\, < test.csv | awk '{ if (length($0) > max) {max = length($0); maxline = $0} } END { print maxline }'| wc -c
let a++
done
Quite often, I create demo data in my lab based on a customer's schema. I use one of a few tricks to do this:
i) Excel
Excel is great for this and you can create a million rows at a time using tricky functions like this (put your values, weighted, in Sheet 2 Column A):
=INDEX(Sheet2!$A:$A,RANDBETWEEN(1,COUNTA(Sheet2!$A:$A)),1)
Then I copy the CSV file to the HANA appliance and run a bunch of commands like:
for a in {1..9} do cat test.csv |sed s/2013-02-04/2013-02-0$a/; done >> testbig.csv
for a in {10..28} do cat test.csv |sed s/2013/02-04/2013-02-$a/; done >> testbig.csv
This allows me to explode a 1m row file to as many rows as I like.
ii) Data Generator
For more complex files I use a tool called DTM Data Generator. This allows me to generate complex files at about 10,000 rows/sec to a CSV file (don't use it to generate data directly into HANA, performance sucks). Overnight I can generate about 360m rows, which is enough for testing. It costs $149 for the Standard Edition, which allows 10m rows per job. So I create the job I want and duplicate it 50 times. If I wasn't so cheap, I'd buy the Enterprise Edition at $349, which allows 2bn rows per job and multithreads.
c) Time the load of the file to /dev/null
# du -m table.csv
30358 table.csv
# time cat table.csv > /dev/null
real 0m9.667s
So I've put this 346m row, 30GB file on a Fusion IO disk and I can read it at 3140MB/sec (or 35.8m rows/sec). I know now that this is the absolute maximum possible to ingest into HANA. Bear in mind that in the real world, it's really hard to move data around at more than 100MB/sec - networks are slow.
d) Build a control file
Now I create a control file, like this:
cat > /sapmnt/log/test.ctl
import data
into table SCHEMA.TABLE
from '/sapmnt/log/test.csv'
record delimited by '\n'
field delimited by ','
optionally enclosed by '"'
error log /sapmnt/log/test.bad
And what's great is we can iterate over a) the number of partitions b) the number of threads c) the batch size
e) Make sure the load runs well
Make sure you aren't generating lines in the BAD file, because this slows downloading substantially. The data needs to be clean. Fix it with the UNIX tools above (usually sed, cut, awk, grep) if necessary.
f) Iterate, iterate, iterate
I make sure my CSV file has enough data to make HANA think (20-30GB is a good amount) and I iterate. For my example, my 346m row file is 30358MB (30GB). With the default options (1 partition, 1 thread, default batch size), the load happens in 1 hour, 35 minutes 30 seconds. This is not HANA speed!
So I guess a rough set of parameters:
- 5 partitions per active table node (15 partitions in my example)
- The same number of threads I have CPUs* active nodes (120 threads in my example)
- Batch size depending on the width of the column. For a simple table like this, I run with 200000. For much wider tables with 50-100 columns, I start with 20000.
With the SQL below, I get improved performance down from 5730s to 88s. That's more like it, and now we start tweaking to get the optimal settings for those 3 parameters via iteration. I expect this to improve performance a further 30% in most cases. You may not think it's worth it, but that extra 30% matters to me.
DROP TABLE "SCHEMA"."TABLE";
CREATE COLUMN TABLE "SCHEMA"."TABLE" (
"TIME" TIME,
"FIELD1" CHAR(1),
"FIELD2" CHAR(16),
"FIELD3" CHAR(4),
"MEASURE1" INT,
"MEASURE2" DECIMAL(7,4))
PARTITION BY ROUNDROBIN PARTITIONS 15;
ALTER TABLE "SCHEMA"."TABLE" DISABLE AUTOMERGE;
IMPORT FROM '/sapmnt/log/test.ctl' WITH THREADS 120 BATCH 200000;
MERGE DELTA OF "SCHEMA"."TABLE";
ALTER TABLE "SCHEMA"."TABLE ENABLE AUTOMERGE;
And I run it lots of times. I iterate on threads, then partition, then batch, and sometimes again. I build a table with load times and tune it until I feel like I've got HANA speed. The key is to copy and paste this statement 50 times into a notepad file and then change the parameters for each load. Execute it as one long SQL statement and leave it running overnight, loading, dropping and loading 50 times or more over.
What's HANA speed? In my example down from 5730 seconds to 60 seconds - 95x performance increase. Nearly 6m fully committed rows/second and 500MB/sec throughput. On a single-node Medium appliance with 40 cores, you should expect around 275MB/sec of throughput into HANA. In any case we are around the throughput of 10GBe Ethernet.
Anyhow in most cases, control file loads aren't a good way to load into HANA long term, because they lack flexibility and the ability to process bad data. But now you have an upper bound for how fast you can ingest. The Data Services team tells me they look to get performance that approaches the HANA bulk loader, for example.
4) Design the right Architecture
Now you know enough about your data to design the right loading architecture for HANA. Note one really important thing: I like to think of HANA as a hungry caterpillar (reminds me of The Cure song Lullaby, but anyhow). How fast you get data into HANA is almost always limited, in a well designed environment, by how fast you can supply.
This means that you need fast storage where you are residing the data to be loaded, fast applications server threads to process it, and fast plumbing - always 10GB Ethernet - to move data around. If you don't have all of this, your HANA in-memory appliance will go to waste. Don't think you can run Data Services on a VMWare farm, for example, and expect to move data around at more than 80MB/sec.
From there, the choice of tool, or tools, is critical. Use this table as a rule of thumb.
Use Case | Control File | Data Services | LT | SRS | ESP |
---|---|---|---|---|---|
Benchmarking and PoCs | X | O | O | O | O |
Batch Loads | O | X | - | - | O |
Transformations | - | X | O | O | X |
Fixing Data Quality | - | X | - | - | O |
Streaming Data | - | - | - | - | X |
Real-Time Replication | - | O | X | O | O |
Replication From SAP | - | O | X | O | O |
Tell me what else you want to see here! | | | | | |
X - Core purpose
O - offers the functionality
"-" - does not offer the functionality.
a) Control Files
As I've said before, Control Files are great for benchmarking and PoCs. And for beating up other ETL teams to get near their performance. After that, you aren't likely to use them much in a productive system.
b) Data Services
Data Services is a great general purpose batch tool. It offers real-time web services but there are better options if you want real-time. It also allows great data transformation, matching and enriching, but every additional transform you add to your batch job will slow it down, so be careful.
In addition if you are loading files from weird formats, like fixed format files, then Data Services is great. There are lots of tips and tricks for SAP HANA to be found at http://wiki.sdn.sap.com/wiki/display/EIM/Data+Services but there are 3 golden rules I apply:
i) Do major transforms on the way into SAP HANA, not in SAP HANA. This advice will change because they are integrating the Data Services ETL engine into SAP HANA, but for now, get your data formats right (e.g. combine date and time into timestamp) on the way in, rather than using materialized views in SAP HANA, which is expensive.
ii) Dial the 3 main parameters (Parallel Process threads in the file loader, bulk loading on the target table, commit size in the HANA table loader and loaders in the HANA table loader) in for your scenario. I start with 40 file loader threads, 10,000 commit size and 4 HANA loaders and tweak it from there. Data Services takes care of optimized merge deltas, which is very cool.
iii) Make sure you have plenty of power on the Data Services box - you can easily consume 16-20 cores on x86 CPU, make sure the threads are fast, and make sure you have 10GBe Ethernet from HANA -> Data Services.
c) SAP Landscape Transformation (LT)
LT is a very neat solution for replicating tables from SAP systems. It is trigger-based so there is a small overhead, and it is therefore suited to low-mid volume transactional tables like sales orders. It can be used to replicate tables from non-SAP databases like Oracle and this makes sense in a SAP shop. If you're not a SAP shop then you wouldn't use LT because it requires a SAP NetWeaver platform (either standalone for LT, or built into another SAP product like the Business Suite).
In these scenarios, LT is easy to set up to replicate tables and this is especially cool with SAP's Rapid Deployment Solutions (RDS), which provide pre-packaged HANA content that consume database tables from LT.
The main things to note with LT are the need for a Unicode LT system (can be used against non-Unicode source ERP) and the need to patch your ERP system to a recent kernel, which may trigger a regression test for your production environment. Also note that ERP, LT and HANA must be on a dedicated LT 10GBe Ethernet network for acceptable latency.
d) Sybase Replication Server (SRS)
SRS was originally the replication technology of choice for SAP ERP scenarios, before the advent of LT. It has several benefits over LT, because it scrapes database logs, which has lower latency and overhead than the trigger-based LT. Stock Exchanges in Capital Markets use SRS to replicate trading data between continents, it's that fast.
You do however need to have a full Enterprise license for your source DB, and many SAP customers buy a restricted runtime license via the Software Application Value (SAV) program, which allows them to rent their Oracle or IBM database for 8-15% of their Application Value. If you fall into this category, then you cannot use SRS legally.
In addition, because it scrapes database logs, SRS has some technical dependencies like Unicode, certain versions and certain databases.
But if what you want is a high-performance sidecar of an existing transactional database into SAP HANA, then SRS is a very interesting product, if niche.
e) Sybase Event Stream Processing (ESP)
ESP is the dark horse of the ETL family because it can do everything, and it can do some things very well. Its core purpose is stream processing, and you can look at the number of different streams it supports here: SyBooks Online - including Tibco Rendezvous, Capital Markets streams like Wombat and FIX, and files.
Its flexibility is excellent, and you can read fixed-format, CSV, FIX, XML or any other kind of data and output it to a stream, or to a database or file adapter. What's more, ESP is being integrated into the HANA database so you will be able to do all of this in-memory. ESP is low-latency and can shift a very large amount of messages a second (I hear millions), depending on the hardware and networks you have at your disposal. What's more you can store a buffer of data in memory for doing time stream processing.
As with any stream processor, when you hit saturation, you will start to page to RAM for incoming or outbound messages, then when RAM is filled up, to disk, and then bad things happen. As with Data Services, you can adapt the number of inbound and outbound feeds to get the performance you need, and adjust the output batch accordingly. Note that ESP is very well integrated with the SAP HANA ODBC interface and you can expect to get excellent performance.
5) Benchmark and iterate
Now you have chosen a product for your loading, you need to benchmark and iterate to try and get as close to the control file performance as possible.
In my tests, I found that the bulk loader performed about 2x as well as anything else, but this is sharing CPU capacity with SAP HANA. I expect that by putting Data Services or ESP onto a separate system with low-latency, high-bandwidth 10GBe network, you can get close to 70-80% of bulk loading performance.
Conclusions
A number of people I spoke to suggested that a set of best practices for data loads wasn't possible, because every customer is different. I'd counter that by saying that a set of best practices is required, to understand which tool should be used and when, depending on a potentially complex set of criteria. My first conclusion is I believe it is simple to select the right tool with some thought.
The second learning I have is that each data set performs slightly differently with HANA because of the complexities of massive parallel processing, columnar compression and column widths and complexities, plus the need to transform data. Once you have selected the correct tool, it is necessary to iterate to get optimal settings - and iterating is well worth it.
The third learning is to make sure you do not compare HANA bulk loading to Oracle bulk loading. HANA bulk loading is fully report-ready data. Oracle bulk loading literally dumps data into a database row, with no indexes, aggregates or save point. It breaks the ACID properties of the database until a backup is made. There is no point of making this comparison.
And a big shout out to:
The IBM X5 appliance that I use for testing. It is a 4x512GB = 2TB, 160 core appliance in a 22U rack.