ARM architecture

http://en.wikipedia.org/wiki/ARM_architecture

ARM architecture

ARM architectures
The ARM logo
Designer	ARM Holdings
Bits	32-bit or 64-bit
Introduced	1985
Design	RISC
Type	Register-Register
Branching	Condition code
Open	Proprietary

64/32-bit architecture
Registers
Introduced	2011
Version	ARMv8-A
Encoding	AArch64/A64 and AArch32/A32 use 32-bit instructions, T32 (Thumb2) uses mixed 16- and 32-bit instructions. ARMv7 user-space compatibility^[1]
Endianness	Bi (Little as default)
Extensions	All mandatory: Thumb-2, NEON,Jazelle, VFPv4-D16, VFPv4
General purpose	31x 64-bit integer registers^[1] plus PC and SP, ELR, SPSR for exception levels
Floating point	32× 128-bit registers,^[1]scalar 32- and 64-bit FP,SIMD 64- and 128-bit FP and integer

32-bit architectures (Cortex)
Registers
Version	ARMv8-R, ARMv7-A, ARMv7-R, ARMv7E-M, ARMv7-M, ARMv6-M
Encoding	32-bit except Thumb2 extensions use mixed 16- and 32-bit instructions.
Endianness	Bi (Little as default)
Extensions	Thumb-2 (mandatory since ARMv7), NEON, Jazelle, FPv4-SP
General purpose	16x 32-bit integer registers including PC and SP
Floating point	Up to 32× 64-bit registers,^[2] SIMD/floating-point (optional)

32-bit architectures (legacy)
Registers
Version	ARMv6, ARMv5, ARMv4T, ARMv3, ARMv2
Encoding	32-bit except Thumb extension uses mixed 16- and 32-bit instructions.
Endianness	Bi (Little as default) in ARMv3 and above
Extensions	Thumb, Jazelle
General purpose	16x 32-bit integer registers including PC (26-bit addressing in older) and SP

ARM is a family of instruction set architectures for computer processors based on a reduced instruction set computing(RISC) architecture developed by British company ARM Holdings.

A RISC-based computer design approach means ARM processors require significantly fewer transistors than typical processors in average computers. This approach reduces costs, heat and power use. These are desirable traits for light, portable, battery-powered devices—including smartphones, laptops, tablet and notepad computers, and other embedded systems. A simpler design facilitates more efficient multi-core CPUs and higher core counts at lower cost, providing higher processing power and improved energy efficiency for servers and supercomputers.^[3]^[4]^[5]

ARM Holdings develops the instruction set and architecture for ARM-based products, but does not manufacture products. The company periodically releases updates to its cores. Current cores from ARM Holdings support a 32-bit address space and 32-bit arithmetic; the recently introduced ARMv8-A architecture adds support for a 64-bit address space and 64-bit arithmetic. Instructions for ARM Holdings' cores have 32-bit-wide fixed-length instructions, but later versions of the architecture also support a variable-length instruction set that provides both 32-bit and 16-bit-wide instructions for improved code density. Some cores can also provide hardware execution of Java bytecodes.

ARM Holdings licenses the chip designs and the ARM instruction set architectures to third-parties, who design their own products that implement one of those architectures—including systems-on-chips (SoC) that incorporate memory, interfaces, radios, etc. Currently, the widely used Cortex cores, older "classic" cores, and specialized SecurCore cores variants are available for each of these to include or exclude optional capabilities. Companies that produce ARM products include Apple, Nvidia, Qualcomm, Rockchip, Samsung Electronics, and Texas Instruments. Apple first implemented the ARMv8-A architecture in the Apple A7 chip in the iPhone 5S.

In 2005, about 98% of all mobile phones sold used at least one ARM processor.^[6] The low power consumption of ARM processors has made them very popular: 37 billion ARM processors have been produced as of 2013, up from 10 billion in 2008.^[7] The ARM architecture (32-bit) is the most widely used architecture in mobile devices, and most popular 32-bit one in embedded systems.^[8]

According to ARM Holdings, in 2010 alone, producers of chips based on ARM architectures reported shipments of 6.1 billion ARM-based processors, representing 95% of smartphones, 35% of digital televisions and set-top boxes and 10% of mobile computers. It is the most widely used 32-bit instruction set architecture in terms of quantity produced.^[9]^[10]

[hide]

History[edit]

Microprocessor-based system on a chip

The ARM1 second processor for the BBC Micro

The British computer manufacturer Acorn Computers first developed ARM in the 1980s to use in its personal computers. Its first ARM-based products were coprocessor modules for the BBC Micro series of computers. After the successful BBC Micro computer, Acorn Computers considered how to move on from the relatively simple MOS Technology 6502 processor to address business markets like the one that was soon dominated by the IBM PC, launched in 1981. The Acorn Business Computer (ABC) plan required that a number of second processors be made to work with the BBC Micro platform, but processors such as the Motorola 68000 and National Semiconductor 32016 were considered unsuitable, and the 6502 was not powerful enough for a graphics based user interface.^[11]

After testing all available processors and finding them lacking, Acorn decided it needed a new architecture. Inspired by white papers on the Berkeley RISC project, Acorn considered designing its own processor.^[12] A visit to the Western Design Center in Phoenix, where the 6502 was being updated by what was effectively a single-person company, showed Acorn engineers Steve Furber and Sophie Wilson they did not need massive resources and state-of-the-art research and development facilities.^[13]

Wilson developed the instruction set, writing a simulation of the processor in BBC Basic that ran on a BBC Micro with a second 6502 processor. This convinced Acorn engineers they were on the right track. Wilson approached Acorn's CEO, Hermann Hauser, and requested more resources. Once he had approval, he assembled a small team to implement Wilson's model in hardware.

Acorn RISC Machine: ARM2[edit]

The official Acorn RISC Machine project started in October 1983. They chose VLSI Technology as the silicon partner, as they were a source of ROMs and custom chips for Acorn. Wilson and Furber led the design. They implemented it with a similar efficiency ethos as the 6502.^[14] A key design goal was achieving low-latency input/output (interrupt) handling like the 6502. The 6502's memory access architecture had let developers produce fast machines without costly direct memory access hardware.

VLSI produced the first ARM silicon on 26 April 1985. It worked the first time, and was known as ARM1 by April 1985.^[3] The first production systems named ARM2 were available the following year.

The first practical ARM application was as a second processor for the BBC Micro, where it helped developed simulation software to finish development of the support chips (VIDC, IOC, MEMC), and sped up the CAD software used in ARM2 development. Wilson subsequently rewrote BBC Basic in ARM assembly language. The in-depth knowledge gained from designing the instruction set enabled the code to be very dense, making ARM BBC Basic an extremely good test for any ARM emulator. The original aim of a principally ARM-based computer was achieved in 1987 with the release of the Acorn Archimedes.^[15] In 1992, Acorn once more won the Queen's Award for Technology for the ARM.

The ARM2 featured a 32-bit data bus, 26-bit address space and 27 32-bit registers. 8 bits from the program counter register were available for other purposes; the top 6 bits (available because of the 26-bit address space), served as status flags, and the bottom 2 bits (available because the program counter was always word-aligned), were used for setting modes. The address bus was extended to 32 bits in the ARM6, but program code still had to lie within the first 64 MB of memory in 26-bit compatibility mode, due to the reserved bits for the status flags.^[16] The ARM2 had a transistor count of just 30,000, compared to Motorola's six-year-older 68000 model with 68,000.^[17] Much of this simplicity came from the lack of microcode (which represents about one-quarter to one-third of the 68000) and from (like most CPUs of the day) not including any cache. This simplicity enabled low power consumption, yet better performance than the Intel 80286. A successor, ARM3, was produced with a 4 KB cache, which further improved performance.^[18]

Apple, DEC, Intel, Marvell: ARM6, StrongARM, XScale[edit]

In the late 1980s Apple Computer and VLSI Technology started working with Acorn on newer versions of the ARM core. In 1990, Acorn spun off the design team into a new company named Acorn RISC Machines Ltd., which became ARM Ltd when its parent company, ARM Holdings plc, floated on the London Stock Exchange and NASDAQ in 1998.^[19]

The new Apple-ARM work would eventually evolve into the ARM6, first released in early 1992. Apple used the ARM6-based ARM610 as the basis for their Apple Newton PDA. In 1994, Acorn used the ARM610 as the main central processing unit (CPU) in their RiscPC computers. DEC licensed the ARM6 architecture and produced the StrongARM. At 233 MHz, this CPU drew only one watt (newer versions draw far less). This work was later passed to Intel as a part of a lawsuit settlement, and Intel took the opportunity to supplement their i960 line with the StrongARM. Intel later developed its own high performance implementation named XScale, which it has since sold to Marvell. Transistor count of the ARM core remained essentially the same size throughout these changes; ARM2 had 30,000 transistors, while ARM6 grew only to 35,000.^{[citation needed]}

Licensing[edit]

Core license[edit]

ARM Holdings' primary business is selling IP cores, which licensees use to create microcontrollers (MCUs) and CPUs based on those cores. The original design manufacturer combines the ARM core with other parts to produce a complete CPU, typically one that can be built in existing semiconductor fabs at low cost and still deliver substantial performance. The most successful implementation has been the ARM7TDMI with hundreds of millions sold. Atmel has been a precursor design center in the ARM7TDMI-based embedded system.

The ARM architectures used in smartphones, PDAs and other mobile devices range from ARMv5, used in low-end devices, through ARMv6, to ARMv7 in current high-end devices. ARMv7 includes a hardware floating-point unit (FPU), with improved speed compared to software-based floating-point.

In 2009, some manufacturers introduced netbooks based on ARM architecture CPUs, in direct competition with netbooks based on Intel Atom.^[20] According to analyst firm IHS iSuppli, by 2015, ARM ICs may be in 23% of all laptops.^[21]

ARM Holdings offers a variety of licensing terms, varying in cost and deliverables. ARM Holdings provides to all licensees an integratable hardware description of the ARM core as well as complete software development toolset (compiler, debugger,software development kit) and the right to sell manufactured silicon containing the ARM CPU.

SoC packages integrating ARM's core designs include Nvidia Tegra's first three generations, CSR plc's Quatro family, ST-Ericsson's Nova and NovaThor, Silicon Labs's Precision32 MCU, Texas Instruments's OMAP products, Samsung's Hummingbird and Exynos products, Apple's A4, A5, andA5X, and Freescale's i.MX.

Fabless licensees, who wish to integrate an ARM core into their own chip design, are usually only interested in acquiring a ready-to-manufacture verified IP core. For these customers, ARM Holdings delivers a gate netlist description of the chosen ARM core, along with an abstracted simulation model and test programs to aid design integration and verification. More ambitious customers, including integrated device manufacturers (IDM) and foundry operators, choose to acquire the processor IP in synthesizable RTL (Verilog) form. With the synthesizable RTL, the customer has the ability to perform architectural level optimisations and extensions. This allows the designer to achieve exotic design goals not otherwise possible with an unmodified netlist (high clock speed, very low power consumption, instruction set extensions, etc.). While ARM Holdings does not grant the licensee the right to resell the ARM architecture itself, licensees may freely sell manufactured product such as chip devices, evaluation boards, complete systems. Merchant foundries can be a special case; not only are they allowed to sell finished silicon containing ARM cores, they generally hold the right to re-manufacture ARM cores for other customers.

ARM Holdings prices its IP based on perceived value. Lower performing ARM cores typically have lower licence costs than higher performing cores. In implementation terms, a synthesizable core costs more than a hard macro (blackbox) core. Complicating price matters, a merchant foundry that holds an ARM licence, such as Samsung or Fujitsu, can offer fab customers reduced licensing costs. In exchange for acquiring the ARM core through the foundry's in-house design services, the customer can reduce or eliminate payment of ARM's upfront licence fee.

Compared to dedicated semiconductor foundries (such as TSMC and UMC) without in-house design services, Fujitsu/Samsung charge two- to three-times(2~3) more per manufactured wafer.^{[citation needed]} For low to mid volume applications, a design service foundry offers lower overall pricing (through subsidisation of the licence fee). For high volume mass-produced parts, the long term cost reduction achievable through lower wafer pricing reduces the impact of ARM's NRE (Non-Recurring Engineering) costs, making the dedicated foundry a better choice.

ARM 公司本身并不靠自有的设计来制造或出售CPU，而是将处理器架构授权给有兴趣的厂家。ARM 提供了多样的授权条款，包括售价与散播性等项目。对于授权方来说，ARM 提供了ARM内核的整合硬件叙述，包含完整的软件开发工具（编译器、debugger、SDK），以及针对内含ARM CPU硅芯片的销售权。对于无晶圆厂的授权方来说，其希望能将ARM 内核整合到他们自行研发的芯片设计中，通常就仅针对取得一份生产就绪的知识产权内核（IP Core）认证。对这些客户来说，ARM 会释出所选的ARM 核心的闸极电路图，连同抽象模拟模型和测试程式，以协助设计整合和验证。需求更多的客户，包括整合元件制造商（IDM）和晶圆厂家，就选择可合成的RTL（暂存器转移层级，如Verilog）形式来取得处理器的知识产权（IP）。借助可整合的RTL，客户就有能力能进行架构上的最佳化与加强。这个方式能让设计者完成额外的设计目标（如高震荡频率、低能量耗损、指令集延伸等）而不会受限于无法更动的电路图。虽然 ARM 并不授予受权方再次出售ARM 架构本身，但受权方可以任意地出售制品（如芯片元件、评估板、完整系统等）。商用晶圆厂是特殊例子，因为他们不仅授予能出售包含ARM 内核的硅晶成品，对其它客户来讲，他们通常也保留重制ARM 内核的权利。

就像大多数IP 出售方，ARM 依照使用价值来决定IP 的售价。在架构上而言，更低效能的ARM 内核比更高效能的内核拥有较低的授权费。以硅芯片实作而言，一颗可整合的内核要比一颗硬件宏（黑箱）内核要来得贵。更复杂的价位问题来讲，持有ARM 授权的商用晶圆厂（例如韩国三星和日本富士通）可以提供更低的授权价格给他们的晶圆厂客户。透过晶圆厂自有的设计技术，客户可以更低或是免费的ARM预付授权费来取得ARM 内核。相较于不具备自有设计技术的专门半导体晶圆厂（如台积电和联电），富士通/三星对每片晶圆多收取了两至三倍的费用。对中少量的应用而言，具备设计部门的晶圆厂提供较低的整体价格（透过授权费用的补助）。对于量产而言，由于长期的成本缩减可借由更低的晶圆价格，减少ARM的NRE成本，使得专门的晶圆厂也成了一个更好的选择。

Architectural licence[edit]

Companies can also obtain an ARM architectural licence for designing their own CPU cores using the ARM instruction sets. These cores must comply fully with the ARM architecture.

Cores[edit]

Main article: List of ARM cores

Architecture	Bit width	Cores designed by ARM Holdings	Cores designed by 3rd parties	Cortex profile	References
ARMv1	32/26	ARM1
ARMv2	32/26	ARM2, ARM3	Amber
ARMv3	32	ARM6, ARM7
ARMv4	32	ARM8	StrongARM, FA526
ARMv4T	32	ARM7TDMI, ARM9TDMI
ARMv5	32	ARM7EJ, ARM9E, ARM10E	XScale, FA626TE, Feroceon, PJ1/Mohawk
ARMv6	32	ARM11
ARMv6-M	32	ARM Cortex-M0, ARM Cortex-M0+, ARM Cortex-M1		Microcontroller
ARMv7-M	32	ARM Cortex-M3		Microcontroller
ARMv7E-M	32	ARM Cortex-M4		Microcontroller
ARMv7-R	32	ARM Cortex-R4, ARM Cortex-R5, ARM Cortex-R7		Real-time
ARMv7-A	32	ARM Cortex-A5, ARM Cortex-A7, ARM Cortex-A8, ARM Cortex-A9, ARM Cortex-A12, ARM Cortex-A15	Krait, Scorpion, PJ4/Sheeva, Apple A6/A6X (Swift)	Application
ARMv8-A	64/32	ARM Cortex-A53, ARM Cortex-A57^[22]	X-Gene, Denver, Apple A7 (Cyclone)	Application	^[23]^[24]
ARMv8-R	32	No announcements yet		Real-time	^[25]^[26]

A list of vendors who implement ARM cores in their design (application specific standard products (ASSP), microprocessor and microcontrollers) is provided by ARM Holdings.^[27]

Example applications of ARM cores[edit]

Main article: List of applications of ARM cores

Tronsmart MK908, a Rockchip-based quad-core Android "mini PC", with a microSD card next to it for a size comparison.

ARM cores are used in a number of products, particularly PDAs and smartphones. Some computing examples are the Microsoft Surface, Apple's iPad and ASUS Eee Pad Transformer. Others include Apple's iPhone smartphone and iPod portable media player,Canon PowerShot A470 digital camera, Nintendo DS handheld game console and TomTom turn-by-turn navigation system.

In 2005, ARM Holdings took part in the development of Manchester University's computer, SpiNNaker, which used ARM cores to simulate the human brain.^[28]

ARM chips are also used in Raspberry Pi, BeagleBoard, BeagleBone, PandaBoard and other single-board computers, because they are very small, inexpensive and consume very little power.

32-bit architecture[edit]

The 32-bit ARM architecture, such as ARMv7-A, is the most widely used architecture in mobile devices.^[8]

From 1995, the ARM Architecture Reference Manual has been the primary source of documentation on the ARM processor architecture and instruction set, distinguishing interfaces that all ARM processors are required to support (such as instruction semantics) from implementation details that may vary. The architecture has evolved over time, and version 7 of the architecture, ARMv7, that defines the architecture for the first of the Cortex series of cores, defines three architecture "profiles":

A-profile, the "Application" profile: Cortex-A series
R-profile, the "Real-time" profile: Cortex-R series
M-profile, the "Microcontroller" profile: Cortex-M series

Although the architecture profiles were first defined for ARMv7, ARM subsequently defined the ARMv6-M architecture (used by the Cortex M0/M0+/M1) as a subset of the ARMv7-M profile with fewer instructions.

CPU modes[edit]

Except in the M-profile, the 32-bit ARM architecture specifies several CPU modes, depending on the implemented architecture features. At any moment in time, the CPU can be in only one mode, but it can switch modes due to external events (interrupts) or programmatically.^[29]

User mode: The only non-privileged mode.
FIQ mode: A privileged mode that is entered whenever the processor accepts an FIQ interrupt.
IRQ mode: A privileged mode that is entered whenever the processor accepts an IRQ interrupt.
Supervisor (svc) mode: A privileged mode entered whenever the CPU is reset or when an SVC instruction is executed.
Abort mode: A privileged mode that is entered whenever a prefetch abort or data abort exception occurs.
Undefined mode: A privileged mode that is entered whenever an undefined instruction exception occurs.
System mode (ARMv4 and above): The only privileged mode that is not entered by an exception. It can only be entered by executing an instruction that explicitly writes to the mode bits of the CPSR.
Monitor mode (ARMv6 and ARMv7 Security Extensions, ARMv8 EL3): A monitor mode is introduced to support TrustZone extension in ARM cores.
Hyp mode (ARMv7 Virtualization Extensions, ARMv8 EL2): A hypervisor mode that supports virtualization of the non-secure operation of the CPU.^[30]

Instruction set[edit]

The original ARM implementation was hardwired without microcode, like the much simpler 8-bit 6502 processor used in prior Acorn microcomputers.

The 32-bit ARM architecture (and the 64-bit architecture for the most part) includes the following RISC features:

Load/store architecture.
No support for unaligned memory accesses in the original version of the architecture. ARMv6 and later, except some microcontroller versions, support unaligned accesses for half-word and single-word load/store instructions with some limitations, such as no guaranteed atomicity.^[31]^[32]
Uniform 16× 32-bit register file (including the Program Counter, Stack Pointer and the Link Register).
Fixed instruction width of 32 bits to ease decoding and pipelining, at the cost of decreased code density. Later, the Thumb instruction set added 16-bit instructions and increased code density.
Mostly single clock-cycle execution.

To compensate for the simpler design, compared with processors like the Intel 80286 and Motorola 68020, some additional design features were used:

Conditional execution of most instructions reduces branch overhead and compensates for the lack of a branch predictor.
Arithmetic instructions alter condition codes only when desired.
32-bit barrel shifter can be used without performance penalty with most arithmetic instructions and address calculations.
Powerful indexed addressing modes.
A link register supports fast leaf function calls.
A simple, but fast, 2-priority-level interrupt subsystem has switched register banks.

Arithmetic instructions[edit]

The ARM supports add, subtract, and multiply instructions. The integer divide instructions are only implemented by ARM cores based on the following ARM architectures:

ARMv7-M and ARMv7E-M architectures always include divide instructions.^[33]
ARMv7-R architecture always includes divide instructions in the Thumb instruction set, but optionally in its 32-bit instruction set.^[34]
ARMv7-A architecture optionally includes the divide instructions. The instructions might not be implemented, or implemented only in the Thumb instruction set, or implemented in both the Thumb and ARM instructions sets, or implemented if the Virtualization Extensions are included.^[34]

Registers[edit]

Registers R0 through R7 are the same across all CPU modes; they are never banked.

R13 and R14 are banked across all privileged CPU modes except system mode. That is, each mode that can be entered because of an exception has its own R13 and R14. These registers generally contain the stack pointer and the return address from function calls, respectively.

Registers across CPU modes
usr	svc	abt	und	irq	fiq
R0
R1
R2
R3
R4
R5
R6
R7
R8					R8_fiq
R9					R9_fiq
R10					R10_fiq
R11					R11_fiq
R12					R12_fiq
R13	R13_svc	R13_abt	R13_und	R13_irq	R13_fiq
R14	R14_svc	R14_abt	R14_und	R14_irq	R14_fiq
R15
CPSR
	SPSR_svc	SPSR_abt	SPSR_und	SPSR_irq	SPSR_fiq

Aliases:

R13 is also referred to as SP, the Stack Pointer.
R14 is also referred to as LR, the Link Register.
R15 is also referred to as PC, the Program Counter.

CPSR has the following 32 bits.^[35]

M (bits 0–4) is the processor mode bits.
T (bit 5) is the Thumb state bit.
F (bit 6) is the FIQ disable bit.
I (bit 7) is the IRQ disable bit.
A (bit 8) is the imprecise data abort disable bit.
E (bit 9) is the data endianness bit.
IT (bits 10–15 and 25–26) is the if-then state bits.
GE (bits 16–19) is the greater-than-or-equal-to bits.
DNM (bits 20–23) is the do not modify bits.
J (bit 24) is the Java state bit.
Q (bit 27) is the sticky overflow bit.
V (bit 28) is the overflow bit.
C (bit 29) is the carry/borrow/extend bit.
Z (bit 30) is the zero bit.
N (bit 31) is the negative/less than bit.

Conditional execution[edit]

Almost every ARM instruction has a conditional execution feature called predication, which is implemented with a 4-bit condition code selector (the predicate). To allow for unconditional execution, one of the four-bit codes causes the instruction to be always executed. Most other CPU architectures only have condition codes on branch instructions.

Though the predicate takes up 4 of the 32 bits in an instruction code, and thus cuts down significantly on the encoding bits available for displacements in memory access instructions, it avoids branch instructions when generating code for small if statements. Apart from eliminating the branch instructions themselves, this preserves the fetch/decode/execute pipeline at the cost of only one cycle per skipped instruction.

The standard example of conditional execution is the subtraction-based Euclidean algorithm:

In the C programming language, the loop is:

    while (i != j)
    {
       if (i > j)
       {
           i -= j;
       }
       else  /* i < j (since i != j in while condition) */
       {
           j -= i;
       }
    }

In ARM assembly, the loop is:

loop:   CMP  Ri, Rj         ; set condition "NE" if (i != j),
                            ;               "GT" if (i > j),
                            ;            or "LT" if (i < j)
        SUBGT  Ri, Ri, Rj   ; if "GT" (Greater Than), i = i-j;
        SUBLT  Rj, Rj, Ri   ; if "LT" (Less Than), j = j-i;
        BNE  loop           ; if "NE" (Not Equal), then loop

which avoids the branches around the then and else clauses. If Ri and Rj are equal then neither of the SUB instructions will be executed, eliminating the need for a conditional branch to implement the while check at the top of the loop, for example had SUBLE (less than or equal) been used.

One of the ways that Thumb code provides a more dense encoding is to remove the four bit selector from non-branch instructions.

Other features[edit]

Another feature of the instruction set is the ability to fold shifts and rotates into the "data processing" (arithmetic, logical, and register-register move) instructions, so that, for example, the C statement

a += (j << 2);

could be rendered as a single-word, single-cycle instruction:^[36]

ADD  Ra, Ra, Rj, LSL #2

This results in the typical ARM program being denser than expected with fewer memory accesses; thus the pipeline is used more efficiently.

The ARM processor also has features rarely seen in other RISC architectures, such as PC-relative addressing (indeed, on the 32-bit^[1] ARM the PC is one of its 16 registers) and pre- and post-increment addressing modes.

The ARM instruction set has increased over time. Some early ARM processors (before ARM7TDMI), for example, have no instruction to store a two-byte quantity.

Pipelines and other implementation issues[edit]

The ARM7 and earlier implementations have a three-stage pipeline; the stages being fetch, decode and execute. Higher-performance designs, such as the ARM9, have deeper pipelines: Cortex-A8 has thirteen stages. Additional implementation changes for higher performance include a faster adder and more extensive branch prediction logic. The difference between the ARM7DI and ARM7DMI cores, for example, was an improved multiplier; hence the added "M".

Coprocessors[edit]

The ARM architecture provides a non-intrusive way of extending the instruction set using "coprocessors" that can be addressed using MCR, MRC, MRRC, MCRR, and similar instructions. The coprocessor space is divided logically into 16 coprocessors with numbers from 0 to 15, coprocessor 15 (cp15) being reserved for some typical control functions like managing the caches and MMU operation on processors that have one.

In ARM-based machines, peripheral devices are usually attached to the processor by mapping their physical registers into ARM memory space, into the coprocessor space, or by connecting to another device (a bus) that in turn attaches to the processor. Coprocessor accesses have lower latency, so some peripherals—for example an XScale interrupt controller—are accessible in both ways: through memory and through coprocessors.

In other cases, chip designers only integrate hardware using the coprocessor mechanism. For example, an image processing engine might be a small ARM7TDMI core combined with a coprocessor that has specialised operations to support a specific set of HDTV transcoding primitives.

Debugging[edit]

All modern ARM processors include hardware debugging facilities, allowing software debuggers to perform operations such as halting, stepping, and breakpointing of code starting from reset. These facilities are built using JTAG support, though some newer cores optionally support ARM's own two-wire "SWD" protocol. In ARM7TDMI cores, the "D" represented JTAG debug support, and the "I" represented presence of an "EmbeddedICE" debug module. For ARM7 and ARM9 core generations, EmbeddedICE over JTAG was a de facto debug standard, though not architecturally guaranteed.

The ARMv7 architecture defines basic debug facilities at an architectural level. These include breakpoints, watchpoints and instruction execution in a "Debug Mode"; similar facilities were also available with EmbeddedICE. Both "halt mode" and "monitor" mode debugging are supported. The actual transport mechanism used to access the debug facilities is not architecturally specified, but implementations generally include JTAG support.

There is a separate ARM "CoreSight" debug architecture, which is not architecturally required by ARMv7 processors.

Tools[edit]

The ARM architecture is supported by a set of development tools such as Emprog ThunderBench for ARM. Such tools allow development engineers to program the ARM architecture device using a high level language like C.^[37]

DSP enhancement instructions[edit]

To improve the ARM architecture for digital signal processing and multimedia applications, DSP instructions were added to the set.^[38] These are signified by an "E" in the name of the ARMv5TE and ARMv5TEJ architectures. E-variants also imply T,D,M and I.

The new instructions are common in digital signal processor architectures. They include variations on signed multiply–accumulate, saturated add and subtract, and count leading zeros.

SIMD extensions for multimedia[edit]

Introduced in ARMv6 architecture.^[39]

Jazelle[edit]

Main article: Jazelle

Jazelle DBX (Direct Bytecode eXecution) is a technique that allows Java Bytecode to be executed directly in the ARM architecture as a third execution state (and instruction set) alongside the existing ARM and Thumb-mode. Support for this state is signified by the "J" in the ARMv5TEJ architecture, and in ARM9EJ-S and ARM7EJ-S core names. Support for this state is required starting in ARMv6 (except for the ARMv7-M profile), though newer cores only include a trivial implementation that provides no hardware acceleration.

Thumb[edit]

To improve compiled code-density, processors since the ARM7TDMI (released in 1994^[40]) have featured Thumb instruction set, which have their own state. (The "T" in "TDMI" indicates the Thumb feature.) When in this state, the processor executes the Thumb instruction set, a compact 16-bit encoding for a subset of the ARM instruction set.^[41] Most of the Thumb instructions are directly mapped to normal ARM instructions. The space-saving comes from making some of the instruction operands implicit and limiting the number of possibilities compared to the ARM instructions executed in the ARM instruction set state.

In Thumb, the 16-bit opcodes have less functionality. For example, only branches can be conditional, and many opcodes are restricted to accessing only half of all of the CPU's general-purpose registers. The shorter opcodes give improved code density overall, even though some operations require extra instructions. In situations where the memory port or bus width is constrained to less than 32 bits, the shorter Thumb opcodes allow increased performance compared with 32-bit ARM code, as less program code may need to be loaded into the processor over the constrained memory bandwidth.

Embedded hardware, such as the Game Boy Advance, typically have a small amount of RAM accessible with a full 32-bit datapath; the majority is accessed via a 16-bit or narrower secondary datapath. In this situation, it usually makes sense to compile Thumb code and hand-optimise a few of the most CPU-intensive sections using full 32-bit ARM instructions, placing these wider instructions into the 32-bit bus accessible memory.

The first processor with a Thumb instruction decoder was the ARM7TDMI. All ARM9 and later families, including XScale, have included a Thumb instruction decoder.

Thumb-2[edit]

Thumb-2 technology was introduced in the ARM1156 core, announced in 2003. Thumb-2 extends the limited 16-bit instruction set of Thumb with additional 32-bit instructions to give the instruction set more breadth, thus producing a variable-length instruction set. A stated aim for Thumb-2 was to achieve code density similar to Thumb with performance similar to the ARM instruction set on 32-bit memory. In ARMv7 this goal can be said to have been met.^{[citation needed]}

Thumb-2 extends the Thumb instruction set with bit-field manipulation, table branches and conditional execution. At the same time, the ARM instruction set was extended to maintain equivalent functionality in both instruction sets. A new "Unified Assembly Language" (UAL) supports generation of either Thumb or ARM instructions from the same source code; versions of Thumb seen on ARMv7 processors are essentially as capable as ARM code (including the ability to write interrupt handlers). This requires a bit of care, and use of a new "IT" (if-then) instruction, which permits up to four successive instructions to execute based on a tested condition, or on its inverse. When compiling into ARM code this is ignored, but when compiling into Thumb it generates an actual instruction. For example:

; if (r0 == r1)
CMP r0, r1
ITE EQ        ; ARM: no code ... Thumb: IT instruction
; then r0 = r2;
MOVEQ r0, r2  ; ARM: conditional; Thumb: condition via ITE 'T' (then)
; else r0 = r3;
MOVNE r0, r3  ; ARM: conditional; Thumb: condition via ITE 'E' (else)
; recall that the Thumb MOV instruction has no bits to encode "EQ" or "NE"

All ARMv7 chips support the Thumb instruction set. All chips in the Cortex-A series, Cortex-R series, and ARM11 series support both "ARM instruction set state" and "Thumb instruction set state", while chips in the Cortex-M series support only the Thumb instruction set.^[42]^[43]^[44]

Thumb Execution Environment (ThumbEE)[edit]

ThumbEE (erroneously called Thumb-2EE in some ARM documentation), marketed as Jazelle RCT (Runtime Compilation Target), was announced in 2005, first appearing in the Cortex-A8 processor. ThumbEE is a fourth Instruction set state, making small changes to the Thumb-2 extended Thumb instruction set. These changes make the instruction set particularly suited to code generated at runtime (e.g. by JIT compilation) in managed Execution Environments. ThumbEE is a target for languages such as Java, C#, Perl, and Python, and allows JIT compilers to output smaller compiled code without impacting performance.

New features provided by ThumbEE include automatic null pointer checks on every load and store instruction, an instruction to perform an array bounds check, and special instructions that call a handler. In addition, because it utilises Thumb-2 technology, ThumbEE provides access to registers r8-r15 (where the Jazelle/DBX Java VM state is held).^[45] Handlers are small sections of frequently called code, commonly used to implement high level languages, such as allocating memory for a new object. These changes come from repurposing a handful of opcodes, and knowing the core is in the new ThumbEE Instruction set state.

On 23 November 2011, ARM Holdings deprecated any use of the ThumbEE instruction set,^[46] and ARMv8 removes support for ThumbEE.

Floating-point (VFP)[edit]

VFP (Vector Floating Point) technology is an FPU coprocessor extension to the ARM architecture. It provides low-cost single-precision and double-precision floating-point computation fully compliant with the ANSI/IEEE Std 754-1985 Standard for Binary Floating-Point Arithmetic. VFP provides floating-point computation suitable for a wide spectrum of applications such as PDAs, smartphones, voice compression and decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications. The VFP architecture was intended to support execution of short "vector mode" instructions but these operated on each vector element sequentially and thus did not offer the performance of true single instruction, multiple data (SIMD) vector parallelism. This vector mode was therefore removed shortly after its introduction,^[47] to be replaced with the much more powerful NEON Advanced SIMD unit.

Some devices such as the ARM Cortex-A8 have a cut-down VFPLite module instead of a full VFP module, and require roughly ten times more clock cycles per float operation.^[48] Other floating-point and/or SIMD coprocessors found in ARM-based processors include FPA, FPE, iwMMXt. They provide some of the same functionality as VFP but are not opcode-compatible with it.

VFPv1: Obsolete.
VFPv2: An optional extension to the ARM instruction set in the ARMv5TE, ARMv5TEJ and ARMv6 architectures.
VFPv3 or VFPv3-D32: Implemented on earlier ARMv7 processors (Cortex-A8 and A9) and is backwards compatible with VFPv2, except that it cannot trap floating-point exceptions. VFPv3 has 32x 64-bit FPU registers as standard, adds VCVT instructions to convert between scalar, float and double, adds immediate mode to VMOV such that constants can be loaded into FPU registers.
VFPv3-D16: As above, but it has only 16 64-bit FPU registers.
VFPv3-F16: Uncommon; it supports IEEE754-2008 half-precision (16-bit) floating point.
VFPv4 or VFPv4-D32: Is implemented on later ARMv7 processors (Cortex-A12 and A15). VFPv4 has 32x 64-bit FPU registers as standard, adds both half-precision extensions and fused multiply-accumulate instructions to the features of VFPv3.
VFPv4-D16: As above, but it has only 16x 64-bit FPU registers. Implemented on Cortex-A5 and A7 processors.

In Debian Linux and derivatives armhf (ARM hard float) refers to the ARMv7 architecture including the additional VFP3-D16 floating-point hardware extension (and Thumb-2) above. Software packages and cross-compiler tools use the armhf vs. arm/armel suffixes to differentiate.^[49]

Advanced SIMD (NEON)[edit]

The Advanced SIMD extension (aka NEON or "MPE" Media Processing Engine) is a combined 64- and 128-bit SIMD instruction set that provides standardized acceleration for media and signal processing applications. NEON is included in all Cortex-A8 devices but is optional in Cortex-A9 devices.^[50] NEON can execute MP3 audio decoding on CPUs running at 10 MHz and can run the GSM adaptive multi-rate (AMR) speech codec at no more than 13 MHz. It features a comprehensive instruction set, separate register files and independent execution hardware.^[51] NEON supports 8-, 16-, 32- and 64-bit integer and single-precision (32-bit) floating-point data and SIMD operations for handling audio and video processing as well as graphics and gaming processing. In NEON, the SIMD supports up to 16 operations at the same time. The NEON hardware shares the same floating-point registers as used in VFP. Devices such as the ARM Cortex-A8 and Cortex-A9 support 128-bit vectors but will execute with 64 bits at a time,^[48] whereas newer Cortex-A15 devices can execute 128 bits at a time.

ProjectNe10 is ARM's first open source project (from its inception). The Ne10 library is a set of common, useful functions written in both NEON and C (for compatibility). The library was created to allow developers to use NEON optimizations without learning NEON but it also serves as a set of highly optimized NEON intrinsic and assembly code examples for common DSP, arithmetic and image processing routines. The code is available on GitHub.

Security extensions (TrustZone)[edit]

The Security Extensions, marketed as TrustZone Technology, is in ARMv6KZ and later application profile architectures. It provides a low cost alternative to adding an additional dedicated security core to an SoC, by providing two virtual processors backed by hardware based access control. This lets the application core switch between two states, referred to as worlds (to reduce confusion with other names for capability domains), in order to prevent information from leaking from the more trusted world to the less trusted world. This world switch is generally orthogonal to all other capabilities of the processor, thus each world can operate independently of the other while using the same core. Memory and peripherals are then made aware of the operating world of the core and may use this to provide access control to secrets and code on the device.

Typical applications of TrustZone Technology are to run a rich operating system in the less trusted world, and smaller security-specialized code in the more trusted world (named TrustZone Software, a TrustZone optimised version of the Trusted Foundations Software developed by Trusted Logic Mobility), allowing much tighter digital rights management for controlling the use of media on ARM-based devices,^[52] and preventing any unapproved use of the device. Trusted Foundations Software was acquired by Gemalto. Giesecke & Devrient developed a rival implementation named Mobicore. In April 2012 ARM Gemalto and Giesecke & Devrient combined their TrustZone portfolios into a joint venture Trustonic.^[53]^[54] Open Virtualization is an open source implementation of the trusted world architecture for TrustZone.^[55]

In practice, since the specific implementation details of TrustZone are proprietary and have not been publicly disclosed for review, it is unclear what level of assurance is provided for a given threat model.^{[citation needed]}

No-execute page protection[edit]

As of ARMv6, the ARM architecture supports no-execute page protection, which is referred to as XN, for eXecute Never.^[56]

ARMv8-R[edit]

The ARMv8-R subarchitecture announced after the ARMv8-A shares some features except that it is not 64-bit.

64/32-bit architecture[edit]

ARMv8-A[edit]

Announced in October 2011,^[57] ARMv8-A (often called ARMv8 although not all variants are 64-bit such as ARMv8-R) represents a fundamental change to the ARM architecture. It adds a 64-bit architecture, named "AArch64", and a new "A64" instruction set. AArch64 provides user-space compatibility with ARMv7-A ISA, the 32-bit architecture, therein referred to as "AArch32" and the old 32-bit instruction set, now named "A32". The Thumb instruction sets are referred to as "T32" and have no 64-bit counterpart. ARMv8-A allows 32-bit applications to be executed in a 64-bit OS, and a 32-bit OS to be under the control of a 64-bit hypervisor.^[1] ARM announced their Cortex-A53 and Cortex-A57 cores on 30 October 2012.^[22]

To both AArch32 and AArch64, ARMv8-A makes VFPv3/v4 and advanced SIMD (NEON) standard. It also adds cryptography instructions supporting AES and SHA-1/SHA-256.

AArch64 features[edit]

New instruction set, A64
- Has 31 general-purpose 64-bit registers.
- Has separate dedicated SP and PC.
- Instructions are still 32 bits long and mostly the same as A32 (with LDM/STM instructions and most conditional execution dropped).
  - Has paired loads/stores (in place of LDM/STM).
- Most instructions can take 32-bit or 64-bit arguments.
- Addresses assumed to be 64-bit.
Advanced SIMD (NEON) enhanced
- Has 32× 128-bit registers (up from 16), also accessible via VFPv4.
- Supports double-precision floating point.
- Fully IEEE 754 compliant
- AES encrypt/decrypt and SHA-1/SHA-2 hashing instructions also use these registers.
A new exception system
- Fewer banked registers and modes.
Memory translation from 48-bit virtual addresses based on the existing LPAE, which was designed to be easily extended to 64-bit.

Operating system support[edit]

32-bit operating systems[edit]

Android, a popular operating system running on the ARM architecture.

Historical operating systems: The first ARM-based personal computer, the Acorn Archimedes, ran an interim operating system called Arthur, which evolved into RISC OS, used on later ARM-based systems from Acorn and other vendors. Some Acorn machines also had a Unix port called RISC iX.

Embedded operating systems: The ARM architecture is supported by a large number of embedded and real-time operating systems, including Linux, Windows CE,Symbian, ChibiOS/RT, FreeRTOS, eCos, Integrity, Nucleus PLUS, MicroC/OS-II, PikeOS,^[58] QNX, RTEMS, RTXC Quadros, ThreadX,VxWorks, DRYOS, MQX, T-Kernel, OSE, SCIOPTA and RISC OS.

Mobile device operating systems: The ARM architecture is the primary hardware environment for most mobile device operating systems such as iOS, Android, Windows Phone, Windows RT, Bada, Blackberry OS/Blackberry 10, MeeGo, Firefox OS, Tizen, Ubuntu Touch, Sailfish and Igelle OS.

Desktop operating systems: The ARM architecture is supported by RISC OS and multiple Unix-like operating systems including BSD and various Linux distributions such as Ubuntu and Chrome OS.

64-bit operating systems[edit]

Mobile device operating systems: iOS 7 on the 64-bit Apple A7 SOC has ARMv8-A application support.

Desktop operating systems: Patches to the Linux kernel adding ARMv8-A support have been posted for review by Catalin Marinas of ARM Ltd. The patches have been included in kernel version 3.7 in late 2012.^[59] ARMv8-A is supported by some Linux distributions.

Comparison of current ARM cores

From Wikipedia, the free encyclopedia

This list provides an overview of the properties of ARM architecture microprocessor cores.

	ARM11	ARM Cortex-A5	ARM Cortex-A7	ARM Cortex-A8	ARM Cortex-A9	Qualcomm Scorpion	Qualcomm Krait^[1]	ARM Cortex-A15 MPCore
Architecture	ARMv6	ARMv7	ARMv7	ARMv7	ARMv7	ARMv7	ARMv7	ARMv7
Decode	single-issue	single-issue	2-wide	2-wide	2-wide	2-wide	3-wide	3-wide
Pipeline depth	8 stages		8 stages	13 stages	8 stages	10 stages	11 stages	15/17-25 stages
Out-of-order execution	No	No	No	No	Yes	non-speculative^[2]	Yes	Yes
FPU	VFPv2	VFPv4 (optional)	VFPv4	VFPv3	VFPv3 (optional)	VFPv3	VFPv4^[3]	VFPv4
Pipelined VFP	Yes		Yes	No	Yes	Yes	Yes	Yes
FPU registers	32× 32-bit	16 × 64-bit	16 × 64-bit	32 × 64-bit	(16 or 32) × 64-bit			32 × 64-bit
NEON (SIMD)	No	64-bit wide (optional)	64-bit wide	64-bit wide	64-bit wide (optional)	128-bit wide	128-bit wide	128-bit wide
Execution ports						3	7
Process technology	90/65/45 nm		40/28 nm	65/55/45 nm	65/45/40/32/28 nm	65/45 nm	28 nm	32/28 nm
L0 cache							4 KB + 4 KB direct mapped
L1 cache	Varying, typically 16 KB + 16 KB	4-64 KB / core	8-64 KB / core	32 KB + 32 KB	32 KB + 32 KB	32 KB + 32 KB	16 KB + 16 KB 4-way set associative	32 KB + 32 KB per core
L2 cache	Varying, typically none		up to 1 MB(optional)	256 or 512 (typical) KB	1 MB	256 KB (Single-core)/512 KB (Dual-core)	1 MB 8-way set associative (Dual-core)/2 MB (Quad-core)	up to 4 MB per cluster, up to 8 KMB per chip
Core configurations	1	1, 2, 4	1, 2, 4, 8	1	1, 2, 4	1, 2	2, 4	2, 4, 8 (4×2)
Speed per core (DMIPS/MHz)	1.25	1.57	1.9	2.0	2.5	2.1	3.3 (Krait) / 3.1 (Krait 200) / 3.4 (Krait 300)^[4] / 3.6 (Krait 400)	3.5

[hide]

ARM-based chips

ARM architecture
List of ARM cores
Comparison of current ARM cores

Application
Processors

Cortex-A5	Actions ATM702x Atmel SAMA5D3 Qualcomm Snapdragon S4 Play, 200 InfoTMIC iMAPx820, iMAPx15 Telechips TCC892x

Cortex-A7	Allwinner A2x, A3x HiSilicon K3V3 Leadcore LC1813 MediaTek MT65xx Qualcomm Snapdragon 200, 400 Samsung Exynos 5410

Cortex-A8	Allwinner A1x Apple A4 Freescale i.MX5x Rockchip RK291x Samsung Exynos 3110, S5PC110, S5PV210 Texas Instruments OMAP 3 ZiiLABS ZMS-08

Cortex-A9	Altera FPGAs Amlogic AML8726 Apple A5, A5X Freescale i.MX6x HiSilicon K3V2 MediaTek MT657x Nvidia Tegra 2, 3, 4i Nufront NuSmart 2816M, NS115, NS115M Renesas EMMA EV2 Rockchip RK292x, RK30xx, RK31xx Samsung Exynos 4 ST-Ericsson NovaThor Telechips TCC8803 Texas Instruments OMAP 4 VIA WonderMedia WM88x0, 89x0 Xilinx FPGAs ZiiLABS ZMS-20, ZMS-40

Cortex-A12	Rockchip RK3288

Cortex-A15	HiSilicon K3V3 MediaTek MT6599 Nvidia Tegra 4, K1 Samsung Exynos 5 Texas Instruments OMAP 5 Allwinner A80

ARMv7-A compatible	Apple A6, A6X (Swift) Broadcom Brahma-B15 Marvell P4J Qualcomm Snapdragon S1, S2, S3, S4 Plus, S4 Pro, 600, 800 (Scorpion, Krait)

Cortex-A53	Altera FPGAs Qualcomm Snapdragon 410

Cortex-A57	AMD Hierofalcon

ARMv8 compatible	Apple A7 (Cyclone) Nvidia Tegra K1 (Project Denver) Applied Micro Circuits Corporation X-Gene

Embedded
Microcontrollers

Cortex-M0	Cypress Semiconductor PSoC 4 NXP LPC11xx, LPC12xx STMicroelectronics STM32 F0

Cortex-M0+	Atmel SAMD20 Energy Micro EFM32 Zero Freescale Kinetis E, L, M NXP LPC8xx

Cortex-M1	Actel FPGAs Altera FPGAs Xilinx FPGAs

Cortex-M3	Actel SmartFusion, SmartFusion 2 Analog Devices ADuCM360 Atmel SAM3A, SAM3N, SAM3S, SAM3U, SAM3X Cypress Semiconductor PSoC 5 Energy Micro EFM32 Tiny, Gecko, Leopard, Giant Fujitsu FM3 Holtek HT32F125x NXP LPC13xx, LPC17xx, LPC18xx ON Semiconductor Q32M210 Silicon Labs Precision32 STMicroelectronics STM32 F1, F2, L1, W Texas Instruments F28, LM3, TMS470, OMAP 4 Toshiba TX03

Cortex-M4	Atmel SAM4L, SAM4N, SAM4S Freescale Kinetis K

Cortex-M4F	Atmel SAM4E, SAM4C(dual core), SAMG Energy Micro EFM32 Wonder Freescale Kinetis K Infineon XMC4000 NXP LPC40xx, LPC43xx STMicroelectronics STM32 F3, F4 Texas Instruments LM4F

Real-Time
Microcontrollers

Cortex-R4F	Texas Instruments RM4, TMS570

Cortex-R5F	Scaleo OLEA

Classic
Processors

ARM7	Atmel SAM7L, SAM7S, SAM7SE, SAM7X, SAM7XC, AT91CAP7, AT91M, AT91R NXP LPC21xx, LPC22xx, LPC23xx, LPC24xx, LH7 STMicroelectronics STR7

ARM9	Atmel SAM9G, SAM9M, SAM9N, SAM9R, SAM9X, SAM9XE, SAM926x, AT91CAP9 Freescale i.MX1x, i.MX2x Rockchip RK27xx, RK28xx NXP LPC29xx,LPC3xxx, LH7A ST-Ericsson Nomadik STn881x STMicroelectronics STR9 Texas Instruments OMAP 1, AM1x VIA WonderMedia WM8505/8650 ZiiLABS ZMS-05

ARM11	Broadcom BCM2835 (Raspberry Pi) Freescale i.MX3x Infotmic IMAPX210/220 Mindspeed Comcerto 1000 Nvidia Tegra APX, 6xx QualcommMSM7000, Snapdragon S1 ST-Ericsson Nomadik STn882x Telechips TCC8902 Texas Instruments OMAP 2 VIA WonderMedia WM87x0

ARMv4 compatible	Digital Equipment Corporation StrongARM

ARMv5 compatible	Intel/Marvell XScale Marvell Sheeva, Feroceon, Jolteon, Mohawk

ARMv6 compatible	Mindspeed Comcerto 1000

http://en.wikipedia.org/wiki/List_of_ARM_cores

List of ARM cores

From Wikipedia, the free encyclopedia

"ARM8" redirects here. For the ARMv8 instruction set architecture, see ARM architecture#ARMv8.

[hide]

This is a sub-article to ARM architecture.

This is a list of ARM architecture-based microprocessor cores by ARM Holdings and 3rd parties, sorted by generation release and name. ARM provides a summary of the numerous vendors who implement ARM cores in their design.^[1] Keil also provides a somewhat newer summary of vendors of ARM based processors.^[2] ARM further provides a chart^[3] displaying an overview of the ARM processor lineup with performance and functionality versus capabilities for the more recent ARM core families.

ARM cores[edit]

Designed by ARM[edit]

ARM Family	ARM Architecture	ARM Core	Feature	Cache (I / D), MMU	Typical MIPS @ MHz
ARM1	ARMv1	ARM1	First implementation	None
ARM2	ARMv2	ARM2	ARMv2 added the MUL (multiply) instruction	None	4 MIPS @ 8 MHz 0.33 DMIPS/MHz
ARM2	ARMv2a	ARM250	Integrated MEMC (MMU), Graphics and IO processor. ARMv2a added the SWP and SWPB (swap) instructions.	None, MEMC1a	7 MIPS @ 12 MHz
ARM3	ARMv2a	ARM3	First integrated memory cache.	4 KB unified	12 MIPS @ 25 MHz 0.50 DMIPS/MHz
ARM6	ARMv3	ARM60	ARMv3 first to support 32-bit memory address space (previously 26-bit)	None	10 MIPS @ 12 MHz
		ARM600	As ARM60, cache and coprocessor bus (for FPA10 floating-point unit).	4 KB unified	28 MIPS @ 33 MHz
		ARM610	As ARM60, cache, no coprocessor bus.	4 KB unified	17 MIPS @ 20 MHz 0.65 DMIPS/MHz
ARM7	ARMv3	ARM700		8 KB unified	40 MHz
		ARM710	As ARM700, no coprocessor bus.	8 KB unified	40 MHz
		ARM710a	As ARM710	8 KB unified	40 MHz 0.68 DMIPS/MHz
ARM7TDMI	ARMv4T	ARM7TDMI(-S)	3-stage pipeline, Thumb, ARMv4 first to drop legacy ARM 26-bit addressing	none	15 MIPS @ 16.8 MHz 63 DMIPS @ 70 MHz
		ARM710T	As ARM7TDMI, cache	8 KB unified, MMU	36 MIPS @ 40 MHz
		ARM720T	As ARM7TDMI, cache	8 KB unified, MMU with Fast Context Switch Extension	60 MIPS @ 59.8 MHz
		ARM740T	As ARM7TDMI, cache	MPU
ARM7EJ	ARMv5TEJ	ARM7EJ-S	5-stage pipeline, Thumb, Jazelle DBX, Enhanced DSP instructions	none
ARM8	ARMv4	ARM810^[4]^[5]	5-stage pipeline, static branch prediction, double-bandwidth memory	8 KB unified, MMU	84 MIPS @ 72 MHz 1.16 DMIPS/MHz
ARM9TDMI	ARMv4T	ARM9TDMI	5-stage pipeline, Thumb	none
		ARM920T	As ARM9TDMI, cache	16 KB / 16 KB, MMU with FCSE (Fast Context Switch Extension)^[6]	200 MIPS @ 180 MHz
		ARM922T	As ARM9TDMI, caches	8 KB / 8 KB, MMU
		ARM940T	As ARM9TDMI, caches	4 KB / 4 KB, MPU
ARM9E	ARMv5TE	ARM946E-S	Thumb, Enhanced DSP instructions, caches	variable, tightly coupled memories, MPU
		ARM966E-S	Thumb, Enhanced DSP instructions	no cache, TCMs
		ARM968E-S	As ARM966E-S	no cache, TCMs
	ARMv5TEJ	ARM926EJ-S	Thumb, Jazelle DBX, Enhanced DSP instructions	variable, TCMs, MMU	220 MIPS @ 200 MHz
	ARMv5TE	ARM996HS	Clockless processor, as ARM966E-S	no caches, TCMs, MPU
ARM10E	ARMv5TE	ARM1020E	6-stage pipeline, Thumb, Enhanced DSP instructions, (VFP)	32 KB / 32 KB, MMU
	ARMv5TE	ARM1022E	As ARM1020E	16 KB / 16 KB, MMU
	ARMv5TEJ	ARM1026EJ-S	Thumb, Jazelle DBX, Enhanced DSP instructions, (VFP)	variable, MMU or MPU
ARM11	ARMv6	ARM1136J(F)-S^[7]	8-stage pipeline, SIMD, Thumb, Jazelle DBX, (VFP), Enhanced DSP instructions	variable, MMU	740 @ 532–665 MHz (i.MX31 SoC), 400–528 MHz
	ARMv6T2	ARM1156T2(F)-S	8-stage pipeline, SIMD, Thumb-2, (VFP), Enhanced DSP instructions	variable, MPU
	ARMv6Z	ARM1176JZ(F)-S	As ARM1136EJ(F)-S	variable, MMU + TrustZone	965 DMIPS @ 772 MHz, up to 2 600 DMIPS with four processors^[8]
	ARMv6K	ARM11 MPCore	As ARM1136EJ(F)-S, 1–4 core SMP	variable, MMU
SecurCore	ARMv6-M	SC000			0.9 DMIPS/MHz
	ARMv4T	SC100
	ARMv7-M	SC300			1.25 DMIPS/MHz
Cortex-M	ARMv6-M	Cortex-M0 ^[9]	Microcontroller profile, Thumb + Thumb-2 subset (BL, MRS, MSR, ISB, DSB, DMB),^[10] hardware multiply instruction (optional small), optional system timer, optional bit-banding memory	No cache, No TCM, No MPU	0.84 DMIPS/MHz
		Cortex-M0+^[11]	Microcontroller profile, Thumb + Thumb-2 subset (BL, MRS, MSR, ISB, DSB, DMB),^[10] hardware multiply instruction (optional small), optional system timer, optional bit-banding memory	No cache, No TCM, optional MPU with 8 regions	0.93 DMIPS/MHz
		Cortex-M1^[12]	Microcontroller profile, Thumb + Thumb-2 subset (BL, MRS, MSR, ISB, DSB, DMB),^[10] hardware multiply instruction (optional small), OS option adds SVC / banked stack pointer, optional system timer, no bit-banding memory	No cache, 0-1024 KB I-TCM, 0-1024 KB D-TCM, No MPU	136 DMIPS @ 170 MHz,^[13] (0.8 DMIPS/MHz FPGA-dependent)^[14]
	ARMv7-M	Cortex-M3^[15]	Microcontroller profile, Thumb / Thumb-2, hardware multiply and divide instructions, optional bit-banding memory	No cache, No TCM, optional MPU with 8 regions	1.25 DMIPS/MHz
	ARMv7E-M	Cortex-M4^[16]	Microcontroller profile, Thumb / Thumb-2 / DSP / optional FPv4 single-precision FPU, hardware multiply and divide instructions, optional bit-banding memory	No cache, No TCM, optional MPU with 8 regions	1.25 DMIPS/MHz
Cortex-R	ARMv7-R	Cortex-R4^[17]	Real-time profile, Thumb / Thumb-2 / DSP / optional VFPv3 FPU, hardware multiply and optional divide instructions, optional parity & ECC for internal buses / cache / TCM, 8-stage pipeline dual-core running lockstep with fault logic	0-64 KB / 0-64 KB, 0-2 of 0-8 MB TCM, opt MPU with 8/12 regions
		Cortex-R5 (MPCore) ^[18]	Real-time profile, Thumb / Thumb-2 / DSP / optional VFPv3 FPU and precision, hardware multiply and optional divide instructions, optional parity & ECC for internal buses / cache / TCM, 8-stage pipeline dual-core running lock-step with fault logic / optional as 2 independent cores, low-latency peripheral port (LLPP), accelerator coherency port (ACP) ^[19]	0-64 KB / 0-64 KB, 0-2 of 0-8 MB TCM, opt MPU with 12/16 regions
		Cortex-R7 (MPCore) ^[20]	Real-time profile, Thumb / Thumb-2 / DSP / optional VFPv3 FPU and precision, hardware multiply and optional divide instructions, optional parity & ECC for internal buses / cache / TCM, 11-stage pipeline dual-core running lock-step with fault logic / out-of-order execution / dynamicregister renaming / optional as 2 independent cores, low-latency peripheral port (LLPP), ACP ^[19]	0-64 KB / 0-64 KB, ? of 0-128 KB TCM, opt MPU with 16 regions
Cortex-A	ARMv7-A	Cortex-A5^[21]	Application profile, ARM / Thumb / Thumb-2 / DSP / SIMD / Optional VFPv4-D16 FPU / Optional NEON / Jazelle RCT and DBX, 1–4 cores / optional MPCore, snoop control unit (SCU), generic interrupt controller (GIC), accelerator coherence port (ACP)	4-64 KB / 4-64 KB L1, MMU + TrustZone	1.57 DMIPS / MHz per core
		Cortex-A7 MPCore ^[22]	Application profile, ARM / Thumb / Thumb-2 / DSP / VFPv4-D16 FPU / NEON / Jazelle RCT and DBX / Hardware virtualization, in-order execution, superscalar, 1–4 SMP cores, Large Physical Address Extensions (LPAE), snoop control unit (SCU), generic interrupt controller (GIC), ACP, architecture and feature set are identical to A15, 8-10 stage pipeline, low-power design^[23]	32 KB / 32 KB L1, 0-4 MB L2, MMU + TrustZone	1.9 DMIPS / MHz per core
		Cortex-A8^[24]	Application profile, ARM / Thumb / Thumb-2 / VFPv3 FPU / NEON / Jazelle RCT and DAC, 13-stagesuperscalar pipeline	16-32 KB / 16-32 KB L1, 0-1 MB L2 opt ECC, MMU + TrustZone	up to 2000 (2.0 DMIPS/MHz in speed from 600 MHz to greater than 1 GHz)
		Cortex-A9 MPCore ^[25]	Application profile, ARM / Thumb / Thumb-2 / DSP / Optional VFPv3 FPU / Optional NEON / Jazelle RCT and DBX, out-of-order speculative issue superscalar, 1–4 SMP cores, snoop control unit (SCU), generic interrupt controller (GIC), accelerator coherence port (ACP)	16-64 KB / 16-64 KB L1, 0-8 MB L2 opt parity, MMU + TrustZone	2.5 DMIPS/MHz per core, 10,000 DMIPS @ 2 GHz on Performance Optimized TSMC 40G(dual core)
		ARM Cortex-A12 ^[26]	Application profile, ARM / Thumb-2 / DSP / VFPv4 FPU / NEON / Hardware virtualization, out-of-order speculative issue superscalar, 1–4 SMP cores, Large Physical Address Extensions (LPAE), snoop control unit (SCU), generic interrupt controller (GIC), accelerator coherence port (ACP)	32-64 KB / 32 KB L1, 256 KB-8 MB L2	3.0 DMIPS / MHz per core
		Cortex-A15 MPCore ^[27]	Application profile, ARM / Thumb / Thumb-2 / DSP / VFPv4 FPU / NEON / Integer divide / Fused MAC / Jazelle RCT / Hardware virtualization, out-of-order speculative issue superscalar, 1–4 SMP cores, Large Physical Address Extensions (LPAE), snoop control unit (SCU), generic interrupt controller (GIC), ACP, 15-24 stage pipeline^[23]	32 KB I$ w/parity / 32 KB D$ w/ECC L1, 0-4 MB L2, L2 has ECC, MMU + TrustZone	At least 3.5 DMIPS/MHz per core (Up to 4.01 DMIPS/MHz depending on implementation).^[28]
Cortex-A50	ARMv8-A	Cortex-A53^[29]	Application profile, AArch32 and AArch64, 1-4 SMP cores, Trustzone, NEON advanced SIMD, VFPv4, hardware virtualization, dual issue, in-order pipeline	8-64 KB w/parity / 8-64 KB w/ECC L1 per core, 128 KB-2 MB L2 shared, 40-bit physical addresses	2.3 DMIPS/MHz
Cortex-A50	ARMv8-A	Cortex-A57^[30]	Application profile, AArch32 and AArch64, 1-4 SMP cores, Trustzone, NEON advanced SIMD, VFPv4, hardware virtualization, multi-issue, deeply out-of-order pipeline	48 KB w/DED parity / 32 KB w/ECC L1 per core, 512 KB-2 MB L2 shared, 44-bit physical addresses	At least 4.1 DMIPS/MHz per core (Up to 4.76 DMIPS/MHz depending on implementation).
ARM Family	ARM Architecture	ARM Core	Feature	Cache (I / D), MMU	Typical MIPS @ MHz

Designed by third parties[edit]

These cores implement the ARM instruction set, and were developed independently by companies with an architectural license from ARM.

Family	ARM Architecture	Core	Feature	Cache (I / D), MMU	Typical MIPS @ MHz
StrongARM	ARMv4	SA-110	5-stage pipeline	16 KB / 16 KB, MMU	100-206 MHz 1.0 DMIPS/MHz
StrongARM	ARMv4	SA-1100	derivative of the SA-110	16 KB / 8 KB, MMU
Faraday^[31]	ARMv4	FA510	6-stage pipeline	up to 32 KB / 32 KB Cache, MPU	1.26 DMIPS/MHz 100-200 MHz
		FA526	6-stage pipeline	up to 32 KB / 32 KB Cache, MMU	1.26 MIPS/MHz 166-300 MHz
		FA626	8-stage pipeline	32 KB / 32 KB Cache, MMU	1.35 DMIPS/MHz 500 MHz
	ARMv5TE	FA606TE	5-stage pipeline	no cache, no MMU	1.22 DMIPS/MHz 200 MHz
		FA626TE	8-stage pipeline	32 KB / 32 KB Cache, MMU	1.43 MIPS/MHz 800 MHz
		FMP626TE	8-stage pipeline, SMP		1.43 MIPS/MHz 500 MHz
		FA726TE	13 stage pipeline, dual issue		2.4 DMIPS/MHz 1000 MHz
XScale	ARMv5TE	XScale	7-stage pipeline, Thumb, Enhanced DSP instructions	32 KB / 32 KB, MMU	133–400 MHz
		Bulverde	Wireless MMX, Wireless SpeedStep added	32 KB / 32 KB, MMU	312–624 MHz
		Monahans^[32]	Wireless MMX2 added	32 KB / 32 KB (L1), optional L2 cache up to 512 KB, MMU	up to 1.25 GHz
Sheeva	ARMv5	Feroceon	5-8 stage pipeline, single-issue	16 KB / 16 KB, MMU	600-2000 MHz
		Jolteon	5-8 stage pipeline, dual-issue	32 KB / 32 KB, MMU	600-2000 MHz
		PJ1 (Mohawk)	5-8 stage pipeline, single-issue, Wireless MMX2	32 KB / 32 KB, MMU	1.46 DMIPS/MHz 1.06 GHz
	ARMv6 / ARMv7-A	PJ4	6-9 stage pipeline, dual-issue, Wireless MMX2, SMP	32 KB / 32 KB, MMU	2.41 DMIPS/MHz 1.6 GHz
Snapdragon	ARMv7-A	Scorpion ^[33]	1 or 2 cores. ARM / Thumb / Thumb-2 / DSP / SIMD / VFPv3 FPU / NEON (128-bit wide)	256 KB L2 per core	2.1 DMIPS / MHz per core
Snapdragon	ARMv7-A	Krait ^[33]	1, 2, or 4 cores. ARM / Thumb / Thumb-2 / DSP / SIMD / VFPv4 FPU / NEON (128-bit wide)	4 KB / 4 KB L0, 16 KB / 16 KB L1, 512 KB L2 per core	3.3 DMIPS / MHz per core
Apple A6, Apple A6X	ARMv7-A	Apple Swift^[34]	2 cores. ARM / Thumb / Thumb-2 / DSP / SIMD / VFPv4 FPU / NEON	L1: 32 KB / 32 KB, L2: 1 MB	3.5 DMIPS / MHz Per Core
Apple A7	ARMv8-A	Apple Cyclone	2 cores. ARM / Thumb / Thumb-2 / DSP / SIMD / VFPv4 FPU / NEON /TrustZone / AArch64	L1: 64 KB / 64 KB, L2: 1 MB
X-Gene	ARMv8-A	X-Gene	64 bit, quad issue, SMP	Cache, MMU, Virtualization	3 GHz
Denver	ARMv8-A	Parker	64 bit

http://en.wikipedia.org/wiki/ARM_Cortex-A8

ARM Cortex-A8

From Wikipedia, the free encyclopedia

ARM Cortex-A8
Designed by	ARM Holdings
Common manufacturer(s)	TSMC
Instruction set	ARMv7
Cores	1
L1 cache	32 KiB/32 KiB
L2 cache	512 KiB

The ARM Cortex-A8 is a processor core designed by ARM Holdings implementing the ARM v7 (32-bit) instruction set architecture.

Compared to the ARM11 core, the Cortex-A8 is a dual-issue superscalar design, achieving roughly twice the instructions executed per clock cycle. The Cortex-A8 was the first Cortex design to be adopted on a large scale for use in consumer devices.^[1]

Key features of the Cortex-A8 core are:

Frequency from 600 MHz to 1 GHz and above
Superscalar dual-issue microarchitecture
NEON SIMD instruction set extension ^[2]
VFPv3 Floating Point Unit
Thumb-2 instruction set encoding
Jazelle RCT (Also known as ThumbEE instruction set)
Advanced branch prediction unit with >95% accuracy
Integrated level 2 Cache (0–4 MiB)
2.0 DMIPS/MHz

[hide]

Chips[edit]

Several system-on-chips (SoC) have implemented the Cortex-A8 core, including:

http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore

ARM Cortex-A9 MPCore

From Wikipedia, the free encyclopedia

ARM Cortex-A9 MPCore
Designed by	ARM Holdings
Common manufacturer(s)	TSMC Samsung Electronics
Max. CPU clock rate	0.8 GHz to 2 GHz
Instruction set	ARMv7
Cores	1–4
L1 cache	32 KB I, 32 KB D
L2 cache	128 KB–8 MB (configurable with L2 cache controller)

The ARM Cortex-A9 MPCore is a 32-bit multicore processor providing up to 4 cache-coherent Cortex-A9 cores, each implementing the ARM v7 instruction set architecture.^[1]

[hide]

Overview[edit]

Key features of the Cortex-A9 core are:^[2]

Out-of-order speculative issue superscalar execution pipeline giving 2.50 DMIPS/MHz/core.
NEON SIMD instruction set extension performing up to 16 operations per instruction (optional).
High performance VFPv3 floating point unit doubling the performance of previous ARM FPUs (optional).
Thumb-2 instruction set encoding reduces the size of programs with little impact on performance.
TrustZone security extensions.
Jazelle DBX support for Java execution.
Jazelle RCT for JIT compilation.
Program Trace Macrocell and CoreSight Design Kit for unobtrusive tracing of instruction execution.
L2 cache controller (0-4 MB).
Multi-core processing.

ARM states that the TSMC 40G hard macro implementation typically operating at 2 GHz; a single core (excluding caches) occupies less than 1.5 mm² when designed in a TSMC 65 nanometer (nm) generic process^[3] and can be clocked at speeds over 1 GHz, consuming less than 250 mW per core.^[4]

Chips[edit]

Several system on a chip (SoC) devices implement the Cortex-A9 core, including:

Altera SoC FPGA^[5]
AMLogic AML8726-M^[6]
Apple A5, A5X
Broadcom BCM11311 (Persona ICE)^[7]
Calxeda EnergyCore ECX-1000^[8]
Entropic EN7588^[9]
Freescale Semiconductor i.MX6^[10]
HiSilicon^[11] K3V2 -Hi3620^[12]
Marvell Avastar 88W8787, used in the Sony PlayStation Vita^[13]^[14]
MediaTek MT6575^[15] (single core), MT6577^[16] (dual core)
Nufront NuSmart 2816, 2816M, 115^[17]
Nvidia Tegra 2 (without NEON extensions), Tegra 3 and Tegra 4i
Trident Microsystems 847x/8x/9x SoC family^[18]
Renesas Electronics EMMA Mobile/EV2
Samsung Exynos 4210,^[19] 4212, 4412
Rockchip RK3066,^[20] RK292x, RK31xx
STMicroelectronics SPEAr1310,^[21] SPEAr1340^[22]
ST-Ericsson Nova A9500, NovaThor U8500,^[23] NovaThor U9500^[24]
Texas Instruments OMAP4 processors
WonderMedia WM8850, WM8950 and WM8980^[25]
Xilinx Extensible Processing Platform^[26]
ZiiLABS ZMS-20^[27]

Systems on a chip[edit]

This list is incomplete; you can help by expanding it.

Developer	Name	Cores	Process	NEONSIMD	Vector floating point unit	GPU
Altera	SoC FPGA	1-2	28 nm	Yes	VFPv3	optionally implemented in FPGA; TES Electronic Solutions D/AVE HD
AMLogic	AML8726-M	1	65 nm	Yes	VFPv3	ARM Mali-400
AMLogic	AML8726-MX	2	40 nm	Yes	VFPv3	ARM Mali-400 MP2
AMLogic	AML8726-M8	4	28 nm	Yes	VFPv3	ARM Mali-450 MP6
Apple Inc.	A5	2	32 nm 45 nm	Yes	VFPv3	PowerVR SGX543MP2
Apple Inc.	A5X	2	45 nm	Yes	VFPv3	PowerVR SGX543MP4
Broadcom	BCM11311 (Persona ICE)	2	40 nm	?	?	Broadcom Videocore IV
Broadcom	BCM21654G	1	40 nm	Yes	VFPv3	Broadcom Videocore IV
Broadcom	BCM21664T	2	40 nm	Yes	VFPv3	Broadcom Videocore IV
Calxeda	EnergyCore ECX-1000^[8]	4	40 nm	Yes	VFPv3	-
Freescale Semiconductor	i.MX6^[28]	1-4	40 nm	Yes	VFPv3-D16	Vivante Corporation GPU IP cores^[29]
HiSilicon	K3V2 (Hi3620)	4	40 nm	Yes	VFPv3	Vivante GC4000
LG Corp	LG L9	2	?	?	?	ARM Mali-400 MP4
Marvell	PXA986	2	45 nm	Yes	VFPv3	PowerVR SGX540 / Vivante GC1000 (Galaxy Tab 3 7-inch)
Marvell	PXA988	2	45 nm	Yes	VFPv3	?
MediaTek	MT6575	1	40 nm	Yes	VFPv3	PowerVR SGX531^[15]
MediaTek	MT6577	2	40 nm	Yes	VFPv3	PowerVR SGX531^[16]
Nufront	NuSmartTM 2816(NS2816)	2	?	Yes	VFPv3	ARM Mali-400^[30]
Nufront	NuSmartTM 2816M (NS2816M)	2	?	Yes	VFPv3	ARM Mali-400
Nufront	NuSmartTM 115 (NS115)	2	?	Yes	VFPv3	ARM Mali-400
Nvidia	Tegra 2 series	2	40 nm	No	VFPv3-D16	GeForce ULP
Nvidia	Tegra 3 (Kal-El) series	4	40 nm	Yes	VFPv3	GeForce ULP
Renesas Electronics	EMMA Mobile/EV2^[31]	?	?	Yes	?	PowerVR SGX530
Rockchip	RK2928	1	40 nm	?	?	ARM Mali-400
Rockchip	RK3066^[20]	2	40 nm	Yes	VFPv3	ARM Mali-400 MP4
Rockchip	RK3128	2	?	Yes	VFPv3	ARM Mali-400 MP4
Rockchip	RK3188^[32]	4	28 nm	Yes	VFPv3	ARM Mali-400 MP4
Samsung	Exynos 4 Dual	2	45 nm	Yes	VFPv3	ARM Mali-400 MP4
Samsung	Exynos 4 Dual	2	32 nm	Yes	VFPv3	ARM Mali-400 MP4
Samsung	Exynos 4 Quad	4	32 nm	Yes	VFPv3	ARM Mali-400 MP4
STMicroelectronics	SPEAr1310	?	?	No	VFPv3	–
STMicroelectronics	SPEAr1340	?	?	No	VFPv3	ARM Mali-200^[33]
ST-Ericsson	Nova A9500	2	45 nm	Yes	VFPv3	ARM Mali-400
ST-Ericsson	NovaThor U8500	2	45 nm	Yes	VFPv3	ARM Mali-400
ST-Ericsson	NovaThor U9500	2	45 nm	Yes	VFPv3	ARM Mali-400
Sony	PlayStation Vita	4	40 nm	Yes	VFPv3	PowerVR SGX543MP4+
Texas Instruments	OMAP4430 OMAP4460	2	45 nm	Yes	VFPv3	PowerVR SGX540
Texas Instruments	OMAP4470	2	45 nm	Yes	VFPv3	PowerVR SGX544
Trident Microsystems	PNX8473^[34]	1	?	?	?	PowerVR SGX531
Trident Microsystems	PNX8483^[35]	1	?	?	?	PowerVR SGX531
Trident Microsystems	PNX8491^[36]^{[dead link]}	1	?	?	?	PowerVR SGX531
WonderMedia	WM8850	1	?	Yes	?	ARM Mali-400
WonderMedia	WM8880	2	40 nm	?	?	ARM Mali-400 MP2
WonderMedia	WM8950	1	?	?	?	ARM Mali-400^[25]
WonderMedia	WM8980	2	40 nm	?	?	ARM Mali-400 MP2
Xilinx	Zynq-7000^[37]	?	28 nm	Yes	VFPv3	–
ZiiLABS	ZMS-20	?	?	Yes	VFPv3	ZiiLABS flexible Stemcell media processing

Development platforms[edit]

This list is incomplete; you can help by expanding it.

Developer	Name	SoC	RAM	ROM	SD	SATA	USB	Ethernet	Wi-Fi	Bluetooth	GPS	Accelerometer	Magnetometer	Gyroscope	Barometer
Origenboard	Origenboard^[38]	Samsung Exynos 4210	1 GiB DDR3	-	2 Port SD/MMC Card Slot	-	embedded	-	SWB-A31	SWB-A31	-	-	-	-	-
Odroid	Odroid-X^[39]	Samsung Exynos 4412	1 GiB LP-DDR2 800	-	SDHC Card Slot + eMMC module socket	-	6*USB2.0 host + µUSB2.0 device	10/100 Mb	-	-	-	-	-	-	-
PandaBoard	PandaBoard	TI OMAP4430^[40]	1 GiB LP-DDR2	-	Full size SD/MMC card	-	LAN9514-JZX	LAN9514-JZX	LS240-WI-01-A20	LS240-WI-01-A20	-	-	-	-	-
Calao systems	Snowball^[41]	ST-Ericsson Nova A9500^[42]	1 GiB LP-DDR2	4 / 8GB e-MMC	microSD	-	FT232R	LAN9221	AW-NH580	AW-NH580	AW-NH580	LSM303DLH	LSM303DLH	L3G4200D	LPS001WP
Trim-Slice	Trim-Slice^[43]	Tegra 2 series	1 GiB DDR2-667	-	Full size SD slot (SDHC) + microSD slot (SDHC)	GL830	embedded	RTL8111DL	RT3070	-	-	-	-	-	-
Radxa	Radxa Rock^[44]	Rockchip RK3188	2 GiB DDR3 800	8GB Nand Flash	microSD (SDXC)	-	2*USB2.0 host + µUSB2.0 device	10/100 Mb	150Mbps 802.11b/g/n	Bluetooth 4.0	-	-	-	-	-

http://en.wikipedia.org/wiki/ARM_Cortex-A5

ARM Cortex-A5

From Wikipedia, the free encyclopedia

ARM Cortex-A5
Designed by	ARM Holdings
Common manufacturer(s)	TSMC
Instruction set	ARMv7
Cores	1-4
L1 cache	4-64 KB/4-64 KB

The ARM Cortex-A5 is a processor core designed by ARM Holdings implementing the ARM v7 instruction set architecture.

[hide]

Overview[edit]

It is intended to replace the ARM9 and ARM11 cores for use in low-end devices.^[1] Compared to those older cores, the Cortex-A5 offers the advanced features of the ARM v7 architecture over the v4/v5 (ARM9) and v6 (ARM11) architectures e.g VFPv4 and NEON advanced SIMD. It also allows devices to run current software, which is increasingly focusing on ARM v7 and dropping support for earlier architectures.

Key features of the Cortex-A5 core are:

Single-issue, in-order microarchitecture with an 8 stage pipeline^[1]
NEON SIMD instruction set extension (optional)
VFPv4 floating-point unit (optional)
Thumb-2 instruction set encoding
Jazelle RCT
1.57 DMIPS / MHz

Chips[edit]

Several system-on-chips (SoC) have implemented the Cortex-A5 core, including:

Atmel SAMA5D3
Freescale Vybrid Series
Snapdragon S4 Play
Spreadtrum SC8810 (single core A5 1ghz + Mali400 GPU)
Actions Semiconductor ATM7029 (gs702a) is a quad-core Cortex-A5 configuration
2013 AMD Fusion APUs will include a Cortex-A5 as a security co-processor^[2]

http://en.wikipedia.org/wiki/ARM_Cortex-A7_MPCore

ARM Cortex-A7 MPCore

From Wikipedia, the free encyclopedia

ARM Cortex-A7 MPCore
Designed by	ARM Holdings
Instruction set	ARMv7
Cores	1-4
L1 cache	8-64 KB/8-64 KB
L2 cache	Optional, up to 1 MB

The ARM Cortex-A7 MPCore is a processor core designed by ARM Holdings implementing the ARM v7 instruction set architecture.

[hide]

Overview[edit]

It has two target applications; firstly as a smaller, faster, and more power-efficient successor to the Cortex-A8. The other use is in the big.LITTLEarchitecture, combining one or more A7 cores with one or more Cortex-A15 cores into a heterogeneous system.^[1] To do this it is fully feature-compatible with the A15.

Key features of the Cortex-A7 core are:

Partial dual-issue, in-order microarchitecture with an 8 stage pipeline^[2]
NEON SIMD instruction set extension
VFPv4 Floating Point Unit
Thumb-2 instruction set encoding
Jazelle RCT
Hardware virtualization
Large Page Address Extensions (LPAE)
Integrated level 2 Cache (0-1 MB)
1.9 DMIPS / MHz^[2]

Chips[edit]

Several system-on-chips (SoC) have implemented the Cortex-A7 core, including:

Allwinner A20 (dual core A7 + Mali-400 MP2 GPU)^[3]
Allwinner A31 (quad core A7 + PowerVR SGX544MP2 GPU)^[4]
Broadcom BCM23550-Quad-core HSPA+ Multimedia Processor ^[5]
HiSilicon K3V3, big.LITTLE architecture with dual-core Cortex-A7 and dual-core Cortex-A15. Use ARM Mali-T658 GPU.
Marvell PXA1088 (quad core A7) ^[6]
Mediatek MT6589 (quad core A7 + Imagination Technologies PowerVR SGX544 GPU)
Qualcomm Snapdragon 200 and Snapdragon 400 MSM8212 and MSM8612, MSM8226 and MSM8626 (quad core A7 + Adreno 305 GPU)
Samsung Exynos 5 Octa (5410), big.LITTLE architecture with quad-core Cortex-A7 and quad-core Cortex-A15. Use Imagination Technologies PowerVR SGX544MP3 GPU.
Samsung Exynos 5 Octa (5420), big.LITTLE architecture with quad-core Cortex-A7 and quad-core Cortex-A15. Use ARM Mali-T628MP6 GPU.

http://en.wikipedia.org/wiki/ARM_Cortex-A12

ARM Cortex-A12

From Wikipedia, the free encyclopedia

ARM Cortex-A12
Designed by	ARM Holdings
Instruction set	ARMv7
Cores	1–4
L1 cache	32-64 KiB I, 32 KiB D
L2 cache	256 KiB–8 MiB (configurable with L2 cache controller)

The ARM Cortex-A12 is a 32-bit multicore processor that has been designed to be the successor to the Cortex-A9. It provides up to 4 cache-coherent cores, each implementing the ARM v7 instruction set architecture.^[1]

[hide]

Overview[edit]

ARM claims that the Cortex-A12 core is 40 percent more powerful than the Cortex-A9 core.^[2] New features not found in the Cortex-A9 include hardware virtualization and 40-bit Large Physical Address Extensions (LPAE) addressing. The CPU can also be used in a big.LITTLE solution together with the Cortex-A7 processor.^[3]

Key features of the Cortex-A12 core are:^[4]

Out-of-order speculative issue superscalar execution pipeline giving 3.00 DMIPS/MHz/core.
NEON SIMD instruction set extension.
High performance VFPv4 floating point unit.
Thumb-2 instruction set encoding reduces the size of programs with little impact on performance.
TrustZone security extensions.
L2 cache controller (0-8 MB).
Multi-core processing.
40-bit Large Physical Address Extensions (LPAE) addressing up to 1 TB of RAM.
Hardware virtualization support.

http://en.wikipedia.org/wiki/ARM_Cortex-A15_MPCore

ARM Cortex-A15

From Wikipedia, the free encyclopedia

(Redirected from ARM Cortex-A15 MPCore)

ARM Cortex-A15 MPCore

Produced	In production late 2011,^[1]to market late 2012^[2]
Designed by	ARM
Max. CPU clock rate	1.0 GHz to 2.5 GHz
Min. feature size	32 nm/28 nm initially^[3] to22 nm roadmap^[3]
Instruction set	ARMv7
Cores	1–4 per cluster, 1–2 clusters per physical chip^[4]
L1 cache	64 KB (32 KB I-cache, 32 KB D-cache) per core
L2 cache	Up to 4 MB^[5] per cluster
L3 cache	none

The ARM Cortex-A15 MPCore is a multicore ARM architecture processor providing an out-of-order superscalar pipeline ARM v7 instruction set running at up to 2.5 GHz.^[6]

[hide]

Overview[edit]

ARM has claimed that the Cortex A15 core is 40 percent more powerful than the Cortex-A9 core with the same number of cores at the same speed.^[7] The first A15 designs came out in the autumn of 2011, but products based on the chip did not reach the market until 2012.^[1]

Key features of the Cortex-A15 core are:

40-bit Large Physical Address Extensions (LPAE) addressing up to 1 TB of RAM.^[8]^[9] As per the x86 Physical Address Extension, still only 32-bit address space is available per process.^[10]
15 stage integer/17–25 stage floating point pipeline, with out-of-order speculative issue 3-way superscalarexecution pipeline^[11]
4 cores per cluster, up to 2 clusters per chip with CoreLink 400 (an AMBA-4 coherent interconnect). ARM provides specifications but the licencees individually design ARM chips, and AMBA-4 scales beyond 2 clusters.
DSP and NEON SIMD extensions onboard (per core)
VFPv4 Floating Point Unit onboard (per core)
Hardware virtualization support
Thumb-2 instruction set encoding reduces the size of programs with little impact on performance.
TrustZone security extensions
Jazelle RCT for JIT compilation
Program Trace Macrocell and CoreSight Design Kit for unobtrusive tracing of instruction execution
32 KB data + 32 KB instruction L1 cache per core
Integrated low-latency level-2 cache controller, up to 4 MB per cluster

Chips[edit]

First implementation came from Samsung in 2012 with the Exynos 5 Dual, which shipped in October 2012 with the Samsung Chromebook Series 3 (ARM version), followed in November by the Google Nexus 10.

Implementations of other manufacturers are expected to hit market in 2013.

Press announcements of forthcoming implementations:

Broadcom SoC^[12]
HiSilicon K3V3^[13]
Nvidia Tegra 4 (Wayne)^[14]
Samsung Exynos 5 Dual^[15]
ST-Ericsson Nova A9600 (dual-core @ 2.5 GHz over 20k DMIPS)^[16]^[17]
Texas Instruments OMAP 5 SoCs^[18]

Other licensees, such as LG,^[19]^[20] are expected to produce an A15 based design at some point.

Systems on a chip[edit]

Model Number	Semiconductor technology	CPU	GPU	Memory interface	Wireless radio technologies	Availability	Utilizing devices
HiSilicon K3V3	28 nm HPL	big.LITTLE architecture using 1.8 GHz dual-core ARM Cortex-A15 + dual-core ARM Cortex-A7	Mali-T658			H2 2013
Nvidia Tegra 4 T40	28 nm HPL	1.9 GHz quad-core ARM Cortex-A15^[21] + 1 low power core	Nvidia GeForce @ 72 core, 672 MHz, 96.8 GFLOPS = 48 PS + 24 VU × 0.672 × 2 (96.8 GFLOPS)^[22](support DirectX 11+,OpenGL 4.X, and PhysX)	32-bit dual-channel DDR3L or LPDDR3 up to 933 MHz (1866 MHz data rate)^[21]	Category 3 (100 Mbit/s) LTE	Q2 2013	Nvidia Shield Tegra Note 7
Nvidia Tegra 4 AP40	28 nm HPL	1.2-1.8 GHz quad-core + low power core	Nvidia GPU 60 ^[21] cores (supportDirectX 11+, OpenGL 4.X, and PhysX)	32-bit dual-channel800 MHz LPDDR3	Category 3 (100 Mbit/s) LTE	Q3 2013
Samsung Exynos5 Dual	32 nm HKMG	1.7 GHz dual-core	ARM Mali-T604 (quad-core)	32-bit dual-channel800 MHz LPDDR3/DDR3or 533 MHz LPDDR2		Q3 2012	Arndale Board,Chromebook,Nexus 10, Armbrix Board
Exynos 5 Octa^[23]^[24]^[25] (Internally Exynos 5410)	28 nm HKMG	1.6–1.8 GHz quad-core ARM Cortex-A15 and 1.2 GHz quad-core ARM Cortex-A7	PowerVR SGX544MP3 @ 533 MHz	32-bit dual-channel800 MHz LPDDR3		Q2 2013	Samsung Galaxy S4, ODROID-XU Board
Samsung Exynos5 Octa^[26] (Internally Exynos 5420)	28 nm HKMG	1.8-1.9 GHz quad-core ARM Cortex-A15 and 1.3 GHz quad-core ARM Cortex-A7(ARM big.LITTLE)	ARM Mali-T628 MP6 @ 600 MHz; 115.2 GFLOPS = 16FP x 2 Vec4 x 6 x 0.600 (115.2 GFLOPS)(?)	32-bit Dual-channel 933 MHz LPDDR3e (14.9 GB/sec)		Q3 2013	Samsung Galaxy Note 3
Texas InstrumentsOMAP5430	28 nm	2.0 GHz dual-core	PowerVR SGX544MP2 @ 532 MHz + dedicated 2D graphics accelerator	32-bit dual-channel532 MHz LPDDR2		Q2 2013
Texas InstrumentsOMAP5432	28 nm	2.0 GHz dual-core	PowerVR SGX544MP2 @ 532 MHz + dedicated 2D graphics accelerator	32-bit dual-channel532 MHz DDR3		Q2 2013

http://en.wikipedia.org/wiki/ARM_big.LITTLE

ARM big.LITTLE

From Wikipedia, the free encyclopedia

ARM big.LITTLE is a heterogeneous computing architecture developed by ARM Holdings coupling (relatively) slower, low-power processor cores with (relatively) more powerful and power-hungry ones. The intention being to create a multi-core processor that can adjust better to dynamic computing needs and use less power than clock scaling alone. In October 2011, big.LITTLE was announced along with the Cortex-A7, which was designed to be architecturally compatible with theCortex-A15.^[1] In October 2012 ARM announced the Cortex-A53 and Cortex-A57 (ARMv8-A) cores, which are also compatible with each other to allow their use in a big.LITTLE chip.^[2]

[hide]

Cluster migration[edit]

There are three ways^[3] for the different processor cores to be arranged in a big.LITTLE design, depending on the scheduler implemented in the Linux kernel.^[4]The clustered model approach is the first and simplest implementation. With this approach the operating system scheduler can only see one of the two processor clusters, when the load on one cluster hits a certain point, the system transitions to the other cluster. All relevant data is passed through the common L2 cache, the first core cluster is powered off and the other one is activated. A Cache Coherent Interconnect (CCI) is used. This model has been implemented in theSamsung Exynos 5 Octa (5410)^[5]

In-kernel switcher (CPU migration)[edit]

big.LITTLE IKS

CPU migration via the in-kernel switcher (IKS) involves pairing up a 'big' core with a 'LITTLE' core, with possibly many identical pairs in one chip. Each pair operates as one virtual core, and only one real core is (fully) powered up and running at a time. The 'big' core is used when demand is high, the 'LITTLE' core when demand is low. When demand on the virtual core changes (between high and low), the incoming core is powered up, running state is transferred, the outgoing is shut down, and processing continues on the new core. Switching is done via the cpufreq framework. A complete big.LITTLE IKS implementation is expected in Linux 3.11 or 3.12. big.LITTLE IKS is an improvement of Cluster Migration, the main difference is that each pair is visible to the scheduler.

The more complex arrangement involves a non-symmetric grouping of 'big' and 'LITTLE' cores. A single chip could have one or two 'big' cores and many more 'LITTLE' cores, or vice-versa. Nvidia created something similar to this with the low-power 'companion core' in their Tegra 3 SoC.

Heterogeneous multi-processing (global task scheduling)[edit]

big.LITTLE MP

The most powerful use model of big.LITTLE is heterogeneous multi-processing (MP), which enables the use of all physical cores at the same time. Threads with high priority or computational intensity can in this case be allocated to the 'big' cores while threads with less priority or less computational intensity, such as background tasks, can be performed by the 'LITTLE' cores.^[6] Upstream big.LITTLE GTS patches are expected to be fully incorporated into the mainline Linux kernel in a few quarters. This model has been implemented in theSamsung Exynos 5 Octa (5420)^[7]

Scheduling[edit]

The paired arrangement allows for switching to be done transparently to the operating system using the existing dynamic voltage and frequency switching (DVFS) facility. The existing DVFS support in the kernel (e.g. cpufreq in Linux) will simply see a list of frequencies/voltages and will switch between them as it sees fit, just like it does on existing hardware. However, the low-end slots will activate the 'LITTLE' core and the high-end slots will activate the 'big' core.

Alternatively, all cores may be exposed to the kernel scheduler, which will decide where each process/thread is executed. This will be required for the non-paired arrangement but could possibly also be used on paired cores. It poses unique problems for the kernel scheduler, which, at least with modern commodity hardware, has been able to assume all cores in a SMP system are equal.

Advantages of global task scheduling[edit]

Finer-grained control of workloads that are migrated between cores. Because the scheduler is directly migrating tasks between cores, kernel overhead is reduced and power savings can be correspondingly increased.
Implementation in the scheduler also makes switching decisions faster than in the cpufreq framework implemented in IKS.
The ability to easily support non-symmetrical SoCs (e.g. with 2 Cortex-A15 cores and 4 Cortex-A7 cores).
The ability to use all cores simultaneously to provide improved peak performance throughput of the SoC compared to IKS.

Implementations[edit]

SoC	Semiconductor technology	big cores	LITTLE cores	GPU	Memory interface	Wireless radio technologies	Availability	Devices
HiSilicon K3V3	28 nm	1.8 GHz dual-core Cortex-A15	1.2 GHz dual-core Cortex-A7	Mali-T658			H2 2013
Samsung Exynos 5 Octa (5410 model)^[8]^[9]	28 nm	1.6-1.8 GHz quad-core Cortex-A15	1.2 GHz quad-core Cortex-A7	PowerVR SGX544MP3	32-bit dual-channel 800 MHz LPDDR3 (12.8 GB/sec)		Q2 2013	Exynos 5-basedSamsung Galaxy S4
Samsung Exynos 5 Octa (5420 model)^[10]	28 nm	1.8-2.0 GHz quad-core Cortex-A15	1.3 GHz quad-core Cortex-A7	Mali-T628MP6	32-bit dual-channel 933 MHz LPDDR3e (14.9 GB/sec)		Q4 2013	Exynos 5-basedSamsung Galaxy Note 3
Renesas Mobile MP6530^[11]	28 nm	2 GHz dual-core Cortex-A15	1 GHz dual-core Cortex-A7	PowerVR SGX544	Dual-channel LPDDR3	LTE CAT4

http://www.cnbeta.com/articles/271955.htm

ARM更新中端产品线：Cortex-A17处理器打头阵

2014-02-11 14:55:14 7529 次阅读稿源：cnBeta.COM 18 条评论

ARM处理器的世界，早已拥有了众多的型号，而今天，该公司公布了更强一点的Cortex-A17处理器内核，随之而来的还有一款新的视频处理器和显示控制器。不过，鉴于ARM提供了三种不同的Cortex-A系列CPU微架构，并且均面向于消费电子设备。而为了让大家了解Cortex-A17的定位，我们决定概述一下。

首先，是最小的Cortex-A7。该核心具有非常低的功耗要求，而ARM甚至还发布了基于A7的变种——Cortex-A53——并且兼容64位的ARM v8指令集架构。

其次，是位于两者中间的Cortex-A12。作为流行的Cortex-A9——这款32位内核支撑了铺天盖地的智能手机和平板电脑——的继任者，A12的到来已经有点晚，甚至还不支持64位的、兼容等效于ARMv8的组合。

当我们首次听到ARM宣布了中端产品更新的时候，我们以为这货会是Cortex-A12的64位继任者——假如不出意外，它的名字很有可能为Cortex-A55。

只不过，遗憾的是，我们最终见到的只是名为Cortex-A17的Cortex-A12继任者。

A17基于与A12相同的微架构，但是升级了用于外部互联的AMBA4。这使得A12拥有更快的内存控制器性能，并有效地改善电源效率。

ARM表示，Cortex-A17要比旧款Cortex-A9快上60%(值得商榷)。不过得益于新的总线接口，A17还支持完整的多核SoC的操作一致性。这意味着其能够应用ARM的big.LITTLE省电方案。

在初期，big.LITTLE只能对称实现——比如4大+4小核心。但是ARM的愿景是最终支持“不对称”。想象一下，未来那配备了4颗低功耗Cortex-A7s核心+2颗更快的Cortex-A17的SoC吧。

Cortex-A17的专利授权将于本季度末向合作伙伴开放，预计消费者能接触到最终设备的时间为2015年。

除了Cortex-A17，ARM还宣布了新的视频处理器(Mali-V500)、以及新的显示控制器(Mali-DP500)。这些模块主要面向采用了Mali-T720 GPU的中端SoC。该显示控制器支持ARM的AFBC帧缓冲压缩技术，有点cross-IP集成协同的意味。

[编译自：TechReport]

posted @ 2014-01-15 21:49 baihuahua 阅读(2650) 评论(0) 编辑收藏举报

刷新页面返回顶部

ARM architecture

ARM architecture

Contents

History[edit]

Acorn RISC Machine: ARM2[edit]

Apple, DEC, Intel, Marvell: ARM6, StrongARM, XScale[edit]

Licensing[edit]

Core license[edit]

Architectural licence[edit]

Cores[edit]

Example applications of ARM cores[edit]

32-bit architecture[edit]

CPU modes[edit]

Instruction set[edit]

Arithmetic instructions[edit]

Registers[edit]

Conditional execution[edit]

Other features[edit]

Pipelines and other implementation issues[edit]

Coprocessors[edit]

Debugging[edit]

Tools[edit]

DSP enhancement instructions[edit]

SIMD extensions for multimedia[edit]

Jazelle[edit]

Thumb[edit]

Thumb-2[edit]

Thumb Execution Environment (ThumbEE)[edit]

Floating-point (VFP)[edit]

Advanced SIMD (NEON)[edit]

Security extensions (TrustZone)[edit]

No-execute page protection[edit]

ARMv8-R[edit]

64/32-bit architecture[edit]

ARMv8-A[edit]

AArch64 features[edit]

Operating system support[edit]

32-bit operating systems[edit]

64-bit operating systems[edit]

See also[edit]

Comparison of current ARM cores

List of ARM cores

Contents

ARM cores[edit]

Designed by ARM[edit]

Designed by third parties[edit]

ARM Cortex-A8

Contents

Chips[edit]

ARM Cortex-A9 MPCore

Contents

Overview[edit]

Chips[edit]

Systems on a chip[edit]

Development platforms[edit]

ARM Cortex-A5

Contents

Overview[edit]

Chips[edit]

ARM Cortex-A7 MPCore

Contents

Overview[edit]

Chips[edit]

ARM Cortex-A12

Contents

Overview[edit]

ARM Cortex-A15

Contents

Overview[edit]

Chips[edit]

Systems on a chip[edit]

ARM big.LITTLE

Contents

Cluster migration[edit]

In-kernel switcher (CPU migration)[edit]

Heterogeneous multi-processing (global task scheduling)[edit]

Scheduling[edit]

Advantages of global task scheduling[edit]

Implementations[edit]

ARM更新中端产品线：Cortex-A17处理器打头阵