Today we are going to have some fun with PCI Express , but before proceeding I just want to share some knowledge with readers and FPGA enthusiasts. So first, a few questions will come into our mind – what is the PCI Express and why would we use PCI Express? Is there anything really useful to us in PCI Express? Please keep reading the post till the end to get the answers to these questions.
PCI Express Connector comes in two sizes one lane which used in common data transfer applications and sixteen lane used in graphic cards.
One lane connector has 36 pins to contacts which are arranged in two rows of 18 contacts, out of 36 contacts only 6 are used for data communication, others are power and auxiliary signals, six functional pins are connected in two pairs , which are often called “Differential Pairs” because both pins are inverted to each other
Similarly Sixteen Lane to allow for more speed, multiple lanes can be used. The REFCLK pair doesn’t need to be duplicated, so for example, PCI Express with 2 lanes uses 5 pairs (1 REFCLK + 2 PET + 2 PER)
Speed of PCI Express One lane Connectors is following :
|Version||One Lane||16 Lane|
|v1.x||250 MB/s ( 2.5 GT/s)||4 GB/s (40 GT/s)|
|v2.x||500 MB/s ( 5 GT/s)||8 GB/s (80 GT/s)|
|v3.0||985 MB/s ( 8 GT/s)||15.75 GB/s (128 GT/s)|
|v4.0||1969 MB/s (16 GT/s)||31.50 GB/s (256 GT/s)|
PCI Express speed is 25 Gbps which is 75 times more faster old PCI bus with the speed of 33Mhz, however PCI was a shared bus and PCI Express is point to point bus .
Speed working at 2.5Ghz is biggest design challenge , Point to Point architecture is really tough to get working because data transfer is really quick which becomes a problem, So to take care high-speed we are using new technique “Clock recovery”.
Basically, for each signal pair, the pair receiver looks at the signal transitions (a bit 0 followed by a bit 1, or vice-versa), from which it can infer the position of surrounding bits. One problem is that if many successive bits are transmitted with the same value (like lots of 0’s), no signal transition is seen.So extra bits are transmitted to ensure that signals transitions are not too far apart (which “re-synchronizes” the clock recovery mechanism).
The extra bits are sent using a scheme called 8b/10b encoding, so that for each 8 bit of useful data, 10 bits are actually transmitted (a 20% overhead) in a specific way that guarantees enough signal transitions. But that also means that at 2.5GHz, we only get 250MBps of useful bandwidth per pair (instead of the 312MBps we would get without the encoding overhead).
Why we use Differential Pairs so below are some advantages :
Differential pairs have one disadvantage, it takes one extra wires to transmit a signal. but never mind that could be an industry trade-off for High Speed and High Reliability .
The Packetized transactions
All operations are packetized, Let’s assume the CPU wants to send some data to a device. It forwards the order to the PCI Express bridge which then creates a packet. The packet contains the address and data to be written and is forwarded serially to the targeted device, which de-packetizes the write order and executes it. If CPU wants to get some data from device then bridge send a data to target device , from where CPU wants to read data then it will create a response/return package and send to the bridge. Either party has to acknowledge the transfers.
PCI Express Stack:
Getting packets flowing reliably along the wires takes some magic. As packets are transmitted serially at very high-speed, they have to be de-serialized/assembled, decoded at the destination (remove the 8b/10b encoding), de-interleaved (if multiple lanes are used), and checked against line corruption (CRC checks). Because all the three complexity is handled by “PCI Express Stack” . Which is composed of three layers
Sounds complicated? It probably is. The thing is, we don’t really care because most of the complexity is handled in the “PCI Express stack”, composed of three layers.
The first two layers are the ones implemented for us in the PCI Express FPGA core (usually a combination of hard and soft core) and handling all the complexity. As a user, we work only in the transaction layer, where life is easy, the sky is blue and girls are beautiful.
In more details:
So the major discussion point should be transaction layer
In the transaction layer, we receive “packets”. There is a 32-bits bus and the packets arrive on the bus (packet lengths are always multiples of 32-bits). Maybe one packet will say “write data 0x0000 at address 0xDEAD”, and another will say “read from address 0xBEEF (and return a response packet)”.
There are many types of packets: memory read, memory write, I/O read, I/O write, message, completion, etc… Our job in the transaction layer is to accept packets and issue packets. The packets are presented to us in a specific format called “transaction layer packets” (TLPs), and each 32-bits data arriving on the bus is called a “double word” (or DW in short). So a Transaction Layer Packet is a Bunch of DWs.
Transaction Layer Packet (TLP) :
TLP’s are simple to interpret Diagram shows the structure of TLPs
Header contains 3 or 4 DWs but the most important fields are part of the first DW.
|00||3 DW Header , No Data|
|01||4 DW Header , No Data|
|10||3 DW Header , With Data|
|11||4 DW Header , With Data|
“FMT” Filed provides a details what is size of transaction and if payload is there on not “Type” describes the TLP operation. The remainder of the TLP header content depends of the TLP operation.
For example, here’s a 32-bits memory write TLP header, where you can see that the write address is at the end of the header (and the write data is not shown as it is in the payload after the header).
In above TLP we can see FMT Value is “10” Which means 3 DW Header, With Data . If there is Data then it must be a write operation, The field LENGTH will tell us what is the size of Payload ( 0 – 1023 DW) , Mostly it is 1 , so there are for DW we have to transfer 3 DW as Fmt & 1 DW as payload
In above TLP is a memory read instead of a write, we have to execute the read and then respond. There is a special TLP for that response, it is called CplD (completion with data) and its payload contains the data that we want to return. FMT Value is “00”, which means “no data”. Makes sense, we don’t need data to do a read, just an address. But we now have to respond with data. And as important, the response will need to be routed back to whoever asked for the read.
We received a read request. Did it come from the CPU ? Or from the interrupt controller ? Or from a graphic card ? After all, many devices are capable of issuing such request. The answer is given in the “Requester ID” – it shows who requested the read. So when we create the CplD TLP, we have to recopy the “Requester ID” in it. This way, it’ll be routed where it belongs by the PCI Express bridge(s). We also have to recopy the “Tag” (which is useful in case of multiple reads pending) and finish all the transactions as per arbitration mechanisms.
So above I tried to explain and answer you questions for PCI Express. Soon I will also tell how to Use PCI Express using Xilinx Wizard.