Skip to content
0
  • Home
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
  • Home
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Sketchy)
  • No Skin
Collapse

Wandering Adventure Party

  1. Home
  2. Uncategorized
  3. In the early days of personal computing CPU bugs were so rare as to be newsworthy.

In the early days of personal computing CPU bugs were so rare as to be newsworthy.

Scheduled Pinned Locked Moved Uncategorized
78 Posts 29 Posters 2 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • Gabriele SveltoG Gabriele Svelto

    When implementing a new core it is commonplace to implement new structures, and especially more aggressive performance features, in a way that makes it possible to disable them via microcode. This gives the design team the flexibility to ship a feature only if it's been proven to be reliable, or delay it for the next iteration. 18/31

    Gabriele SveltoG This user is from outside of this forum
    Gabriele SveltoG This user is from outside of this forum
    Gabriele Svelto
    wrote last edited by
    #19

    Microcode can also be used to work around conditions caused by data races, by injecting bubbles in the pipeline under certain conditions. If the execution of two back-to-back operations is known to cause a problem it might be possible to avoid it by delaying the execution of the second operation by one cycle, again trading performance for stability. 19/31

    Gabriele SveltoG 1 Reply Last reply
    0
    • Gabriele SveltoG Gabriele Svelto

      Microcode can also be used to work around conditions caused by data races, by injecting bubbles in the pipeline under certain conditions. If the execution of two back-to-back operations is known to cause a problem it might be possible to avoid it by delaying the execution of the second operation by one cycle, again trading performance for stability. 19/31

      Gabriele SveltoG This user is from outside of this forum
      Gabriele SveltoG This user is from outside of this forum
      Gabriele Svelto
      wrote last edited by
      #20

      However not all bugs can be fixed this way. Bugs within logic that sits on a critical path can rarely be fixed. Additionally some microcode fixes can only be made to work if the microcode is loaded at boot time, right when the CPU is initialized. If the updated microcode is loaded by the operating system it might be too late to reconfigure the core's operation, you'll need an updated UEFI firmware for some fix to work. 20/31

      Gabriele SveltoG Marcos DioneM 2 Replies Last reply
      0
      • Gabriele SveltoG Gabriele Svelto

        However not all bugs can be fixed this way. Bugs within logic that sits on a critical path can rarely be fixed. Additionally some microcode fixes can only be made to work if the microcode is loaded at boot time, right when the CPU is initialized. If the updated microcode is loaded by the operating system it might be too late to reconfigure the core's operation, you'll need an updated UEFI firmware for some fix to work. 20/31

        Gabriele SveltoG This user is from outside of this forum
        Gabriele SveltoG This user is from outside of this forum
        Gabriele Svelto
        wrote last edited by
        #21

        But this is just logic bugs and unfortunately there's a lot more than that nowadays. If you've followed the controversy around Intel's first-generation Raptor Lake CPUs you'll know that they had issues that would cause seemingly random failures to happen. These bugs were caused by too little voltage being provided to the core under certain conditions which in turn would often cause a race condition within certain circuits leading to the wrong results being delivered. 21/31

        Gabriele SveltoG 1 Reply Last reply
        0
        • Gabriele SveltoG Gabriele Svelto

          But this is just logic bugs and unfortunately there's a lot more than that nowadays. If you've followed the controversy around Intel's first-generation Raptor Lake CPUs you'll know that they had issues that would cause seemingly random failures to happen. These bugs were caused by too little voltage being provided to the core under certain conditions which in turn would often cause a race condition within certain circuits leading to the wrong results being delivered. 21/31

          Gabriele SveltoG This user is from outside of this forum
          Gabriele SveltoG This user is from outside of this forum
          Gabriele Svelto
          wrote last edited by
          #22

          To understand how this works keep this in mind: the maximum frequency at which a CPU can operate is dictated by the longest path through the circuits that make up a pipeline stage. Signals propagating via wires and turning transistors on and off take time, and because modern circuit design is strictly synchronous, all the signals must reach the end of the stage before the end of a clock cycle. 22/31

          Gabriele SveltoG 1 Reply Last reply
          0
          • Gabriele SveltoG Gabriele Svelto

            To understand how this works keep this in mind: the maximum frequency at which a CPU can operate is dictated by the longest path through the circuits that make up a pipeline stage. Signals propagating via wires and turning transistors on and off take time, and because modern circuit design is strictly synchronous, all the signals must reach the end of the stage before the end of a clock cycle. 22/31

            Gabriele SveltoG This user is from outside of this forum
            Gabriele SveltoG This user is from outside of this forum
            Gabriele Svelto
            wrote last edited by
            #23

            When a clock cycle ends, all the signals resulting from a pipeline stage are stored in a pipeline register. A storage element - invisible to the user - that separates pipeline stages. So if a stage adds two numbers for example, the pipeline register will hold the result of this addition. The next cycle this result will be fed to the circuits that make up the next pipeline stage. If the result of the addition I mentioned is an address for example, then it might be used to access the cache. 23/31

            Gabriele SveltoG 1 Reply Last reply
            0
            • Gabriele SveltoG Gabriele Svelto

              When a clock cycle ends, all the signals resulting from a pipeline stage are stored in a pipeline register. A storage element - invisible to the user - that separates pipeline stages. So if a stage adds two numbers for example, the pipeline register will hold the result of this addition. The next cycle this result will be fed to the circuits that make up the next pipeline stage. If the result of the addition I mentioned is an address for example, then it might be used to access the cache. 23/31

              Gabriele SveltoG This user is from outside of this forum
              Gabriele SveltoG This user is from outside of this forum
              Gabriele Svelto
              wrote last edited by
              #24

              The speed at which signals propagate in circuits is proportional to how much voltage is being applied. In older CPUs this voltage was fixed, but in modern ones it changes thousands of times per second to save power. Providing just as little voltage needed for a certain clock frequency can dramatically reduce power consumption, but providing too little voltage may cause a signal to arrive late, or the wrong signal to reach the pipeline register, causing in turn a cascade of failures. 24/31

              Gabriele SveltoG Graham Sutherland / PolynomialG 2 Replies Last reply
              0
              • Gabriele SveltoG Gabriele Svelto

                The speed at which signals propagate in circuits is proportional to how much voltage is being applied. In older CPUs this voltage was fixed, but in modern ones it changes thousands of times per second to save power. Providing just as little voltage needed for a certain clock frequency can dramatically reduce power consumption, but providing too little voltage may cause a signal to arrive late, or the wrong signal to reach the pipeline register, causing in turn a cascade of failures. 24/31

                Gabriele SveltoG This user is from outside of this forum
                Gabriele SveltoG This user is from outside of this forum
                Gabriele Svelto
                wrote last edited by
                #25

                In Raptor Lake's case a very common pattern that me and others have noticed is that sometimes the wrong 8-bit value is delivered. This happens when reading 8-bit registers such as AH or AL, which are just slices of larger integer registers, and don't have dedicated physical storage. The operation that pulls out the higher or lower 8 bits of the last 16 bits of a regular register is usually done via a multiplexer or MUX. 25/31

                Gabriele SveltoG 1 Reply Last reply
                0
                • Gabriele SveltoG Gabriele Svelto

                  In Raptor Lake's case a very common pattern that me and others have noticed is that sometimes the wrong 8-bit value is delivered. This happens when reading 8-bit registers such as AH or AL, which are just slices of larger integer registers, and don't have dedicated physical storage. The operation that pulls out the higher or lower 8 bits of the last 16 bits of a regular register is usually done via a multiplexer or MUX. 25/31

                  Gabriele SveltoG This user is from outside of this forum
                  Gabriele SveltoG This user is from outside of this forum
                  Gabriele Svelto
                  wrote last edited by
                  #26

                  This is a circuit with two sets of 8 wires that go into it, plus one wire to select which inputs will go to the output, and a single set of 8 wires going out. Depending on the value of the select signal you'll get one or the other set of inputs. Guess what happens if the select signal arrives too late, for example right after the end of the clock cycle? You get the wrong set of bits in the output. 26/31

                  Gabriele SveltoG 1 Reply Last reply
                  0
                  • Gabriele SveltoG Gabriele Svelto

                    This is a circuit with two sets of 8 wires that go into it, plus one wire to select which inputs will go to the output, and a single set of 8 wires going out. Depending on the value of the select signal you'll get one or the other set of inputs. Guess what happens if the select signal arrives too late, for example right after the end of the clock cycle? You get the wrong set of bits in the output. 26/31

                    Gabriele SveltoG This user is from outside of this forum
                    Gabriele SveltoG This user is from outside of this forum
                    Gabriele Svelto
                    wrote last edited by
                    #27

                    I can't be sure that this is exactly what's happening on Raptor Lake CPUs, it's just a theory. But a modern CPU core has millions upon millions of these types of circuits, and a timing issue in any of them can lead to these kinds of problems. And that's without saying that voltage delivery across a core is an exquisitely analog problem, with voltage fluctuations that might be caused by all sorts of events: instructions being executed, temperature, etc... 27/31

                    Gabriele SveltoG K 2 Replies Last reply
                    0
                    • Gabriele SveltoG Gabriele Svelto

                      I can't be sure that this is exactly what's happening on Raptor Lake CPUs, it's just a theory. But a modern CPU core has millions upon millions of these types of circuits, and a timing issue in any of them can lead to these kinds of problems. And that's without saying that voltage delivery across a core is an exquisitely analog problem, with voltage fluctuations that might be caused by all sorts of events: instructions being executed, temperature, etc... 27/31

                      Gabriele SveltoG This user is from outside of this forum
                      Gabriele SveltoG This user is from outside of this forum
                      Gabriele Svelto
                      wrote last edited by
                      #28

                      You might also remember that Raptor Lake CPU problems get worse over time. That's because circuits degrade, and applying the wrong voltage can cause them to degrade faster. Circuit degradation is a research field of its own, but its effects are broadly the same: resistance in wires go up, capacity of trench capacitors go down, etc… and the combined effect of these changes is that circuits get slower and need more voltage to operate at the same frequency. 28/31

                      Gabriele SveltoG 1 Reply Last reply
                      0
                      • Gabriele SveltoG Gabriele Svelto

                        You might also remember that Raptor Lake CPU problems get worse over time. That's because circuits degrade, and applying the wrong voltage can cause them to degrade faster. Circuit degradation is a research field of its own, but its effects are broadly the same: resistance in wires go up, capacity of trench capacitors go down, etc… and the combined effect of these changes is that circuits get slower and need more voltage to operate at the same frequency. 28/31

                        Gabriele SveltoG This user is from outside of this forum
                        Gabriele SveltoG This user is from outside of this forum
                        Gabriele Svelto
                        wrote last edited by
                        #29

                        When CPUs ship their most performance critical circuits are supposed to come with a certain timing slack that will compensate for this effect. Over time this timing slack gets smaller. If a CPU is already operating near the edge, aging might cut this slack all the way down to zero, causing the core to fail consistently. 29/31

                        Gabriele SveltoG 1 Reply Last reply
                        0
                        • Gabriele SveltoG Gabriele Svelto

                          When CPUs ship their most performance critical circuits are supposed to come with a certain timing slack that will compensate for this effect. Over time this timing slack gets smaller. If a CPU is already operating near the edge, aging might cut this slack all the way down to zero, causing the core to fail consistently. 29/31

                          Gabriele SveltoG This user is from outside of this forum
                          Gabriele SveltoG This user is from outside of this forum
                          Gabriele Svelto
                          wrote last edited by
                          #30

                          And remember there's a lot of variables involved: timing broadly depends on transistor sizing and wire resistance. Higher voltages improve transistor performance but increase power dissipation and thus temperature. Temperature increases resistance which decreases propagation speed in wires. It's a delicate dance to keep a dynamic equilibrium of optimal power consumption, adequate performance and reliability. 30/31

                          Gabriele SveltoG 1 Reply Last reply
                          0
                          • Gabriele SveltoG Gabriele Svelto

                            And remember there's a lot of variables involved: timing broadly depends on transistor sizing and wire resistance. Higher voltages improve transistor performance but increase power dissipation and thus temperature. Temperature increases resistance which decreases propagation speed in wires. It's a delicate dance to keep a dynamic equilibrium of optimal power consumption, adequate performance and reliability. 30/31

                            Gabriele SveltoG This user is from outside of this forum
                            Gabriele SveltoG This user is from outside of this forum
                            Gabriele Svelto
                            wrote last edited by
                            #31

                            All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31

                            Gabriele SveltoG Josh Bowman-MatthewsJ pinkforest(she/her) πŸ¦€P x0X HyaninerH 5 Replies Last reply
                            0
                            • Gabriele SveltoG Gabriele Svelto

                              All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31

                              Gabriele SveltoG This user is from outside of this forum
                              Gabriele SveltoG This user is from outside of this forum
                              Gabriele Svelto
                              wrote last edited by
                              #32

                              Bonus end-of-thread post: when you encounter these bugs try to cut the hardware designers some slack. They work on increasingly complex stuff, with increasingly pressing deadlines and under upper management who rarely understands what they're doing. Put the blame for these bugs where it's due: on executives that haven't allocated enough time, people and resources to make a quality product.

                              Grumble πŸ‡ΊπŸ‡Έ πŸ‡ΊπŸ‡¦ πŸ‡¬πŸ‡±G Perpetuum MobileP The Orange ThemeT BrettB 4 Replies Last reply
                              0
                              • Gabriele SveltoG Gabriele Svelto

                                In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧡 1/31

                                A. R. YounceA This user is from outside of this forum
                                A. R. YounceA This user is from outside of this forum
                                A. R. Younce
                                wrote last edited by
                                #33

                                @gabrielesvelto This is one of those cases where I wish I had a Mastodon client that let me like the whole thread.

                                1 Reply Last reply
                                0
                                • Gabriele SveltoG Gabriele Svelto

                                  Bonus end-of-thread post: when you encounter these bugs try to cut the hardware designers some slack. They work on increasingly complex stuff, with increasingly pressing deadlines and under upper management who rarely understands what they're doing. Put the blame for these bugs where it's due: on executives that haven't allocated enough time, people and resources to make a quality product.

                                  Grumble πŸ‡ΊπŸ‡Έ πŸ‡ΊπŸ‡¦ πŸ‡¬πŸ‡±G This user is from outside of this forum
                                  Grumble πŸ‡ΊπŸ‡Έ πŸ‡ΊπŸ‡¦ πŸ‡¬πŸ‡±G This user is from outside of this forum
                                  Grumble πŸ‡ΊπŸ‡Έ πŸ‡ΊπŸ‡¦ πŸ‡¬πŸ‡±
                                  wrote last edited by
                                  #34

                                  @gabrielesvelto I went to a lecture in the early 1990's by Tim Leonard, the formal methods guy at DEC. His story was that DEC had as-built simulators for every CPU they designed, and they had correct-per-the-spec simulators for these CPUs.

                                  At night, after the engineers went home, their workstations would fire up tools that generated random sequences of instructions, throw those sequences at both simulators, and compare the results. This took *lots *of machines, but, as Tim joked, Equipment was DEC's middle name.

                                  And they'd find bugs - typically with longer sequences, and with weird corner cases of exceptions and interrupts - but real bugs in real products they'd already shipped.

                                  But here was the banger: sure, they'd fix those bugs. But there were still more bugs to find, and it took longer and longer to find them.

                                  Leonard's empirical conclusion is that there is no "last bug" to be found and fixed in real hardware. There's always one more bug out there, and it'll take you longer and longer (and cost more and more) to find it.

                                  Mike SpoonerS 1 Reply Last reply
                                  2
                                  0
                                  • Gabriele SveltoG Gabriele Svelto

                                    In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧡 1/31

                                    arclightA This user is from outside of this forum
                                    arclightA This user is from outside of this forum
                                    arclight
                                    wrote last edited by
                                    #35

                                    @gabrielesvelto Thank you for this detailed and specific explanation. Chris Hobbs discusses the relative unreliability of popular modern CPUs in "Embedded Systems Development for Safety-Critical Systems" but not to this depth.

                                    I don't do embedded work but I do safety-related software QA. Our process has three types of test - acceptance tests which determine fitness-for-use, installation tests to ensure the system is in proper working order, and in-service tests which are sort of a mystery. There's no real guidance on what an in-service test is or how it differs from an installation test. Those are typically run when the operating system is updated or there are similar changes to support software. Given the issue of CPU degradation, I wonder if it makes sense to periodically run in-service tests or somehow detect CPU degradation (that's probably something that should be owned by the infrastructure people vs the application people).

                                    I've mainly thought of CPU failures as design or manufacturing defects, not in terms of "wear" so this has me questioning the assumptions our testing is based on.

                                    Gabriele SveltoG 1 Reply Last reply
                                    0
                                    • Gabriele SveltoG Gabriele Svelto

                                      All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31

                                      Josh Bowman-MatthewsJ This user is from outside of this forum
                                      Josh Bowman-MatthewsJ This user is from outside of this forum
                                      Josh Bowman-Matthews
                                      wrote last edited by
                                      #36

                                      @gabrielesvelto Super interesting; thanks for writing this up!

                                      1 Reply Last reply
                                      0
                                      • Gabriele SveltoG Gabriele Svelto

                                        All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31

                                        pinkforest(she/her) πŸ¦€P This user is from outside of this forum
                                        pinkforest(she/her) πŸ¦€P This user is from outside of this forum
                                        pinkforest(she/her) πŸ¦€
                                        wrote last edited by
                                        #37

                                        @gabrielesvelto great read ty!

                                        1 Reply Last reply
                                        0
                                        • Gabriele SveltoG Gabriele Svelto

                                          In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧡 1/31

                                          StuT This user is from outside of this forum
                                          StuT This user is from outside of this forum
                                          Stu
                                          wrote last edited by
                                          #38

                                          @gabrielesvelto Fascinating thread, especially the degradation over time inherit to modern processors. That came up recently in an interesting viral video on a world where we forget how to make new CPUs.

                                          Bit of an aside, but I assume this affects other architectures? The thread mentioned Intel and AMD, but I assume Arm and Risc-V are similarly prone to these sorts of problems?

                                          Gabriele SveltoG 1 Reply Last reply
                                          0

                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          Powered by NodeBB Contributors
                                          • First post
                                            Last post