Skip to content
0
  • Home
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
  • Home
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Sketchy)
  • No Skin
Collapse

Wandering Adventure Party

  1. Home
  2. Uncategorized
  3. In the early days of personal computing CPU bugs were so rare as to be newsworthy.

In the early days of personal computing CPU bugs were so rare as to be newsworthy.

Scheduled Pinned Locked Moved Uncategorized
78 Posts 29 Posters 2 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • Gabriele SveltoG Gabriele Svelto

    In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31

    arclightA This user is from outside of this forum
    arclightA This user is from outside of this forum
    arclight
    wrote last edited by
    #35

    @gabrielesvelto Thank you for this detailed and specific explanation. Chris Hobbs discusses the relative unreliability of popular modern CPUs in "Embedded Systems Development for Safety-Critical Systems" but not to this depth.

    I don't do embedded work but I do safety-related software QA. Our process has three types of test - acceptance tests which determine fitness-for-use, installation tests to ensure the system is in proper working order, and in-service tests which are sort of a mystery. There's no real guidance on what an in-service test is or how it differs from an installation test. Those are typically run when the operating system is updated or there are similar changes to support software. Given the issue of CPU degradation, I wonder if it makes sense to periodically run in-service tests or somehow detect CPU degradation (that's probably something that should be owned by the infrastructure people vs the application people).

    I've mainly thought of CPU failures as design or manufacturing defects, not in terms of "wear" so this has me questioning the assumptions our testing is based on.

    Gabriele SveltoG 1 Reply Last reply
    0
    • Gabriele SveltoG Gabriele Svelto

      All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31

      Josh Bowman-MatthewsJ This user is from outside of this forum
      Josh Bowman-MatthewsJ This user is from outside of this forum
      Josh Bowman-Matthews
      wrote last edited by
      #36

      @gabrielesvelto Super interesting; thanks for writing this up!

      1 Reply Last reply
      0
      • Gabriele SveltoG Gabriele Svelto

        All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31

        pinkforest(she/her) 🦀P This user is from outside of this forum
        pinkforest(she/her) 🦀P This user is from outside of this forum
        pinkforest(she/her) 🦀
        wrote last edited by
        #37

        @gabrielesvelto great read ty!

        1 Reply Last reply
        0
        • Gabriele SveltoG Gabriele Svelto

          In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31

          StuT This user is from outside of this forum
          StuT This user is from outside of this forum
          Stu
          wrote last edited by
          #38

          @gabrielesvelto Fascinating thread, especially the degradation over time inherit to modern processors. That came up recently in an interesting viral video on a world where we forget how to make new CPUs.

          Bit of an aside, but I assume this affects other architectures? The thread mentioned Intel and AMD, but I assume Arm and Risc-V are similarly prone to these sorts of problems?

          Gabriele SveltoG 1 Reply Last reply
          0
          • Gabriele SveltoG Gabriele Svelto

            Bonus end-of-thread post: when you encounter these bugs try to cut the hardware designers some slack. They work on increasingly complex stuff, with increasingly pressing deadlines and under upper management who rarely understands what they're doing. Put the blame for these bugs where it's due: on executives that haven't allocated enough time, people and resources to make a quality product.

            Perpetuum MobileP This user is from outside of this forum
            Perpetuum MobileP This user is from outside of this forum
            Perpetuum Mobile
            wrote last edited by
            #39

            @gabrielesvelto that's the deep nerdy stuff I love about IT! Thanks a ton for sharing this!

            Alex@rtnVFRmedia Suffolk UKV 1 Reply Last reply
            0
            • Gabriele SveltoG Gabriele Svelto

              The speed at which signals propagate in circuits is proportional to how much voltage is being applied. In older CPUs this voltage was fixed, but in modern ones it changes thousands of times per second to save power. Providing just as little voltage needed for a certain clock frequency can dramatically reduce power consumption, but providing too little voltage may cause a signal to arrive late, or the wrong signal to reach the pipeline register, causing in turn a cascade of failures. 24/31

              Graham Sutherland / PolynomialG This user is from outside of this forum
              Graham Sutherland / PolynomialG This user is from outside of this forum
              Graham Sutherland / Polynomial
              wrote last edited by
              #40

              @gabrielesvelto nitpick: the propagation velocity of a *signal* in a circuit is not affected by the voltage magnitude; that is a function of the (innate) dielectric constant of the material.

              however, a higher core voltage does mean that a rising edge tends to reach the gate threshold voltage of a transistor more quickly, which reduces the time it takes for each asynchronous logic element's output to reach a well-defined state after a change in input, thus propagating logic *state* more quickly.

              Graham Sutherland / PolynomialG 1 Reply Last reply
              0
              • Graham Sutherland / PolynomialG Graham Sutherland / Polynomial

                @gabrielesvelto nitpick: the propagation velocity of a *signal* in a circuit is not affected by the voltage magnitude; that is a function of the (innate) dielectric constant of the material.

                however, a higher core voltage does mean that a rising edge tends to reach the gate threshold voltage of a transistor more quickly, which reduces the time it takes for each asynchronous logic element's output to reach a well-defined state after a change in input, thus propagating logic *state* more quickly.

                Graham Sutherland / PolynomialG This user is from outside of this forum
                Graham Sutherland / PolynomialG This user is from outside of this forum
                Graham Sutherland / Polynomial
                wrote last edited by
                #41

                @gabrielesvelto (what you said is absolutely correct regarding "signals" in the HDL sense of the word, it just gets a bit muddled when we're simultaneously talking about the analogue behaviours of the actual electrical signals, hence the clarification ^^)

                Gabriele SveltoG 1 Reply Last reply
                0
                • Gabriele SveltoG Gabriele Svelto

                  Bonus end-of-thread post: when you encounter these bugs try to cut the hardware designers some slack. They work on increasingly complex stuff, with increasingly pressing deadlines and under upper management who rarely understands what they're doing. Put the blame for these bugs where it's due: on executives that haven't allocated enough time, people and resources to make a quality product.

                  The Orange ThemeT This user is from outside of this forum
                  The Orange ThemeT This user is from outside of this forum
                  The Orange Theme
                  wrote last edited by
                  #42

                  @gabrielesvelto This was a phenomenal write-up, thank you!

                  1 Reply Last reply
                  0
                  • Gabriele SveltoG Gabriele Svelto

                    In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31

                    Dubious BlurD This user is from outside of this forum
                    Dubious BlurD This user is from outside of this forum
                    Dubious Blur
                    wrote last edited by
                    #43

                    @gabrielesvelto fantastic thread thank you 😄

                    Gabriele SveltoG 1 Reply Last reply
                    0
                    • Gabriele SveltoG Gabriele Svelto

                      In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31

                      AndresFreundTecA This user is from outside of this forum
                      AndresFreundTecA This user is from outside of this forum
                      AndresFreundTec
                      wrote last edited by
                      #44

                      @gabrielesvelto Nice thread!

                      You seem to imply that bugs have become considerably more frequent, largely due to the increased complexity. Right?

                      To me it's not obvious that the larger number of known issues isn't to a large degree due to much better visibility (we didn't have anywhere close to today's automatic crash collection systems in the past) and due to the vastly increased number of CPUs... Do you have any gut feeling about that?

                      Gabriele SveltoG 1 Reply Last reply
                      0
                      • Graham Sutherland / PolynomialG Graham Sutherland / Polynomial

                        @gabrielesvelto (what you said is absolutely correct regarding "signals" in the HDL sense of the word, it just gets a bit muddled when we're simultaneously talking about the analogue behaviours of the actual electrical signals, hence the clarification ^^)

                        Gabriele SveltoG This user is from outside of this forum
                        Gabriele SveltoG This user is from outside of this forum
                        Gabriele Svelto
                        wrote last edited by
                        #45

                        @gsuberland thanks, I was playing a bit fast and loose with the terminology. As I was writing these toots I reminded myself that entire books have been written just to model transistor behavior and propagation delay, and my very crude wording would probably give their authors a heart attack.

                        Graham Sutherland / PolynomialG 1 Reply Last reply
                        0
                        • AndresFreundTecA AndresFreundTec

                          @gabrielesvelto Nice thread!

                          You seem to imply that bugs have become considerably more frequent, largely due to the increased complexity. Right?

                          To me it's not obvious that the larger number of known issues isn't to a large degree due to much better visibility (we didn't have anywhere close to today's automatic crash collection systems in the past) and due to the vastly increased number of CPUs... Do you have any gut feeling about that?

                          Gabriele SveltoG This user is from outside of this forum
                          Gabriele SveltoG This user is from outside of this forum
                          Gabriele Svelto
                          wrote last edited by
                          #46

                          @AndresFreundTec I've been in charge of Firefox stability for ten years now and some of my early work to detect hardware issues dates back then. In pre-2020 years we could get a 2-3 bugs per year, usually across different CPUs. Now we get dozens, it's really on another level.

                          Gabriele SveltoG 1 Reply Last reply
                          0
                          • Gabriele SveltoG Gabriele Svelto

                            @AndresFreundTec I've been in charge of Firefox stability for ten years now and some of my early work to detect hardware issues dates back then. In pre-2020 years we could get a 2-3 bugs per year, usually across different CPUs. Now we get dozens, it's really on another level.

                            Gabriele SveltoG This user is from outside of this forum
                            Gabriele SveltoG This user is from outside of this forum
                            Gabriele Svelto
                            wrote last edited by
                            #47

                            @AndresFreundTec admittedly we get a lot more after a new microarchitecture launches, and then they go down as microcode updates get rolled out. If Microsoft hadn't started shipping microcode updates with their OS updates we'd be swamped.

                            1 Reply Last reply
                            0
                            • Gabriele SveltoG Gabriele Svelto

                              In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31

                              Kim Spence-Jones 🇬🇧😷K This user is from outside of this forum
                              Kim Spence-Jones 🇬🇧😷K This user is from outside of this forum
                              Kim Spence-Jones 🇬🇧😷
                              wrote last edited by
                              #48

                              @gabrielesvelto
                              There’s also meta-stability. If a value is snapshotted half way through it changing, it may occasionally result in the output not being one or zero, but some ‘half’ value. Depending on the circuits using that result, it may be interpreted as either 1 or 0 — and maybe different parts of the circuit will use different interpretations. Such intermediate states are only meta-stable, and will flip to a firm 1 or 0 at some indeterminate time later, possibly propagating the problem.

                              Gabriele SveltoG 1 Reply Last reply
                              0
                              • Kim Spence-Jones 🇬🇧😷K Kim Spence-Jones 🇬🇧😷

                                @gabrielesvelto
                                There’s also meta-stability. If a value is snapshotted half way through it changing, it may occasionally result in the output not being one or zero, but some ‘half’ value. Depending on the circuits using that result, it may be interpreted as either 1 or 0 — and maybe different parts of the circuit will use different interpretations. Such intermediate states are only meta-stable, and will flip to a firm 1 or 0 at some indeterminate time later, possibly propagating the problem.

                                Gabriele SveltoG This user is from outside of this forum
                                Gabriele SveltoG This user is from outside of this forum
                                Gabriele Svelto
                                wrote last edited by
                                #49

                                @KimSJ ah yes, very good point. It's been a while since my days in hardware land and I had forgotten about it.

                                Link Preview Image
                                1 Reply Last reply
                                0
                                • Dubious BlurD Dubious Blur

                                  @gabrielesvelto fantastic thread thank you 😄

                                  Gabriele SveltoG This user is from outside of this forum
                                  Gabriele SveltoG This user is from outside of this forum
                                  Gabriele Svelto
                                  wrote last edited by
                                  #50

                                  @dubiousblur glad you liked it!

                                  1 Reply Last reply
                                  0
                                  • StuT Stu

                                    @gabrielesvelto Fascinating thread, especially the degradation over time inherit to modern processors. That came up recently in an interesting viral video on a world where we forget how to make new CPUs.

                                    Bit of an aside, but I assume this affects other architectures? The thread mentioned Intel and AMD, but I assume Arm and Risc-V are similarly prone to these sorts of problems?

                                    Gabriele SveltoG This user is from outside of this forum
                                    Gabriele SveltoG This user is from outside of this forum
                                    Gabriele Svelto
                                    wrote last edited by
                                    #51

                                    @tehstu yes, absolutely. I've encountered several bugs in AMD CPUs, not many on ARM just yet, but our ARM user-base is very small compared to x86, so it's just less likely for us to stumble upon them. Plus we have some machinery that can detect some hardware bugs automatically but it doesn't work on ARM just yet.

                                    1 Reply Last reply
                                    0
                                    • Gabriele SveltoG Gabriele Svelto

                                      However not all bugs can be fixed this way. Bugs within logic that sits on a critical path can rarely be fixed. Additionally some microcode fixes can only be made to work if the microcode is loaded at boot time, right when the CPU is initialized. If the updated microcode is loaded by the operating system it might be too late to reconfigure the core's operation, you'll need an updated UEFI firmware for some fix to work. 20/31

                                      Marcos DioneM This user is from outside of this forum
                                      Marcos DioneM This user is from outside of this forum
                                      Marcos Dione
                                      wrote last edited by
                                      #52

                                      @gabrielesvelto but UEFI is already quite complex, it has to find block devices, read their partition tables, read FAT file systems, read directories and files, load data in memory and transfer execution. Wouldn't a patch after all that not be too late?

                                      Gabriele SveltoG 1 Reply Last reply
                                      0
                                      • Gabriele SveltoG Gabriele Svelto

                                        I can't be sure that this is exactly what's happening on Raptor Lake CPUs, it's just a theory. But a modern CPU core has millions upon millions of these types of circuits, and a timing issue in any of them can lead to these kinds of problems. And that's without saying that voltage delivery across a core is an exquisitely analog problem, with voltage fluctuations that might be caused by all sorts of events: instructions being executed, temperature, etc... 27/31

                                        K This user is from outside of this forum
                                        K This user is from outside of this forum
                                        krzysdz
                                        wrote last edited by
                                        #53

                                        @gabrielesvelto Intel's officially stated reason is that (too) high voltage (and temperature) caused fast degradation of clock trees inside cores. This degradation resulted in a duty cycle shift (square wave no longer square?), which caused general instability. If they use both posedge and negedge as triggers, then change in duty cycle will definitely violate timing.

                                        Link Preview Image
                                        Intel Core 13th and 14th Gen Desktop Instability Root Cause Update

                                        Following extensive investigation of the Intel® Core™ 13th and 14th Gen desktop processor Vmin Shift Instability issue, Intel can now confirm the

                                        favicon

                                        (community.intel.com)

                                        1 Reply Last reply
                                        0
                                        • arclightA arclight

                                          @gabrielesvelto Thank you for this detailed and specific explanation. Chris Hobbs discusses the relative unreliability of popular modern CPUs in "Embedded Systems Development for Safety-Critical Systems" but not to this depth.

                                          I don't do embedded work but I do safety-related software QA. Our process has three types of test - acceptance tests which determine fitness-for-use, installation tests to ensure the system is in proper working order, and in-service tests which are sort of a mystery. There's no real guidance on what an in-service test is or how it differs from an installation test. Those are typically run when the operating system is updated or there are similar changes to support software. Given the issue of CPU degradation, I wonder if it makes sense to periodically run in-service tests or somehow detect CPU degradation (that's probably something that should be owned by the infrastructure people vs the application people).

                                          I've mainly thought of CPU failures as design or manufacturing defects, not in terms of "wear" so this has me questioning the assumptions our testing is based on.

                                          Gabriele SveltoG This user is from outside of this forum
                                          Gabriele SveltoG This user is from outside of this forum
                                          Gabriele Svelto
                                          wrote last edited by
                                          #54

                                          @arclight timing degradation should not be visible outside of the highest-spec desktop CPUs which are really pushing the envelope even when they're new. Embedded systems and even mid-range desktop CPUs will never fail because of it. What might become visible is increased power consumption over time though.

                                          Gabriele SveltoG 1 Reply Last reply
                                          0

                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          Powered by NodeBB Contributors
                                          • First post
                                            Last post