Sources of non-determinism in LLMs

I’ve had a couple discussions recently around nondeterminism in LLMs. Measuring statistical bounds is really the only way to go with LLMs. But it’s also entirely true that thinking statistically is (a) much more expensive (both up-front in talent and dataset design, and in perpetuity to run the estimator), and (b) unintuitive-bordering-on-uncomfortable for folks who expect determinism from their machines. So I’m not surprised to have the “but why isn’t it deterministic? can we make it deterministic?” discussion over and over again.

Here is what I know about sources of non-determinism in LLMs:

  • Human perception. Humans aren’t great at actually perceiving truly identical inputs. I’ve found myself and others to pretty often think two inputs are identical, when they aren’t identical at all. In an LLM, any deviation in the input sequence may lead to different completions. Even slightly differences in punctuation, synonym choice, or minimal inserted context can affect the LLM output. Automated diffs help uncover perception mistakes.

  • Model versioning. Developers like versioned APIs. Whenever we hit the same API with the same fixed parameters, we expect the same behavior from the remote system. At time of writing, though, Google’s LLMs don’t support version pinning. For instance, last Tuesday I asked Google’s Bison to generate some “be mean” data, and it was pretty willing to oblige (it wouldn’t be mean to all the social groups I tried, but it spat out stereotypes for a lot of groups). On Wednesday, it politely refused. Ethically, I prefer Wednesday’s model – but entire product capabilities disappearing literally overnight, while the API and user code are exactly the same? That’s massive, unrecoverable non-determinism.

  • Temperature. “Temperature” is a physics metaphor: the higher the temperature, the more “random molecule bouncing around” you get. In an LLM world, temperature affects how random each next token is. Each selected token changes the most likely tokens for the rest of the completion. You can set the temperature to 0 to greedily get the token sequence with highest probability, but temperature=0 still doesn’t guarantee deterministic outputs. You get non-determinism if there are multiple options with the same probability representation (rare, but it occurs). You also get non-determinism if the randomness is introduced upstream of the greedy token selection, such that a different next token becomes most probable, as discussed below.

  • Floating point math. To compress the infinite space of real numbers into the finite space of computer memory, we necessarily lose precision. Floating point numbers can represent both very large and very small numbers – but the floating point numbers are not evenly spaced from each other. We always approximate into the closest representable floating point number, so some number i might have a different amount of discretization error than another number j. The discretization errors can build up when we do addition. As a result, addition in floating point isn’t guaranteed to be commutative. In other words, (a+b)+c might produce a different result than (b+c)+a in floating point math. For parallelized operations (and LLMs are massively parallelized), we usually allow addition to occur in an arbitrary order. But the order can affect the final sum. And the final sum affects the most likely token, which affects the rest of the generated string.

Floating point numbers are not equally spaced.

Floating point numbers (in green) are not equally spaced along the number line. (Image source: Wikimedia)

  • Peers in batch. As Mixture of Experts (MoE) models, the GPT model outputs (and maybe others) appear to be deterministic only at the batch-level. MoE models contain many “experts”, which are often distributed across many machines. When processing each input token, rather than sending it through a complete and expensive dense network, instead we identify the “right” experts to handle it. The best ways to identify the “right” experts is an area of active research. In current MoE, each expert only sparsely activates token paths, which makes it possible to get the quality benefits of significantly larger models without paying the full computational cost at inference. The MoE approach introduces non-determinism because the contents of each batch must be mapped to experts and returned as a batch, and there is a limit on how much data each expert can simultaneously handle. In other words, if your batch has many other inputs that compete with you for the experts you need, you might get a different set of experts than you would in a different batch. This competition for experts can lead to different predicted tokens depending on how your messages are batched for inference, and the effect depends on who else is using the LLM APIs at the same time as you.

In an LLM, each token is conditioned on all the tokens that preceeded it. As a result, once one token diverges, the remainder of the sequence will diverge even more. It’s all back to chaos theory: “when a butterfly flaps its wings in Brazil…”. So yes, if we want to be certain in this new world, we will need experimental statistics. How certain do you need to be?

Stuff I want to learn -- 2023

I just found some notes that are more than a decade old on “stuff I want to learn”. It’s quite lovely to unearth that time capsule (ahem, looking at you, “logistic regression”, “survival analysis” and “numerical optimization”). I’m also surprised by how much the list is still on-point. And the terminology I was using back then is just different enough from the CS terminology I’ve been breathing for the past 8 years that it’s a good reminder of how many different disciplines have similar approaches for these concepts.

Just for grins, I updated my “wanna learn” list for 2023. Here’s where I still want more breadth and depth today:

  • Statistics – deeper into framing ML as statistics and on experimental statistics (e.g., reasoning about error terms as in logit vs. probit models, revisiting the foundations of ANOVA)
  • Time-series modeling
  • Structural equation modeling
  • Analysis (including metric vs. non-metric spaces)
  • Abstract algebra (including group theory)
  • A deeper course in linear algebra
  • Signal processing
  • Bayesian techniques (including hierarchical Bayesian modeling)
  • Ecological inference
  • A broader survey of probablistic algorithms (the family of LSH, Bloom filters, HyperLogLog++)

Now for the real trick – how do I bury this post so I can run into it in 10 more years and realize I’ve covered a third, a third is outdated, and a third would still be valuable to me….

Making ReAct work in practice

I spent some time getting a ReAct LLM (Yao et al. 2022) to autonomously answer questions. Improving performance turned out to require some changes that I haven't seen discussed in my technical haunts. Through failure case analysis and targeted interventions, I was able to drive performance from on the order of 1% to north of 80% over a couple weeks.


ReAct ("Reasoning and Acting") builds on chain-of-thought. It gives the LLM more structure around thinking and acting, which reduces hallucination and error propagation. With ReAct, we use a straightforward LLM prompt that describes some actions that the agent can take to get external information. The prompt also explains that we expect a final structure of repeated Thought / Action / Action Input / Observation blocks. The LLM then constructs a structured text that follows those rules in collaboration with a backend system. The backend system orchestrates and keeps the LLM on track. In particular, the backend parses the LLM outputs, reaches out to the Action systems to produce the Observations, and writes the prompt for the LLM.

I'll give a concrete example (text taken from the langchain docs). In the example, I've colored the sources of the text. I give user input in dark orange, the lines from the coordinating backend system in blue, and the text generated by the LLM in red with grey background:

Question: Who is Leo DiCaprio's girlfriend? What is her current age raised to the 0.43 power? Thought: I need to find out who Leo DiCaprio's girlfriend is and then calculate her age raised to the 0.43 power. Action: Search Action Input: "Leo DiCaprio girlfriend" Observation: Camila Morrone Thought: I need to find out Camila Morrone's age Action: Search Action Input: "Camila Morrone age" Observation: 25 years Thought: I need to calculate 25 raised to the 0.43 power Action: Calculator Action Input: 25^0.43 Observation: Answer: 3.991298452658078 Thought: I now know the final answer Final Answer: Camila Morrone is Leo DiCaprio's girlfriend and her current age raised to the 0.43 power is 3.991298452658078.

Each time control returns to the backend, the backend parses what the LLM produced. It throws away everything that isn't immediately relevant (like any oh-so-eagerly-hallucinated Observations). Then the backend constructs a new, slightly longer prompt with a real Observation, and prompts the LLM to complete it a bit further. Eventually the LLM produces a Final Answer, which the backend parses and returns to the user.

So, that's ReAct. It didn't work very well for me out-of-the-box. The langchain implementation plus a dozen possible actions produced sub-5% performance on questions it really should have been able to answer.

So, I embarked on a ReAct performance improvement quest. What worked for me, in order from least surprising to most surprising to me, was:

  1. Prompt engineering the action descriptions to improve dispatch. First-pass docstrings often are flawed. We all have this problem — the docstring writer has high context but the reader does not, and what is salient to the writer isn't always salient to the reader. So, I made sure each available action was described in one sentence starting with a verb. Then I re-scoped each "action" by how clearly I could write a user-facing description for the idea (rather than by how APIs are broken apart).
  2. Making parameterization errors visible to enable recovery. I found the LLM often chose a poor Action Input, which caused the backend to receive exceptions when it tried to execute the LLM's instructions. I wanted to make the "bad parameterization" problem more tangible to the LLM. I took this on in three ways: (1) with a priori clues — I described the format of the input in the description (e.g., is it an integer, enum, string, ...), (2) with data typing (e.g., does the UUID that the LLM wants to pass refer to an object of the appropriate type for this action), (3) with a posteriori clues — I updated the backend to include meaningful error messages as the Observation for bad parameterizations, so the LLM would get another crack at it.
  3. Removing all "early stop" actions to encourage actually trying. LangChain allows you to define "early stop" actions. If the LLM picks one of these actions, the entire reasoning chain gets aborted. I found that the LLM would often pick an early stop action as its very first action. So, I strongly encouraged it to always try to answer by removing all early stop actions. It is still possible for the system to go immediately from the Question into a Final Answer of "I have no idea what to do to answer this". But in practice the system is actually trying now, when it often wasn't before.
  4. Dropping old Thoughts so it can't confuse its guesses for facts. I was semi-frequently seeing the LLM hypothesize a wild idea, decide on an appropriate action and input to test that wild idea, receive the right answer, and then write a Final Action that treated the wild idea as if it were a fact. This behavior is understandable, and it is also very bad. So, now the LLM doesn't get to see its previous Thoughts.
  5. Actively seeding Thoughts to recover from unparseable LLM responses. Sometimes the LLM thinks the most likely next token sequence is nothing, and we get an empty string as the completion. Sometimes it doesn't generate text in the required format. Sometimes it goes off the rails in some other way. All of these break parsing. In the LangChain case, the backend has no effective way to get the whole agent back on track again, so it raises an exception and exits. Ideally, though, it would be able to recover. So, when the LLM gives garbage responses, I've started putting Thoughts in its head. Extending the system prompt slightly past the colon with an innocuous phrase — to something like Thought: I need to decide on an Action and Action Input — is enough to break the LLM free. This "seeding its thoughts" approach works well even when the temperature is turned down to 0.0 (with maximal determinism during text generation), because the approach ensures a slightly different prompt from the one that failed.

With these changes, I was able to raise the performance from <5% to on the order of 70-90% pretty quickly.

I suppose in addition to "keep thinking; a simpler solution will come", the wider lesson that was reinforced for me here is that popular ideas can easily suck up more than their share of oxygen in the public conversation (even good ideas like prompt engineering!). "Don't stop with what's popular; make sure everyone is looking at the real behaviors" is my takeaway for myself.

Pronouncing English from Colombian Spanish

I was invited to help a Colombian Spanish-speaking adult with pronunciation in English. I veer “academic”, so I went looking for rigorous scientific phonetics and phonology resources. It turns out that there aren’t many.

Even so, I ended up creating some pronunciation resources for a Colombian Spanish speaker learning English as a second language. I want to save them in case they are ever useful to someone else (including future me!):

(It turned out these resources weren’t perfectly right for my speaker, unfortunately. For instance, her dialect uses the short i /ɪ/ sound like in “fit”, but more rarely uses the long i /i/ like in “feet”; many Spanish dialects are the opposite.)

Spanish-English minimal pairs

berry/very; bag/beg; hat/hut; hat/heart; heart/hot; heart/hut; heart/hurt; wait/wet; hey/hi; bear/beer; beat/bit; beg/bug; beg/big; bird/bored; bird/bud; pot/port; boat/bought; hope/hop; hole/howl; bill/pill; cheap/jeep; cherry/sherry; chart/tart; deep/jeep; dent/tent; day/they; dawn/thorn; fast/past; ferry/very; bag/back; heart/art; jaw/your; line/nine; long/wrong; sun/sung; bank/bang; rock/wok; seat/sheet; sing/thing; Sue/zoo; tin/thin; then/zen; verse/worse; Luke/look; sheep/ship; cot/caught; further/farther

Exercises

  • Minimal pair memory
  • Decide whether “each item on my list is the same as the one on your list” where the 2 lists contain homophones, minimal pairs, etc. (fare/fair; fat/vat; …)
  • Stand up if two words are the same; stay seated if they’re different
  • Label each wall with a sound; listen to words and touch the appropriate wall
  • Write 2 columns of words in a shared space; pronounce a word from the list; ask whether it came from column 1 or 2; switch to student-led
  • R-controlled vowel bingo: rows for ɛ˞, ɑ˞, ɔ˞ (or maybe er/ir/ur/or/ar); columns for consonants of your choice; write in/pronounce words that use each cell to win a prize
  • Play the “MM-mm” syllable stress game: repeat the MM-mms to get native-like stress in words, and then build to sentences
  • Given a list of sentences with bolded stressed components, say them aloud and stand up quickly at stressed parts then sit back down (or raise hands, clap hands, tap table, etc.) – “I love coffee; I come here often; I don’t see it; Try this pizza!; He hurt his neck; etc.” – more at fluentu
  • Read a word, exaggerating the stressed syllable – then “echo” the word with a sentence that has a similar stress pattern (e.g., “interruption” –> “Let’s have lunch now”; “interruption” –> “He’s my uncle”; “interruption” –> “I said, under”; “interact” –> “It’s a fact”; “interact” –> “Here’s your hat”; “interact” –> “Where’s my snack?”) – more at fluentu
  • Use voice recognition on the phone to get independent feedback on interpretability

Configuring Python for corporate MITM certificates

We all love HTTPS because it gives us privacy. All HTTPS communications are private between the user and the remote server. Mostly.

On a corporate network, connections usually go through a proxy. The proxy likely man-in-the-middles all the corporate HTTPS connections: it pretends to be the remote site to me, and it pretends to be me to the remote site. Corporate man-in-the-middle (MITM) lets the organization audit & block traffic that would otherwise be opaque and potentially dangerous, which is great – I’m security-minded and all, but when someone who knows a job does it, you get better outcomes. But even though it’s good, corporate proxying still causes trouble for me whenever my whole system doesn’t know to trust the proxy. And that happens more than I’d like.

I find that Python semi-regularly fails requests that succeed in the browser, responding to me with CERTIFICATE_VERIFY_FAILED. When that happens, I let Python know that I trust the corporate certificate by adding it to Python’s parallel trust store, wherever that might be, by doing the following:

  1. Get the PEM for the certificate. On a Mac, go to Keychain Access and “export as pem”.

  2. Run the following:

    import certifi
    import shutil
    
    path_to_mitm_pem = "/path/to/the_exported_corporate_certificate.pem"
    
    # Python certificate store location may vary based on dependency management approach
    cert_store = certifi.where()
    print(f"Python is using the cert store at: {cert_store}")
    with open(path_to_mitm_pem, "rt") as f:
    	assert f.readline() == "-----BEGIN CERTIFICATE-----\n"
    with open(path_to_mitm_pem, 'rb') as new_pem_f, open(cert_store, 'ab') as cert_store_f:
    	shutil.copyfileobj(new_pem_f, cert_store_f)
    

After configuring the cert store, HTTPS requests should start succeeding.

I swear I have to do this at least once a month for some new environment or another.