AI & machine learning AI & machine learning desk

Large language models in production: what Australian teams get wrong

Moving a large language model from demo to production is where most Australian AI projects quietly stall. Here is a clear look at the failure points and what experienced teams do differently.

By Tobias Strand · June 12, 2026

a close up of a computer with green lights

Large language models have moved from curiosity to core infrastructure faster than most Australian IT teams expected. Pilots run well. The demos are convincing. Then the model hits production and something quietly breaks: outputs drift, latency spikes, costs spiral, or the legal team asks a question nobody has an answer to. The gap between a working prototype and a reliable production system is wider than it looks from the outside, and the mistakes tend to cluster around a handful of predictable patterns.

Treating the model as the finished product

The most common mistake is shipping the model and calling it done. Large language models are components, not complete systems. A production deployment needs surrounding infrastructure: guardrails on input and output, retrieval layers to anchor responses in current information, monitoring for quality drift, fallback logic, and abuse controls. Teams that skip this work because the prototype "just worked" are setting themselves up for an incident they can't explain to the business.

This is closely related to the adoption gap covered in AI adoption in Australian enterprises: where things stand now. The pattern is consistent: pilots succeed in controlled conditions, and then the production context introduces variables the prototype never faced. Real users ask unexpected questions. Edge cases multiply. Context windows fill in ways the demo never did.

Underestimating the retrieval problem

Most enterprise LLM use cases depend on proprietary or time-sensitive information: internal policies, product catalogues, customer records, recent regulatory guidance. A base model knows none of this. Teams often respond by stuffing documents into the context window, which works up to a point and then degrades badly as the window fills. The better answer is a proper retrieval-augmented generation pipeline, where a search layer fetches the most relevant content before the model generates a response.

Getting retrieval right is genuinely hard. Chunking strategies matter. Embedding models need to be matched to the query domain. Re-ranking adds latency. Most teams underinvest here because retrieval infrastructure is less visible than the model itself, but it is usually the difference between a system users trust and one they abandon. For a deeper look at how this layer works and why enterprises need it, the breakdown of retrieval-augmented generation covers the architecture in detail.

No plan for evaluation

How do you know the model is performing well? Most teams cannot answer this question clearly at launch. They rely on informal human review or wait for users to complain. Neither is adequate at scale. Evaluation frameworks for LLMs are still maturing, but the basics are achievable: a set of golden test cases, automated checks for format and policy compliance, and a regular human review of sampled outputs. Without these, quality regressions go undetected until they become visible failures.

Evaluation also needs to account for model updates. When a foundation model provider updates their hosted API, the model's behaviour can shift in subtle ways. A system that worked reliably one month may produce noticeably different outputs the next. Teams running on managed API endpoints need to treat model version changes as deployment events and test accordingly.

Ignoring cost until it becomes a crisis

Token costs are easy to overlook in a prototype, where usage is low and budgets are loose. In production, the arithmetic changes fast. A system processing thousands of requests per day with long context windows can generate bills that surprise even experienced cloud teams. The failure mode here is not malice or carelessness; it is that nobody modelled the cost curve before launch.

Practical mitigations exist: caching common responses, using smaller models for tasks that do not require frontier capability, truncating context intelligently, and batching requests where latency permits. None of these are complicated, but they require deliberate engineering work. Teams that treat cost as an afterthought tend to scramble for fixes under pressure, which leads to rushed changes and new failure modes.

Privacy and data handling as an afterthought

Australian organisations using third-party LLM APIs need to think carefully about what data they are sending offshore. The Privacy Act, sector-specific regulations, and customer expectations all place constraints on what can flow into a commercial model endpoint. This is especially relevant for teams handling health records, financial data, or government information.

The failure is usually not deliberate. It is that privacy review happens after the architecture is locked in, when retrofitting controls is painful and expensive. The right time to ask "what data will this system process, and where will it go?" is at the design stage, not after the contract is signed with a US-based API provider. Sovereign and on-premises deployment options exist for high-sensitivity workloads, and they deserve evaluation early rather than as a last resort.

Prompt management without discipline

System prompts and few-shot examples are, effectively, code. They control the model's behaviour and they change over time as teams tune performance. Yet most organisations treat them as informal text files, without versioning, review processes, or change management. When something breaks, there is no audit trail. When a new team member edits a prompt to fix one problem, they can inadvertently introduce another.

Prompt management tooling has matured considerably, and the good practices from software engineering transfer directly: version control, code review for prompt changes, staged rollouts, and automated regression testing. Teams that apply engineering discipline to their prompts find that production systems are dramatically easier to maintain and debug.

What separates teams that ship well

The Australian teams doing this well share a few traits. They treat LLM deployment as a software engineering problem, not an AI research problem. They invest in observability from day one, logging inputs, outputs, latency, and cost in a way that supports both debugging and continuous improvement. They involve legal, privacy, and risk stakeholders before the system is live rather than after. And they plan for the model to be wrong: they design fallbacks, set user expectations appropriately, and build feedback loops that surface errors quickly.

None of this is exotic. It is the same discipline that separates reliable software from fragile software. The difference is that LLMs introduce a new class of failure mode: outputs that are plausible-sounding but wrong, inconsistent across runs, or subtly misaligned with business intent. Catching and managing that class of failure requires instrumentation and process that most teams do not have by default. Building it is the actual work of putting a large language model into production.