My service was played in Enowars 9: Here is how it went

On Saturday, July 19, 2025, the Attack-Defense Capture The Flag event Enowars 9 took place. Serving as a qualifier for next year's German hacking championship, the Deutsche Hacking-Meisterschaft, the event drew approximately 110 international teams, including top-ranked names in the global CTF scene. Hosted by TU Berlin's Enoflag team, the competition featured eight custom services we developed over the past five months. One of these was my extended ChatGPT clone, memorAIs, which turned out to be the service exploited by the fewest teams.

Capture what?

Within the scope of a class at Technische Universität Berlin, we were tasked with creating a service for the ninth event of the Enowars series. Such an event is a competition where teams both attack other teams' vulnerable services and defend their own. Each team runs the same set of services, in this case our self-written ones, and the goal is to find and exploit vulnerabilities in opponents’ systems while patching and securing their own. Behind every vulnerability the adversarial team finds a flag-store in which the flags are put every round by a central instance. These flags act as proof of compromise and must be submitted within a short time window to be counted as valid. Points are awarded for successfully stealing flags (attacks) and for maintaining service availability and integrity (defense).

Your data, remembered. Your questions, answered.

Early in the development phase, my fellow classmates voted for giving the CTF event a time travel theme (scandalously outvoting the vibe-coding suggestion), and so it came to pass that my ChatGPT clone not only answers but also remembers. This memory capability is powered by a Retrieval-Augmented Generation (RAG) pipeline, assembled with my own Word2Vec embedding model (more on that later) written in the AI-first language of my choice: C#. That is, the user can query the service with a question, the service retrieves relevant documents (called memories) from its database, and then generates an answer based on the retrieved information. Moreover, as an AI-advocate system, it's also a Model Context Protocol (MCP) host, effectively transforming it into a frontier AI application (editor's note: this paragraph may contain traces of irony). Users can have their personal AI agents (e.g., Claude Desktop, Cursor, etc.) fetch via the trendy protocol their exchanged chat messages with the service's language model: a quite small 135 million parameter one from the SmolLM model family.
Anybody curious can tinker with the service here: stoffregen.io/memorais. The source code can be found on GitHub.

Item 1 of 8

The implemented vulnerabilities

My service contained two vulnerabilities that teams needed to exploit in order to extract a flag. Given that the service is centered around AI capabilities, I aimed to design vulnerabilities that also revolved around AI concepts.
When discussing attack vectors in LLM-based applications, we often think of techniques like prompt injection, data and model poisoning, and system prompt leakage, which attempt to manipulate a model's behavior. However, such attacks aren’t directly applicable in the context of our CTF event, where the architecture requires that flags be retrievable for the attacking team from specific, exploitable locations within the service. Typically, a flag might be stored in a database. In other words, prompt injection is not a viable approach here, since it generally targets the consumers of the model output rather than the service itself.
Another obstacle were the hardware constraints. We equipped each team with a virtual machine for the competition, which, due to financial restrictions, was limited to 2 vCPUs and 4 GB of RAM per service. That limited my service to a single 135 million parameter model, which is significantly smaller than the models used in most modern AI applications that have hundreds of billions of parameters. Such a small model easily drifts off-topic and produces nonsensical outputs when given complex prompts. Therefore, vulnerabilities that rely on the model understanding context, reproducing information correctly, and behaving at least somewhat deterministically were not feasible.

Embedding Inversion Vulnerability

Embeddings are a common way to represent text in a vector, allowing models to understand and process language. A Retrieval-Augmented Generation pipeline is assembled by first encoding documents into embeddings using an embedding model and storing them in a vector database for efficient retrieval. At query time, the user’s question is embedded, similar vectors are retrieved from the database, and those results are passed to an LLM to generate a context-aware response. A sketch of a retrieval augmented generation pipeline

A sketch of a retrieval augmented generation pipeline

A basic RAG pipeline. Credits: Dr Julija (2024): How I built a Simple Retrieval-Augmented Generation (RAG) Pipeline [URL]

Now comes the twist: embeddings are often handled carelessly, without fully understanding what they can reveal about the original content. Recent research has shown that embeddings can sometimes be reversed with shockingly high accuracy, highlighting the importance of protecting them. This process of recovering parts of the original text is known as embedding inversion.
I considered implementing such a vulnerability but faced a few obstacles. First, as described above, even strong inversion models do not achieve sufficient accuracy for teams to reliably reverse the embeddings of flags. Furthermore, embeddings are lossy, discarding exact character-level information and they compress information, preserving meaning rather than precise sequences. Together, these factors make recovering a CTF flag from its embedding essentially impossible.
I still felt this could be an interesting challenge, inspired by the original idea of embedding inversion and suitable for the competition. To work around the obstacles, I implemented a custom Word2Vec embedding model with several tailored modifications, enabling a fully reversible embedding process.

-[----->+<]>--.++..++++. To explain part of my service, I need to start with a clever idea from our tutors: instead of using the standard Enowars flags, they suggested replacing them with Brainfuck code that, when executed, simply prints the original flags. Brainfuck is an esoteric programming language known for its extreme minimalism, it has only eight commands but is Turing complete, meaning it can compute anything that’s computable. While powerful, it’s notoriously difficult to read and write due to its limited character set.
Typically, an Enowars flag starts with the "ENO" prefix followed by 48 random Base64 characters.
For example: If you paste the following code into a brainfuck interpreter,

++++++++[>+++++++++>+++++++++>+++++++++++++++>+++++++++++++>+++++++>+++++>++++++++>++++++++++++>++++++>++++++++++<<<<<<<<<<-]>---.>++++++.+.+++++++++.<.+++++++++.---------.>>----.<<.>>+.>---.>----.<++++++.<++.>>+.<<<-.<++++++.>>>>>+++.>>>>++.<<<<<+.>>+++.<.>++.<<<<+++.<<-.>>>++.>.+.>>.>+++.>+.>-.--.+++++.<<<.<<<++++.>>>>++.<<<<.>+.<<<++.>-----.>-.--.<<<++++.>>>>---.<+.-----.+++.>+++.>>>>+.<<<<<<+++++.

you will get the original flag ENOXENEtEue4kw5WK+R6C+EzJm67Ec1QOTEqeq8YupnN5ojm82z.
We informed the teams upfront to expect these Brainfuck-encoded flags. This added complexity was roughly equal for both less and more experienced teams, although we anticipated it would disrupt some of the sophisticated tooling used by advanced teams.

The second hidden model: When selecting an embedding model, I opted for a simple approach, largely due to the hardware limitations we had set. Word2Vec, a shallow neural network, assigns each word in its training corpus a single vector representation. An embedding.
To address the challenges described earlier, I introduced a second, hidden Word2Vec model that operates differently. Instead of treating flags as ordinary words, this model generates an embedding for each individual character. These character embeddings are then concatenated to form a fixed 1500-dimensional vector representation of the flag.
This approach prevents Out-of-Vocabulary (OOV) issues and eliminates the need for averaging steps, which would otherwise make it impossible to reconstruct the original flag.

Out-Of-Vocabulary: Words, alien to the original text corpus, are called Out-of-Vocabulary words and are typically assigned a zero vector. This means that the model cannot handle them, and they are effectively ignored.
Obviously, putting all possible 51-character long random flags into the training corpus would be computational infeasible, but without doing so, they are always be OOV. By embedding each flag character by character, I only need all possible flag characters in the training corpus.

Averaging: With normal sentences, this averaging step is often used with Word2Vec to get a single vector for the whole sentence, which makes sentences comparable to each other. But averaging blends all the characters together, which erases the order and exact content of the flag. It’s like mixing different paint colors into one bucket, once they’re blended, you can’t separate them back out or tell which colors you started with.

Qdrant's employed norm: Every embedding vector is saved to the vector database Qdrant, which enables efficient similarity searches. On upload, Qdrant normalizes the vectors to ensure that distance calculations are consistent. It does it by multiplying the inverse of the vector's euclidean norm to the vector itself. This computation preserves the vector's direction but entirely discards the vector's magnitude, the teams would be unable to inverse the embedding since there are infinitely many possible vectors along the same direction. Therefore, I had to save the norm along with the vector to have teams be able to reverse the normalization and continue with the inversion.

Variable brainfuck flag length: The length of the brainfuck code generated for each flag varies depending on the flag itself. However, since the Word2Vec model requires a fixed output vector size to be defined in advance, and the vector database also expects vectors of a consistent size, I needed to standardize the embedding dimensions for both normal sentences and brainfuck-encoded flags.
To keep the vectors unobtrusive, I chose a size close to what popular embedding models, such as those from OpenAI, typically use (1536 dimensions). To simplify the now following calculation, I rounded this down to 1500 dimensions.
One of my fellow classmates developed the brainfuck encoder and tested it on a few million randomly generated flags to estimate the maximum possible length of an encoded flag. I then used this maximum length as the divisor for distributing the embedding dimensions. In our case, the longest encoded flag reached close to 500 characters. This meant that set my hidden Word2Vec model to represent each flag character as a 3-dimensional vector (1500 / 500 = 3).
Since this required every flag to have exactly 500 characters, I padded shorter flags with non-brainfuck characters which are luckily safely ignored by interpreters.

Crypto but stupid

The Model Context Protocol is one of the hot topics in the AI community, enabling seamless communication between a language model and various APIs.

A sketch depicting how the Model Context Protocol facilitates communication between a LLM and different APIs.

A sketch depicting how the Model Context Protocol facilitates communication between a LLM and different APIs.

The Model Context Protocol. Credits: Descope (2025): What Is the Model Context Protocol (MCP) and How It Works [URL]

To have your personal AI agent access your chat history, it must first negotiate a special JSON Web Token (JWT) with my service that is then later used in the authorization header of all subsequent MCP requests. MCP usually works with OAuth 2.0, which would have been a charm to implement, with an real self-hosted Identity Provider (IdP) like Keycloak and other fancy extensions like Proof Key for Code Exchange (PKCE) but unfortunately, time and hardware constraints hindered me from implementing this.
Negotiating the JWT involves a challenge-response authentication using a user generated access token as the shared secret and the rarely employed 8-bit cipher feedback (CFB) mode of the Advanced Encryption Standard (AES). However, because the initialization vector (IV) is set to all zeros, an attacker has a 1 in 256 chance of being successfully authenticated, making brute-force attacks feasible. This flaw is inspired by Zerologon, a critical vulnerability from 2020 in Microsoft's Netlogon protocol. For a more detailed explanation, I recommend reading the Wikipedia article, which I expanded from a stub into a full article last semester as part of a course at Universitat Politècnica de València.

And so it began...

The rainbow scoreboard

A screenshot of the scoreboard mid-game. You see it colored in various colors as infrastructure problems led to unreachable services.

A screenshot of the scoreboard taken mid-game. Teams are listed on the left, and services are shown at the top, with their vulnerabilities listed below. A red blood emoji indicates if a vulnerability was exploited, along with the team responsible. Green services are reachable, red and orange indicate downtime or unreachability, and blue typically means the service is restarting.

The event was unfortunately clouded by infrastructure problems on our side. Our tutors and members of TU Berlin's Enoflag CTF team managed the infrastructure setup. Each team was allocated a virtual machine running the services, with traffic routed through two routers. Too few routers. Attack-defense CTFs can be extremely demanding on infrastructure, as teams often launch near-denial-of-service traffic during reconnaissance and exploitation. The limited capacity of only two routers became a bottleneck, causing services to become unavailable and unreachable by other teams. Scaling the infrastructure took time, and during that period, the competition experience suffered. One team reported their virtual machines being partially unreachable for for a whole hour. Frustrating for everyone involved.

They bypassed my crypto vulnerability :(

Like each team, we had our own virtual machine, primarily for debugging purposes. This setup allowed us to monitor all incoming traffic. During a CTF event, attacking teams typically send traffic to all other teams, hoping to capture as many flags as possible. This gave us a realistic view of the kinds of attacks teams were carrying out against each other. I kept a close eye on the traffic to my service to spot any unusual patterns that might indicate a malfunction. If something went wrong, I would have had to provide a hotfix for the teams to apply.
Right after the first team exploited my crypto vulnerability, I checked the traffic and saw exactly what I expected: the team was sending many requests to brute-force the access token. But things quickly changed. The scoreboard showed that several more teams had cracked the vulnerability, yet the traffic didn’t spike as I anticipated. Instead, I observed behavior resembling a normal authentication flow. Teams appeared to simply create a regular user account to get the flag.
Then it hit me: in the very last request, they were sending a different user ID, the one belonging to the central instance that stored the flag in the database. My code was missing a crucial check in the second step of the handshake, allowing teams to impersonate the central instance and retrieve the flag without brute-forcing the access token. Damn. My service had an unintended bypass. Strangely, only a few teams actually exploited this vulnerability. On the bright side, this eased the heavy traffic load we were experiencing, since teams no longer had to attempt on average of 256 times for the JWT.

A player pointing out that brute-forcing the JWT would have had put the servers under even higher load.

A player on our discord server. Actually, a JWT remained valid for the entire game, so teams only needed to exploit others once.

How the bushwackers took over

One of my fellow classmates developed a neat service that cloned LeetCode, a platform where users solve coding challenges by uploading custom code, which is then executed and tested against predefined test cases. To enable this, the code runs inside a sandboxed virtualized environment for security. Midway through the competition, one of the participating teams contacted us after discovering suspicious SSH keys on their server, asking whether they belonged to us. Around the same time, we realized that the Russian team, known as the Bushwhackers, had managed to escape the sandbox. They began stealing flags directly from the databases of all other services, bypassing the need to exploit the intended vulnerabilities.

Emil Lerner, a member of the Russian team explaining how to escape the virtualization.

Conclusion

Having never participated in a CTF before, I was surprised by how much fun it was to develop a service and then watch teams try to break into it. We stayed at the university from 1 p.m. until well past midnight, chatting about the teams and services, helping our tutors with the infrastructure, and placing bets on which vulnerability would be exploited next. My service received positive feedback later on. The embedding inversion vulnerability was the third-to-last to be exploited, and overall, my service was exploited by the fewest teams. I’ll definitely carry the CTF spirit forward and plan to join one as a player next time.

Niclas Stoffregen, 22.08.2025

Privacy | About