The part I didn’t think about
Using an external API felt like the right move early on. I didn’t have to deal with models, which meant no infra, no optimization, and no real headaches. Just call the API and move on.
But I ignored a few things. I didn’t control the system. I had no fallback. I couldn’t tweak how it behaved.
It worked because everything was stable. The moment it wasn’t, I had a problem.
Looking for alternatives
My first instinct was simple. Replace the API.
I checked a few options. Some were expensive. Some were slow. Some didn’t give enough control. Nothing felt like a clean fit.
At that point, I had two choices. Keep depending on another service, or take ownership.
I went with the second.
When it broke
Moderation sat right in the middle of my flow. Every post and comment went through it.
When the API stopped being usable, I couldn’t safely accept content. I had no backup system, and the pipeline was effectively broken.
This wasn’t a bug. It was a missing system.
Looking for alternatives
My first instinct was simple. Replace the API.
I checked a few options. Some were expensive. Some were slow. Some didn’t give enough control. Nothing felt like a clean fit.
At that point, I had two choices. Keep depending on another service, or take ownership.
I went with the second.
Moving to a self-hosted model
I switched to a local model and picked Detoxify. It’s a BERT-based model trained for toxicity detection by Unitary.
This solved one problem. I was no longer dependent on an API.
But it created another one. Now I had to serve the model.
The real problem started here
Running a model on your machine is easy. Running it in a backend is different.
I had to deal with model size, inference time, CPU limits, and memory limits. I wasn’t using any high-end infra either. This had to run on a free tier.
At this point, moderation was no longer just a feature. It became a system I had to design.
Making it fast enough
The raw PyTorch model was too heavy. Inference was slow, and memory usage was high.
So I moved to ONNX.
I didn’t use a prebuilt export. I created my own using ONNX and onnx-script. This gave me more control over how the model runs.
After that, inference got faster, memory usage dropped, and performance became predictable.
Now it was usable.
Deploying it on cheap infra
I deployed the service on Oracle Ampere free tier. No GPU, limited CPU, and tight memory.
So everything had to be efficient.
The ONNX version made it possible to run inference within those limits. It wasn’t perfect, but it was stable, and that mattered more.
Current flow
This is what it looks like now.
A user submits content. It goes to the moderation service. The model checks toxicity, and the system decides whether to allow or block it.
The flow is simple. The work behind it is not.
This is just one layer of my moderation system, not the entire thing.
What changed for me
Before this, I treated APIs as building blocks.
Now I think differently. If something is core to the product, I want control over it.
External services are still useful, but not for everything.
What I would improve next
There are still gaps.
I’d move moderation to async queues, batch requests instead of handling them one by one, add proper metrics and monitoring, and explore smaller models to reduce cost.
This system works. It can be better.
Closing
I started with one API call.
Now I have a moderation service.
I didn’t plan for it. I was forced into it.
That’s how most real systems get built.