Coding AI Using AI: My Journey of Building Code Police During a Hackathon

At Bizom , our fast-growing company, we faced a growing challenge as our team expanded: maintaining code quality and error handling across our Bitbucket repositories. With more developers contributing code, it became increasingly difficult to ensure consistency and high standards during code reviews.

I explored existing solutions like Coderabbit.ai and WhatTheDiff.ai , but most of these tools were designed primarily for GitHub. Since we rely heavily on Bitbucket for our version control and collaboration, these tools didn’t align with our workflow. That’s when I realized we needed something custom — a solution tailored specifically to our needs.

The perfect opportunity to bring this idea to life came during an internal engineering hackathon called Mobithon, sponsored by Bizom. This event was the ideal platform to experiment with innovative ideas and solve real-world challenges we were facing.

Pitching the Vision: Introducing an AI-Powered PR Review Bot

During the hackathon, I pitched the concept of an AI-powered PR review bot — a tool that would automatically analyze pull requests (PRs) in Bitbucket and provide actionable feedback. Here’s how it would work:

When a developer creates a PR in Bitbucket, the bot is triggered via a webhook. It reviews the code and identifies critical issues, quality concerns, and areas for improvement. The bot then provides clear, actionable recommendations, such as suggested fixes or optimizations. The idea was well-received, and after a productive discussion with Rohit Agarwal, our Chief AI Officer — who was also leading the hackathon — it was approved for development.

The Journey of Building Code Police: From Idea to AI-Powered Reality

Building Code Police wasn’t something I did alone. I put together a team of five people, including myself, and we worked together to make this idea a reality. Each of us had different skills, and we all played a part in making the project successful.

We started by figuring out which technologies to use. Since most AI tools work best with Python, we decided to use Python for the project. It’s easy to learn, has lots of helpful libraries, and is great for building AI models quickly.

For the backend, we chose FastAPI instead of Flask. FastAPI is faster, supports asynchronous tasks (which is important when dealing with big codebases), and automatically creates documentation for APIs. This made our job easier and let us focus on the main features of Code Police .

Step 1: Picking the Right Tools

Step 2: Dividing the work

Once we picked our tools, we split the project into smaller tasks so everyone could focus on their part:

Webhooks : We set up Bitbucket webhooks to trigger Code Police whenever someone created or updated a pull request (PR). Code Analysis : We wrote the logic to read and analyze code for errors, quality issues, and ways to improve it. AI Training : We collected code examples, labeled them as “good” or “bad,” and trained the AI model to understand what to look for. Feedback Delivery : We designed a way to send clear and useful feedback to developers, either through API responses or a simple dashboard. Each task had its own challenges, but breaking it down made the work feel less overwhelming.

Step 3: Discovering Cursor — A Lifeline for Our Team

At the start of the project, we faced a big challenge: most of us weren’t Python experts. While we understood the basics, diving into advanced packages and frameworks felt overwhelming. How could we build an AI-powered tool like Code Police if we were still learning the language ourselves?

That’s when we stumbled upon Cursor , an IDE (Integrated Development Environment) designed specifically to help developers write code faster and smarter. Cursor turned out to be a lifesaver for our team. It wasn’t just an editor — it was like having a coding assistant by our side, guiding us through the process.

How Cursor Helped Us Build Code Police

Simplifying Python Development

Cursor made writing Python code much easier, even for beginners. Its intelligent autocomplete feature suggested functions, libraries, and syntax as we typed, which helped us avoid common mistakes. For example:

When we needed to use FastAPI for the backend, Cursor automatically imported the necessary modules and showed examples of how to structure API endpoints. It also helped us navigate Python’s vast ecosystem of packages, recommending the best ones for tasks like parsing code or training AI models.

Breaking Down Complex Problems

One of the hardest parts of building Code Police was figuring out how to analyze code effectively. Cursor’s composer feature allowed us to break down large chunks of code into smaller, manageable pieces. This made it easier to:

Identify patterns in the code that indicated potential issues (e.g., unused variables or inefficient loops). Test individual components of the AI model without getting lost in the complexity of the entire system.

Accelerating AI Development

Cursor didn’t just assist with Python — it also streamlined AI-specific tasks. For instance:

It suggested efficient ways to preprocess data for training the AI model. It provided templates for working with machine learning libraries, saving us hours of trial and error.

Learning on the Job

Cursor didn’t just assist with Python — it also streamlined AI-specific tasks. For instance:

It suggested efficient ways to preprocess data for training the AI model. It provided templates for working with machine learning libraries, saving us hours of trial and error.

Step 4: Choosing the Perfect AI Model for Code Analysis

Once we had the workflow in place — fetching the code diff from Bitbucket via its API — the next challenge was selecting the right AI model to analyze the code and provide meaningful feedback. This step was crucial because the quality of our tool depended heavily on the AI’s ability to understand code and generate actionable suggestions.

Exploring Open-Source Models

Initially, we considered using CodeBERT , an open-source model developed by Microsoft specifically for code analysis. CodeBERT is widely regarded as a powerful tool for tasks like code review, bug detection, and optimization. We also explored Hugging Face , a platform that hosts thousands of open-source models, hoping to find something suitable for our needs.

However, we quickly ran into a roadblock: we didn’t have access to GPUs . Training or fine-tuning large models like CodeBERT requires significant computational power, which we simply couldn’t afford at the time. This limitation forced us to rethink our approach.

Evaluating API-Based Models

We then turned our attention to API-based models, which are pre-trained and don’t require extensive computational resources. After some research, we narrowed down our options to two promising candidates:

DeepSeek : Known for its strong performance in code-related tasks, DeepSeek received glowing reviews from developers who had used it for similar projects. Claude Sonnet 3.5 : A powerful language model developed by Anthropic, Claude was praised for its speed, accuracy, and ability to handle complex prompts. To make an informed decision, we tested both models extensively.

Challenges with DeepSeek

While DeepSeek initially showed promise, we encountered significant issues during testing:

Slow API Responses : The time it took to receive feedback was inconsistent, often leading to delays in processing PRs. Server Busy Errors : On multiple occasions, DeepSeek’s API returned “server busy” messages, which disrupted our workflow and made it unreliable for real-time use. These challenges made us reconsider DeepSeek as a viable option for Code Police .

Why We Chose Claude Sonnet 3.5

In contrast, Claude Sonnet 3.5 stood out for several reasons:

Fast and Reliable API : Claude's responses were consistently quick, ensuring minimal delays in delivering feedback to developers.

High-Quality Results : The suggestions provided by Claude were accurate, relevant, and aligned with our expectations for code review.

Workbench Feature : One of Claude’s standout features was its workbench , which helped us craft better prompts. Writing effective prompts is crucial when working with AI models, and the workbench made this process much easier.

Cost vs. Value : While Claude was slightly more expensive than DeepSeek, its reliability and performance justified the additional cost.

Ultimately, we decided to go with Claude Sonnet 3.5 , confident that it would meet our needs for speed, accuracy, and scalability.

Step 5: Finalizing the Integration

With Claude Sonnet 3.5 selected, we integrated it into Code Police seamlessly. Here’s how the final workflow looked:

Fetch the Diff : Use Bitbucket’s diff API to retrieve the code changes from a PR.

Send to Claude : Pass the diff to Claude’s API for analysis.

Generate Feedback : Receive actionable suggestions from Claude and deliver them to developers via PR comments.

This streamlined process ensured that Code Police could provide timely, accurate, and helpful feedback — transforming the way our team handled code reviews.

Problems with AI generated code reviews

While building Code Police , we quickly realized that AI-powered tools, though incredibly powerful, aren’t perfect. In fact, one of our biggest challenges came from the very feature we were trying to implement: AI-generated code reviews . Here’s what happened during testing and how we addressed these issues.

The Over-Powered AI Problem

Initially, we planned for Code Police to leave comments on pull requests (PRs), which developers would then address before merging their code. However, during testing, we discovered an unexpected issue: the AI wouldn’t stop suggesting changes!

For example:

We tested Code Police on a simple PR with just three lines of code. The AI flagged several potential improvements, which seemed reasonable at first.

But as we addressed each suggestion, the AI kept generating new ones — often for things that didn’t really matter or were overly complex.

By the time we reached 48 commits , the AI still wasn’t satisfied. It felt like we were stuck in an endless loop of minor tweaks and over-engineering.

This was frustrating — not just for us but also for developers who would eventually use the tool. No one wants to spend hours addressing endless suggestions, especially when many of them don’t add real value.

The Context Problem

Another challenge we faced was related to the limited context provided by small diffs. Since we were sending only the changed portion of the code (via Bitbucket’s diff API), the AI lacked visibility into the broader codebase. This led to two main issues:

Over-Engineering Suggestions : Without understanding the full context, the AI sometimes suggested overly complex solutions that weren’t necessary for the task at hand. For instance, it might recommend refactoring a simple function into multiple classes, even though the original implementation was perfectly fine.

Misaligned Feedback : The AI occasionally misunderstood the intent behind certain changes, leading to irrelevant or counterproductive suggestions.

These problems highlighted a key limitation of AI: while it excels at analyzing patterns, it struggles with nuance and context — especially when working with fragmented data.

The Solution: Introducing a PR Score

To overcome the challenges posed by the AI’s overzealous suggestions and lack of context, we introduced a new feature: a PR Score . This score, ranging from 1 to 10, provides developers with a quick and intuitive measure of the overall quality of their pull request. Here’s how it works and why it made such a difference.

Why a PR Score?

During testing, we realized that developers didn’t need — or want — an exhaustive list of every possible improvement. Instead, they needed a way to gauge whether their code met the team’s standards and was ready for review. The PR Score addressed this need by:

Summarizing the AI's analysis into a single, easy-to-understand number
Reducing cognitive overload by focusing on high-level quality rather than granular details
Encouraging iterative improvements without overwhelming developers

For example:

A score of 8/10 might indicate that the PR is mostly good but has a few minor issues to address.
A score of 4/10 would signal significant problems that need attention before merging.*

This approach shifted the focus from perfection to progress, making the tool more practical and user-friendly.

How We Calculated the PR Score

To calculate the PR Score, we developed a weighted scoring system based on several key factors:

Code Quality : Syntax errors, unused variables, and formatting issues reduced the score significantly. Clean, readable code contributed positively.
Performance : Inefficient loops or redundant operations lowered the score. Optimized code boosted the score.
Best Practices : Adherence to coding standards (e.g., PSR for PHP) increased the score. Violations of best practices decreased it.

Each factor was assigned a weight based on its importance, and the final score was normalized to a scale of 1 to 10. For instance:

Critical issues like syntax errors had a higher weight, while minor formatting issues had a lower impact.

How the PR Score Improved the Workflow

Introducing the PR Score transformed the way developers interacted with Code Police . Here’s how:

Clear Expectations : Developers now knew exactly what was expected of them — a score above a certain threshold (e.g., 7/10) meant the PR was ready for human review.
Prioritization : Instead of addressing every suggestion, developers could focus on the most impactful changes to improve their score.
Reduced Over-Engineering : By setting a target score, we discouraged unnecessary tweaks and kept the process efficient.
Encouraged Iteration : Developers could submit multiple iterations of their PR, watching their score improve as they addressed key issues.

For example, during testing, one developer started with a score of 3/10 due to several critical issues. After addressing those, their score improved to 7/10 , at which point the PR was approved for review. This gamified approach made the process engaging and motivating.

Lessons Learned

The PR Score taught us an important lesson: simplicity is key . By distilling complex AI-generated feedback into a single metric, we made the tool more accessible and actionable. It also reinforced the idea that AI should complement — not dominate — the development process.

While the AI still provided detailed suggestions for those who wanted them, the PR Score gave developers the flexibility to decide how much feedback they needed. This balance ensured that Code Police added value without creating friction.

The Role of Mentorship

Throughout this process, mentorship played a vital role in shaping Code Police into a tool that truly met our team’s needs. Both Rohit Agarwal, our Chief AI Officer and Bhupendra-VP of Engineering at Bizom provided invaluable guidance, helping us refine our approach and elevate the quality of the AI-generated reviews.

Rohit Sir’s Guidance

Rohit Sir’s insights were instrumental in guiding our decision-making at every step. For example:

He encouraged us to focus on real-world usability rather than just theoretical performance, ensuring that Code Police would be practical and actionable for developers. He advised us to rigorously test each model under different scenarios, which helped us identify reliability issues with some of the APIs we initially considered. He also shared tips on structuring our integration pipeline for maximum efficiency, ensuring that Code Police could scale as our team grew. Thanks to his mentorship, we were able to make informed decisions — from choosing Claude Sonnet 3.5 to implementing features like the PR Score and weekly newsletters.

Bhupendra’s Feedback

In addition to Rohit Sir’s guidance, Bhupendra’s feedback was crucial in improving the quality of code reviews generated by Code Police . His contributions included:

Refining Suggestions : Bhupendra reviewed the AI’s feedback and pointed out areas where the suggestions could be clearer or more specific. For instance, he highlighted cases where the AI flagged issues but didn’t provide enough context or examples to help developers understand the problem.
Enhancing Relevance : He emphasized the importance of aligning suggestions with Bizom’s coding standards and workflows. This ensured that the feedback wasn’t just technically correct but also practical and applicable to our specific use cases.
Testing Edge Cases : Bhupendra pushed us to test Code Police on edge cases — such as poorly written legacy code or highly complex functions — to ensure it could handle real-world challenges effectively. His attention to detail and focus on quality helped us fine-tune Code Police , making it a more reliable and valuable tool for the entire team.

The Power of Collaboration

The combination of Rohit Sir’s strategic vision and Bhupendra’s hands-on feedback created a powerful synergy that drove the success of Code Police . Their mentorship reminded us that building great tools isn’t just about technology — it’s about understanding the people who will use them and continuously iterating based on their needs.

Thanks to their support, we were able to build a tool that not only addressed technical challenges but also fostered a culture of learning, growth, and collaboration within our team.

Hackathon Day: Customizing Code Police for Bizom

Under Rohit Sir’s and Bhupendra’s guidance, we spent the pre-hackathon phase laying the groundwork for Code Police . But on the day of the hackathon, we took things to the next level by customizing the tool to meet Bizom’s unique requirements. Here’s what we worked on:

Personalized Feedback and Weekly Newsletters One of our key goals was to make Code Police more than just a code review tool — it needed to be a mentor for developers. To achieve this, we introduced two features:

Saving Suggestions Based on Users : Every suggestion generated by Code Police was saved and tagged to individual developers. This allowed us to track their progress over time and identify recurring areas for improvement.

Weekly Newsletters : At the end of each week, developers received a personalized newsletter summarizing their PR performance. The newsletter included:

Areas for Improvement : Suggestions from the week’s PRs, along with helpful links to resources (e.g., tutorials, documentation) to address those gaps.

Appreciation for Strengths : Positive reinforcement for areas where they excelled, such as writing clean code or optimizing performance.

For example, a developer might receive feedback like:

“You’ve been doing a great job following PHP’s object-oriented programming (OOP) principles, especially with your use of classes and inheritance! However, we noticed some inefficiencies in how you’re handling database queries this week. Using prepared statements can improve security and performance. Here’s a link to learn more about prepared statements in PHP .”

This personalized approach not only helped developers grow but also made them feel valued and supported.

Building a Developer Dashboard

To provide visibility into team performance and individual contributions, we created a dashboard that displayed key metrics:

Top Performers : Developers with the highest average PR scores were highlighted, fostering healthy competition and recognition.

Total Quality Score : A cumulative score reflecting the overall quality of the team’s codebase.

Critical Issues : A breakdown of high-priority issues flagged across all PRs.

Average Developer Score : Each developer’s average PR score, helping managers identify who might need additional support or training.

Developer-Wise Suggestions : A document summarizing all suggestions for each developer, making it easy for them to track their progress.

The dashboard became a central hub for monitoring code quality and celebrating achievements. It also provided actionable insights for team leads, enabling them to make data-driven decisions.

Customizing for PHP and Java

While Code Police initially focused on General programming, we knew that Bizom’s codebase included languages like PHP and Java . During the hackathon, we worked on extending the tool’s capabilities to support these languages. This involved:

Training Claude Sonnet 3.5 on datasets specific to PHP and Java.
Fine-tuning the AI model to recognize language-specific patterns, best practices, and common pitfalls.
Testing the tool rigorously to ensure it provided accurate and relevant feedback for each language.

By expanding Code Police to cover multiple languages, we ensured that it could serve the entire development team, regardless of the programming language they used.

Sending Entire Files for Context-Aware Suggestions

One of the most impactful changes we made during the hackathon was shifting from sending code diffs to sending entire files for analysis. Initially, we relied on Bitbucket’s diff API to send only the changed portions of the code. While this approach worked, it often led to issues:

Limited Context : The AI lacked visibility into the broader structure of the code, resulting in over-engineering or irrelevant suggestions.

Misaligned Feedback : Without understanding the full context, the AI sometimes misunderstood the intent behind certain changes. To address these challenges, we experimented with sending entire files to Claude Sonnet 3.5. This small but significant change had a drastic impact :

Improved Accuracy : With access to the complete file, the AI could better understand the relationships between different parts of the code, leading to more precise and actionable suggestions.

Reduced Over-Engineering : The AI no longer suggested unnecessary refactors or overly complex solutions, as it could see the bigger picture.

Enhanced Context Awareness : The AI’s recommendations became more aligned with the actual needs of the codebase, improving developer trust in the tool. For example, when analyzing a function within a class, the AI could now consider how that function interacted with other methods and attributes in the same file. This resulted in suggestions that were not only technically correct but also practical and relevant.

Lessons Learned

This experience reinforced an important lesson: context matters . By providing the AI with more information, we enabled it to deliver higher-quality feedback. It also reminded us that even small adjustments — like switching from diffs to full files — can have a profound impact on the effectiveness of a tool.

The improvements we saw during the hackathon validated our decision to prioritize context-awareness. Developers reported that the suggestions were now more helpful and less intrusive, making Code Police a truly valuable addition to their workflow.

Conclusion: A Journey of Growth and Discovery

Even though we didn’t win the hackathon, the experience was far more rewarding than any prize could have been. Building Code Police taught us invaluable lessons about AI, problem-solving, and collaboration. Under the guidance of mentors like Rohit Sir and Bhupendra, we gained hands-on experience with cutting-edge tools like LLMs (Large Language Models) , prompt engineering , and AI-powered development .

More importantly, the hackathon fundamentally changed the way we approach problems. Before, we might have seen challenges as roadblocks. Now, we view them as opportunities to innovate and grow. Whether it’s leveraging AI to build smarter tools or finding creative solutions to technical limitations, our problem-solving mindset has significantly evolved.

The journey of building Code Police reminded us that success isn’t just about the destination — it’s about the skills you gain, the people you collaborate with, and the perspectives you develop along the way. While our tool is still a work in progress, the lessons we’ve learned will stay with us for the rest of our careers.

As we continue refining Code Police and exploring new ways to use AI in software development, one thing is clear: this hackathon was just the beginning. It showed us the immense potential of AI to transform not only how we code but also how we think, learn, and solve problems.