Taste: the Unquantifiable Fitness Metric

What 75 Hours of AI Music Creation Revealed About Quality Assessment

Jul 15, 2025

I have a confession: over the Fourth of July long weekend, I spent 75 hours making music with AI on full OCD mode: staying up until 2-3 AM each night, waking up at 7-8 AM to immediately continue where I left off. By the end of 4.5 days, I had created 18 songs and developed what I can only describe as a temporary addiction to optimizing AI-generated music. During this binge, my process for making each song evolved from the initial careless 5 second prompt to a full 10-step process lasting up to 8 hours.

It started innocently enough. On Wednesday July 2, I noticed Suno had released version 4.5 about a month ago, so I decided to try it out. I had played with Suno 3.5 back in April 2024 and found it interesting but not compelling enough to use seriously. Version 4.0 earlier this year hadn't felt like much of an improvement. But the first few songs 4.5 generated were qualitatively different—good enough that I thought, "Maybe I could actually make something I'd want to listen to repeatedly." So I started making songs. Every time I finished a song, I would think about how to make the next one better, and started iteratively improving my own song-making process.

I didn't set out to develop a systematic approach; I was just trying to make songs I actually wanted to listen to myself. But looking back now, I can see that my music creation process unconsciously evolved through several distinct phases, each one emerging when I hit a wall with the then-current approach. The whole 75 hour marathon was a case study of real-time optimization of efficiently working with new AI. By the end of the long weekend when I needed to extricate myself from my obsession to get back to real work, I had two songs I was genuinely proud of. Comparing the final songs with the original drafts created by Suno 4.5, I personally think there's a dramatic improvement, even though I can't mathematically describe the "goodness" metric that I was optimizing for.

If I had to verbally describe my evaluation criteria, I would say I was asking myself (1) would I want to listen to this song over and over again in the next 3 months, (2) would I be able to remember this song and hum it in 6 months, and (3) would I still like this song in 2-5 years? The objective/fitness function that I have is probably 80% correlated with other people's, because different people's preference for music style and lyrics meaning varies, but there are some universal/common expectations about melody/harmony consistency. We broadly call this the "taste problem" of AI benchmarks—the challenge of formalizing quality assessment in domains where subjective judgment is difficult to mathematically formalize as a fitness or reward function. It's a problem that extends far beyond music and may represent one of the most important cognitive frontiers where some humans still currently maintain an edge over AI systems. But first, let's go through my evolution of music creation processes, as that maybe sheds light on the moving goalposts of "taste" in music.

My Different Phases of Music Creation

Phase 1 was pure exploration: I would enter a different title and style prompt, and press the Button that says "Generate Lyrics". The total time to create a song was about 5 seconds. I would then spend another 3 minutes to listen to the song, unless I hated it and stopped 20 seconds in. At this phase, I was really just trying to get a broad overview of the types of music that 4.5 can competently create. After the initial marvel of the improved capabilities of 4.5 wore off in about an hour or so, I started thinking about how to improve the quality of the music. Basically, I decided that Suno 4.5 output was good enough to act as starting material for a song, but not good enough to be a "final product" song for the vast majority of outputs. I then did some controlled experiments for an hour or two comparing the same lyrics using different style prompts and settled on a few favorite style prompts.

Phase 2 was selection from multiple outputs: For each title and style prompt, I would generate 10 songs, and pick the best after listening to the top 2 or 3 songs fully. By definition, the songs coming out of my Phase 2 process were in at least the top 10% of songs produced by Suno 4.5, at least based on my own "taste".

Phase 3 saw the use of a separate AI (Claude) for lyrics: The Suno built-in Lyrics generator gave songs with relatively simple lyrics and tended to produce songs that were seldom longer than 2 minutes. So I decided to give my current favorite large language model Claude (4.0 Sonnet) the job of generating lyrics. Based on past experience collaborating with Claude, I needed to provide a bit of instructions in addition to giving the music style and title, otherwise we end up with language over-represented in AI writing. In addition to giving Claude general instructions to avoid cliched language, I also give Claude instructions to include between 4 and 8 specific phrases for the lyrics. By this point, I was getting a little impatient about screening through 10 songs each time, so I reduced the number of songs generated to 6. Maybe in terms of the strict melody and harmony, the songs produced were not significantly better than in Phase 3, but at least for me, having good lyrics is an important property of a song. In this phase, my own time spent per song had creeped up to about 20 minutes (negotiating lyrics with Claude and then picking among 6 Suno work products).

Phase 4 was when I began using Suno's Song Editor and Remastering: The Song Editor allowed me to change different parts of the song in a random access format, and dramatically improved the overall quality of the end points. Additionally, it completely upended my workflow. Unlike in Phase 3 where I judged Suno's many song outputs based on overall "goodness", in Phase 4 I would pick an initial Suno output based only on the quality of chorus and harmony. I could easily swap out the Verses and Bridges as many times I needed, but swapping out the Chorus is annoying because the Chorus repeats but not perfectly, so getting consistency on Chorus is way more work than I was willing to invest. Suddenly I was now spending more than an hour to create each song, but the quality also jumped dramatically. In retrospect, I'm also realizing that editing capabilities mattered more than base generation quality when you want high-quality end products.

As one might expect, my Frankensongs from Phase 4 had multiple sections that were good individually, but the interfaces were terribly jarring from inconsistent beats and keys. Suno conveniently had a Remaster functionality that was supposed to fix exactly problem. Unfortunately, the Remastered versions would swap out voices, change the volumes of instrumental tracks/stems, and even change the tune in not-so-subtle ways. The degree of variation in the Remaster outputs makes me quite sure that this was "intentional variation" to allow the user to have more options to select from. But in contrast to the Song Editor, Remaster has no tuning knobs for even coarse control, which meant that I typically had to manually sift through 10 to 50 Remaster versions to find the single one Remaster song that both (1) fixes the artifacts and (2) preserves my original song. Implementing both the Song Editor section edits (which was fun!) and Remaster screening (which was agonizing!) shot up my typical time to make a song up to maybe 2 hours.

Phase 5 involved a second round of granular edits: Although the products of Phase 4 song-making were already quite good overall (and much better than the original Suno 4.5 outputs), they would almost always have one or two flaws immediately signal "amateur" regardless of how good everything else sounds. It could be a mispronounced word with multiple pronunciations ("lives" as a verb vs. "lives" as a noun), it could be an incorrect note, or it could be weird stutter from imperfect healing/melding of two different sections. Although these are typically less than 1 second in duration, Suno has a minimum Section edit length requirement of 3 seconds, so these microedits unfortunately reintroduce the inconsistency problem that requires Remastering. But overall, adding this step didn't dramatically increase my time investment per song, and makes the AI-generated nature of the end product much less detectable to OCD folks like me. (As an aside, I'm not formally diagnosed with obsessive-compulsive disorder, and I seem to have a bunch of attention deficient hyperactive disorder tendencies as well, so I'm using the term "OCD" loosely here.)

Phase 6 added backtracking: The changes that I made in Phases 4 and 5 are essentially trying to locally improve the song with each step, but I was realizing that sometimes I get to local plateaus of "taste" that are difficult to improve further. In technical terms, what I was doing is essentially stochastic gradient descent algorithm. Basically, imagine the universe of all possible songs as a vast wooded mountain range, and we are trying to get to the tallest mountain. Even if we find the path to the tallest mountain on the first try, there are innumerable ups and downs along the path. Occasionally, we may need to jump or swim across a river, similar to how we Replace a verse section. When the trail ends, we may need to wander a bit in the brush until we find a new trail, similar to how we Remaster the entire song. But occasionally, we end up at a local hilltop that is the highest point within a square mile, but far from the highest mountain. And when we get there, we need to make the emotionally difficult decision to significantly rewind and abandon time-consuming edits. This could sometimes mean throwing away 3-4 hours of detailed work. It was painful, but necessary to escape what optimization theorists call "local maxima"—situations where continuing with your current approach prevents you from reaching better solutions. And to be honest, this is the part that I often find myself adding the most value right now to all AI systems for all tasks.

The whole progression of ever more complex AI music creation processes took five days. In addition to the process of creating and optimizing music creation processes, I was simultaneously working on refining my "music taste", trying to consider how to balance the multiple dimensions of musical quality to an overall "goodness" score that decided whether I continue with a particular edit or not. Through this process, initially I was just listening for pleasant overall experience, and then later obsessing over specific notes or pronunciations. Through each phase of the music creation process, my music evaluation metric also evolved.

When AI Tools Cross Capability Thresholds

My decision to invest 75 hours wasn't random: I'd tested Suno 3.5 in April 2024 and 4.0 earlier this year, but neither crossed my personal threshold for serious time investment. Suno 4.5 was qualitatively different—the initial results were good enough, above a critical capability threshold, that I thought I might actually be able to create something I'd want to keep listening to for months or years. This threshold detection ability seems strategically important for anyone working with AI tools professionally. Most people either abandon promising technologies too early or waste substantial time on tools that aren't quite ready for their specific applications. Learning to recognize genuine capability inflection points versus mere incremental improvements could provide significant competitive advantages as AI tools proliferate across professional domains.

The breakthrough with Suno 4.5 wasn't just improved base generation quality (which grabbed my attention) but more importantly the enhanced Song Editor functionality. Most casual users probably focus on initial output quality and miss that granular editing control often determines whether AI output can achieve professional standards. This insight likely applies broadly across AI creative tools—the most powerful AI applications not only requires initial results above a threshold performance to keep the results-oriented user interested, but also powerful tools for iterative refinement toward specific quality targets.

Looking ahead, based on current limitations I've identified—pronunciation errors with certain words, inability to combine male and female vocals in certain genres, occasional rhythm inconsistencies during section transitions—I predict Suno 5.0 could reduce my optimization time from 8 hours to 2-3 hours per song to achieve the same overall quality. However, I expect to need another 75 hour time investment when 5.0 arrives, because tool improvements fundamentally shift where bottlenecks occur rather than simply accelerating current processes. Additionally, although I will then be able to make songs up to my bar in 2-3 hours then, I will likely still want to spend 8 hours per song, because hedonic adaptation will increase my "taste" and bar for what I consider good enough.

Each major update essentially resets the optimization landscape and the bar for acceptable final product, requiring users to redevelop approaches from first principles. The pronunciation fixes in 5.0 might eliminate my current granular editing phases entirely, but then I'll probably discover new quality dimensions that become the primary limiting factors.

The Unquantifiable Fitness Problem

Here's what's intellectually humbling: after 75 hours of optimization guided by thousands of quality assessment decisions, I cannot provide a mathematical description of my objective function. This isn't a personal intellectual limitation—it reflects the fundamental challenge of formalizing subjective judgment that operates across multiple timeframes, contexts, and evolving standards.

Consider the complexity involved in my multi-dimensional evaluation framework: immediate aesthetic appeal (does this sound good right now?), medium-term memorability (will I remember this melody weeks later?), long-term emotional resonance (will this satisfy me years from now?), and comparative assessment (how does this stack up against professional music I already love?). Each temporal dimension involves different cognitive processes and draws on different types of musical experience and cultural knowledge. The weighting between these factors shifts dynamically depending on mood, listening context, musical style, and even time of day when I'm evaluating.

The evaluation framework changes so rapidly and depends on so many subjective variables that mathematical formalization would be practically impossible even if it were theoretically feasible. This connects directly to what AI researchers like Ilya Sutskever call the "taste" factor in scientific research—that ineffable judgment about what constitutes good work that somehow transcends quantifiable metrics while remaining surprisingly consistent across experienced practitioners.

Music provides an almost ideal domain for studying this problem because it involves relatively pure aesthetic judgment without the complications of practical utility or direct economic outcomes that characterize most business applications. When I evaluate whether a song meets my personal quality standards, I'm making judgments about harmonic relationships, rhythmic consistency, emotional expression, and artistic coherence that resist reduction to measurable parameters. Yet these judgments prove consistent enough to guide 75 hours of detailed optimization and produce results I'm genuinely comfortable sharing with friends.

The challenge of mathematically defining "taste" as an optimization metric extends far beyond music creation. Quality assessment in scientific research, business strategy, creative writing, product design, and interpersonal relationships all involve similar difficulties in formalizing subjective judgment that operates across multiple dimensions and timeframes. These domains may represent cognitive frontiers where human evaluation at least currently remains essential despite advances in AI technical capabilities.

Historical Patterns: Technology Amplifies Rather Than Replaces Judgment

The challenge of preserving human quality assessment in creative domains has instructive historical precedents that suggest this problem may be fundamental rather than a temporary technical limitation. Consider photography's evolution over the past century, which provides a useful analogy for understanding how powerful new creative tools tend to reshape rather than eliminate the need for aesthetic judgment.

When digital photography emerged in the 1990s, many industry observers predicted it would democratize image creation by eliminating technical barriers like film processing chemistry, darkroom timing, and expensive equipment. While digital tools did make certain technical aspects more accessible to amateur photographers, they simultaneously created entirely new layers of complexity around image editing software, color management systems, and post-processing workflows that could take years to master.

Professional photographers didn't become obsolete during this transition. Instead, their expertise shifted from mastering chemical processes and darkroom techniques to understanding sophisticated digital imaging workflows, while maintaining exactly the same core skills of visual composition, lighting understanding, and aesthetic judgment that had always distinguished professional from amateur work. The fundamental human contribution—recognizing compelling compositions, understanding how light interacts with subjects, making aesthetic judgments about what makes an image worth viewing—remained as important as ever.

Ansel Adams, who lived long enough to see the early development of digital photography, captured this continuity perfectly when he observed: "You don't take a photograph, you make it." The underlying technology had evolved dramatically from large-format film cameras to digital sensors, but the essential creative process of making aesthetic decisions remained fundamentally unchanged.

The music optimization experience suggests we're witnessing a similar technological transition with AI creative tools. Rather than replacing human creativity entirely, these systems appear most powerful when they amplify distinctly human capabilities like aesthetic judgment, quality assessment, and creative direction. Technical execution becomes increasingly automated and sophisticated, but the evaluation of whether output achieves intended aesthetic, emotional, or strategic effects remains fundamentally dependent on human judgment.

Collaborative Writing with AI: Another Protracted Process Evolution

As I was reflecting on the music optimization process for this article, I realized I've probably gone through a very similar unconscious evolution in learning to collaborate effectively with Claude on blog writing. Even though Claude doesn't retain long-term memory across conversations, the evidence of this co-evolution is right here in our current collaborative article. While Claude can do a reasonable job writing a 4000 word article with minimal prompting about a topic, these initial work products are obviously "AI generated" and are not interesting enough to capture my own attention.

Through months of experimentation, I discovered that the best way to work on writing an article together is NOT to tell Claude that we're writing an article together. Rather, I start by just having a deep chat with Claude about a topic of interest, and the conversation sometimes takes turns I didn't initially expect. Once we have enough content on a topic that sometimes strain against Claude's context window limitations, I tell Claude that I'd like to write a blog post together, providing the detailed instructions below:

"Could we start with a Section-based outline, and then expand each section to a few paragraph summaries, and then expand each paragraph summary to a few sentences? This is the method that I taught my PhD students to write academic papers. Please make the section-based outline, paragraph summaries, and initial draft into three different artifacts, for easier change tracking as we go through revisions. Please target 3000 to 4000 words for the blog post."

Afterwards, just like with Suno and music, I will go through line by line and make some edits in phrasing to make it just the way I like. Also just like with music creation, I can't provide Claude with a comprehensive specification of my collaborative writing quality standards upfront. These preferences and evaluation criteria emerged organically through hitting quality plateaus and receiving iterative feedback about what worked and what needed improvement. The fact that I've discovered "Claude produces better results when starting from scratch rather than trying to iteratively modify existing content" suggests I've been unconsciously optimizing our collaborative process using exactly the same meta-learning approaches I applied to music creation.

Quick and Dirty vs. Obsessive Optimization

I spent a huge amount of time to develop AI collaboration processes for music and for blog post writing, but I don't do this all the time. For example, I frequently use AI for legal contract review, but there I typically just pass documents to both Claude and ChatGPT for quick "red flags" analysis, with the redundancy just to make sure nothing is overlooked. The stakes are much lower, reuse potential is minimal, and the quality threshold is "competent and legally sound" rather than "exceptional and persuasive." The time investment required for intensive optimization simply wouldn't be justified by the potential outcomes or long-term value creation.

The intensity and sophistication of my process optimization approach depends critically on three key factors: output stakes, reuse potential, and personal quality thresholds. For music creation, 8 hours of optimization time to create content I'll personally consume for 200+ hours over several years represents an exceptional return on investment. However, the calculation also incorporates less quantifiable factors like creative satisfaction, skill development in human-AI collaboration, and the intrinsic psychological value of producing something that meets high personal quality standards.

The common thread across all professional domains where I apply intensive optimization is that quality assessment continues to involve subjective human judgment that resists automation, even as the underlying technical execution becomes more computationally efficient and sophisticated. AI tools can increasingly handle complex data analysis, initial content drafting, formatting standardization, and other technical implementation components, but evaluating whether final output achieves intended strategic, aesthetic, or persuasive effects remains fundamentally dependent on human assessment capabilities.

From a business strategy perspective, understanding when to invest in intensive optimization versus accepting "good enough" output represents an increasingly crucial competitive capability. Organizations that develop systematic approaches for identifying high-impact optimization opportunities while avoiding over-investment in low-stakes applications may achieve significant advantages in both resource allocation efficiency and final output quality that directly affects business outcomes.

For high-stakes, high-reuse applications, I readily invest substantial optimization time that would seem excessive for casual use. Investor pitch decks represent a prime example: I regularly spend 50+ hours refining a single presentation because the output directly affects funding outcomes and company valuation at Biostate AI. The stakes are enormous—potentially millions of dollars in financing at different valuations—and the reuse potential is significant, as successful pitch decks get presented to dozens of investors over extended fundraising periods. The quality threshold is correspondingly high, since experienced investors see hundreds of presentations annually and can immediately identify amateur efforts that suggest poor attention to detail or strategic thinking.

Currently, AI tools for presentation design and visual communication aren't sufficiently capable to justify deep optimization investment for high-stakes applications. Existing systems struggle with sophisticated layout decisions, visual hierarchy principles, and the kind of aesthetic polish that distinguishes professional presentations from amateur efforts. However, I anticipate that when multimodal AI capabilities cross my personal quality threshold—likely within the next 1-2 years based on current development trajectories—I'll apply similarly intensive optimization approaches because the stakes absolutely justify the effort investment.

The Human Edge in Creative and Strategic Domains

As AI systems achieve increasingly sophisticated technical capabilities across creative and analytical domains, the current comparative advantage of humans over machines appears to be the meta-cognitive capabilities demonstrated throughout my music optimization experience: quality assessment, process innovation, and strategic judgment. In other words, we should focus on (1) recognizing when emerging tools are ready for serious professional investment, (2) developing systematic optimization strategies for optimizing use of new AI capabilities, and (3) maintaining nuanced quality standards ("taste") that resist mathematical formalization but guide effective human-AI collaborative workflows.

Cultivating sophisticated quality assessment capabilities becomes increasingly valuable as AI tools make basic technical execution more accessible to non-experts. For anyone working in creative or strategic professional roles: systematically develop your quality assessment capabilities, a.k.a. your "taste". Recognizing quality plateaus, innovating optimization approaches, maintaining subjective judgment standards that guide rather than constrain creative exploration—may represent the most distinctive and valuable human contribution in the current increasingly AI-enabled professional world.

By David Zhang and Claude 4.0 Sonnet
July 15, 2025

Nano Thoughts

Discussion about this post