The Anatomy of Clean Analytics Data
Cavaliero, D.
Naming/structuring events and data is harder than it may seem. The open ended and flexible nature of tools like Segment, RudderStack, and even Google Analytics 4 (GA4) leave convention and syntax up to the end user - which leaves a lot of room for human error.
A quick preface
This crap is hard. It's even harder when you are a single contributor or small team trying to convince an organization or an external organization (client) to change their ways. I hope this article provides some sound reasoning/arguments for the data engineers out there trying to fight the good fight for clean data.
Event naming conventions
Having, poor rigor around event naming doesn't hurt at first. It hurts six+ months later when you're trying to build a report and realize button_click, Button Click, and btnClicked are all floating around in your behavioral event data.
The flexible nature of modern analytics tools is a double-edged sword — you can name things however you want, which means people will name things however they want.
A simple framework
I am particularly fond of Segments "Object Action" framework. The idea here is that event names are prefixed with a string that identifies the object (noun) being interacted with, and the action (verb) being a string that describes the type of interaction with the object.
For example, if we wanted to push an event into GA4 when a visitor submits a form a good event name would be form_submit.
The primary benefit to using this convention is that your event names should be semantic, and easy to understand if you implement it correctly.
Case considerations
Snake Case (e.g. form_submit)
Pros:
- Easy to break apart via simple
.split('_')and easy to re-assemble via.join('_').
Cons:
- Some may argue it is harder to read than
camelCase. Personally I don't - but I've been a PHP developer for 15 years where$snake_caseis common for variables and function/method names (especially in WordPress).
Kebab Case (e.g. form-submit)
Pros:
- Equally as easy to break apart via simple
.split('-')and easy to re-assemble via.join('-').
Cons:
- Using
-in payload keys prevents accessing values in objects via.notation (e.g. you'd need to use array/bracket notation).
Pascal Case (e.g. FormSubmit)
Pros:
- Very readable - and more common in analytics software than
kebab-case,camelCaseorTitle Case
Cons:
- Parsing requires the use of regular expressions which are less ergonomic than a simple character delimited string.
Camel Case (e.g. formSubmit)
Pros:
- The most common convention for object property names in Javascript.
Cons:
- Parsing requires the use of regular expressions which are less ergonomic than a simple character delimited string.
- Less readable for analytics reports than
PascalCase.
Title Case (e.g. Form Submit)
Pros:
- Most human readable, makes for clean reports
Cons:
- Most modern analytics software prevents the use of spaces in event and dimension names.
- The minor convenience of being clean and readable for reports can easily be done via parsing and transforming the value in a given report.
- Using spaces in payload keys prevents accessing values in objects via
.notation (e.g. you'd need to use array/bracket notation). (Same issue as-inkebab-case).
My thoughts
I strongly believe that snake_case is the best path forward for use in analytics data (barring any specific platform nuances).
Avoid case sensitivity
Case sensitivity introduces a higher likelihood of errors when writing queries or during event implementation. Stick to lower case.
Avoid regular expressions when possible
Sticking to character delimited strings also makes parsing and transforming events and dimensions to different formats significantly easier. Case delimited strings require the use of regular expressions, I don't know a single person on the planet who actually likes writing regex patterns (or maintaining them).
Analytics payloads don't need to follow Javascript standards
Most devs would argue that using snake_case for property names in a Javascript object is bad practice. If this data were being used outside of an analytics capacity, I would agree with this sentiment. Developers should think of properties in an analytics object as database columns, where the convention is by and large snake_case.
Other bits & bobs
Keep objects "simple"
You will have an urge (or be pushed by the powers that be) to name things using internal lingo/vernacular -- avoid it. Is the event measuring something related to a form? Good, the object is form - not funnel, not some weird abbreviation internal marketing came up with. Keeping events named based on what is actually happening on the site/application keeps them DURABLE over time.
Event actions should be "singular"
One thing that I commonly run into is marketing teams using plural event names. This is likely because they tend to look at data in aggregate rows, where it seems to make sense. However, when you look at a single users behavior, using plural event names makes the data awkward.
Additionally, avoid using past-tense for actions such as viewed or clicked. Events are measured at the moment they occurred adding past-tense to your event names can make reports confusing.
| Action | Description |
|---|---|
| view | Something was viewed - irregardless of "time". |
| impression | A "time" based view configured via a "time in viewport" threshold. |
| click | Typically measures link or other UI related interactions. |
| download | A file was downloaded, can be from a click or a form submit. |
| upload | A file was uploaded to a remote destination. |
| toggle | Measures interfaces that have a boolean state (accordions, checkboxes etc..). |
| open | A more explicit way to measure events on objects with a boolean state. |
| close | (See above) |
| submit | Only applicable to forms. Should only fire on a valid submit. |
| capture | Used for singular PII capture events (e.g. phone_capture or email_capture). |
| scroll | Measures when a % threshold in the browser intersects the viewport. |
| start | When a multi-step or time-based process is started (e.g. form or video events). |
| progress | When a multi-step or time-based process moves forward. |
| regress | When a multi-step or time-based process moves backward. |
| resume | When a multi-step or time-based process is continued from a stalled state. |
| complete | When a multi-step or time-based process is finished (reaches its end). |
| create | Measures when an object is created. (More common in SaaS applications) |
| update | Measures when an object is updated. (More common in SaaS applications) |
| delete | Measures when an object is deleted. (More common in SaaS applications) |
DataLayer design
The dataLayer is often overlooked by marketing and media teams. Largely because of the abstraction of data "away" from the tags/platforms they intend to send the data into.
Nested vs. flat payload structures
I tend to prefer a nested data structure more effective than flattened ones. The goal of your dataLayer isn't to make the payload in the "shape" of the data sent to the various analytics systems, the goal should be to keep it in a clear format that makes debugging and collisions less likely.
1// Nested 2dataLayer.push({ 3 event: 'form_submit', 4 form: { 5 id: 'my-form-id', 6 name: 'My Human Readable Form Name', 7 fields: { 8 first_name: 'Big', 9 last_name: 'Bird',10 email: 'big.bird@gmail.com',11 phone: '(412) 555-1234',12 }13 }14})15 16// Flattened17dataLayer.push({18 event: 'form_submit',19 form_id: 'my-form-id',20 form_name: 'My Human Readable Form Name',21 form_fields: {22 first_name: 'Big',23 last_name: 'Bird',24 email: 'big.bird@gmail.com',25 phone: '(412) 555-1234',26 }27})
Conclusion
At the end of the day, the best naming convention is the one your team actually follows. But if you're starting fresh or cleaning up a mess, snake_case with an object_action structure will serve you well.
Keep it lowercase. Keep it simple. Keep it durable. Your future self (and whoever inherits your implementation) will thank you.