DSC.abaceee8

The Anatomy of Clean Analytics Data

Author
Cavaliero, D.
Filed Under
Data Collection

Naming/structuring events and data is harder than it may seem. The open ended and flexible nature of tools like Segment, RudderStack, and even Google Analytics 4 (GA4) leave convention and syntax up to the end user - which leaves a lot of room for human error.

A quick preface

This crap is hard. It's even harder when you are a single contributor or small team trying to convince an organization or an external organization (client) to change their ways. I hope this article provides some sound reasoning/arguments for the data engineers out there trying to fight the good fight for clean data.

Event naming conventions

Having, poor rigor around event naming doesn't hurt at first. It hurts six+ months later when you're trying to build a report and realize button_click, Button Click, and btnClicked are all floating around in your behavioral event data.

The flexible nature of modern analytics tools is a double-edged sword — you can name things however you want, which means people will name things however they want.

A simple framework

I am particularly fond of Segments "Object Action" framework. The idea here is that event names are prefixed with a string that identifies the object (noun) being interacted with, and the action (verb) being a string that describes the type of interaction with the object.

For example, if we wanted to push an event into GA4 when a visitor submits a form a good event name would be form_submit.

The primary benefit to using this convention is that your event names should be semantic, and easy to understand if you implement it correctly.

Case considerations

Snake Case (e.g. form_submit)

Pros:

  • Easy to break apart via simple .split('_') and easy to re-assemble via .join('_').

Cons:

  • Some may argue it is harder to read than camelCase. Personally I don't - but I've been a PHP developer for 15 years where $snake_case is common for variables and function/method names (especially in WordPress).

Kebab Case (e.g. form-submit)

Pros:

  • Equally as easy to break apart via simple .split('-') and easy to re-assemble via .join('-').

Cons:

  • Using - in payload keys prevents accessing values in objects via . notation (e.g. you'd need to use array/bracket notation).

Pascal Case (e.g. FormSubmit)

Pros:

  • Very readable - and more common in analytics software than kebab-case, camelCase or Title Case

Cons:

  • Parsing requires the use of regular expressions which are less ergonomic than a simple character delimited string.

Camel Case (e.g. formSubmit)

Pros:

  • The most common convention for object property names in Javascript.

Cons:

  • Parsing requires the use of regular expressions which are less ergonomic than a simple character delimited string.
  • Less readable for analytics reports than PascalCase.

Title Case (e.g. Form Submit)

Pros:

  • Most human readable, makes for clean reports

Cons:

  • Most modern analytics software prevents the use of spaces in event and dimension names.
  • The minor convenience of being clean and readable for reports can easily be done via parsing and transforming the value in a given report.
  • Using spaces in payload keys prevents accessing values in objects via . notation (e.g. you'd need to use array/bracket notation). (Same issue as - in kebab-case).

My thoughts

I strongly believe that snake_case is the best path forward for use in analytics data (barring any specific platform nuances).

Avoid case sensitivity
Case sensitivity introduces a higher likelihood of errors when writing queries or during event implementation. Stick to lower case.

Avoid regular expressions when possible
Sticking to character delimited strings also makes parsing and transforming events and dimensions to different formats significantly easier. Case delimited strings require the use of regular expressions, I don't know a single person on the planet who actually likes writing regex patterns (or maintaining them).

Analytics payloads don't need to follow Javascript standards
Most devs would argue that using snake_case for property names in a Javascript object is bad practice. If this data were being used outside of an analytics capacity, I would agree with this sentiment. Developers should think of properties in an analytics object as database columns, where the convention is by and large snake_case.

Other bits & bobs

Keep objects "simple"

You will have an urge (or be pushed by the powers that be) to name things using internal lingo/vernacular -- avoid it. Is the event measuring something related to a form? Good, the object is form - not funnel, not some weird abbreviation internal marketing came up with. Keeping events named based on what is actually happening on the site/application keeps them DURABLE over time.

Event actions should be "singular"

One thing that I commonly run into is marketing teams using plural event names. This is likely because they tend to look at data in aggregate rows, where it seems to make sense. However, when you look at a single users behavior, using plural event names makes the data awkward.

Additionally, avoid using past-tense for actions such as viewed or clicked. Events are measured at the moment they occurred adding past-tense to your event names can make reports confusing.

Action Description
view Something was viewed - irregardless of "time".
impression A "time" based view configured via a "time in viewport" threshold.
click Typically measures link or other UI related interactions.
download A file was downloaded, can be from a click or a form submit.
upload A file was uploaded to a remote destination.
toggle Measures interfaces that have a boolean state (accordions, checkboxes etc..).
open A more explicit way to measure events on objects with a boolean state.
close (See above)
submit Only applicable to forms. Should only fire on a valid submit.
capture Used for singular PII capture events (e.g. phone_capture or email_capture).
scroll Measures when a % threshold in the browser intersects the viewport.
start When a multi-step or time-based process is started (e.g. form or video events).
progress When a multi-step or time-based process moves forward.
regress When a multi-step or time-based process moves backward.
resume When a multi-step or time-based process is continued from a stalled state.
complete When a multi-step or time-based process is finished (reaches its end).
create Measures when an object is created. (More common in SaaS applications)
update Measures when an object is updated. (More common in SaaS applications)
delete Measures when an object is deleted. (More common in SaaS applications)

DataLayer design

The dataLayer is often overlooked by marketing and media teams. Largely because of the abstraction of data "away" from the tags/platforms they intend to send the data into.

Nested vs. flat payload structures

I tend to prefer a nested data structure more effective than flattened ones. The goal of your dataLayer isn't to make the payload in the "shape" of the data sent to the various analytics systems, the goal should be to keep it in a clear format that makes debugging and collisions less likely.

1// Nested
2dataLayer.push({
3 event: 'form_submit',
4 form: {
5 id: 'my-form-id',
6 name: 'My Human Readable Form Name',
7 fields: {
8 first_name: 'Big',
9 last_name: 'Bird',
10 email: 'big.bird@gmail.com',
11 phone: '(412) 555-1234',
12 }
13 }
14})
15 
16// Flattened
17dataLayer.push({
18 event: 'form_submit',
19 form_id: 'my-form-id',
20 form_name: 'My Human Readable Form Name',
21 form_fields: {
22 first_name: 'Big',
23 last_name: 'Bird',
24 email: 'big.bird@gmail.com',
25 phone: '(412) 555-1234',
26 }
27})

Conclusion

At the end of the day, the best naming convention is the one your team actually follows. But if you're starting fresh or cleaning up a mess, snake_case with an object_action structure will serve you well.

Keep it lowercase. Keep it simple. Keep it durable. Your future self (and whoever inherits your implementation) will thank you.