Dividends from Data: Building a Lean Data Stack for a Series C Fintech

The Tech Blog

Feb 2024

10 min

Dividends from Data: Building a Lean Data Stack for a Series C Fintech

Andy Turner

+ 4 others

Never miss a story- keep up to date with the latest news and insights

At a glance

Download

It is often said that a journey of a thousand miles begins with a single step.

10 years ago, building a data technology stack felt a lot more like a thousand miles than it does today; technology, automation, and business understanding of the value of data have significantly improved. Instead, the problem today is knowing how to take the first step.

As anyone on LinkedIn with ‘data’ in their title will know, the market is saturated with suppliers of every kind. From the well known cloud service providers, to the remarkably niche, every stage in the data lifecycle is covered by dozens of technology companies, each of whom claims to have the best, most cost-efficient solution.

PrimaryBid is a category-defining business, and as such, there are no established players to follow as we choose our data technologies. Our requirements had several layers:

PrimaryBid facilitates novel access for retail investors to short dated fundraising deals in the public equity and debt markets. As such, we need a platform that can be elastic to market conditions, with a significant preference for variable over fixed cost.
PrimaryBid operates in a heavily regulated environment: we are subject to financial services regulations in each of our jurisdictions, as well as data protection regulations (including GDPR). It is therefore essential that our technology stack allows for best-in-class compliance with all applicable requirements.
PrimaryBid handles many types of sensitive data, including Personally identifiable information (PII), market sensitive/insider information and proprietary Intellectual Property (IP). Information security is therefore a critical requirement.
A part of PrimaryBid’s IP is our ability to forecast retail investor demand for a transaction using our proprietary and unique data assets. As a result, our stack must include a scalable machine learning environment, with a diverse set of market, partner, and customer model features.
As a Series C business with international ambitions, the technologies we pick have to scale exponentially as we do, and be globally available as we set our sights on geographic expansion.
As a category creator, we need to be able to understand our market, our customers, and our partners in near-real time, keeping us competitive and user-focused.
And, perhaps the biggest cliche, we needed all of the above for as low a cost as possible, with the ability to pivot easily as our strategy develops.

Over the last 12 months, we have built a lean, secure, low-cost solution to the challenges above. We’ve picked suppliers and partners that are a great fit for us, and we have been hugely impressed by the quality of tech available to data teams now, compared with only a few years ago.

‍

The 30,000 foot view

The 30,000 foot view will not surprise any data professional. We gather data from various sources, structure it into something useful, surface it in a variety of ways, and combine it together into models that we expose to our products and services. Throughout this process, we monitor data quality, ensure data privacy, and send alerts to our team when things break.

Figure 2: High Level summary of our data stack

Data gathering and transformation

For getting raw data into our data platform, we wanted technology partners whose solutions were low-code, fast, scalable, and OpEx driven; as data provides a significant competitive advantage in a new category, we wanted to be able to extract information from a wide variety of sources.

The (now classic) combination of FiveTran and dbt were our picks to meet our needs.

Fivetran supports a huge range of pre-built data ‘connectors’, which allow data teams to land new feeds in a matter of minutes, addressing our needs around the variety of data sources we want to combine. The cost model we have adopted is based on monthly ‘active’ rows; i.e. you pay incrementally for rows that change, and not for any rows that stay the same, meaning that we only pay for what we use. The additional benefit of this approach is that you can pull data at the cadence that works for you; there is no difference in cost if you pull a feed daily vs. every 30 seconds.

Fivetran also takes care of connector maintenance; schema updates, versioning changes, etc. are all automatically handled on their end. This frees up massive amounts of data engineering time in our team by outsourcing the perpetual cycle of updating API integrations of our 35+ data sources.

Once the data is extracted by Fivetran, dbt turns raw data into a usable structure for downstream tools, a process known as ‘analytics engineering’. Our dbt cloud model is per-seat, and so will scale well as the team grows. dbt and Fivetran make a synergistic partnership, with many Fivetran connectors having dbt templates available off the shelf for data teams to leverage — another boost to productivity and time-to-insight. dbt is hugely popular with data engineers, and contains many best practices from software development (e.g. testing, version control) that ensure analytics transformations are robust and transparent.

Both platforms have their own orchestration tools for pipeline scheduling and monitoring, but we deploy Apache Airflow 2.0, managed via Google Cloud’s Cloud Composer, to give us finer-grained control over this process. The result is pipelines that deliver data to users with sub-2-minute latency, while maintaining minimal cost.

‍

Data storage, governance, and privacy

At the risk of the rest of this post becoming homogenous, this is the point in our data stack where Google Cloud starts to solve a whole variety of our needs. With personal experience of all three of the top cloud providers, we started with an open mind in our search; we had originally anticipated having one tool for storage and warehousing, and a totally separate provider for tracking lineage and providing governance. In the end, Google Cloud solved all these challenges in one platform.

Starting with Google Cloud’s BigQuery, our evaluation showed increased performance vs our existing provider, at a lower cost. BigQuery is highly scalable, serverless, and separates compute costs from storage costs, allowing us only to pay for exactly what we need at any given time. A nod also to Google Cloud’s documentation, which our team found highly practical and easy to use compared to alternatives.

What sold us on Google Cloud’s ecosystem though was their integration of data privacy, governance and lineage throughout. Leveraging Google’s Dataplex, we set security policies in one place, on the raw data itself. As the data is then transformed, passed between Google Cloud services, and turned into predictive models, these same security policies are adhered to throughout, with zero further effort.

One example is PII, which is locked away from every employee bar a small number who need it for their day-to-day roles. We tag data one time with a ‘has_PII’ flag, and it doesn’t matter what tool you are using to access the data (BigQuery, BI tools, python notebooks etc.), if you do not have permission to PII in the raw data you will never be able to see it anywhere.

These privacy rules are highly visible and auditable, support infrastructure-as-code, and allow us to apply table, column, and row-level masking where necessary.

‍

Figure 3: Dataplex — Credit: Google Cloud

‍

Data analytics

Of all the saturated parts of the data ecosystem, analytics, dashboarding and self-service are particularly dense with offerings. This, probably, is also the area where the needs of the business and the skillset of the team played the greatest role in picking the right solution for us.

For PrimaryBid, we chose Looker, a business intelligence platform acquired by Google Cloud in 2019. Having team experience with PowerBI, Tableau, Data Studio (now Looker Studio) and many other tools, Looker was a relatively expensive choice vs its competition. For us, though, it delivered several benefits above and beyond alternatives.

Firstly, the developer requirements: in our case, every member of our data team is very comfortable with SQL. This plays to the core of how Looker functions; instead of storing data itself, Looker writes SQL queries directly against your data warehouse. To ensure it writes the right query, engineers and analysts build Looker analytics models using ‘LookML’. LookML for the most part is low-code, but for complex transformations, SQL can be written directly into the model, which plays to our team’s strengths.

As a contrast, PowerBI uses a VBA/Excel-esque language called DAX; when I used PowerBI, I would only ever write DAX when building dashboards. As a result, I never got a real feel for the language, and development would be a slow and Stack Exchange-heavy process. With our team’s SQL-based skills, LookML was much easier to pick up and remember.

Secondly, the extensibility of Looker into our platforms was a core decision factor. With the LookML models in place, transformed, clean data can be passed to any downstream service. Our first build-out was dashboards within Looker itself; with these in place, data can also be called by API via the Looker SDK, offering a lean and efficient way to embed analytics internally and externally. As a tech company, the ability to embed analytics with low-code techniques allows us to be agile in how we represent data and who can see it.

Finally, the interplay between Looker and Dataplex is particularly powerful. Behind the scenes, Looker is writing queries against BigQuery. As it does so, all rules around data security and privacy are preserved, tailored to each specific user. Keeping data in one place, with one set of rules, was a huge preference over copying data into another BI or Analytics platform, facilitating adherence to compliance and privacy requirements.

‍

Data Science + Machine Learning (DS/ML)

The last step in our data pipelines is our DS/ML environment. We originally explored DataBricks to meet our needs; several members of our team have fantastic experiences with their platform, and I would definitely use them again in the future. In particular, their collaborative notebooks are best-in-class.

This time, we leaned even further into Google Cloud’s offerings, and decided to try Vertex AI for model development, deployment, and monitoring.

As with Looker, Vertex AI immediately picks up the governance and privacy benefits from DataPlex, and lineage is automatically surfaced all the way from raw data to model output; Vertex AI’s metadata store has been excellent for providing a transparent view of the models we have live, and the data that feeds them.

Data at PrimaryBid comes from a wide range of internal and external sources. Having model inputs standardised and universally available was a high priority to ensure we’re leveraging everything that we have to make predictions about the capital markets. The integrated Vertex AI Feature Store has been a huge benefit in our model pipelines, and can serve features with low latency to live models in production that are frequently updated as market conditions change.

To make model building as flexible as possible, we have used the open-source Kubeflow framework for pipeline orchestration; this framework decomposes each step of the model building process into components, each of which performs a fully self-contained task, and then passes metadata and model artefacts to the next component in the pipeline. The result is highly adaptable and visible ML pipelines, where individual elements can be upgraded or debugged independently without affecting the rest of the code base.

To date, the models we have built have been custom developments, rather than off-the-shelf, but it’s worth noting that Vertex AI has a whole host of models pre-built that can be used as starting points for development. This may come in particularly useful for our team as we begin experimenting with Generative AI in the coming months.

Figure 4: Vertex AI pre-trained models, Credit: Google Cloud

‍

So there you have it, a whistle-stop tour of a data stack. We’re thrilled with the choices we’ve made, and are getting great feedback from the business. We’re always looking to improve; if you have any thoughts or suggestions, please let me know!

‍

The future

We’ve had the stack above running full speed for a little while now, and are happy with the performance, flexibility, and cost of the choices we’ve made. Being a growing business though, we cannot stand still; I imagine we’ll see the following extensions to our stack in the near future:

Reverse ETL: one technology not explicitly called out above is Segment, a platform we use for event tracking and Product analytics. Later this year, we plan to leverage Segment’s reverse ETL abilities, writing transformed data back to the source systems it came from.
Multi-region support: data privacy and security are built into the fabric of our data stack. Today, our setup is designed principally to cater for UK and EU legal and regulatory requirements; this will adapt and evolve as we work across more and more geographies.
Generative AI: we do anticipate investigating where PrimaryBid could leverage gen AI internally and externally. Of course, as we approach this challenge, we are hyper aware of our obligations to keep data and intellectual property private, especially given the high profile mistakes that can be made while the technology is nascent. We plan to leverage the frozen/adapter model approach as outlined in the figure below — leveraging the power of massive Large Language Models (LLMs), but in a private environment. More on this to come later in the year!