My latest project - a data management tool

Hello all,

It’s been a couple of years since I’ve posted here. I’ve had a number of failed attempts at starting a software business, and eventually took a break from that idea after actually finding a well-paying software job that wasn’t horrible (on the bright side, my last project, though it failed utterly as a business, played a part in helping me get my current job).

Well, the lockdown happened, I got bored with all the extra time, and started on an open source project at least partially inspired by problems I’d faced at work over the last several years:


It’s a platform meant to serve as a foundation for a (small) organization’s data management needs.


I’ve been trying to push the thought that this could grow into a business (i.e., open-core, services, etc.) to the back of my mind (to not get ahead of myself) until I can see if it catches on or not. It’s fairly early in development, needs more documentation, etc., but I’d appreciate any feedback and/or advice - esp. on the how to promote an open source project front :slight_smile:


1 Like


How is what you are doing different to Apache Airflow?

Thanks for the question! When I was making this project, I was thinking more along the lines of a (greatly) simplified Apache Nifi, rather than Airflow. As in, it’s focused on ingesting, processing, cataloging, and storing/shipping data around, rather than managing/scheduling workflows. Though, I admit some overlap with Airflow is here as well - the data flows are indeed DAGs, whereas Nifi is much more… free-form (I believe this to be a detriment for a lot of use cases - you can end up with absolutely insane, unmaintanable NiFi flows fairly easily).

Generally speaking, Apache Nifi is a bit too much to ask your typical Data Scientist/Analysts to work with (most know how to write Python script (btw, I do plan to add R support to this as well), but are definitely not qualified to do what you’d call “real” programming). I’m aiming for something simpler - i.e., a smaller shop that needs to move beyond jupyter notebooks and having data scattered around personal workstations, and need to operationalize some of the ingest, data prep/cleaning, and processing parts of their data pipeline. Also, Nifi is great, but it’s age is starting to show - it’s clunky to try to horizontally scale and use in a containerized environment.

I’m also trying to keep it open and flexible, I’ve seen workflows where you have to have some human-in-the-loop steps (labeling, reviewing, et al), and I aim to put that capability (to integrate with external tools) in here as well.

It’s still early, and a lot of stuff is missing, but I hope it’s far enough along to give folks the general concept.

P.S. Obviously, if you’re a big organization, you probably have the time/inclination/resources to come up with a bespoke solution perfectly suited to your needs, in which case this probably isn’t for you :slight_smile:

Sounds you are coming at a similar problem to my own product, from a different angle.

Things that I have discovered in this market:

  • It is a pretty crowded space.
  • Data scientists are pretty wedded to the tools they already use.
  • Non-data scientists are hard to shift from Excel!

It is a pretty crowded space.

Yea, lots of solutions out there, including some really expensive commercial ones. A lot of them are focused on particular use cases or types of data too. I don’t think anyone has cracked the “general purpose” platform yet. I’m hoping going the open source route will help here.

Data scientists are pretty wedded to the tools they already use.

Definitely. Personally I’m not a fan of Python, but it has become the standard language for doing data science, so you have to support it with any solution like this. I added Clojure support for myself :slight_smile:

Non-data scientists are hard to shift from Excel!

No kidding. I’ve seen some really gnarly spreadsheets over the years.

For what it’s worth - don’t spend time building an open source product. This comes from someone who co-founded a company in this specific space and took it to 7-digits ARR in 4 years.

Well now I’m curious. Why would you recommend against the Open Source route?

I’ve spent 6 years developing and marketing an open-source ETL framework and low-cost desktop data integration tool, similar to yours. During these years I’ve made exactly 30K on a few custom data integration projects. We have made 50K in our first year with Etlworks and are growing ~ x2.5 annually since then.


I think you should niche down. You’re basically building a form of a programming tool to which the answer to “can it do X” is always “yes”. That might sound like a good thing but in reality it means you have no idea how to spend your meagre financial and time budgets.

I used to work at a VC backed startup in a tangentially related space (we had a broader computing platform with a workflow style tool).

1 Like

I agree. The problem is to try to work out which niche!

1 Like