June 23, 2025

Where People Get their Data, and Why It Matters

Farm Fresh? Organic? NON-GMO? Unit pricing difference? Sourcing ingredients for a great meal can be a pain, much like extracting data.

Welcome back to our series, Towards Data Freedom!  In the last piece, we discussed the big picture of ETL pipelines. Today, we're diving in a littler deeper into the E: Extract.

What Exactly Is 'Extract'?

In the simplest terms, extraction is the process of pulling data from a source so you can do some action or series of actions on it. Think of extraction as getting all your raw ingredients onto your kitchen counter—no cooking yet, just making sure everything is accessible.

Much like cooking, you do need to do some work for these ingredients: maybe you grow the veggies yourself, or buy them from the supermarket. You also need to make sure you have the EXACT right ingredients (just because mint and basil look alike, DOES NOT mean they’re interchangeable in a dish, for instance). In data, this might look like making sure all your dates are typed correctly, or that they’re in fact birth dates and not some other date.

Common Sources of Data for Extraction

Extraction can happen from a variety of sources:

  • APIs: Pulling structured data from services like Salesforce, Twitter, or HubSpot.
  • Databases: Querying SQL databases like MySQL, PostgreSQL, or NoSQL databases such as MongoDB.
  • CSV Files: Whether it’s manually collected customer data, or pulled from your own internal Excel workbooks.
  • Web Scraping: Extracting data directly from websites, useful when APIs aren’t available or are insufficient.

Common Problems Encountered in Extraction

Despite being conceptually straightforward, extraction isn't without headaches:

  • API Changes: Data schemas or endpoints often change without notice, causing disruptions and broken pipelines.
  • Opaque Rate Limiting: APIs or databases may restrict how frequently you can request data, and this might not always be apparent without documentation deep dives.
  • Inconsistent Data Formats: The same type of data can come in varied formats, complicating processing.
  • Authentication and Permissions: Secure data access frequently involves handling complex authentication flows.
  • Scraper Fragility: Web scrapers can break due to slight modifications in web page layouts (similar to inconsistent data formats, but with code to constantly maintain).

How can Structify Help?

Structify does the meal prep for you, so you can focus on what matters: making a tasty (and healthy) meal! Our tooling allows users to treat the internet as a dataset, without manually writing custom scrapers by:

  • Automatically navigating complex sites.
  • Detect and adjust to website layout changes in real-time.
  • Enable rapid iteration, making extraction robust and flexible.

We also turn internal data sources into useable data by:

  • Turning PDFs into a useable, tabular format that can then be used for analysis and visualization
  • Allowing spreadsheet upload or database connections, so you can leverage your internal data easily and headache-free

You might be curious which other pieces of extraction Structify can help with: stay tuned for our further pieces to learn more!

Yours in data,

Alex Reichenbach

Share this post