Database Magic: Shrinking Giants – Mastering the Art of Generating Compact Dumps from Massive Databases to Work on Your Local Environment

Generating a database dump is a critical practice in the realm of database management, offering a safeguard against data loss, aiding in disaster recovery, and facilitating seamless data migration or replication. A database dump essentially involves creating a snapshot of the entire database, capturing its structure and content at a specific point in time. While this process is indispensable, it becomes particularly challenging when dealing with large databases. Let’s delve into the significance of generating a database dump and explore the unique challenges associated with large-scale data sets.

Importance of Generating a Database Dump:

  1. Data Preservation:
    • A database dump serves as a comprehensive backup mechanism, ensuring the preservation of critical business data. In the event of accidental deletion, corruption, or system failure, a recent database dump enables quick and reliable data recovery.
  2. Disaster Recovery:
    • In the face of catastrophic events such as server crashes, natural disasters, or cyberattacks, a database dump is a lifeline for disaster recovery. It enables organizations to restore their databases to a stable state and resume operations promptly.
  3. Data Migration and Upgrades:
    • When transitioning to a new database system or performing version upgrades, a database dump is instrumental. It facilitates a smooth migration of data, allowing organizations to maintain continuity without losing valuable information.
  4. Testing and Development:
    • Developers often require realistic data sets for testing and development purposes. A database dump provides a replica of the production database, enabling developers to work in an environment that closely mirrors the actual operational conditions.
  5. Compliance and Auditing:
    • In regulated industries, maintaining data integrity and accessibility is paramount. Regularly generating database dumps aids in compliance with data protection regulations and facilitates auditing processes.

Challenges Associated with Large Databases:

  1. Resource Intensiveness:
    • Large databases consume considerable system resources during the dump process, leading to increased CPU, memory, and storage usage. This can impact the performance of the production system during the dumping period.
  2. Extended Downtime:
    • Generating a dump of a large database often requires downtime to ensure data consistency. The longer the database dump takes, the more extended the system downtime, affecting operational efficiency.
  3. Storage Requirements:
    • Large databases result in voluminous dump files, requiring substantial storage space for backup retention. This can strain available storage resources and necessitate efficient archiving strategies.
  4. Network Bandwidth:
    • Transferring large dump files across a network can be time-consuming and bandwidth-intensive. This becomes a significant challenge, especially in scenarios where databases need to be replicated or migrated between different geographical locations.
  5. Concurrency Issues:
    • Large databases often handle high levels of concurrent transactions. Generating a dump while maintaining data consistency and transactional integrity poses a complex challenge, necessitating careful planning and execution.

Many times we, developers, need to create a new database dump from the either staging or production environment to work on some tasks that are only reproducible in those environments or we simply need to update our local installation of the project to work with the most updated data.

One of the problems is the size of the databases of some projects that can reach tens of gigabytes of data, depending on the project, which is really hard to deal with because of the time it takes to generate and import this dump and because of lack of disk space on our local environments.

To mitigate this kind of problem, some useful commands are listed below.

The idea is to generate a dump of the database without unnecessary data, like logs, cached data, transient information, etc. To do so, we first need to identify the biggest tables in the database we want to dump with the following command:

SELECT
	table_schema AS `Database`,
	table_name AS `Table`,
	round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB`
FROM
	information_schema.TABLES
ORDER BY
	(data_length + index_length)
	DESC;

This command will give you an output like the following one:

+----------+------------------------------------------------+------------+
| Database | Table                                          | Size in MB |
+----------+------------------------------------------------+------------+
| db       | agilitycache_agilitycache                      |    7298.64 |
| db       | something_custom_pricing                       |    2473.94 |
| db       | agility_price_data                             |     714.67 |
| db       | url_rewrite                                    |     310.86 |
| db       | amasty_xsearch_users_search                    |     304.23 |
| db       | catalog_product_index_eav_replica              |     152.08 |
| db       | catalog_product_index_eav                      |     125.33 |
| db       | catalogsearch_fulltext_cl                      |     104.63 |
| db       | catalog_product_entity_int                     |      85.56 |
| db       | reporting_users                                |      72.59 |
| db       | inventory_source_item                          |      63.30 |
| db       | search_query                                   |      47.17 |
| db       | catalog_url_rewrite_product_category           |      29.42 |
| db       | catalog_product_entity_varchar                 |      28.06 |
| db       | cron_schedule                                  |      18.03 |
| db       | catalog_product_index_price                    |      13.06 |
| db       | catalog_product_index_price_replica            |      12.31 |
| db       | catalogrule_product_cl                         |      11.52 |
| db       | catalog_product_attribute_cl                   |      11.52 |
| db       | company_permissions                            |      10.03 |
| db       | klevu_product_sync                             |       8.88 |
| db       | klevu_product_sync_history                     |       7.23 |
| db       | customer_address_entity                        |       7.03 |
| db       | quote_address                                  |       6.67 |
| db       | catalog_product_entity                         |       6.47 |
| db       | customer_log                                   |       5.03 |
| db       | catalog_product_entity_text                    |       4.91 |
| db       | sales_order_item                               |       4.73 |
| db       | sales_order                                    |       4.53 |
| db       | customer_address_entity_varchar                |       3.83 |
| db       | customer_address_entity_int                    |       3.83 |
| db       | catalog_category_product_index_store38_replica |       3.77 |
| db       | catalog_category_product_index_store44_replica |       3.72 |
| db       | catalog_category_product_index_store46_replica |       3.72 |
| db       | cataloginventory_stock_item                    |       3.52 |
| db       | catalog_category_product_index_store38         |       3.48 |
| db       | catalog_category_product_index_store46         |       3.48 |
| db       | catalog_category_product_index_store44         |       3.48 |
| db       | quote_item                                     |       3.11 |
| db       | sales_order_grid                               |       2.98 |
| db       | quote                                          |       2.77 |
| db       | email_catalog                                  |       2.75 |
| db       | customer_entity                                |       2.63 |
| db       | inventory_low_stock_notification_configuration |       2.52 |
| db       | catalog_product_entity_media_gallery_value     |       2.30 |
| db       | sales_bestsellers_aggregated_daily             |       2.25 |
| db       | customer_grid_flat                             |       2.23 |
| db       | catalog_product_entity_decimal                 |       2.23 |
| db       | sales_bestsellers_aggregated_monthly           |       2.16 |
| db       | company_structure                              |       2.11 |
| db       | inventory_stock_2                              |       1.92 |
| db       | inventory_stock_5                              |       1.92 |
| db       | cataloginventory_stock_status_replica          |       1.91 |
| db       | persistent_session                             |       1.73 |
| db       | catalog_product_entity_media_gallery           |       1.69 |
| db       | quote_item_option                              |       1.63 |
| db       | sales_order_address                            |       1.63 |
| db       | ui_bookmark                                    |       1.58 |
| db       | negotiable_quote                               |       1.53 |
| db       | klevu_order_sync                               |       1.52 |

Now, we can easily identify the heaviest tables in our database so we can skip the data generation for these tables when we are dumping the database to a file.

In the example above, the table agilitycache_agilitycache is totally unnecessary and it’s more than 7GB sized. We can definitely ignore this table. Other tables the following also can be ignored because they hold irrelevant data:

  • agility_price_data
  • amasty_xsearch_users_search
  • catalogsearch_fulltext_cl

After identifying the tables you need to ignore, it’s time to start running the proper queries to generate the dump file.

Even though we are going to ignore those tables, we can’t simply skip them completely because Magento (or any other application that’s running this database) needs them in order to work so we need to generate the structure of these tables at least.

The first thing, then, is to generate the complete structure of the database into a file skipping some information, like triggers, table spaces, and data.

mysqldump -h [host] -u [user] -p [database] --no-tablespaces --skip-triggers --no-data | sed -e 's/DEFINER[ ]*=[ ]*[^*]*\*/\*/' >> database.dump.sql

The second step is to export only the data from the database. You can use --ignore-table=[table] for as many tables as you want. We still need to skip some information here, like table spaces, triggers, and create info.

mysqldump -h [host] -u [user] -p [database] --no-tablespaces --skip-triggers --no-create-info --ignore-table=[database].agilitycache_agilitycache --ignore-table=[database].custom_pricing --ignore-table=[database].agility_price_data --ignore-table=[database].catalogsearch_fulltext_cl --ignore-table=[database].amasty_xsearch_users_search | sed -e 's/DEFINER[ ]*=[ ]*[^*]*\*/\*/' >> database.dump.sql

Now, the third and last step is to export the routines. Here, no data or structure is exported.

mysqldump -h [host] -u [user] -p [database] --no-tablespaces --routines --no-create-info --no-data --no-create-db --skip-opt | sed -e 's/DEFINER[ ]*=[ ]*[^*]*\*/\*/' >> database.dump.sql

There are some cases where this will not be so effective because the heaviest tables are related to orders, products, and customers, which are necessary information most of the time. However, any information related to quotes, logs, debug, etc, is commonly unnecessary and can be wiped out from the dump.

In the example above, the database has 10GB of size and, after dumping it with the strategy above, the end size was down to 500MB.

In conclusion, the importance of generating a database dump cannot be overstated, it serves as a linchpin for data management, recovery, and operational continuity. However, the challenges associated with large databases demand meticulous planning, resource allocation, and strategic execution to ensure the efficiency of the dump process without compromising system performance.

I hope this blog post helps you.

– Tiago

Leave a comment