Slow slow slow deploy

I have been tasked with testing OttoFMS and OttoDeploy, with the idea that we could replace our existing migration system for our client. In theory it should work great, but in my early testing things do not look so good. Our test environment in is AWS with two t3.2xlarge (8vcp/32GBRAM) Ubuntu Servers, with 100GB Boot, 400GB Data, 500GB Backup Drive, this was all done to relatively replicate the production environment which is on premises. Also matching the local machines is an 8GB Swap file because Todd said you need it, so I matched what the actual production servers have.

Copies of the Production files were transferred from S3 Backups to our BigTest machine in the sky. This is about 53 files and 188GB worth of stuff. The plan was to time a migration of a few files (30 Gigs maybe?) from the BigStaging Server in the sky to compare to what we are doing in real life, back on premise.

My first failed test was to try to ā€œdeployā€ all the files from BigTest to BigStage. This failed miserably. Then through testing I discovered that it didnā€™t like the 30GB files at all. Then, I tried smaller files, and it still failed.

Finally I thought I would try a sub 1GB file, and that worked. It worked rather nicely. Next up was a 2GB file. Just a single 2GB file, and as of this writing, and based on checking htop it looks like all the work on the first machine (BigTest) in this case, is done. But the zip file is still growing in /backupfolder/OttoFMS/inbox/build_blahblahblah
It has been over two hours and itā€™s only at 1.6 GB of 2.3.

Of course this is unacceptable, but I am more than willing to confess my ignorance on such matters, so here I am. What am I missing? Or is OttoFMS/OttoDeploy simply not capable of dealing with anything over 1 GB?

Hello Fred,

I enjoyed hanging at your office last year for the FM user group meeting. It seemed like a really fun bunch of people to work with. Welcome to the forum.

Thanks for taking the time to send some feedback. Happy to help. OttoFMS and OttoDeploy should be able to handle many Giga bytes of files and data. We run lots of tests that do that on a regular basis. But very large file sets may require special setup which we could help you with.

If you could send us the following information from your server as a Direct Message, we could probably tell you what happened and may even be able to suggest solutions to the issues.

Here is what I need. You can send this all to me via direct message on here ( lower left corner )

  1. The Deployment JSON from OttoDeploy. Click here to see how to get that
  2. The otto-info.log files from both servers Click here to see how to get that

Thanks again,

Todd

Hello Fred,

We are looking at changing the compression settings on the zip files that get created. It looks like that will make a difference in how long it takes to make the zips after the backups are done.

I just ran a test on 3 31 gigabyte files for a total of 91 gigabytes. It took just under 4 minutes to make backups of all three files, and put them into a zip in just under 4 minutes.

Now this is a dev machine with plenty of power and no other users on it so I donā€™t expect it to be that fast on a production machine. We are going to try this in a test version tomorrow on production machines. Iā€™ll let you know how it goes.

If you get a chance I would really like to know what sort of errors you got with your attempts. You just mention that the failed, you donā€™t say what happened, or what errors you could see in the Deployment Details page. We would appreciate it.

Thanks

Todd

Files Sent in DM.

The Errors look like:
message,phase,level,timeElapsed
deployment queued,queued,info,0
ā€œStarting deployment: ā€œā€œPopulate Big Testā€ā€ā€,starting,info,0
53 files,starting,info,9
Sending just in time build request to source server,building,info,1105
Fetching just in time build status from source server,building,info,1193
#ā€¦Lots more ā€œFetching just in time buildā€¦ā€
Fetching just in time build status from source server,building,info,299683
Build server error: failed - backup or copy failed:SOMEBIG_Billing.fmp12. Error: Schedule timed out,building,error,302741
Starting to open files,opening,info,302753
Deployment process failed. Original files are unmodified.,done,error,302790
Removing builds from inbox,post-deployment steps,info,302838

That was the initial failure. Later I selected smaller files and received a similar time out:

message,phase,level,timeElapsed
deployment queued,queued,info,0
ā€œStarting deployment: ā€œā€œNot so BigStage populateā€ā€ā€,starting,info,0
1 files,starting,info,11
Sending just in time build request to source server,building,info,1117
Fetching just in time build status from source server,building,info,1266
ā€œBuild server error: failed - backup or copy failed: SOME_FILE.fmp12. Error: File SOME_FILE.fmp12 is not open, backup failed.ā€,building,error,4399
Starting to open files,opening,info,4406
Deployment process failed. Original files are unmodified.,done,error,4424
Removing builds from inbox,post-deployment steps,info,4463

My speculation is that thereā€™s some simple setting that needs to be changed, but the initial tests with the larger files were unsuccessful.

Also this was all for a ā€œdeploymentā€ we still havenā€™t tried a ā€œmigrationā€ yet. Thatā€™s where the rubber will really hit the road.

Looking forward to seeing this work.

Thanks Fred,

Iā€™ll take a look at your log files. They will be very helpful

We have a couple of ideas we are trying for the large file moves. I think we can make the zip faster by some degree.

One new idea is that maybe for very large file sets, OttoFMS shouldnā€™t even be making a backup of the files you want. If you have that 188GBs of production data on a server with users, you probably have to manage backups very carefully. You probably donā€™t want OttoFMS running a backup when it needs to.

What if we let you specify one of the regular backup that you are already running, as the source of the files?

OttoDeploy and OttoFMS could just grab the files from their and stream them over. No need to run another backup. Maybe we donā€™t even zip.

What do you think of this idea?

Thanks

Todd

I am late to the party, but watching this thread. Hi Fred, thanks for all the help and feedback! Out of curiosity, have you been successful with a small test deployment?

@dude2copy, yes when it works itā€™s downright delightful. Sub 1GB files were deployed with ease. Iā€™m fairly certain that there is some obvious setting that Iā€™ve missed.

Want to chime in with my experience on v3 when we reseeded Dev server with live production data a few large files failed to come over so I had to copy them again, or if for some reason it just wouldnā€™t copy files I had to FTP files between servers. The effect of this is that we end up with orphan records due to there being a time differential between most files coming over initially and then waiting for the FTP to copy the other files - which is why I see the concept of builds/zips helpful, and or the concept of a transaction approach it all happens successfully or reverts to a prior state.

The first time I saw the missing record issue or extra records with out parents on a test or dev server it took a lot of cycles to identify the cause, and induced some panic fortunately it didnā€™t affect testing or development after we determined the true cause.

I donā€™t think you missed a setting. I think our default method for moving huge files is not where it needs to be. Moving 10s of Gigiabytes off a production server under load is going to need some refactoring.

We are going to try optimizing the zips, because that will be useful for the other use cases. But we have another couple of ideas that may help even more.

We hope to have some updates soon.

Thank you

Todd

So this afternoon, I was asked to try a migration. I preferred that this would have come after a successful deployment, but sure, letā€™s give it a try. Moreover, I read that the migration moved smaller (cloned) files, so there should be no hang up on the donor server. But then I I got a big error message (when trying to move about 8 large files for testing) :Not enough space on the backup disk. Required: 266.9 GB, available: 38.32 GB.

Well, that doesnā€™t make any sense I have got plenty of room on those drivesā€¦so I try with the one troublesome file that took way to long to migrate last night (with a different migration system). This time I got the result: Not enough space on the backup disk. Required: 92.41 GB, available: 82.14 GB

What? I have a 30 GB File with a 400 GB data drive, and 500 GB backup. Most of the space was free, so whatā€™s with the error?

Soā€¦I discovered that that free space was actually the free space on the boot drive But why is it asking for room on the boot drive? It looks like all the work is actually getting done on the desginated backup drive.

Is there a way to ā€œfixā€ this so that OttoFMS/Deploy doesnā€™t need that much extra space on the boot drive. It seams really stupid that in a production environment, that we would need to have ever more free space (probably 500GB) just to do a migration. Also, the warning isnā€™t particularly clear upon which server it wants all this space. I think it was the doner.

Hello Fred,

OttoFMS uses three places for its work.

  1. The OttoFMS application directory. This is mostly configuration and the apps internal SQLIte database. And itā€™s logs
  2. It uses what ever you have setup as the default FileMaker backup directory.
  3. When it makes clones, it has to use the FileMaker server application directory, because that is the only place you can make clones using the ā€œclones onlyā€ feature.

By far the most space it uses is going to be in the FileMaker Default backup directory. That is where it makes copies, and zips , and backups.

Iā€™d be happy to take a look at more information about what files you think Otto has put on your boot drive.

Let me know if this information helps

Thanks

Todd

Following up on Sā€¦lā€¦oā€¦wā€¦nā€¦eā€¦sā€¦sssssss.

While we are waiting for the next release, I was asked to re-try the test migration of real data again. This time I utilized tried a sub-deployment since the OttoFMS/Deploy is still checking the available space on the OS drive instead of the Backup drive. This resulted in a successful migration of 6 files including a 31 GB file, and a 23 GB file in about 3 hours. Yay.

I kept the one problematic 30GB file that previously took 5 hours to migrate, as a separate sub-deployment. For a fair repeat test, I replaced the file on the target server with a copy of the un-updated file.

This time, after 19 hours, I had to shut it all down.

After the 5 hour long test, I checked the logs and saw that a table named ā€œRecovered Libraryā€ had taken two hours to migrate. This is a little table with 48 records and two fields (one of which was a container with some little control icons). Consulting with the Lead developer who supervises migrations for this client with our current migration software, I learned that this is a rouge table that probably came about during a crash somewhere. In the past the developers have deleted them.

So, I tried to delete the table ā€œRecovered Libraryā€ (it was in both source and target) and FMP told me that I didnā€™t have the right authorityā€½ Since I was being pressed into running the test again, I just left the the tables in both the source and target files. Andā€¦Yeah, that table still took two hours, but this time there was no joy of completion.
Update: It looks like thatā€™s because I didnā€™t fix the permissions DOH!

This morning I killed that test. The interesting part was that I SSHed into the server turned on htop And I observed that it the ā€œFMDataMigrationā€ process was taking up 100% of one of the eight processors. Soā€¦that was interesting.

The powers that be are pressing me to make sure that this will work. So, Iā€™ll be wrangling our developers to get things cleaned up and try again.

Hey Fred,

We are doing final testing on the the new release. It should be out very soon.

A recovery table is an indication of a problem. The file was damaged and was recovered. Your devs were right, you need to delete those. But you should also run a recover on the file again to make sure it has no corruption left in it.

The Data Migration tool usually just quits when it encounters corruption. OttoFMS should just say Migration Failed and leave the original files in place. But it could be that file corruption could cause and encridibly slow migration. I am not aware of any migration even those with 100GBs of space taking more than 3 or 4 hours. 19 hours is extreme, and a clear signal that there may be more corruption.

You could also just run the Migration with the Data Migration Tool manually just to see what it does.

The Data Migration tool, ( which is from Claris) will use 100 % of a CPU. That is common. It will also use up a huge amount of memory, which is why on Linux you must have a swap file setup, and why we have been cautious about releasing a version with concurrency.

Thanks

Todd

1 Like

Just a follow up here. We are still having issues with one file. In spite of downloading the problematic 30 GB file, running a recover on a local machine, and then uploading back to the test server, and creating a clone of the donor file, and recovering it (several times actually), the migration is again taking 19 hours. And again it looks like itā€™s hanging at the end.

Hello,

Are you saying the migration itself is taking 19 hours? Or is the entire process running on OttoFMS taking that long. In the progress screen of the deployment does it get passed Downloading and is it migrating? or is still downloading?

If the migration phase of the deployment phase is taking 19 hours that is a very different then thing then if it stuck on the downloading phase.

If it still downloading then did you upgrade to the OttoFMS 4.2.2?

https://community.proofgeist.com/t/ottofms-version-4-2-2-released/362/5

The last couple of updates have made major improvements to the speed of both preparing the build and transferring the build. Downloads should not longer be an issue on Linux to Linux setups

On our linux to linux file very large files transfers now we are getting 1.5 gb/s transfer rates.

Also OttoDeploy now includes options to adjust the compressions setting for the zip as well as the memory limit for the zip process. This will let you balance speed versus resources available on the build machine.

Less compression faster, more disk space
less memory = slower, less memory used

Upgrading to the latest version is as simple as running a command line.

Let me know if this helps

Thanks

Todd

Toddā€“good questions.

Per our previous discussions I believe there is something off with the target file, and possibly the donor file.

So I took the donor file, created a clone, and then proceeded to recover it several times. Whether or not this does anything I donā€™t know but since it is a clone, it went fast enough, and one of our FileMaker developers said they have done that in the past. This file was then hosted on the donor (source) server

Next I downloaded the big target file. Ran one recover on a very fast Mac. This took like six hours. :frowning: Uploded this file back to the Test Machine. Woo hoo. And proceeded to do a migration with OttoFMS/Deploy.

Since it was cloning a clone, the first part of the migration went fine and the file got moved over without problem, and then it preceded to migrate schema, Looking into the very convenient File browser in ottoFMS, I could watch the new file grow in size. It was right on course to be done in three hours. Alas, I had an opera rehearsal to get to. Hours later, it still says ā€œMigrating.ā€

BUT, the last last modified time is at about the 6 hour point since start, and the size of the BIGFILE_otto_migrated.fmp12 is 30.81 GB, which looks correct. Soā€¦there you go.
This was done with OttoFMS 4.2.1.

Letting it run is really just out of curiosity. Itā€™s now been about a day with that test. So that raises the question. Whatā€™s the best way to cancel one of these migrations? So far, I have shut down the server.

Ok that is some very good info.

I think I might want to remove OttoFMS from the mix and do a migration manaully using the DataMigration Tool on the command line. My guess is that it will take a very long time there as well. But it would be good to confirm.

I can imagine a few scenarios where a particular data migration takes a very long time, possibly as long as what you are seing.

  1. The DataMigration Tool examines the changes that were made to each table and determines if the table can migrated in ā€œblockā€ mode or ā€œrecordā€ mode. It does this on itā€™s own as part of the migration. If it determines that a table need to go into ā€œrecordā€ mode and there are a lot of records, this can take a long time. Each record is moved over one by one. ā€œBlockā€ mode just shoves the data into the file it is very fast.

  2. If you have selected the options to re-build indexes and re-evaluate calcs, as part of the migration setup then you basically just set every table to ā€œrecordā€ mode, and it will be very slow.

Soā€¦ if you are getting pushed into ā€œrecordā€ mode and there is a lot of data and a lot of indexes then it is possible that you could get into a situation that takes many hours to run

Thatā€™s the bad news.

The good news is that, you probably wonā€™t get pushed into ā€œrecordā€ mode on every migration. Because you wonā€™t be changing schema in every table between migrations and you wonā€™t need to re-build indexes or re-evaluate calcs. Under those conditions you should get block mode most of them time.

More good news is the migration log, either the one that comes out of OttoFMS or the one that the Data Migration Tool will produce when you run it manaully will tell you if the table was imported with either ā€œrecordā€ or ā€œblockā€ mode. That might help you understand what is happening.

Let me know if this helps

Thanks

Todd

1 Like

Itā€™s about time for an update with the slow slow slow deploy.

The problem file, which we will call ā€œBilling,ā€ because thatā€™s its name, continues to be a problem. Although I have recovered it (this took a day), a migration would only work with the command line directly. Which is a good sign, but when I use the same files on Otto, the migration failed.

This makes certain people very grumpy because ā€œit works with ā€˜Midgeā€™.ā€ Todd, as you may recall, ā€œMidgeā€ is our existing migration tool, that was built a while back, and only works with FM 18. But still, it works and only takes a couple of ours to process that fileā€“thus the grumpiness.

All that being saidā€¦ I LOVE the new acceleration features. On the 6 other files (excluding the problematic Billing) that were a part of the original test migration, every thing went remarkably well. The first deployment with, on an earlier versions of OttoFMS/Deploy, took just under 3 hours. including a 30GB, and 25GB file.

Yesterday I updated to the latest versions of Otto and I moved the the memory slider allllll the way to the right, and selected 4 concurrent migrations (working of the assumption of I have four vCPU processors on this machine).

This time, the entire migration for the exact same six files took about 90min: half the time. Using htop I could see that for the first time, most of the memory was actually being used, and the process was finally using some of that swap space. Previously, the swap space wasnā€™t touched and the process only ever used about 4-5 GB RAM. Go figureā€“

That makes me real happy, but Iā€™m an optimist, and the grumpy ones are pessimists.

Well, this is some good news. I think.

But I would really like to understand what is going on with that one file. As you know OttoFMS just uses the DMT that is installed by default on the server to do the migration. There isnā€™t anything particular to OttoFMS about it. It just runs system commands, same as what I assume your in house tool does.

Whats different about that system. Does it use a particular version of the Data Migration tool (DMT)? Maybe it simply that the latest DMT provided by Claris is slower than or just doesnā€™t work with the one you used for ā€œMidgeā€. If thatā€™s true, we should let Claris know.

Does Midge use a different version of the DMT?

Thanks

Todd

Correct, Midge was built around a much earlier version of the DMT. it has to be run on older hardware with FileMaker 18. So in my mind what we have here is two different cars, both are VWs both a 4 cylinder, both use gas, but one is naturally aspirated with a carburetor whereas the other is a turbo with fuel injectionā€¦