🚀 Feature
Nemo tests it's own CI on a Pytorch container from NGC (versioned as YY.MM) and these are generally available on other cloud providers too. Note that - usually once pytorch has a public release, it takes at least one month for the next container to actually have the public pytorch release. By actually have the released pytorch, I mean that the current container will have an alpha release of pytorch with some cherry-picked changes vs the actual full new release in public.
This can cause cases where improper version checking (using distutils instead of packaging.version.Version) can fail these alpha version comparison tests and cause PTL inside of the container to pick incorrect code paths. So the ecosystem CI will work fine ... but when you run it on a pytorch container released from Nvidia (ie on most cloud providers) it may fail (and not just Nemo, anything that uses PTL and hits that code path).
So maybe on a separate test prior to release, test the ecosystem CI on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.
Motivation
For a current example of exactly how we have to patch for such an issue right now (wrt Pytorch 1.10, NGC Container 21.01 and Pytorch Lightning 1.5.9), https://github.com/NVIDIA/NeMo/blob/8e15ba43ba0a17b456d3bfa09444574ef1faa301/Jenkinsfile#L70-L76 due to an issue regarding torchtext.
For an extreme case of exactly how bad things become - we had to adaptively install torch, PTL and nemo dependencies based on whether the install occurred inside a container or not.. https://github.com/NVIDIA/NeMo/blob/r1.0.0rc1/setup.py#L107-L146
Pitch
Maybe test the ecosystem CI (or just even PTL alone) on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.
Alternatives
Apart from manual patching of PTL source at install time, we haven't found any better solution than to wait it out for a month or two before the container actually contains the latest code from the latest torch release.
🚀 Feature
Nemo tests it's own CI on a Pytorch container from NGC (versioned as YY.MM) and these are generally available on other cloud providers too. Note that - usually once pytorch has a public release, it takes at least one month for the next container to actually have the public pytorch release. By actually have the released pytorch, I mean that the current container will have an alpha release of pytorch with some cherry-picked changes vs the actual full new release in public.
This can cause cases where improper version checking (using distutils instead of packaging.version.Version) can fail these alpha version comparison tests and cause PTL inside of the container to pick incorrect code paths. So the ecosystem CI will work fine ... but when you run it on a pytorch container released from Nvidia (ie on most cloud providers) it may fail (and not just Nemo, anything that uses PTL and hits that code path).
So maybe on a separate test prior to release, test the ecosystem CI on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.
Motivation
For a current example of exactly how we have to patch for such an issue right now (wrt Pytorch 1.10, NGC Container 21.01 and Pytorch Lightning 1.5.9), https://github.com/NVIDIA/NeMo/blob/8e15ba43ba0a17b456d3bfa09444574ef1faa301/Jenkinsfile#L70-L76 due to an issue regarding torchtext.
For an extreme case of exactly how bad things become - we had to adaptively install torch, PTL and nemo dependencies based on whether the install occurred inside a container or not.. https://github.com/NVIDIA/NeMo/blob/r1.0.0rc1/setup.py#L107-L146
Pitch
Maybe test the ecosystem CI (or just even PTL alone) on the latest public NGC pytorch container (or really any cloud container which has pytorch built into it). Ofc this is a big task so it's just a suggestion.
Alternatives
Apart from manual patching of PTL source at install time, we haven't found any better solution than to wait it out for a month or two before the container actually contains the latest code from the latest torch release.