The past few days have been pretty frustrating. The majority of my time has been spent on my research. I’ve been trying to get a neural net for lane detection set up and running on the lab’s computers. It’s been a painful process, and I’ve probably sunk ~ 15 hours with little progress.
Long story short, I’m dealing with CUDA issues, which are compounded by the fact that the computer is running Ubuntu 18, which is causing some version control issues.
torch.cuda.is_available()
was returning false. I tried a few things and spent a few hours solving this:
pip install
instead of conda install
to install my pytorch versions. I found some people in github issues that did this, and it worked! I have no intuition for why this worked. One thing I noticed was that conda was installing the CPU version of pytorch no matter what. I think it was trying to do something fancy with trying to solve my environment and adjusted the installations based on that.torch.cuda.is_available()
returns true, so pytorch is finally recognizing that I have cuda driver, toolkit, and device available. Now I am getting the following issue: Which has something to do with nvcc, the CUDA compiler, which can be downloaded separetely from the toolkit.<aside> 💡
After ~20 hours of trying I have finally concluded that there is no reasonable solution to this problem. The project relies on CUDA toolkit 9.X or 10.X. I have not found a conda installation for these toolkits that works in my environment. When I used newer CUDA toolkits (11.X or 12.X) there is code that has changed which results in functions being called that aren’t defined. I could have pursued a fix where I migrated a bunch of functions from the THC files to the ATEN API, but it would have involving messing with hundreds of lines of C++ code and dozens of files. I don’t know much about this type of coding (CUDA, CUDA Kernels), so I won’t be good at debugging if and when errors arise. For that reason I am moving on to a different neural net that doesn’t rely on old CUDA toolkit versions.
</aside>