(ch_GEElarge)=
# Managing large tasks

## Introduction
By now we have gained some experience in working with *GEE*. If you have gone through the [Introduction](ch_IntroGEE) and the chapter on [image workflows](ch_imagewf), then you are already well equiped with techniques and tools that allow you to do some easy visualizations in *GEE*, up- and download features and feature collections, and also be able to do some advanced analyses, such as [masking and reducing](ch_imagewf) or [image classifications](ch_GEEclassification).

What this (last) section will address, is the following: while *GEE* is meant to be run over large areas, it is much easier to get data for large areas and/or long time series down to your local machine. Very commonly, one receives a *calculation timeout* or a *user memory issue* - and in the beginning it is very frustrating, as one would expect that Google's servers should be able to deal with large data. Yet, they do - the only problem is to get the data out. The built-in tools in the `geemap` package are not really made for that, and many online courses and tutorials do not consider this issue. Though, one often wants to do some additional (secondary) analysis (e.g., advanced segmentation using the `skimage`-library) for which the available methods in *GEE* do not suffice.

This chapter is specifically designed to provide some tools/techniques for the use with *GEE* that allow you to do exactly that. Much of these tools were born and developed through experience and by constantly being confronted with this problem. This also means, that they are not perfect and coded in the most efficient way. Keep this in mind, when going over this tutorial.

The elements, that we will apply in this context area:
1. How to split large tasks into smaller chunks?
2. How to access exported datasets in `GEEassets` and `GDrive`?
3. How to keep track of export tasks running submit tasks continuously?

As usual, however, we start the engine and load the `geemap` package:

In [1]:
import ee
import geemap.foliumap as geemap
try:
    ee.Initialize()
except Exception as e:
    ee.Authenticate()
    ee.Initialize()

## Split large tasks into smaller chunks
This is strongly connected to the lab we have been working on in the context of our vector sessions ([here](ch_ogr_creations) or [here](ch_ogr)). The idea is to subdivide a point dataset into chunks of a relatively limited geographic extent, do the job/extract the data, and then puzzle them together again. Besides the chapters mentioned before, we need the following *GEE* techniques, that you can look for in the [introduction to *GEE*](ch_IntroGEE):
1. How to create geometries - from scratch and from existing .gpkg files
2. How to convert local vector files into feature collections.

Have a look in the respective chapters. Here, we won't repeat these, but encourage you to have a look at them yourself if you feel unsure.

## Asset management
In the chapter of [image workflows](ch_imagewf) we have learned that we can export datasets into assets. What is important to know here, is that we can do this both with raster (i.e., images) and vector data (i.e., `ee.FeatureCollection()`). Two examples for such a code are presented below:

In [None]:
# Export vector data to assets
exporttask = ee.batch.Export.table.toAsset(
    collection=, # add here the feature collection
    description=, # add here a description
    assetId=) # here comes the asset
exporttask.start()

The question here is how to best manage the assets? One way I like to do - and how it is probably easiest - is to work with a `string` that guides you to your asset. In the case of this book/script, this could be for example:

In [14]:
asset = "projects/ee-matthiasbaumann84/assets/geopy"

Now, with some easy techniques we have learned already (e.g., list files, iterations), we can do the same using assets as we do for our local file system. In this concrete example, we can list all files inside the asset folder. The two files I prepared in the folder are the same shapefiles I used in the [chapter on image classification in *GEE*](ch_GEEclassification). Important here is, that these are server-side objects that we can work with in *GEE*, so there won't be any risk of running into memory and timeout issues (except for when using `.getInfo()`).

Let's get a list of the feature collections in the asset and see what we get (and how to download it)

In [15]:
fc_list = ee.data.getList({'id': asset})
print(fc_list)

[{'type': 'Table', 'id': 'projects/ee-matthiasbaumann84/assets/geopy/01_ROI_shape'}, {'type': 'Table', 'id': 'projects/ee-matthiasbaumann84/assets/geopy/02_RandomPoints_1000'}]


We have a list of assets (as expected). What we do now is the following: we will access the first element (i.e., the ROI), so that we can work with it. To do that we need to define the string indicating the asset as an `ee.FeatureCollection()`:

In [17]:
fc = ee.FeatureCollection(fc_list[0]['id'])
fc

Pretty simple, and also in-line with everything we have learned so far. We will have an exercise like this in the course, but think for the moment about the following questions:
1. How to use the `ee.batch.Export.table.toAsset()` function to export e.g., STMs for many points across large areas $\rightarrow$ think again about grids and how they help you subdividing a large task into many smaller tasks.
2. Using the `fc_list` and the possibility to iterate over it: would it be possible to merge many small assets into one single larger one? This would essentially mean the reversed step of 1. Have a look at [how to merge multiple feature collections](https://developers.google.com/earth-engine/apidocs/ee-featurecollection-merge) and then how to export those again
3. Assume that you only want the result of 2. and you want to delete the results of 1. as they only are temporary elements that you won't need anymore. Below is some code that helps you deleting an asset.

In [None]:
listToDelete = ee.data.getList({'id': asset})
for coll in listToDelete:
    file = coll['id']
    ee.data.deleteAsset(file)

Now you are ready to work with feature collections and images as assets in the same way you can do it in a local file system. This immensely helped me organizing large data (e.g., at the continental scale) and make use of *GEE* in form of a gigantic processing and data engine.

## Google Drive Manangement
You can of course do the same using the exports in to your google Drive. What we know already is how export images to your GDrive - we learned this in previous sections. Below is again an example for this, and you can imagine to do this also with featurecollections, right?

In [None]:
task = ee.batch.Export.image.toDrive(image=, # here comes the image object you want to export
                                     folder='', # the folder in your Gdrive to work in
                                     description= '', # some description for the export
                                     region=, # some geometry that tells GEE the ROI.
                                     scale=30) # the saptial resolution in meters for the export
task.start()

So far, so good. But how do we access our GDrive so that we can do the same operations (e.g., list files, download files, delete files) as we do with assets and in our local file system? To dot this, we need start with some parameters to be set and steps to be done. Specifically:
1. Install and load the package `pydrive` and its submodules
2. Create a `client_secrets`file and authenticate with yur GDrive.
3. Access the folder in your GDrive

### Install and load `pydrive`
this is an easy one. You know [how to install packages](ch_setup), and we need two submodules of this package. Afterwards you can authenticate with your google drive

In [18]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive

Before we can conenct and authenticate with our GDrive, we need to create our `client_secrets` to be able to connect with our GDrive. To do this, you need to follow the steps outlined [on this website](https://developers.google.com/identity/protocols/oauth2/web-server#creatingcred) or alternatively [here](https://www.balbooa.com/help/gridbox-documentation/integrations/other/google-client-id). After that, store your `client_secrets` file where you want and, and start the authentication:

In [23]:
GoogleAuth.DEFAULT_SETTINGS['client_config_file'] = "PATH_TO_YOUR_FILE/client_secrets.json"  
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?client_id=9291758365-jcrgrd0a3gfv10ipqfc5vq675ua5hve4.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&access_type=offline&response_type=code

Authentication successful.


Now we can start managing the data. To do this, however (and I have not found a better solution yet), is that we need to provide the physical address of your folder in your GDrive. For that reason I have create a folder in my GDrive with the name that clearly indicates that *GEE* stuff will be stored there (e.g., `geopy`). Once you have defined that, you can get the physical adddress from the address field:

```{figure} figs/gdrive_screenshot.jpg
---
width: 100%
name: gdrive
---
How to get the physical address of your GDrive folder that you want to use for your data exported from Google Earth Engine. Note, that this is here just an example from the creation of this script. The folder (and hence the address) does not exist anymore.
```

With this, we can now access the files in GDrive:

In [24]:
file_list = drive.ListFile({'q': "'1O_SIblsy57bc6G_9DYctnnQnab3nkiFz' in parents and trashed=false"}).GetList()
file_list

[GoogleDriveFile({'kind': 'drive#file', 'userPermission': {'id': 'me', 'type': 'user', 'role': 'owner', 'kind': 'drive#permission', 'selfLink': 'https://www.googleapis.com/drive/v2/files/1NTTAt3xwO_-P0HZ0R9a-oOSPgXCnMnvH/permissions/me', 'etag': '"Yy5GsXS7YmvEvwOmp0KzkBLA6uk"', 'pendingOwner': False}, 'fileExtension': 'shx', 'md5Checksum': 'eec4664bbaeb1b157837a0c60b8cbdfb', 'selfLink': 'https://www.googleapis.com/drive/v2/files/1NTTAt3xwO_-P0HZ0R9a-oOSPgXCnMnvH', 'ownerNames': ['Matthias Baumann'], 'lastModifyingUserName': 'Matthias Baumann', 'editable': True, 'writersCanShare': True, 'downloadUrl': 'https://www.googleapis.com/drive/v2/files/1NTTAt3xwO_-P0HZ0R9a-oOSPgXCnMnvH?alt=media&source=downloadUrl', 'mimeType': 'application/octet-stream', 'parents': [{'selfLink': 'https://www.googleapis.com/drive/v2/files/1NTTAt3xwO_-P0HZ0R9a-oOSPgXCnMnvH/parents/1O_SIblsy57bc6G_9DYctnnQnab3nkiFz', 'id': '1O_SIblsy57bc6G_9DYctnnQnab3nkiFz', 'isRoot': False, 'kind': 'drive#parentReference', 'pare

Pretty complicated output. But if you go once through this list, then you may see some pattern and dictionary keys that you recognize, such as the `['id']` and the `['title']` keys. Below is an example on how to use this for getting the files:

In [27]:
for file in file_list:
    file_id = drive.CreateFile({'id': file['id']})
    file_name =file['title']
    print(file_name)

02_RandomPoints_1000.shx
02_RandomPoints_1000.shp
02_RandomPoints_1000.prj
02_RandomPoints_1000.dbf
01_ROI_shape.shp
01_ROI_shape.shx
01_ROI_shape.dbf
01_ROI_shape.prj


Last, we ammend this code in two ways:
1. For downloading the files in the list. To do this, we need to do some *string*-concatenation.
2. For deleting the files on your GDrive and for cleaning up GDrive

In [None]:
# 1. Download the files
outFolder = "PATH_TO_YOUR_LOCAL_FOLDER"
for file in file_list:
    file_id = drive.CreateFile({'id': file['id']})
    fname = file["title"]  # This is the filename the file is stored on google drive
    # Define output file name and download
    outName = outFolder + fname
    file_id.GetContentFile(outName)

In [None]:
for file in drive_list:
    file_id = drive.CreateFile({'id': file['id']})
    file_id.Delete()

And that is it. Now, we are ready to work with GDrive in the same way we know it from file systems. This has the advantage that we can think of loops (e.g., while-loops) that continuously check in GDrive whether new files arrived and the download them. This can be for example helpful, if you don't have a paid GoogleOne account, and hence only limited storage capacities that might not be sufficient to store all data from your study area.

## Managing number of processes
The last bit we do is to look at some basic code lines to manage processes. The scenario is as follows: suppose you subdivide your study region using a Grid with a large number of grid cells (e.g., several hundreds). Now, you want to export STMs or a classification for all tiles to your GDrive and from there download them to your local computer. Below you find some basic elements. We describe the basic idea/functionality, but you probably need to adjust this to your personal processing engine.

*GEE* allows you to submit many tasks, but only a limited numnber (2-5) are processed in parallel. Now, the first reflex would be to submt all tasks and then let *GEE* process them all. Now, with unlimited GDrive storage this would be an option, but many people (incl. me) only have limted storage (e.g., 15GB in the free version, 100GB in the basic paid version). So, you will continuously submit tasks, and at the same time clean up your GDrive by downloading and deleting the files. Below is a loop construct that allows you doing this. The basic idea of this loop is as follows:
1. define a maximum number of tasks that you want *GEE* either have running or in a queue. I define this through `maxTasks = 10`, meaning that the sum of the two should not exceed 10.
2. The larger `for`-loop then submits all tasks **in theory** but it checks if the number of current running/queued tasks `n_tasks` is larger or equal `maxTasks`. If this is the case, it stops and sleeps for 60 seconds, before attemting it again.
3. If `n_tasks < maxTasks` anonther tile will be submitted, and GDrive will be checked for new files that arrived. Those will be downloaded and stored locally. It looks a bit complicated at first, but in the end it is mostly task management. Have a look - **be aware, that the variable names have not been defined previously in the code block, so you will have to think of how to manage the loop, and what the process is. In other words: this is a generic loop, which you will have to adapt!**

In [None]:
maxTasks = 10
# Instantiate the number of tasks
n_tasks = 0
# Loop over the missing tile IDs
for tile in tileIDs:
    while(n_tasks >= maxTasks):
        time.sleep(60) # Check every 60 seconds whether a new task could be started when the Maximum number is reached
        try:
            task_list = str(ee.batch.Task.list())
            n_running = task_list.count('RUNNING')
            n_ready = task_list.count('READY')
            n_tasks = n_running + n_ready
        except:
            time.sleep(5)
# Submit the next classification classification
    # Begin of the code example you want to run
    # -----------------
    # end of the code example you want to run 
    task_list = str(ee.batch.Task.list())
    n_running = task_list.count('RUNNING')
    n_ready = task_list.count('READY')
    n_tasks = n_running + n_ready
# Check in GoogleDrive whether new files arrived and download them
    drive_list = drive.ListFile({'q': "'1O_SIblsy57bc6G_9DYctnnQnab3nkiFz' in parents and trashed=false"}).GetList()
    for file in drive_list:
        file_id = drive.CreateFile({'id': file['id']})
        fname = file["title"]
        outName = outFolder + fname
        file_id.GetContentFile(outName)
        file_id.Delete()

This is it. We have identified and explored the key elements needed for successively submit tasks to *GEE* and keep a loop running until all tasks are submitted, while at the same time downloading the files from GDrive.