I had an interesting use case as part of my work wherein we needed to generate Nested JSON out of table definitions (i.e. from the INFORMATION_SCHEMA.COLUMNS) that was then used by other system for further processing. Any changes being done on the schema, say adding or removal of column or data type change, we were doing this manually to the JSON file. This process was error prone. As the number of tables to maintain increased, the vulnerability of manually modifying the JSON as and when changes being done at the table level happened.
Looking for simpler solution when I looked around google BigQuery had some ready made operators up to the task namely –
- ARRAY_AGG – This returns an ARRAY of expression values specified and can also work with aggregation.
- STRUCT – Constructs a container of ordered fields i.e. like a list in python. Returns an ARRAY object.
Let me illustrate the use case with an example. Say we have following dataset (data taken from here)-
Objective
To obtain the JSON file in the following format. There are 2 nested levels that I want to generate. Level 1 with Name, Race, Gender (Under Superhero), Level 2 with Height, Weight, SkinColor, EyeColor (Under Attributes)-

Solution
Here is how the query for the solution would look like –
WITH BASE AS
(
SELECT
Publisher
,Name
,Race
,Gender
,ARRAY_AGG(STRUCT(Height, Weight, SkinColor, EyeColor)) AS Attributes
FROM `gcp-project-name.dataset_name.marvel_characters_info`
GROUP BY 1,2,3,4
)
SELECT
Publisher
,ARRAY_AGG(STRUCT(Name, Race, Gender, Attributes)) AS Superhero
FROM BASE
GROUP BY 1;
Output would be as shown below –

BigQuery out of the box provides option to download the data in JSON format and that would result in the format as mentioned in the problem statement. If you are working on automating this say using Python, then you would need to use python libraries (google.cloud.bigquery) and call the SQL, then use Pandas dataframe to store the output.