Common SQLite examples

This section lists recepies to frequently asked questions on “how-to’s” with NVIDIA Nsight Systems data exported in SQLite format.

General notes

The data exported into SQLite database don’t contain any indexes by default and are as close to the contents of the report as possible. Therefore, for better readability and speed of execution, queries below will alter tables.

Common helper commands

When utilizing sqlite3 commandline tool, it’s helpful to have data printed as named columns, this can be done with:

.mode column
.headers on

Default column width is determined by the data in the first row of results. If this doesn’t work out well, you can specify widths manually:

.width 10 20 50

Obtaining sample report

CLI interface of Nsight Systems was used to profile radixSortThrust CUDA sample, then the resulting .qdrep file was exported using the nsys-exporter tool.

nsys profile --trace=cuda,osrt radixSortThrust
nsys-exporter --export-sqlite report1.qdrep

Serialized process and thread identifiers

NVIDIA Nsight Systems stores identifiers where events originated in serialized form. For events that have globalTid or globalPid fields exported, use the following code to extract numeric TID and PID:

SELECT globalTid / 0x1000000 % 0x1000000 AS PID, globalTid % 0x1000000 AS TID FROM TABLE_NAME;

NOTE: globalTid field includes both TID and PID values, while globalPid only the PID value.

Correlate CUDA kernel launches with CUDA API kernel launches

ALTER TABLE CUPTI_ACTIVITY_KIND_RUNTIME ADD COLUMN name TEXT;
ALTER TABLE CUPTI_ACTIVITY_KIND_RUNTIME ADD COLUMN kernelName TEXT;

UPDATE CUPTI_ACTIVITY_KIND_RUNTIME SET kernelName =
    (SELECT value FROM StringIds
    JOIN CUPTI_ACTIVITY_KIND_KERNEL AS cuda_gpu
        ON cuda_gpu.shortName = StringIds.id
        AND CUPTI_ACTIVITY_KIND_RUNTIME.correlationId = cuda_gpu.correlationId);

UPDATE CUPTI_ACTIVITY_KIND_RUNTIME SET name =
    (SELECT value FROM StringIds WHERE nameId = StringIds.id);

-- Select 10 longest CUDA API ranges that resulted in kernel execution.
SELECT name, kernelName, start, end FROM CUPTI_ACTIVITY_KIND_RUNTIME
    WHERE kernelName IS NOT NULL ORDER BY end - start LIMIT 10;

Results:

name                    kernelName               start       end       
----------------------  -----------------------  ----------  ----------
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  658863435   658868490 
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  609755015   609760075 
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  632683286   632688349 
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  606495356   606500439 
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  603114486   603119586 
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  802729785   802734906 
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  593381170   593386294 
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  658759955   658765090 
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  681549917   681555059 
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  717812527   717817671 

Remove ranges overlapping with overhead

-- Use the this query to count CUDA API ranges overlapping with the overhead ones.
-- Replace "SELECT COUNT(*)" with "DELETE" to remove such ranges.

SELECT COUNT(*) FROM CUPTI_ACTIVITY_KIND_RUNTIME WHERE id IN
(
    SELECT cuda.id
    FROM PROFILER_OVERHEAD as overhead
    INNER JOIN CUPTI_ACTIVITY_KIND_RUNTIME as cuda ON
    (cuda.start BETWEEN overhead.start and overhead.end)
    OR (cuda.end BETWEEN overhead.start and overhead.end)
    OR (cuda.start < overhead.start AND cuda.end > overhead.end)
);

Results:

COUNT(*)  
----------
1095      

Backtraces for OSRT ranges

-- Adding text columns makes results of the query below more human-readable.
-- These steps are completely optional.
ALTER TABLE OSRT_API ADD COLUMN name TEXT;
UPDATE OSRT_API SET name = (SELECT value FROM StringIds WHERE OSRT_API.nameId = StringIds.id);

ALTER TABLE OSRT_CALLCHAINS ADD COLUMN symbolName TEXT;
UPDATE OSRT_CALLCHAINS SET symbolName = (SELECT value FROM StringIds WHERE symbol = StringIds.id);

ALTER TABLE OSRT_CALLCHAINS ADD COLUMN moduleName TEXT;
UPDATE OSRT_CALLCHAINS SET moduleName = (SELECT value FROM StringIds WHERE module = StringIds.id);

-- Print backtrace of the longest OSRT range
SELECT globalTid / 0x1000000 % 0x1000000 AS PID, globalTid % 0x1000000 AS TID,
    OSRT_API.id, start, end, name, callchainId, stackDepth, symbolName, moduleName
FROM OSRT_API LEFT JOIN OSRT_CALLCHAINS ON callchainId == OSRT_CALLCHAINS.id
WHERE OSRT_API.id IN (SELECT id FROM OSRT_API ORDER BY end - start DESC LIMIT 1)
ORDER BY stackDepth LIMIT 10;

Results:

PID         TID         id          start       end         name                    callchainId  stackDepth  symbolName                      moduleName                              
----------  ----------  ----------  ----------  ----------  ----------------------  -----------  ----------  ------------------------------  ----------------------------------------
19163       19176       676         360897690   860966851   pthread_cond_timedwait  88           0           pthread_cond_timedwait@GLIBC_2  /lib/x86_64-linux-gnu/libpthread-2.27.so
19163       19176       676         360897690   860966851   pthread_cond_timedwait  88           1           0x7fbc983b7227                  /usr/lib/x86_64-linux-gnu/libcuda.so.418
19163       19176       676         360897690   860966851   pthread_cond_timedwait  88           2           0x7fbc9835d5c7                  /usr/lib/x86_64-linux-gnu/libcuda.so.418
19163       19176       676         360897690   860966851   pthread_cond_timedwait  88           3           0x7fbc983b64a8                  /usr/lib/x86_64-linux-gnu/libcuda.so.418
19163       19176       676         360897690   860966851   pthread_cond_timedwait  88           4           start_thread                    /lib/x86_64-linux-gnu/libpthread-2.27.so
19163       19176       676         360897690   860966851   pthread_cond_timedwait  88           5           __clone                         /lib/x86_64-linux-gnu/libc-2.27.so      

Profiled processes output streams

ALTER TABLE ProcessStreams ADD COLUMN filename TEXT;
UPDATE ProcessStreams SET filename = (SELECT value FROM StringIds WHERE ProcessStreams.filenameId = StringIds.id);

ALTER TABLE ProcessStreams ADD COLUMN content TEXT;
UPDATE ProcessStreams SET content = (SELECT value FROM StringIds WHERE ProcessStreams.contentId = StringIds.id);

-- Select all collected stdout and stderr streams.
select globalPid / 0x1000000 % 0x1000000 AS PID, filename, content from ProcessStreams;

Results:

PID         filename                                                 content                                                                                                                                                                                                                                                                                                                    
----------  -------------------------------------------------------  --------------------------------------------------------------------------------------------------------------------                                                                                                                                                                                                       
19163       /tmp/nvidia/nsight_systems/streams/pid_19163_stdout.log  /home/user_name/NVIDIA_CUDA-10.1_Samples/6_Advanced/radixSortThrust/radixSortThrust Starting...

GPU Device 0: "Quadro P2000" with compute capability 6.1


Sorting 1048576 32-bit unsigned int keys and values

radixSortThrust, Throughput = 401.0872 MElements/s, Time = 0.00261 s, Size = 1048576 elements
Test passed

19163       /tmp/nvidia/nsight_systems/streams/pid_19163_stderr.log                                                                                                                                                                                                                                                                                                                             

Thread summary

Please note, that Nsight Systems applies additional logic during sampling events processing to work around lost events. This means that the results of the below query might differ slightly from the ones shown in “Analysis summary” tab.

-- Thread summary calculated using CPU cycles (when available)
SELECT
    globalTid / 0x1000000 % 0x1000000 AS PID,
    globalTid % 0x1000000 AS TID,
    ROUND(100.0 * SUM(cpuCycles) / 
        (
            SELECT SUM(cpuCycles) FROM COMPOSITE_EVENTS 
            GROUP BY globalTid / 0x1000000000000 % 0x100
        ),
        2
    ) as CPU_utilization,
    (SELECT value FROM StringIds WHERE id = 
        (
            SELECT nameId FROM ThreadNames 
            WHERE ThreadNames.globalTid = COMPOSITE_EVENTS.globalTid
        )
    ) as thread_name
FROM COMPOSITE_EVENTS
GROUP BY globalTid
ORDER BY CPU_utilization DESC
LIMIT 10;

Results:

PID         TID         CPU_utilization  thread_name    
----------  ----------  ---------------  ---------------
19163       19163       98.4             radixSortThrust
19163       19168       1.35             CUPTI worker th
19163       19166       0.25             [NS]           

Thread running time may be calculated using scheduling data, when PMU counter data was not collected.

CREATE INDEX sched_start ON SCHED_EVENTS (start);

-- Intermediate table
CREATE TABLE CPU_USAGE AS
SELECT
    first.globalTid as globalTid,
    (SELECT nameId FROM ThreadNames WHERE ThreadNames.globalTid = first.globalTid) as nameId,
    sum(second.start - first.start) as total_duration,
    count() as ranges_count
FROM SCHED_EVENTS as first
LEFT JOIN SCHED_EVENTS as second 
ON second.id =
    (
        SELECT id
        FROM SCHED_EVENTS
        WHERE start > first.start AND globalTid = first.globalTid
        ORDER BY start ASC
        LIMIT 1
    )
WHERE first.isSchedIn != 0
GROUP BY first.globalTid
ORDER BY total_duration DESC;

-- Thread summary calculated using scheduled time
SELECT
    globalTid / 0x1000000 % 0x1000000 AS PID,
    globalTid % 0x1000000 AS TID,
    (SELECT value FROM StringIds where nameId == id) as thread_name,
    ROUND(100.0 * total_duration / (SELECT SUM(total_duration) FROM CPU_USAGE), 2) as CPU_utilization 
FROM CPU_USAGE
ORDER BY CPU_utilization DESC;

Results:

PID         TID         thread_name      CPU_utilization
----------  ----------  ---------------  ---------------
19163       19163       radixSortThrust  93.74          
19163       19169       radixSortThrust  3.22           
19163       19168       CUPTI worker th  2.46           
19163       19166       [NS]             0.44           
19163       19172       radixSortThrust  0.07           
19163       19167       [NS Comms]       0.05           
19163       19176       radixSortThrust  0.02           
19163       19170       radixSortThrust  0.0            

Function table

These examples demonstrate how to calculate Flat and BottomUp (for top level only) views statistics.

ALTER TABLE SAMPLING_CALLCHAINS ADD COLUMN symbolName TEXT;
UPDATE SAMPLING_CALLCHAINS SET symbolName = (SELECT value FROM StringIds WHERE symbol = StringIds.id);

ALTER TABLE SAMPLING_CALLCHAINS ADD COLUMN moduleName TEXT;
UPDATE SAMPLING_CALLCHAINS SET moduleName = (SELECT value FROM StringIds WHERE module = StringIds.id);


-- Flat view
SELECT symbolName, moduleName, ROUND(100.0 * sum(cpuCycles) /
    (SELECT SUM(cpuCycles) FROM COMPOSITE_EVENTS), 2) AS flatTimePercentage
FROM SAMPLING_CALLCHAINS
LEFT JOIN COMPOSITE_EVENTS ON SAMPLING_CALLCHAINS.id == COMPOSITE_EVENTS.id
GROUP BY symbol, module
ORDER BY flatTimePercentage DESC
LIMIT 5;

-- BottomUp view (top level only)
SELECT symbolName, moduleName, ROUND(100.0 * sum(cpuCycles) /
    (SELECT SUM(cpuCycles) FROM COMPOSITE_EVENTS), 2) AS selfTimePercentage
FROM SAMPLING_CALLCHAINS
LEFT JOIN COMPOSITE_EVENTS ON SAMPLING_CALLCHAINS.id == COMPOSITE_EVENTS.id
WHERE stackDepth == 0
GROUP BY symbol, module
ORDER BY selfTimePercentage DESC
LIMIT 5;

Results:

symbolName   moduleName   flatTimePercentage
-----------  -----------  ------------------
[Max depth]  [Max depth]  99.92             
thrust::zip  /home/user_  24.17             
thrust::zip  /home/user_  24.17             
thrust::det  /home/user_  24.17             
thrust::det  /home/user_  24.17             
symbolName      moduleName                                   selfTimePercentage
--------------  -------------------------------------------  ------------------
0x7fbc984982b6  /usr/lib/x86_64-linux-gnu/libcuda.so.418.39  5.29              
0x7fbc982d0010  /usr/lib/x86_64-linux-gnu/libcuda.so.418.39  2.81              
thrust::iterat  /home/user_name/NVIDIA_CUDA-10.1_Samples/6_  2.23              
thrust::iterat  /home/user_name/NVIDIA_CUDA-10.1_Samples/6_  1.55              
void thrust::i  /home/user_name/NVIDIA_CUDA-10.1_Samples/6_  1.55              

DX12 API frame duration histogram

The example demonstrates how to calculate DX12 CPU frames durartion and construct a histogram out of it.

CREATE INDEX DX12_API_ENDTS ON DX12_API (end);

CREATE TEMP VIEW DX12_API_FPS AS SELECT end AS start,
    (SELECT end FROM DX12_API
        WHERE end > outer.end AND nameId == (SELECT id FROM StringIds
            WHERE value == "IDXGISwapChain::Present")
        ORDER BY end ASC LIMIT 1) AS end
FROM DX12_API AS outer
    WHERE nameId == (SELECT id FROM StringIds WHERE value == "IDXGISwapChain::Present")
ORDER BY end;

-- Number of frames with a duration of [X, X + 1) milliseconds
SELECT
    CAST((end - start) / 1000000.0 AS INT) AS duration_ms,
    count(*)
FROM DX12_API_FPS
WHERE end IS NOT NULL
GROUP BY duration_ms
ORDER BY duration_ms;

Results:

duration_ms  count(*)  
-----------  ----------
3            1         
4            2         
5            7         
6            153       
7            19        
8            116       
9            16        
10           8         
11           2         
12           2         
13           1         
14           4         
16           3         
17           2         
18           1         

GPU context switch events enumeration

GPU context duration is between first BEGIN and a matching END event.

SELECT (CASE tag WHEN 8 THEN "BEGIN" WHEN 7 THEN "END" END) AS tag,
    globalPid / 0x1000000 % 0x1000000 AS PID,
    vmId, seqNo, contextId, timestamp, gpuId FROM FECS_EVENTS
WHERE tag in (7, 8) ORDER BY seqNo LIMIT 10;

Results:

tag         PID         vmId        seqNo       contextId   timestamp   gpuId     
----------  ----------  ----------  ----------  ----------  ----------  ----------
BEGIN       23371       0           0           1048578     56759171    0         
BEGIN       23371       0           1           1048578     56927765    0         
BEGIN       23371       0           3           1048578     63799379    0         
END         23371       0           4           1048578     63918806    0         
BEGIN       19397       0           5           1048577     64014692    0         
BEGIN       19397       0           6           1048577     64250369    0         
BEGIN       19397       0           8           1048577     1918310004  0         
END         19397       0           9           1048577     1918521098  0         
BEGIN       19397       0           10          1048577     2024164744  0         
BEGIN       19397       0           11          1048577     2024358650  0         

Rename CUDA kernels with NVTX

The example demonstrates how to map innermost NVTX push-pop range to a matching CUDA kernel run.

ALTER TABLE CUPTI_ACTIVITY_KIND_KERNEL ADD COLUMN nvtxRange TEXT;
CREATE INDEX nvtx_start ON NVTX_EVENTS (start);


UPDATE CUPTI_ACTIVITY_KIND_KERNEL SET nvtxRange = (
    SELECT NVTX_EVENTS.text
    FROM NVTX_EVENTS JOIN CUPTI_ACTIVITY_KIND_RUNTIME ON
        NVTX_EVENTS.eventType == 59 AND
        NVTX_EVENTS.globalTid == CUPTI_ACTIVITY_KIND_RUNTIME.globalTid AND
        NVTX_EVENTS.start <= CUPTI_ACTIVITY_KIND_RUNTIME.start AND
        NVTX_EVENTS.end >= CUPTI_ACTIVITY_KIND_RUNTIME.end
    WHERE
        CUPTI_ACTIVITY_KIND_KERNEL.correlationId == CUPTI_ACTIVITY_KIND_RUNTIME.correlationId
    ORDER BY NVTX_EVENTS.start DESC LIMIT 1
);

SELECT start, end, globalPid, StringIds.value as shortName, nvtxRange
FROM CUPTI_ACTIVITY_KIND_KERNEL JOIN StringIds ON shortName == id
ORDER BY start LIMIT 6;

Results:

start       end         globalPid          shortName      nvtxRange 
----------  ----------  -----------------  -------------  ----------
526545376   526676256   72057700439031808  MatrixMulCUDA            
526899648   527030368   72057700439031808  MatrixMulCUDA  Add       
527031648   527162272   72057700439031808  MatrixMulCUDA  Add       
527163584   527294176   72057700439031808  MatrixMulCUDA  My Kernel 
527296160   527426592   72057700439031808  MatrixMulCUDA  My Range  
527428096   527558656   72057700439031808  MatrixMulCUDA